Entity Matching Guide
Introduction
This page covers details you should know before matching entities in your data. Some of the advice here is targeted towards the lead researchers from your group as it addresses larger project level decisions about URI use. Other concepts and examples are for research assistants who are doing the actual entity matching work.
LINCS’s Approach to URIs
LINCS’s strategy is to prioritize re-using existing URIs as identifiers whenever possible. We do this to enrich the linked data that exists online while limiting the number of duplicate identifiers we add to the LOD ecosystem that people need to choose between.
The principals we follow are:
- Every entity in the LINCS triplestore has one primary URI that acts as the identifier to represent that real-world concept.
- We try to consistently use the same URI for the same entity across all datasets.
- This requires incoming datasets to match entities against datasets already in LINCS and prioritize the URIs that LINCS is already using.
- Each entity can have many owl:sameAs relationships to connect it to additional equivalent URIs from other sources
One of the benefits of the LINCS knowledge graph is that our contributing datasets contain obscure entities that are not well represented elsewhere online. However, this creates a challenge for entity matching as we often cannot find external URIs for entities. In these cases, projects who have the capacity to do so will mint URIs using project specific namespace. As a final option, LINCS will mint URIs using the namespace http://id.lincsproject.ca/
.
Basically any named thing in your data should be matched to an entity so that we can represent it with a URI. A few examples of things we would try to do entity matching for include:
- People
- Places
- Companies
- Specific objects
- Categories of objects
- Creative works
- Materials
- Abstract concepts
- Political movements
Choosing a Source
LINCS has a selection of LOD sources—or authority files—that we tend to use. Here is a description of what we like about each of our most commonly used sources and details of where it fits in our order of preference. However, each project has its own priorities that may impact a different order of preference. The domain of your data will impact this as well. We suggest that before you start matching entities, you investigate the authority files of interest to ensure you are comfortable connecting your data to that source. You should also look for domain specific sources not listed here.
DBpedia
DBpedia is a good source for entities that are notable enough to have Wikipedia pages. We tend to use Wikidata before DBpedia when the same entity is in both.
GeoNames
GeoNames is our first choice source for modern geographic locations. If you cannot find a location in GeoNames then, in order of LINCS preference, try Getty TGN, VIAF, or Wikidata.
Getty
There are 4 separate datasets in Getty that you can search through for different types of entities:
- Art and Architecture Thesaurus (AAT)
- Union List of Artist Names (ULAN)
- Getty Thesaurus of Geographic Names (TGN)
- Cultural Objects Name Authority (CONA)
For person entities, LINCS uses Getty as the second choice after VIAF when the people are likely to be artists.
LINCS
If you plan on publishing your transformed data with LINCS, then you will need to match entities against existing LINCS data. This helps us prevent having the same entity in our triplestore under multiple URIs.
LOC
The Library of Congress (LOC) is a good place for concepts and types. There are many different groups of terms within LOC so you will have to browse to find relevant groupings.
VIAF
The Virtual International Authority File (VIAF) is our first choice for bibliographic records as well as people and companies connected to those records—like authors and publishers.
VIAF also contains geographic locations, but more often use GeoNames.
Wikidata
Wikidata contains billions of entities covering a large variety of types. We often use Wikidata when we cannot find an entity in a domain specific source. Wikidata comes with the caveat that it is community-created so the way that entities are defined is subject to frequent changes.
Because Wikidata is so widely used, it is a good place to find out about other domain specific sources. If you search an entity on Wikidata, then scroll to the bottom of the page to the “identifiers” section, you can see URIs for that entity from other authority files. This can help you learn about other authorities relevant to your domain. Though, be sure to research those newly found authority files and ensure that the identifiers they provide are valid LOD URIs.
Wikidata acts as a bridge between many authority files. Once you have one external URI, from VIAF for example, you can query Wikidata using SPARQL to find the Wikidata URI that corresponds to a VIAF URI. This can help you add additional owl:sameAs links to your data or switch the authority of preference.
Matching Vocabulary Terms
The concepts and techniques for matching entities in your data apply to vocabulary terms as well. Whenever possible, choose a vocabulary term that is already used frequently in LINCS data. This will help connect your data to others. If there are multiple vocabulary terms that match yours, you can typically use multiple.
Specific vocabularies that are already in use in LINCS data include:
- DBpedia
- GeoNames
- Getty Art & Architecture Thesaurus (AAT)
- Getty Thesaurus of Geographic Names (TGN)
- Getty Union List of Artist Names (ULAN)
- Homosaurus
- Library of Congress Subject Headings
- Library of Congress Names
- MARC List for Languages
- MARC Relators
- Nomenclature for Museum Cataloging
- VIAF
- Wikidata
See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.
Valid URIs
If you would like to match entities in your data against a source not listed in our documentation, first check that it is a source of linked data and that they have permanent URIs hosted for each entity. If you are unsure about using a source, check with the Transformation Team.
When using any source of URI, be mindful that you have the namespace and formatting of the URI exactly as it is listed. This should be the permanent link for the entity, and not necessarily the link you see in the address bar of your web browser.
Here are the namespaces of the sources we frequently use, with common errors listed:
DBpedia
http://dbpedia.org/resource/
- Not
https://dbpedia.org/page/
- Not
GeoNames
https://sws.geonames.org/
https
nothttp
Getty
- These start with
http://vocab.getty.edu/
followed by the vocabulary name. Be careful not to use the page URIs that start withhttp://vocab.getty.edu/page/
. - AAT
http://vocab.getty.edu/aat/
- ULAN
http://vocab.getty.edu/ulan/
- TGN
http://vocab.getty.edu/tgn/
LOC
- There are multiple valid namespaces within LOC data, typically begining with
http://id.loc.gov/authorities/
- The URI should be listed under “URIs” within an entity’s web page
- Make note of the use of
http
nothttps
and not including.html
at the end
VIAF
http://viaf.org/viaf/
- You can find this listed as Permalink within a record’s web page
- There should not be a trailing
/
Wikidata
http://www.wikidata.org/entity/
http
nothttps
/entity/
not/wiki/
Non-LOD Sources
If you cannot find an LOD source for a URI, but can find references to the entity in documents on the web, there are ways we can include those in your conceptual mapping.
As an example from Map of Early Modern London (MoEML), they included Wikipedia pages within which entities were mentioned:
<https://mapoflondon.uvic.ca/MORE14>
rdf:type crm:E21_Person ;
rdfs:label "Dame Alice More (née Harpur)"@en ;
crm:P129i_is_subject_of <https://en.wikipedia.org/wiki/Alice_More> .
<https://en.wikipedia.org/wiki/Alice_More>
rdf:type crm:E73_Information_Object ;
crm:P2_has_type <http://www.wikidata.org/entity/Q36774> .
Minting URIs
When a project cannot find an existing URI for an entity, does not approve of the URIs it finds, or does not have the capacity to match all entities, we can mint new URIs instead.
The first option is for the project or data owner to mint and host new URIs. It is then the responsibility of the data owner to maintain those URIs, keeping them stable and online.
Examples of namespaces that contributing projects used to mint their own URIs include:
https://mapoflondon.uvic.ca/
https://personography.1890s.ca/
https://anthologiagraeca.org/api/
If your project is not able to commit to minting and hosting URIs, then LINCS can mint them for you under the namespace http://id.lincsproject.ca/
.
Note that your data is going to end up with entities with the namespace http://id.lincsproject.ca/
because CIDOC CRM introduces intermediate nodes for events that do not have URIs elsewhere in LOD sources.
URIs in your Data
When multiple LINCS projects each use the same URI as an entity’s primary identifier, people will be able to easily view the merged version of those records in ResearchSpace and query each dataset to see the individual contributions. This shared use of primary URIs helps with the “linked-ness” of the linked data.
With that said, we do have projects that choose to use their own identifiers as the primary identifiers—even if the same entity is already in LINCS—so that their whole dataset is consistent. That choice is ultimately up to your research team.
Here are some examples of how URI found through entity matching can be added to your data, using Map of Early Modern London (MoEML) data as an illustrative sample:
Option 1
When we find an external URI for an entity, we use that as the primary identifier for that entity in LINCS. We then have two sub-choices for how the project specific URI could be used:
- Project URIs become the objects of owl:sameAs relationships:
<http://www.wikidata.org/entity/QYYY> owl:sameAs <https://mapoflondon.uvic.ca/XXX>
- Project URIs become identifiers for the entities:
<http://www.wikidata.org/entity/QYYY> crm:P1_is_identified_by <http://id.lincsproject.ca/AAA> .
<http://id.lincsproject.ca/AAA> rdf:type crm:E42_Identifier .
<http://id.lincsproject.ca/AAA> crm:P190_has_symbolic_content "FLEM1" .
<http://id.lincsproject.ca/AAA> crm:P2_has_type <http://id.lincsproject.ca/BBB> .
<http://id.lincsproject.ca/BBB> rdfs:label "Map of Early Modern London Project Identifier" .
When there is no existing URI for an entity from any authority source, we have two more choices:
- Use project URIs as the primary identifier
- Mint a LINCS URI and use that as the primary identifier and then connect to the project URI using one of the choices above
Option 2
Every person in the data would have a MoEML URI as the primary identifier for that entity.
We would connect those entities to their matched entities value using owl:sameAs:
<https://mapoflondon.uvic.ca/XXX> owl:sameAs <http://www.wikidata.org/entity/QYYY>
De-duplication within your Data
As explained in the Match Entities step, we recommend that you enhance your source data with internal unique identifiers for each entity before starting to match entities. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, match them, and put the new URIs into the source or transformed data then you will be able to easily put the new URIs in the correct locations.
Depending on your approach, having these internal identifiers can also help you de-duplicate your own data before you start matching it against external sources. If you do not de-duplicate your data first, then you will effectively still do that as long as you assign the same external URI to each occurrence of that entity in your data. The downside is that entity matching may take longer if you are looking up each instance of the same thing.
Placeholder URIs
Projects often do not have entity matching complete by the time we start setting up the code or tools to implement the rest of the transformation. In these cases, we introduce placeholder URIs that can be swapped out in the final data once we have all completed entity matching. LINCS, for example, uses the namespace http://temp.lincsproject.ca/
to represent a placeholder.