Match Entities
Introduction
Entity matching, also called reconciliation, entity linking and named entity disambiguation, is the step where we add unique identifiers in the form of URIs to your data to represent each unique entity. The goal is to use the same identifier every time that the same real-world thing is mention in your data, other LINCS data, and, ideally, linked data elsewhere on the web. By using the same identifier, we can connect all of the statements made about that entity together to create a rich and informative graph about that entity.
This page covers how entity matching fits into the data transformation workflows and the available tools, while our Entity Matching Guide gets into the details of how to match entities.
For a list of all entity matching tools that LINCS uses, see the tool catalogue.
Resources Needed
Given that this process is so time consuming, we recommend starting to match entities in your data as early as possible. You can follow our Entity Matching Guide to start matching entities even before you have committed to the rest of the LINCS transformation process.
Entity matching can be completed in tandem with the other transformation steps. Use placeholder values in the transformation until your research team has finished matching entities. Once URIs have been found, they can be added to either the source data or to the transformed data to replace the placeholder values. The dataset, however, will not be published by LINCS until either the research team finishes their entity matching or LINCS and the research team come to an agreement that no more entity matching can take place, and new URIs need to be minted for the remaining entities. Note that once the data is published, the research team can continue to enhance it, including further entity matching.
Entity matching will always be completed by your research team because it requires domain knowledge to ensure you are choosing the correct identifiers. This is a great task for undergraduate or graduate research assistants.
The Transformation Team can offer guidance on this step and particularly with how to set up your data for entity matching and how to merge the URIs back into your data.
Time Commitment
Entity matching tends to be the slowest part of the transformation process. It can be sped up with tools that perform automated linking, but this comes at the sacrifice of accuracy. The loss in accuracy is worsened for the type of data coming into LINCS because it references more obscure or historically overlooked entities that are not well represented in existing LOD sources.
LINCS’s approach is to mix automation with manual review:
- Start with tools that automatically suggest candidate matches for entities
- If possible, apply filtering based on the context for each entity in your data and the authority data to separate trustworthy suggestions from ones that need review
- Have students manually review the uncertain candidates
With that mixed approach, you can estimate the time needed by assuming that each entity in your data will need 1-5 minutes for a human to match entities it. This range depends on how familiar the person is with the data and whether they will need to spend time researching the entities to confirm a match.
For large datasets, it is not always feasible to carefully match all entities. Our strategy for this has been:
- Match as many entities as is feasible
- When the data is ready for publication, other than entity matching not being completed, your team can discuss with LINCS at what point you would like to call it and mint URIs for the remaining unmatched entities
- Once the data is published, you can slowly continue to add URIs for the unmatched entities in ResearchSpace Review
Research Team | Ontology Team | Transformation Team | Storage Team | |
---|---|---|---|---|
Set Up your Data | ✓ | ✓ | ||
Match your Entities | ✓ | |||
Merge Matched Entities Data | ✓ | ✓ | ||
Choose Vocabularies | ✓ | ✓ |
Set Up your Data
For each workflow there are typically two options for setting up your data:
- Use a tool that takes your data in its original format and allows you to add URIs to the source data. For example, LEAF Writer lets you annotate XML and TEI data with entity types and URIs.
- In this case, you should not need to do any setup beyond the typical Clean Data step.
- Use a script or query tool to pull entities out of your data, along with contextual information about those entities. Then use a tool such as VERSD or OpenRefine to find URIs for the entities. Finally, use another script or query tool to insert those URIs back into either the source data or the transformed data.
- This case will typically result in one or more spreadsheets where each row represents one entity and the columns contain contextual details about the entity. For example, you may have one internal unique identifier per row to represent a person, and then columns for their name, birth date, and death date so that you can quickly check if candidate URIs are correct.
- LINCS typically uses custom scripts to complete this step. The Transformation Team can offer advice and sample scripts from previous transformations.
Consider enhancing your source data with internal unique identifiers for each entity. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, match them, and put the new URIs into the source or transformed data then you will be able to easily put the new URIs in the correct locations.
Match your Entities
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
For structured data, extract entities and their context from your source data. This process may require a custom script, but often it will be as easy as using a simplified version of your source spreadsheets or using an online tool to transform structured data into a spreadsheet. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise. Note that OpenRefine accepts a broad range of starting file types so you may be able to skip the initial extraction step.
Particularly for small datasets, you may find it sufficient to manually lookup URIs and add them directly to your source data or wait and add them to the transformed data.
As usual, your options for semi-structured data depend on your specific data:
- Use LEAF Writer to review and add URIs directly to source XML documents.
- Create a custom script to extract entities and their context. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise.
- Manually lookup URIs and add them directly to your source data or wait and add them to the transformed data.
To take advantage of the automated triple extraction possible with the TEI workflow, it is best that you insert entity URIs directly into your source TEI files. LEAF Writer is our preferred tool for this step.
Alternatively, you may choose to make manual changes or changes using custom scripts to the source TEI files. You do still have the option to add additional URIs to the transformed data at the end.
This step of the natural language data workflow is still in progress. Check back over the next few months as we release the tools described here.
The APIs that we use to automatically extract triples from natural language data combine the tasks of identifying entities, choosing URIs for entities, and extracting the relationships between entities. If you are using this approach, then you do not need to do anything for this step. Though, you may want to review the URIs that the tool suggests for your data and double check any that have conflicting information. You should also do additional look-ups to find URIs for entities that our tools did not match, as these automated tools do not search through all authority files. Refer to the documentation for the tool you are using to confirm which authority files it connects to.
Merge Entity-Matched Data
The research team and Transformation Team will use a custom script or the Linked Data Enhancement API to merge the new URIs with either the cleaned version of the source data or the transformed data.
The Transformation Team will mint new URIs for anything that could not be matched.
Choose Vocabularies
Similarly to matching entities, you also need to choose vocabulary terms to use in your data and include their URIs. These vocabulary terms are often used to add fine-grained types to entities and relationships, compared to the broad types that CIDOC CRM uses. Choosing the appropriate vocabulary terms for your project will require you to explore the terms’ definitions to find ones that fit. When possible, prioritize using terms that are already frequently used within LINCS data to increase the connections and potential for interesting queries between your data and other LINCS data.
See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.