Skip to main content

Match Entities

Introduction

Entity matching, also called reconciliation, entity linking and named entity disambiguation, is the step where we add unique identifiers in the form of URIs to your data to represent each unique entity. The goal is to use the same identifier every time that the same real-world thing is mention in your data, other LINCS data, and, ideally, linked data elsewhere on the web. By using the same identifier, we can connect all of the statements made about that entity together to create a rich and informative graph about that entity.

This page covers how entity matching fits into the data transformation workflows and the available tools, while our Entity Matching Guide gets into the details of how to match entities.

For a list of all entity matching tools that LINCS uses, see the tool catalogue.

Resources Needed

Given that this process is so time consuming, we recommend starting to match entities in your data as early as possible. You can follow our Entity Matching Guide to start matching entities even before you have committed to the rest of the LINCS transformation process.

Entity matching can be completed in tandem with the other transformation steps. Use placeholder values in the transformation until your research team has finished matching entities. Once URIs have been found, they can be added to either the source data or to the transformed data to replace the placeholder values. The dataset, however, will not be published by LINCS until either the research team finishes their entity matching or LINCS and the research team come to an agreement that no more entity matching can take place, and new URIs need to be minted for the remaining entities. Note that once the data is published, the research team can continue to enhance it, including further entity matching.

note

Entity matching will always be completed by your research team because it requires domain knowledge to ensure you are choosing the correct identifiers. This is a great task for undergraduate or graduate research assistants.

The Transformation Team can offer guidance on this step and particularly with how to set up your data for entity matching and how to merge the URIs back into your data.

Time Commitment

Entity matching tends to be the slowest part of the transformation process. It can be sped up with tools that perform automated linking, but this comes at the sacrifice of accuracy. The loss in accuracy is worsened for the type of data coming into LINCS because it references more obscure or historically overlooked entities that are not well represented in existing LOD sources.

LINCS’s approach is to mix automation with manual review:

  1. Start with tools that automatically suggest candidate matches for entities
  2. If possible, apply filtering based on the context for each entity in your data and the authority data to separate trustworthy suggestions from ones that need review
  3. Have students manually review the uncertain candidates

With that mixed approach, you can estimate the time needed by assuming that each entity in your data will need 1-5 minutes for a human to match entities it. This range depends on how familiar the person is with the data and whether they will need to spend time researching the entities to confirm a match.

For large datasets, it is not always feasible to carefully match all entities. Our strategy for this has been:

  1. Match as many entities as is feasible
  2. When the data is ready for publication, other than entity matching not being completed, your team can discuss with LINCS at what point you would like to call it and mint URIs for the remaining unmatched entities
  3. Once the data is published, you can slowly continue to add URIs for the unmatched entities in ResearchSpace Review
Research TeamOntology TeamTransformation TeamStorage Team
Set Up your Data
Match your Entities
Merge Matched Entities Data
Choose Vocabularies

Set Up your Data

For each workflow there are typically two options for setting up your data:

  • Use a tool that takes your data in its original format and allows you to add URIs to the source data. For example, LEAF Writer lets you annotate XML and TEI data with entity types and URIs.
    • In this case, you should not need to do any setup beyond the typical Clean Data step.
  • Use a script or query tool to pull entities out of your data, along with contextual information about those entities. Then use a tool such as VERSD or OpenRefine to find URIs for the entities. Finally, use another script or query tool to insert those URIs back into either the source data or the transformed data.
    • This case will typically result in one or more spreadsheets where each row represents one entity and the columns contain contextual details about the entity. For example, you may have one internal unique identifier per row to represent a person, and then columns for their name, birth date, and death date so that you can quickly check if candidate URIs are correct.
    • LINCS typically uses custom scripts to complete this step. The Transformation Team can offer advice and sample scripts from previous transformations.
warning

Consider enhancing your source data with internal unique identifiers for each entity. These can be temporary identifiers that will be replaced before your data is published. The benefit is that if you extract entities from your text, match them, and put the new URIs into the source or transformed data then you will be able to easily put the new URIs in the correct locations.

Match your Entities

For structured data, extract entities and their context from your source data. This process may require a custom script, but often it will be as easy as using a simplified version of your source spreadsheets or using an online tool to transform structured data into a spreadsheet. To find and confirm URIs, use VERSD if your data is bibliographic and OpenRefine otherwise. Note that OpenRefine accepts a broad range of starting file types so you may be able to skip the initial extraction step.

Particularly for small datasets, you may find it sufficient to manually lookup URIs and add them directly to your source data or wait and add them to the transformed data.

Merge Entity-Matched Data

The research team and Transformation Team will use a custom script or the Linked Data Enhancement API to merge the new URIs with either the cleaned version of the source data or the transformed data.

The Transformation Team will mint new URIs for anything that could not be matched.

Choose Vocabularies

Similarly to matching entities, you also need to choose vocabulary terms to use in your data and include their URIs. These vocabulary terms are often used to add fine-grained types to entities and relationships, compared to the broad types that CIDOC CRM uses. Choosing the appropriate vocabulary terms for your project will require you to explore the terms’ definitions to find ones that fit. When possible, prioritize using terms that are already frequently used within LINCS data to increase the connections and potential for interesting queries between your data and other LINCS data.

See our Vocabularies documentation for additional background and the Vocabulary Browser to find vocabulary terms created by or used in LINCS.