LINCS will convert existing data into Linked Open Data (LOD) by extracting linked data entities and relationships from heterogeneous datasets. The Conversion component involves both significant tool adaptation and data conversion.
The processes involved are:
- Detect entities at the heart of human history and culture.
- Disambiguate or link entities to one or more records of those entities stored in reference knowledge graphs from the LOD cloud.
- Detect relationships between entities within the source materials through natural language processing and machine learning.
- Validate the results to ensure sufficient precision.
Existing tools and processes will be adapted to convert datasets to RDF triples, supporting the most common formats used by the community: relational data, plain text, and semi-structured text. All LINCS tools will build on award-winning algorithms developed at Alberta that perform named entity recognition (NER) and disambiguation (NED) with respect to a known Knowledge Graph, incorporate hand-tagged datasets as training data for the models, and provide an interface for tuning parameters. This component will be generic, modular, and work with several open-source algorithms for entity detection.
To quickly mobilize a large set of relevant data, LINCS prioritizes dataset conversion, starting with the most LOD-ready content within the first six months. Core researcher datasets will receive the fullest processing, including human vetting. Others will be processed automatically, with confidence levels set for high precision to minimize false positives; the resulting millions of triples will make these materials immediately accessible, poised for vetting as researchers engage with them. Conversion is of two types: the translation of a relational dataset into RDF format, pointing back to the source dataset where possible; and extraction of LOD from a dataset comprised of natural or semi-structured language, which will always point back to the source. In both cases, LINCS data-conversion tools will track provenance of the data for scholarly purposes.