Get Started with Conversion Workflows
Introduction​
LINCS has developed a series of conversion workflows to cover the most common starting points for creating Linked Open Data (LOD).
All of the general information about contributing data to LINCS and the initial steps of expressing interest and completing the data intake interview process apply to all workflows; see the Publish Data with LINCS and Learn about Contributing pages.
The Four Conversion Workflows​
Browse through the following four tabs for an overview of each workflow and to understand how to categorize your data. The rest of the pages in this conversion workflow documentation cover each individual conversion step in order. Each step contains these same four tabs so that you can tailor the instructions to your data.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
We count data as structured if:
- The entities are all tagged individually (e.g., one entity per spreadsheet cell or XML element)
And the entities are connected, either:
- In a hierarchical way (e.g., nested XML elements)
- With relationships between entities expressed following some clearly-defined schema and data structure (e.g., spreadsheet headings relating columns of entities together)
Data Example
Here are data samples from two projects published with LINCS that began as structured data.
The Canadian Centre for Ethnomusicology data started as several spreadsheets with a row for each artifact.
ID | Title | placeMade | placeMadeID | material | materialID |
---|---|---|---|---|---|
CCEA-L1995.63 | Bamboo Flute | Edmonton | https://sws.geonames.org/5946768 | bamboo | http://www.wikidata.org/entity/Q27891820 |
CCEA1995.21 | Pair of Taiko Drums | Shinano | https://sws.geonames.org/1852136 | hide | http://www.wikidata.org/entity/Q3291230 |
The University of Saskatchewan Art Collection data began as an XML file with a parent element for each art object.
<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description>
<ObjectIdentifier>1910.001.001</ObjectIdentifier>
<AcquistionDate>1910</AcquistionDate>
<ObjectTitle>Portrait of Thomas Copland</ObjectTitle>
<ArtistName url="http://www.wikidata.org/entity/Q100921439">Victor Albert Long</ArtistName>
<Medium url="http://vocab.getty.edu/aat/300015050">oil paint</Medium>
<Category url="http://vocab.getty.edu/aat/300033618">painting</Category>
</rdf:Description>
<rdf:Description>
<ObjectIdentifier>2018.026.001</ObjectIdentifier>
<AcquistionDate>2018</AcquistionDate>
<ObjectTitle>Grace</ObjectTitle>
<ArtistName url="http://www.wikidata.org/entity/Q19609740">Lori Blondeau</ArtistName>
<Medium url="http://vocab.getty.edu/aat/300265621">inkjet print</Medium>
<Category url="http://vocab.getty.edu/aat/300046300">Photograph</Category>
</rdf:Description>
</rdf:RDF>
Workflow Overview
This workflow is our most customizable and curatable because the entities and relationships are clearly defined in the source data. We typically create a custom conceptual mapping for each dataset, reusing past mappings where possible, and convert the data using the 3M mapping tool.
Data Example
Here is a simplified excerpt from the Orlando Project data, which started its conversion process as hand-annotated XML documents.
<DATE>By March 1643</DATE>, early in this year of fierce <TOPIC>Civil War</TOPIC> fighting <NAME>Dorothy Osborne</NAME>’s mother moved with her children from <PLACE>Chicksands</PLACE> to the fortified port of <PLACE>St Malo</PLACE>.
Workflow Overview
This workflow is still being created. Check back as we release additional details and access to the tools discussed.
This workflow requires custom work by the Research Teams. It utilizes LEAF Writer with LINCS-API behind the scenes, helping you identify entities and relationships in your XML documents in a semi-automated way.
Data Example
Here is a data sample of TEI data.
<person xml:id="h-nr254">
<idno type="Wikidata">http://www.wikidata.org/entity/Q28911659</idno>
<persName>
<name>Braderman, Joan</name>
</persName>
<persName type="preferred">
<forename>Joan</forename>
<surname>Braderman</surname>
</persName>
<floruit>
<date>1977</date>
</floruit>
<occupation cert="high">Film maker</occupation>
<affiliation cert="high">Heresies Collective</affiliation>
</person>
Workflow Overview
This workflow includes a simple web-based user interface, LINCS XTriples, where you can upload your TEI document and select a conversion template. The tool is designed to produce CIDOC CRM triples from TEI files that conform to the templates in LEAF-Writer. Then you will use the last steps of the structured data workflow to enhance, validate, and ingest your data into LINCS.
Workflow Limitations
- This workflow is intended to extract a small set of possible relationships from TEI files in an automated way. If you are working with files that do not conform to the LINCS templates in LEAF-Writer, please modify and use the XSLTs linked from the LINCS XTriples documentation to transform your TEI.
Natural language data includes documents that are fully plain text—like a written biography saved as a TXT file—or any document of a different format where there is plain text embedded within. For example, if an XML document has certain tags that always contain full sentences of text, then we can extract that text and use the natural language workflow.
Data Example
Here is a simplified excerpt from the Orlando Project data, where we pulled out natural language text from XML documents.
By March 1643, early in this year of fierce Civil War fighting, Dorothy Osborne’s mother moved with her children from Chicksands to the fortified port of St Malo.
Workflow Overview
This workflow is still being created. Check back as we release additional details and access to the tools discussed.
This workflow uses LINCS APIs to perform automated named entity recognition (NER) and relation extraction (RE) on natural language text files. These APIs are able to output triples that have relationships from a set list of options. After you verify the output, you can use a second API that will convert your triples into CIDOC CRM for final validation and ingestion into LINCS.
This level of automation is meant to be a faster, though less precise, conversion method than that of the structured conversion workflow or a manual treatment of natural language texts. If your Research Team has the time, then you can put more manual curation into it, using the tools as a starting point. NERVE is a great option to use on this type of data when looking for a balanced approach between automation and manual editing.
Workflow Limitations
- This workflow is designed for English text and we are working towards French support. Some individual tools may support other languages. If so, it will be explained within that tool’s documentation.
- This workflow is best suited to documents containing factual statements about real-world people, places, and written-works, such as biographies or other non-fiction descriptive text.
- Text extracted from sources like social media or poor quality Optical Character Recognition (OCR) is unlikely to result in high-quality LOD, unless significant data cleaning can be done by the Research Team before conversion.
Interested in contributing your converted data to LINCS? Review the Contribute Data to LINCS and Learn about Contributing pages to understand if your data is a good fit for LINCS, what is required of your team, what support LINCS can offer, and how to contact LINCS.
Common Questions​
Below are a list of questions that are important to consider before, during, and after the conversion process.
Before Conversion​
Are the conversion workflows automated?​
The projects contributing to LINCS vary in the content and format of their source data. To accommodate all of this important data diversity, LINCS has prioritized workflows made up of independent steps that can be done in multiple different ways depending on the type of data and the needs of the Research Team. This means that we do not have a single automated workflow.
Each workflow contains a mix of steps that are more automated and others that require manual work. Similarly, some steps will feed into the next without any additional effort, while others may require data manipulation.
Not ready to commit to the entire process?​
There are parts of the conversion process that the Research Team can begin before fully committing to converting data with LINCS. In particular, cleaning and reconciling your data will help you understand your data better and make it easier for you to work with it, even if you do not convert it to LOD and contribute it to the LINCS triplestore. For more information, see Data Cleaning and Reconciliation.
These conversion steps and associated tools can be used even if you are not contributing to LINCS. For example, you can use these steps if you have data from a domain outside of the LINCS Areas of Inquiry and will contribute to another LOD project.
During Conversion​
How does LINCS collaborate?​
Once the conversion process has started, the Research Team will be connected with LINCS team members who will help with each step of the process. The Research Team will also need to provide LINCS contact information for their team members so everyone can be kept informed.
Conversion is an iterative process. The Research Team can expect to have regular meetings with LINCS team members to discuss the conversion process and work collaboratively. The more time that the Research Team can make available for these meetings and for the work required throughout the process, the faster the data will be prepared.
When will my data be public?​
After your dataset is converted, it will be ingested into the LINCS triplestore as a trial. This ingestion allows the Research Team to view their dataset in ResearchSpace and look for errors. While the dataset is technically public, it is not yet published LOD. The dataset should not be used in publications at this stage.
After errors have been spotted and changes have been made, the final version of the dataset will be uploaded to the LINCS triplestore. The final dataset will now be accessible via the official version of ResearchSpace and is published LOD. This data can be used in publications and shared with others. It will also be available to others who want to use and connect to the data, except in limited, mutually agreed-upon circumstances.
After Conversion​
What if I want to edit my data?​
Once the dataset is published to ResearchSpace, the Research Team can make changes to the data directly in ResearchSpace. Changes made by the Research Team affect the version of the dataset that is in the LINCS triplestore, which means that the conversion workflow does not need to be repeated.
What if I want to add more data later?​
If the Research Team wants to add more data after the conversion process, LINCS can rerun the data conversion without repeating the consultation process if the new data has the exact same structure as the initial data. If the new data does not have the same structure, the conversion process will need to be altered and repeated. Steps like reconciliation will always need to be redone if there are new entities in the new data. Note that the new data can then be merged with the existing project or can be made into a new, separate project.