Export Data
Introduction
Before converting your data, you need to prepare a version of it that is easy to share and work with. The idea is to export all of the data that you want to convert from your project’s data store and save it in a format suited for your next conversion step. Because this is a custom step for each project, relying on your unique data store, LINCS can only guide you on what the outcome should look like.
Resources Needed
If your data needs to be extracted from a data store like a relational database, you will likely need support from your database administrator. Support from a team member with basic programming knowledge (e.g., undergraduate Python experience) is useful for restructuring your data.
Research Team | Ontology Team | Conversion Team | Storage Team | |
---|---|---|---|---|
Identify Source Data | ✓ | ✓ | ||
Extract Source Data | ✓ | |||
Send Source Data | ✓ |
This step could range from a one hour task of downloading text files and moving them to a shared location, to a few days task of exporting complex relational data into XML files that group the relevant information.
Identify Source Data
The Research Team identifies which dataset or part of a dataset they hope to convert. You will need to work with your technical team to answer:
- Where is the data stored?
- How can you access it?
- Whose help do you need to access it? To export it? To restructure it?
Many projects have their data available to view on a website. When we talk about exporting data, we mean that you need to find out where the data is actually stored—where is the website getting the data from? You will need to go straight to that source or identify that there is an API or a data download link that can handle the export for you.
If you plan to contribute your data to LINCS, your Research Team, the Ontology Team, and the Conversion Team will meet to discuss the structure and contents of the identified dataset. We will discuss the overarching research goals of the Research Team so LINCS can create Linked Open Data (LOD) that is useful and meaningful.
It is helpful for the Research Team to come to this meeting prepared with some research questions they are hoping to answer.
Expected Outputs
The expected output depends on the structure of your data and the conversion tools you plan to use. It is helpful to review the rest of the conversion workflow steps before completing this step, particularly the Implement Conceptual Mapping step.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
If you plan to use 3M for conversion, as LINCS typically does for structured data, then export the data and transform it into XML document(s) following the suggestions in Preparing Data for 3M.
Otherwise, you can:
- Setup a custom data export, in the programming language of your choice, that outputs your data as LINCS compliant Resource Description Framework (RDF), as defined by your Develop Conceptual Mapping step.
- Find a tool online that suits your data and export your data into the format the tool expects as input.
See the Implement Conceptual Mapping page for help deciding if 3M or a custom solution is right for your data.
Typically, it is easiest to export data into individual XML files that can be worked on one at a time. However, if you have short documents with highly connected content, you may prefer to combine them into a single XML file.
Be sure to name each file with a unique and meaningful title, like with a unique document identifier.
Each file will have an extension .xml
and, while it is fine for the XML to contain tags unique to your project, it needs to be valid XML. You can check this using an online XML validator.
Export the data so you end up with an individual file for each TEI document. Be sure to name each file with a unique and meaningful title, like with a unique document identifier.
Each file will have a .xml
extension.
Export the data as cleanly as possible into plain text .txt
files. Be sure to name each file with a unique and meaningful title, like with a unique document identifier. If you extracted text that was embedded within documents, you will likely want to keep track of where each excerpt originated from. For example, by saving the offset of the text in the original document.
Images or PDFs that contain text need to be processed (e.g., through OCR) by the Research Team so we have well formatted, clean, typically English, plain text documents. Similarly, formats like Microsoft Word documents should be saved as plain text, as the formatting embedded in such file types is not useful here.
Depending on the conversion tools you choose, you may find it best to save long texts in a single file, or to split them into smaller files by chapters, sections, paragraphs, or sentences. Read through the rest of the conversion workflow steps to get an idea of what will best suit your data and remember you can experiment with multiple options using a sample of your data.
Send Source Data to LINCS
For LINCS to provide guidance as you convert your data, we need to see the data.
If the data is publicly available on a website or through an API, then please share links and documentation. This documentation helps us see the meaning that you are showing through the data.
We also need a copy or sample of the exported data that you prepared based on this page so that we can understand your technical needs.