Skip to main content

Export Data

Introduction

Before converting your data, you need to prepare a version of it that is easy to share and work with. The idea is to export all of the data that you want to convert from your project’s data store and save it in a format suited for your next conversion step. Because this is a custom step for each project, relying on your unique data store, LINCS can only guide you on what the outcome should look like.

Resources Needed

If your data needs to be extracted from a data store like a relational database, you will likely need support from your database administrator. Support from a team member with basic programming knowledge (e.g., undergraduate Python experience) is useful for restructuring your data.

This step could range from a one hour task of downloading text files and moving them to a shared location, to a few days task of exporting complex relational data into XML files that group the relevant information.

Research TeamOntology TeamConversion TeamStorage Team
Identify your Source Data
Send a Representative Sample
Export your Full Dataset
Send your Full Dataset

Identify your Source Data

The research team identifies which dataset or part of a dataset they hope to convert. You will need to work with your technical team to answer:

  • Where is the data stored?
  • How can you access it?
  • Whose help do you need to access it? To export it? To restructure it?
  • Can you make changes directly to it during the cleaning and reconciliation steps?
  • Will you convert all of your data? Or only some fields?

Many projects have their data available to view on a website. When we talk about exporting data, we mean that you need to find out where the data is actually stored—where is the website getting the data from? You will need to go straight to that source or identify that there is an API or a data download link that can handle the export for you.

Choose your Source Data

LINCS emphasizes finding your source data because that will be the copy of your data that we use as the starting point for your conversion workflow. For many of the workflows, we develop code and mappings that rely on the structure of your source data remaining constant—even if the contents change through cleaning and reconciliation as we go.

Think of your source data as the place where you will go to make changes if you find data errors during the conversion process. Would you go back to the database, edit there, re-export, then re-run the LINCS conversion code? This would mean you have improved the version of your data that you are likely to use for other purposes. Or would you rather export your data into an easy to work with format like a spreadsheet, clean your data in the sheet, run the LINCS conversion code, and leave changes to your true source data for a future project?

An important consideration is whether you intend to continue converting more data to add to LINCS after the initial conversion. If yes, it is better to spend time at the start making sure the export and cleaning steps are easily repeatable without needing to waste time duplicating manual work like creating a spreadsheet with special formatting. If not, and this is a one-time conversion, it is safe to prioritize speed.

If you plan to contribute your data to LINCS, your research team, the Ontology Team, and the Conversion Team will meet to discuss the structure and contents of the identified dataset. We will discuss the overarching research goals of the research team so we can create linked open data (LOD) that is useful and meaningful.

It is helpful for the research team to come to this meeting prepared with some research questions they are hoping to answer.

note

A common scenario for the structured data conversion workflow is where a project has a relational database that is the true initial source of their data. They have a few options of what to count as the source of data when working with LINCS:

  1. Treat that relational database as the source. While we are converting data, if errors need to be fixed in the data, the research team makes changes directly to their database. Their database administrator then creates a copy or a data dump of that database, which includes all of the data and the schema that tells us how the data is organized. LINCS’s conversion code takes that data dump as the input or starting point.
  2. If there is an API that allows us to request data from their database, then they still treat the database as the source like in the previous option. But we do not need a database dump to be regenerated each time significant changes are made. Instead, the conversion code would consider the API as the starting point, where we can call on it from the conversion code to get up-to-date data.
  3. The research team does not have access, permission, or capacity to make changes directly to the source database. Instead, they create a one-time export of their database into a format they like working in such as a spreadsheet or XML document. Cleaning and reconciliation happen directly in that new format and the conversion code uses that new format as the starting point.

Send a Representative Sample

When you start working with LINCS, send us a representative data sample so that we can help create a conceptual mapping.

A representative sample:

  • Must include all of the fields you want to be present in your converted data. For example, there is a spreadsheet column for every category of data.
  • Must not include fields that should remain private to your institution or that will not be useful in LOD form. Examples include internal database IDs, institutional information about object purchasing, or personal information about still-living persons.
  • Must include blank fields or placeholders for data you will add before conversion is complete. If adding the fields is not possible, communicate to LINCS the changes you intend to make.
  • Should not include blank fields or placeholders for data you will not have time to add in the near future. Those fields are usually best left for a second round of conversion once you are comfortable with the process.
info

LINCS will request a representative sample in our first meeting. Come prepared and kick-start the conversion process.

Export your Full Dataset

The expected output depends on the structure of your data and the conversion tools you plan to use. It is helpful to review the rest of the conversion workflow steps before completing this step, particularly the Implement Conceptual Mapping step.

If you plan to use 3M for conversion, as LINCS typically does for structured data, then export the data and transform it into XML document(s) following the suggestions in Preparing Data for 3M.

Otherwise, you can:

See the Implement Conceptual Mapping page for help deciding if 3M or a custom solution is right for your data.

Send your Full Dataset

As we get further into the conversion process, if LINCS is completing any of the conversion steps with you, we will need a copy of your full dataset before any more work can continue.

If the data is publicly available on a website or through an API, then please share links and documentation. This documentation helps us see the meaning that you are showing through the data.