Skip to main content

Clean Data

Introduction

In the data cleaning step, you ensure your data is consistent in how it expresses entities and the relationships between them. Your data needs to be consistent and clean to be mapped and converted.

Resources Needed

When should I clean my data?

By cleaning your data at this early stage, the rest of the conversion steps will be easier and your converted data will be more accurate and more meaningful.

info

Data cleaning is often time consuming; LINCS recommends you start as soon as you can. Go ahead and follow the tips on this page and our Data Cleaning Guide, even before committing to the entire conversion process.

Typically, it is fastest to do bulk cleaning tasks on your original data because:

  • Your team is already familiar with the format
  • In the case of structured data, for example, it is efficient to make changes to a whole column of a spreadsheet at once and there are many common tools to help
  • By cleaning your original data, it is improved for other uses you have for your data besides publishing with LINCS

Still, you can continue with more cleaning at the Validate and Enhance Step and, once the conversion is complete, you can edit your data directly in ResearchSpace.

info

You may find it easiest to apply your data cleaning directly to the data store where it comes from and then follow the Export Data step after.

Or you may choose to export your data into an easy to work with format and clean that version of the data.

This decision will depend on how easy it is to edit the data in the data store, and whether you want the version of the data in the original data store to continue matching the cleaned version of the data.

How long will it take?

There is a trade-off of initial cleaning time vs data quality. Each research team needs to determine when they want to stop cleaning and move on to the next steps. You may also choose to split your full dataset into parts, where you clean and convert one part initially, and then repeat for each part as you have time.

Depending on the size of the dataset and how clean it is to begin with, cleaning usually ranges from a few hour task to a few weeks of part-time research assistant work.

Who cleans my data?

Data cleaning is always done by the Research Team because you are the experts in your own data. This is a task likely suited for a research assistant in your group.

Research TeamOntology TeamConversion TeamStorage Team
Clean Source Data
Send Cleaned Data
info

The LINCS Conversion Team can offer you guidance specific to your data, but first, see our Data Cleaning Guide which should be a good starting point, covering the data cleaning steps and tools we have used with previous research teams. For more information about data cleaning tools, see Data Cleaning.

Cleaning Tools

OpenRefine is LINCS’s preferred tool for cleaning structured data because it offers built in functionality and good documentation for many of the cleaning tasks outlined in our Data Cleaning Guide.

We also use a mix of spreadsheet editors like Google Sheets and Microsoft Excel, as well as custom Python scripts.

If your data is in a relational database or datastore with an editing interface, discuss options with your database administrator as there may already be editing methods in place.

Expected Outputs

The output from this step should be the same as the output from the Export Data step, except with cleaning applied to the data. Similar to that step, send LINCS a copy of your cleaned data if you would like approval or if LINCS is helping to implement your next conversion steps.