In the data cleaning step, you ensure your data is consistent in how it expresses entities and the relationships between them. Your data needs to be consistent and clean to be mapped and converted.
When should I clean my data?
By cleaning your data at this early stage, the rest of the conversion steps will be easier and your converted data will be more accurate and more meaningful.
Data cleaning is often time consuming; LINCS recommends you start as soon as you can. Go ahead and follow the tips on this page and our Data Cleaning Guide, even before committing to the entire conversion process.
Typically, it is fastest to do bulk cleaning tasks on your original data because:
- Your team is already familiar with the format
- In the case of structured data, for example, it is efficient to make changes to a whole column of a spreadsheet at once and there are many common tools to help
- By cleaning your original data, it is improved for other uses you have for your data besides publishing with LINCS
Still, you can continue with more cleaning at the Validate and Enhance Step and, once the conversion is complete, you can edit your data directly in ResearchSpace.
You may find it easiest to apply your data cleaning directly to the data store where it comes from and then follow the Export Data step after.
Or you may choose to export your data into an easy to work with format and clean that version of the data.
This decision will depend on how easy it is to edit the data in the data store, and whether you want the version of the data in the original data store to continue matching the cleaned version of the data.
How long will it take?
There is a trade-off of initial cleaning time vs data quality. Each research team needs to determine when they want to stop cleaning and move on to the next steps. You may also choose to split your full dataset into parts, where you clean and convert one part initially, and then repeat for each part as you have time.
Depending on the size of the dataset and how clean it is to begin with, cleaning usually ranges from a few hour task to a few weeks of part-time research assistant work.
Who cleans my data?
Data cleaning is always done by the Research Team because you are the experts in your own data. This is a task likely suited for a research assistant in your group.
|Research Team||Ontology Team||Conversion Team||Storage Team|
|Clean Source Data||✓|
|Send Cleaned Data||✓|
The LINCS Conversion Team can offer you guidance specific to your data, but first, see our Data Cleaning Guide which should be a good starting point, covering the data cleaning steps and tools we have used with previous research teams. For more information about data cleaning tools, see Data Cleaning.
- Structured Data
- Semi-Structured Data
- TEI Data
- Natural Language Data
We also use a mix of spreadsheet editors like Google Sheets and Microsoft Excel, as well as custom Python scripts.
If your data is in a relational database or datastore with an editing interface, discuss options with your database administrator as there may already be editing methods in place.
LINCS often uses custom Python scripts with XML parsing libraries because each project has its own data structure and needs.
LEAF-Writer may be useful if you have XML data.
For TEI created by hand, it is best to associate your file with a customized schema, so that you can check that your TEI is valid and that your tag use conforms to the use you intend for your project as you go. If you are new to schema customization, or would be happier working with an existing schema, use LEAF-Writer to create your TEI: it will automatically validate your TEI against a schema.
More advanced users can write XSLTs to create a new cleaner version of their TEI files.
LINCS typically uses custom Python scripts or manual changes in a text editor for smaller fixes.
The output from this step should be the same as the output from the Export Data step, except with cleaning applied to the data. Similar to that step, send LINCS a copy of your cleaned data if you would like approval or if LINCS is helping to implement your next conversion steps.