Preparing Data for 3M
Introduction
To use 3M to transform your data into CIDOC CRM triples, your starting data—or source data—must be formatted as XML. This page covers the guiding principles that LINCS follows when preparing source data for transformation using 3M.
Starting with XML Data
If your source data is already in XML, there may be small changes you can make to the starting structure to make it easier to work with in 3M.
Starting with non-XML Data
If your source data is in a different format than XML, you may be able to transform it into XML fairly easily and then use 3M.
Unsure if you should transform your data to XML and use 3M, or transform your data in a different way? Read more about your options in our Data Transformation Workflow Documentation.
How Should my XML Look?
Because you will set-up custom mappings in 3M, there is no specific XML structure or naming scheme that you must follow. Though, this page will give suggestions for structures that make transformation easier.
What you name elements in your source XML only matters so much that you can easily keep track of what information means and that you are consistent throughout your data. The names of the XML elements do not end up in the final RDF data so the names are only important if you are releasing your XML elsewhere.
The important criteria for your XML are:
- Your XML is valid XML. There are online validators you can use to check.
- Elements in your XML have sub-elements for each relationship you expect to see in the output RDF.
- Relevant text is nested such that you can have meaningful labels in the output.
- There are consistent unique identifiers throughout the data so that the final data does not end up representing the same entity with more than one identifier.
Examples
Here are some examples of input XML and the corresponding mapping diagrams from previous projects. The structure for your own source XML will depend on your own mapping diagrams, but this should give you a sense of what the next step will look like.
Anthologia graeca
The Anthologia graeca data started as a relational database, from which we extracted an XML file for each major type of object in the data. Within each of those XML files, we have an XML element for each instance of that object type. We then have sub-elements for every relationship that should be present in the final RDF. This often required us to join tables when we queried the relational data to make sure we had access to the right information for detailed labels.
Here is the CRM diagram for this project.
The city_sample.xml file is an example where we had a <city>
element for each instance of a city. The sub-elements allowed us to give details about that city in the RDF.
<city>
<city_id>116</city_id>
<city_unique_id>122</city_unique_id>
<city_name>Colchis</city_name>
<city_name_language_id>eng</city_name_language_id>
<city_urn_id>5737</city_urn_id>
<city_urn>https://www.wikidata.org/entity/Q183150</city_urn>
</city>
In the same way we have <city>
elements, we also had separate <city_names>
and <city_alt_urn>
elements for every alternative name or urn for a city, so we could create a new name relationship for each instance of <city_names>
and a new urn relationship for each instance of <city_alt_urn>
and connect those back to the corresponding <city>
using the <city_id>
element that was common across all elements.
Here is the output RDF from this project's 3M mapping. You will see that our 3M mapping created temporary URIs for each object and many of the intermediate nodes that connect objects. For cities, they would be of the form <http://temp.lincsproject.ca/anthologie/place/city/8>
. This is where unique internal identifiers come in handy because we can start with XML that is broken down into little elements that only represent a single relationship but as long as we always use temporary URIs with the same format and the same unique identifier, then all of those relationships will be connected together properly in the final RDF.
After that, we have a final step where we replace those temporary URIs with persistent ones. These can be URIs that you have and will keep persistent, external identifiers we use from existing authority files, or LINCS minted ones.
University of Saskatchewan Art Collection
The University of Saskatchewan Art Collection had a simpler starting XML structure where each parent <rdf:Description>
element represented one art object and all information associated with that object was nested as a sub-element. All entities were matched with official URIs so there was not a need for other internal identifiers.
Here is the CRM diagram for this project.
This project shows examples of using XML attributes to give details about other parts of the data instead of completely separate XML elements. For example,
<rdf:Description>
…
<Category url="http://vocab.getty.edu/aat/300033618">painting</Category>
…
</rdf:Description>
This allowed us to use the URI <http://vocab.getty.edu/aat/300033618>
every time that an object fell into the category of painting. The element value “painting” became the label for the <http://vocab.getty.edu/aat/300033618>
object in the final data.
There are many suitable alternatives for what the input XMl could have looked like and how it could have been broken down. Here are two examples that would have worked equivalently well in 3M, and there are certainly other valid ways to represent the same data.
Example 1:
<rdf:Description>
…
<Category>http://vocab.getty.edu/aat/300033618</Category>
…
</rdf:Description>
<Category>
<url>http://vocab.getty.edu/aat/300033618</url>
<name>painting</name>
</Category>
Example 2:
<rdf:Description>
…
<CategoryID>40</CategoryID>
…
</rdf:Description>
<CategoryName>
<ID>40</ID>
<name>painting</name>
</CategoryName>
<CategoryURL>
<ID>40</ID>
<URL>http://vocab.getty.edu/aat/300033618</URL>
</CategoryURL>
YellowNineties
This project had a similarly simple XML structure where each parent <rdf:Description>
tag represented one person and the sub-elements represented all information about that person.
This data required a pre-processing step where we added attributes to the XML to make it easier to add meaningful labels to the RDF and to prevent creating multiple identifiers for the same relationship. One example is when there was a relationship between two people like “colleague_of”. This was originally expressed from each side using only the person identifiers, like this:
<rdf:Description rdf:about="https://personography.1890s.ca/persons/2617">
…
<y90s_colleague_of rdf:resource="https://personography.1890s.ca/persons/3036"/>
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/3036">
…
<y90s_colleague_of rdf:resource="https://personography.1890s.ca/persons/2617"/>
</rdf:Description>
This resulted in the one colleague_of relationship coming out as two separate events in our final data. So we added a unique identifier to the relationship in the source XML and a label. Like this:
<rdf:Description rdf:about="https://personography.1890s.ca/persons/2617">
…
<y90s_colleague_of
rdf:resource="https://personography.1890s.ca/persons/3036"
relationship_id="colleague_of#26173036"
relationship_label="Colleague relationship between Cahan, Abraham and Hapgood, Norman" />
</rdf:Description>
<rdf:Description rdf:about="https://personography.1890s.ca/persons/3036">
…
<y90s_colleague_of
rdf:resource="https://personography.1890s.ca/persons/2617"
relationship_id="colleague_of#26173036"
relationship_label="Colleague relationship between Cahan, Abraham and Hapgood, Norman" />
</rdf:Description>
This way we could represent the relationship using the unique identifier “colleague_of#26173036” and even if we defined that relationship twice starting from each of the relationship participants’ <rdf:Description>
tags, all the information would be merged correctly because of the shared identifier.
FAQ
I am unsure of the best way to setup my XML. How do I get started?
Follow through our guided 3M tutorial using our sample data so that you understand the basics of 3M.
Once you know how mapping projects work in 3M, it will be easier for you to think about what will work best for your data.
You can also manually make a few small samples of your data using the different setups that you are considering and play around with creating sample 3M mappings for them. Try to setup up the same single relation mapping in 3M for each data sample. Which made it easiest to get all the information you needed?
How do I transform my data into XML?
We tend to create custom python scripts to transform other structured data formats into XML. Check back soon for sample scripts.
There are also many tools online that can help you depending on the format of your source data. OpenRefine is one example that allows for various input and output formats.