Research Data Infrastructure

LINCS will leverage existing solutions and expand Compute Canada infrastructure to mobilize large-scale, heterogeneous datasets as linked open data (LOD) for humanities research. The Source Datasets will contribute to the development of LINCS’ Research Data Infrastructure through:

1) A Conversion toolkit by which LINCS will mobilize, enrich, and interlink research data;

2) a national linked data Storage system; and

3) an Access system to filter, query, analyze, visualize, and annotate cultural materials; to modify, evaluate, correct, or reject automated semantic enrichments; and to allow others to use the conversion tools.

The LINCS System Diagram provides a more detailed map of the proposed infrastructure. LINCS will adhere to standards established by T.B. Lee and other architects of the World Wide Web for the Semantic Web, drawing on practical solutions implemented for cultural data. It will adopt, adapt and extend best practices established by large projects including Europeana, the Digital Public Library of America (DPLA) and LD4P, as well as learning from enrichment-oriented platforms like the Australian HuNI and European InterEdition projects, and from cultural heritage providers in Canada who are experimenting with LOD.

LINCS will reuse existing ontologies where possible, creating well-documented bilingual ontologies as needed for domain-specific knowledge where they do not yet exist. LINCS adopts standard algorithms for natural language processing (NLP) and entity matching for its conversion processes and its tools, and will build on methods used by other large-scale LOD conversion projects including Linked Data for Production (LD4P), to which we are linked through.

For interfaces, LINCS extends or adopts several existing tools and platforms designed for humanities researchers, unlike the generic BRAT annotation platform, for instance. We will adopt the Pelagios interface for spatial markup, adding gazetteers. LINCS extends the Canadian Writing Research Collaboratory and the Voyant text analysis and visualization suite, the most used Compute Canada tool in 2016. The notebook method we will use to document the processes for converting diverse datasets is now well established as a means of spreading coding skills, as opposed to developing one-size-fits-all-platforms. LINCS builds in spirit for its access plans on successful models, such as the Situated Networks and Archival Contexts (SNAC) project for an access interface, the DPLA for a developer interface, and the Humanities Networked Infrastructure (HuNI) interface for engaging users with linked data.

Through LINCS, Canadian researchers will have unprecedented access to cultural heritage content, and our cultural heritage will receive unprecedented exposure. This includes copyrighted data: our published record is so young that the lack of access to digital collections for analysis has significantly impeded research on Canadian culture, but LINCS will be able to elucidate, for instance, the massive but protected HTDL datasets. Access tools will ensure data mobilization across the full spectrum of mainstream and technical researchers.

Where possible, tools will be implemented as stand-alone Web services, with Application Programming Interfaces (APIs) to support third-party use. Components will be modular: able to function individually, integrated within a workflow, or built into another system. The architecture will support further tool development for the wide range of use cases emerging from the academy and beyond. Code will be open-source, and open design principles will allow others to build interfaces for their own data and plug LINCS tools into other environments. Workshops will be held on conversion tools, knowledge graphs, and integrating LINCS tools into partner interfaces. Experienced researchers will have tools to prepare data for LINCS, and the less technical will be supported by partners such as CWRC.

LINCS will be both generic, in permitting different query types on data converted from very different sources, and precise, in allowing researchers to drill down to specific domain vocabularies and highly specialized subsets of content.