Marvels of API Communication: The Diffbot NLP Vetting Tool
- LINCS Project
- February 24, 2023
— Mohammed Marzookh Farook & Ananya Rao, LINCS Junior Research Analysts
If you’ve ever listened in on a technical conversation between computer scientists, you might’ve heard them refer to something called an API. What exactly is an API? Let’s break it down.
An application is any software that performs a distinct function. Microsoft Excel and Google Sheets are good examples of common applications that lots of people use—even those who are not computer scientists.
Now, every application has specific rules for how to interact with it. For example, most users of Microsoft Excel know that they can bold the text in a cell by clicking on the B button on the ribbon toolbar that appears at the top of Excel’s interface. An interface is a contract that defines rules for software interaction.
These two terms—application program + interface— join together to make Application Programming Interface or API. An API is a way for two or more applications to communicate with each other using requests and responses. As Junior Research Analysts working for LINCS, we’ve had plenty of opportunities to work with APIs as we try to join together the applications that make up LINCS.
One of the APIs that we have worked on is the Diffbot NLP Vetting workflow. The purpose of the Diffbot Vetting Workflow is to assess the performance of the Diffbot Natural Language Processing (NLP) API on LINCS datasets. The Diffbot NLP API uses machine learning and computer vision algorithms to look for meaning in text. In particular, it identifies entities (e.g., people, organizations, products) and data about them (e.g., the relationships between them).
For example, when provided with the text, “On 11 November 1741, Abigail Smith (later Abigail Adams) was born in Boston, Massachusetts,” Diffbot created the following knowledge graph:
Using data from the web, Diffbot was able to identify the text’s entities (circles) and also the relationships (arrows) between them, which it refers to as facts. Diffbot’s AI algorithm consistently identifies entities and facts correctly in many of the applications for which it is used outside of LINCS; however, error-free identification is not always possible given the complexity of the data LINCS is working with. To help with this, we wanted to create a tool that would analyze the accuracy of what Diffbot was telling us: the Diffbot NLP Vetting Workflow.
In essence, the Diffbot NLP Vetting Workflow is a few steps contained in a Jupyter Notebook; using it does not require extensive programming experience or in-depth knowledge of how Diffbot works. Behind the scenes, however, the workflow’s few lines point to a complex Python code that handles the API communication and adheres to Diffbot’s many standards.
Starting with plain text as input, the workflow sends a request to Diffbot to identify all of the entities and facts in the text. The response that comes back looks something like this:
This result, which takes the form of a JSON response, is not all that easy to assess. To make it easier to vet the data, our program calls on the Google Sheets API to generate a spreadsheet:
Google Sheets, in turn, applies some default formatting: hiding, resizing, and freezing columns, as well as applying bold and colours to the text. Each function is associated with a different API request; a lot of trial and error was involved in creating something that would be easy for a human to review.
The resulting spreadsheet shows the entity, Diffbot’s confidence in its own prediction (a number between 0 and 1, with 0 being least confident and 1 being most confident), and a link to a matching entity in Wikidata, and more. A person reviewing the data can use all of this information to confirm or correct the accuracy of Diffbot’s work. To make this review as simple as possible, the last column in the sheet shows where in the original text Diffbot found the entity, so that the reviewer can refer back to the original easily if further checks are needed. Using the information in the spreadsheet, the reviewer can then either approve or reject Diffbot’s results (by entering yes or no in the spreadsheet; columns L, M, and N in the image above).
The newly vetted information (as well as overall accuracy scores for the data being reviewed) then continues its path onward, through a further series of steps that involve both applications and people—a multistage process best saved for future blog posts. At the end, what started out as a sentence in a document can be discovered as entities and relationships in LINCS’s triplestore. And, thanks to the Diffbot NLP Vetting Workflow, those using LINCS’s data can have confidence that the content they’re exploring is accurate!