by Pieter Botha
LINCS Technical Manager
IMAGE BY PIETER BOTHA
There are many graph databases out there that support RDF: Virtuoso, GraphDB, Stardog, AnzoGraph, and RDFox, to name just a few popular ones. But if the requirements for your triplestore include open source, as it does for our CFI-funded LINCS project, then Blazegraph and Apache’s Jena Fuseki are two of your most mature options. This article compares Blazegraph and Jena Fuseki, two contenders for the LINCS graph database. Thanks to Angus Addlesee for writing an article that compared Blazegraph with commercial triplestores and inspired the testing methodology for this post.
Blazegraph, previously known as Bigdata, is a great triplestore that scales to billions of triples with thousands of proven use cases. In fact, it was so good that AWS bought the Blazegraph trademark almost five years ago and hired some of its staff, including the CEO. Unfortunately, that meant that most of Blazegraph’s development experience was used to create a competing product: Amazon Neptune. Although the official releases of Blazegraph have slowed down, it still supports SPARQL 1.1 and is by no means outdated.
Apache’s Fuseki, along with the entire Jena project and all its plugins, is still actively developed as of October 2020. It supports the SPARQL 1.1 update and gets new features and enhancements with each new release, which takes place every quarter or so. We know that Fuseki can scale, as shown here loading the entire Wikidata dump. But what is query performance like and can it be compared to Blazegraph? Let’s find out!
Trying to have a fair competition in a matchup like this is very difficult. Different products almost always have different strengths and selective benchmarking can easily skew results. Getting one-sided results was not the intention here, but I did choose a small set of tests, as an exhaustive test suite would require a book and not an article. My testing involved loading this Olympic sports dataset with ~1.8m triples and then executing some timed SPARQL queries using the built-in web interface of both triplestores.
I used this docker file from the LINCS project to create a Fuseki instance based on the latest v3.16 release. It is a basic TDB2 configuration with a full-text index for all rdfs:label properties.
The tests were executed on an 8-series Core-i5 with SSDs and plenty of RAM. Neither triplestore was “warmed up” and queries were executed in the same order and the same number of times in an effort to keep the playing field as level as possible.
The SPARQL queries used these prefixes:
Queries were executed twice and both results were recorded.
It is pretty important to most projects to know how long it will take to load data into the triplestore. Since our dataset is relatively small (< 2 million triples), I was able to use the web interface of both triplestores to load the Turtle file without any issues.
Fuseki: 57s, 30.3s
Blazegraph: 57s, 21.5s
The second run was an update with the same .ttl file used in the first run. No actual changes were made to the graph.
The simplest of queries to just see how many triples are in the dataset.
Fuseki: 0.8s, 0.5s
Blazegraph: 0.02s, 0.01s
It looks like Blazegraph did some pre-aggregation here while loading the data.
Fuseki: 7.7s, 5.0s
Blazegraph: 7.0s, 4.2s
Blazegraph was consistently faster and dips ahead further with this typical query.
Using the full-text index efficiently required slightly different queries because Fuseki performed very slowly unless the full-text search was the first filter.
Fuseki: 0.2s, 0.1s
Blazegraph: 0.3s, 0.1s
Moving the full-text filter to the top for Blazegraph too made it perform faster than Fuseki—0.08s for the first run and 0.04s for the second run.
Blazegraph: 7.0s, 6.0s
Fuseki did not manage to finish this query before the configured timeout of 10 minutes.
This query joins a graph over the internet from dbpedia.org
Fuseki: 7.9s, 7.5s
Blazegraph: 0.5s, 0.4s
I must admit that I was somewhat surprised by the results. Blazegraph performed consistently better than Fuseki in this scenario. The complex query that Fuseki just couldn’t finish could possibly be an indexing problem. Be that as it may, Blazegraph ran that same query just fine straight out of the box. Blazegraph also beat Fuseki by more than an order of magnitude with the federated query.
One possible explanation for these one-sided results is that Blazegraph’s indexes are better configured for this dataset and I need to apply more effort to get Fuseki’s indexes optimized. Please feel free to look at the config.ttl I used to configure Fuseki and let me know in the comments if I missed an obvious optimization or if I misconfigured something.
Fuseki configuration followup
Although I tried to configure Fuseki with the simplest full-text index possible, I feared that a misconfiguration might have been the cause for the comparative disappointing performance. To rule out that possibility, and for the sake of completeness, I ran the benchmarks against default Fuseki databases created from the admin portal without any customization.
The results were more or less aligned with the original performance figures. So, it seems that vanilla Fuseki is just considerably slower than Blazegraph for this dataset and queries.