Property Path use in Wikidata Queries

September 28th, 2018 9:06 AM

I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.

Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:

Pct. Count Number of Paths
74.3048% 58161337 0 paths used in query
24.7023% 19335490 1 paths used in query
0.6729% 526673 2 paths used in query
0.2787% 218186 4 paths used in query
0.0255% 19965 3 paths used in query
0.0056% 4387 7 paths used in query
0.0037% 2865 8 paths used in query
0.0030% 2327 9 paths used in query
0.0011% 865 6 paths used in query
0.0008% 604 11 paths used in query
0.0006% 434 5 paths used in query
0.0005% 398 10 paths used in query
0.0002% 156 12 paths used in query
0.0001% 110 15 paths used in query
0.0001% 101 19 paths used in query
0.0001% 56 13 paths used in query
0.0000% 12 14 paths used in query

I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths. The type of path operators used skews heavily towards * (ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs. Here are the structures representing at least 0.1% of the paths in the dataset:

Pct. Count Path Structure
49.3632% 10573772 ?s <iri1> * ?o .
39.8349% 8532772 ?s <iri1> / <iri2> ?o .
4.6857% 1003694 ?s <iri1> / ( <iri2> * ) ?o .
1.8983% 406616 ?s ( <iri1> + ) / ( <iri2> * ) ?o .
1.4626% 313290 ?s ( <iri1> * ) / <iri2> ?o .
1.1970% 256401 ?s ( ^ <iri1> ) / ( <iri2> * ) ?o .
0.7339% 157212 ?s <iri1> + ?o .
0.1919% 41110 ?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o .
0.1658% 35525 ?s <iri1> / <iri2> / <iri3> ?o .
0.1496% 32035 ?s <iri1> / ( <iri1> * ) ?o .
0.1124% 11889 ?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o .

There are also some rare but interesting uses of property paths in these logs:

Pct. Count Path Structure
0.0499% 5274 ?s ( ( <iri1> / ( <iri2> * ) ) / ( <iri3> / ( <iri2> * ) ) ) / ( <iri4> / ( <iri2> * ) ) ?o .
0.0015% 157 ?s ( <iri1> / <iri2> / <iri3> / <iri4> / <iri5> / <iri6> / <iri7> / <iri8> / <iri9> ) * ?o .
0.0003% 28 ?s ( ( ( ( <iri1> / <iri2> / <iri3> ) ? ) / ( <iri4> ? ) ) / ( <iri5> * ) ) / ( <iri6> / ( <iri7> ? ) ) ?o .

Without further investigation it’s hard to say if these represent meaningful queries or are just someone playing with SPARQL and/or Wikidata, but I found them curious.

  1. These numbers don’t align exactly with the Wikidata query dumps as there were some that I couldn’t parse with my tools. ↩︎

SPARQL Blank Nodes

August 5th, 2017 11:40 AM

I recently ran across what I believe to be a mistake in the SPARQL 1.1 Query Language, and thought I’d add some detail here.

Section 5.1.1 of the SPARQL 1.1 specification says:

When using blank nodes of the form _:abc, labels for blank nodes are scoped to the basic graph pattern. A label can be used in only a single basic graph pattern in any query.

However, during the parsing of a SPARQL property paths into the SPARQL algebra, many property path expressions do not result in basic graph patterns. Only those expressions that result in “adjacent triple patterns” produce a basic graph pattern. That means that a graph pattern such as:

_:s <p> ?x ;
    <q>* ?y .

does not result in a BGP. Intuitively, I think this should be allowed. It’s intention seems clear. However, it results in two primary algebraic components: a basic graph pattern with the triple pattern :s <p> ?x, and a property path :s ZeroOrMorePath(<q>) ?y. This certainly breaks the rule about only using blank nodes in a single basic graph pattern.

The language in section 5.1.1 originated in SPARQL 1.0, and I believe was just overlooked during the update to the language that added property paths.

When handling blank node labels, instead of following the exact language of the specification, I believe SPARQL implementations should instead allow blank node labels that appear in any adjacent set of basic graph patterns and property paths.

Andy Seaborne helpfully added this issue to the SPARQL 1.1 Errata.

Feature for SPARQL 1.2

July 6th, 2017 6:15 PM

Jindřich Mynarz recently posted a good list of “What I would like to see in SPARQL 1.2” and I thought I’d add a few comments as well as some of my own wished-for features.

Explicit ordering in GROUP_CONCAT, and quads support for both the HTTP Graph Store Protocol and CONSTRUCT queries (items 2, 5, and 8 in Jindřich’s list) seem like obvious improvements to SPARQL with a clear path forward for semantics and implementation.

Here are the some of the other wished-for features:

  • Explicitly specify the REDUCED modifier (#1)

    As an implementor, I quite like the fact that REDUCED is “underspecified.” It allows optimization opportunities that are much cheaper than a full DISTINCT would be, while still reducing result cardinality. I think it’s unfortunate that REDUCED hasn’t seen much use over the years, but I’m not sure what a better-specified REDUCED operator would do different from DISTINCT.

  • Property path quantifiers (#3)

    The challenge of supporting path quantifiers like elt{n,m} is figuring out what the result cardinality should be. The syntax for this was standardized during the development of SPARQL 1.1, but we couldn’t find consensus on whether elt{n,m} should act like a translation to an equivalent BGP/UNION pattern or like the arbitrary length paths (which do not introduce duplicate results). For small values of n and m, the translation approach seems natural, but as they grow, it’s not obvious that use cases would only want the translation semantics and not the non-duplicating semantics.

    Perhaps a new syntax could be developed which would allow the query author to indicate the desired cardinality semantics.

  • Date time/duration arithmetic functions (#6)

    This seems like a good idea, and very useful to some users, though it would substantially increase the size and number of the built-in functions and operators.

  • Support for non-scalar-producing aggregates (#9)

    I’m interested to see how this plays out as a SPARQL extension in systems like Stardog. It likely has a lot of interesting uses, but I worry that it would greatly complicate the query and data models, leading to calls to extend the semantics of RDF, and add new query forms, operators, and functions.

  • Structured serialization format for SPARQL queries (#10)

    I’m indifferent to this. I suspect some people would benefit from such a format, but I don’t think I’ve ever had need for one (where I couldn’t just parse a query myself and use the resulting AST) and it would be another format to support for implementors.

Beyond that, here are some other things I’d like to see worked on (either standardization, or cross-implementation support):

  • Support for window functions

  • Explicit support for named graphs in SERVICE blocks

    This can be partially accomplished right now for hard-coded graphs by using an endpoint url with the default-graph-uri query parameter, but I’d like more general support that could work dynamically with the active graph when the SERVICE block is evaluated.

  • Structured errors for use in the SPARQL Protocol

    My preference for this would be using the RFC7807 “Problem Details” JSON format, with a curated list of IRIs and associated metadata representing common error types (syntax errors, query-to-complex or too-many-requests refusals, etc.). There’s a lot of potential for smarter clients if errors contain structured data (e.g. SPARQL editors can highlight/fix syntax issues; clients could choose alternate data sources such as triple pattern fragments when the endpoint is overwhelmed).

SPARQL Limit by Resource

March 20th, 2017 7:17 PM

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

        ?rank <= 2,
                Group((?school), BGP(?p :name ?name . ?p :school ?school . ?p :age ?age .)),
    {?age, ?name, ?school}

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
GROUP BY ?school
RANK(DESC(?age)) AS ?rank
HAVING (?rank <= 2)

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

SELECT ?age ?name ?school WHERE {
    ?p :name ?name ;
       :school ?school ;
       :age ?age .
GROUP BY ?school


SELECT ?age ?name ?school WHERE {
        SELECT ?age ?name ?school (RANK(GROUP BY ?school ORDER BY DESC(?age)) AS ?rank) WHERE {
            ?p :name ?name ;
               :school ?school ;
               :age ?age .
    FILTER(?rank <= 2)

Swifty Serd

February 17th, 2017 10:04 AM

I recently found myself wanting a simple and efficient way to parse RDF files in Swift, and ended up writing a small library that uses David Robillard’s excellent Serd library for RDF syntax. The resulting projects (available on GitHub) are swift-serd, a low-level wrapper around the C API, and SerdParser, a more Swift-y higher-level API that allows parsing of RDF content and handling the resulting triples.

The resulting code benefits from serd’s performance and standards-compliance, and allows very simple parsing of N-Triples and Turtle files:

import SerdParser

// extract all foaf:name triples
let parser = SerdParser()
let count = try parser.parse(file: filename) { (s, p, o) in
    if case .iri("") = p {
        print("\(s) has name \(o) .")
print("\(count) triples processed")


November 8th, 2016 11:47 PM

I’m devastated and heartbroken. I’m worried about the people in my life who would not have health insurance if not for Obamacare. I’m worried for Muslims, immigrants, the disabled, women, and anyone else who will be on the receiving end of the harassment and intolerance that has been normalized by our President-elect. I’m worried that we will dangerously backtrack on progress made in fighting climate change, and waste time obstinately defending fossil fuels and opposing green alternatives. I’m worried about living in a world where truth and facts don’t seem to carry any weight, where science is viewed with skepticism, and experts are dismissed out of hand.

I’m worried, but resolved. Turns out, there’s a lot of work to do.

Election Day

November 8th, 2016 6:46 AM

President Obama:

We are not as divided as our politics suggests

I’m drawing inspiration from this. Regardless of what happens today, there will be a lot of work to do. I hope we can come together, with the understanding that we all want to make this country better. That will require listening to others, and trying to understand their concerns. And it will require being open to facts and evidence, even when they contradict beliefs. It’s election day. Let’s do this.


December 31st, 2015 7:56 PM

At the airport for the last flight of 2015.

2015 Travel Map

Happy New Year!

ISWC 2014

October 28th, 2014 6:44 PM

Last week I attended ISWC 2014 in Riva del Garda and had a very nice time.

Riva del Garda

I presented a paper at the Developers Workshop about designing the new PerlRDF and SPARQLKit systems using trait-based programming. Covering more PerlRDF work, Kjetil presented his work on RDF::LinkedData.

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime by Kontchakov, et al. caught my attention during the main conference. The authors show a technique to map query answering under OWL QL entailment to pure SPARQL 1.1 queries under simple entailment, allowing OWL QL reasoning to be used with SPARQL 1.1 systems that do not otherwise support OWL. An important challenge highlighted by the work (and echoed by the audience after the presentation) was the limited performance of existing property path implementations. It seems as if there might be a subset of SPARQL 1.1 property paths (e.g. with limited nesting of path operators) that could be implemented more efficiently while providing enough expressivity to allow things like OWL reasoning.

There seemed to be a lot of interest in more flexible solutions to SPARQL query answering. Ruben’s Linked Data Fragments work addresses one part of this issue (though I remain somewhat skeptical about it as a general-purpose solution). That work was received very well at the conference, winning the Best Demo award. In conversations I had, there was also a lot of interest in the development of SPARQL systems that could handle capacity and scalability issues more gracefully (being smart enough to use appropriate HTTP error codes when a query cannot be answered due to system load or the complexity of the query, allowing a fallback to linked data fragments, for example).

A Year of Travel

December 19th, 2013 8:57 PM

I flew 52 flights in 2013, totaling 68,266 miles.

Air Tahiti Nui Wing

45 flights, and 60,817 of those miles were a part of our around-the-world trip.

2013 Travel Map

Upon returning to the US we moved to California, driving across the country in October.

You're not in Kansas anymore! foursquare badge

I expect there’ll be quite a bit less traveling in 2014, but hopefully just as much adventure.