More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
1. More Complete Resultset Retrieval from Large
Heterogeneous RDF Sources
Andre Valdestilhas Tommaso Soru Muhammad Saleem
AKSW Group, University of Leipzig, Germany
November 24, 2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
4. Motivation
Where to find RDF datasets?
9,960
raw RDF datasets658,206
Datasets (HDT files)
LODLaundromat
Which Dataset?
...
559
Endpoint
Different formats1
Query more than 221 billion triples (> 5 Terabytes)
1Serialization.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
5. Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20082
2Query from FEDBench
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
6. Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20083
4 HDT datasets4
containing data that can answer the query
3Query from FEDBench
4Semantic Web Dog Food from LOD Laundromat datasets.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
7. Motivation
Approaches
(+) Multiple SPARQL endpoints
(-) 90% are dump files
(+) Dereferenceable URIs
(-) 43% of the URIs are
non-dereferenceable
Endpoint HDT file file.rdf Dump_file_2
WIMU
Where is my URI?
(+) Data from non-dereferenceable
URIs
(-) No SPARQL query
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
8. The approach
A hybrid SPARQL query engine
Collect data from multiple SPARQL endpoints,
Data from RDF dumps including HDT files and use Link Traversal
Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa
aWhere is my URI?(WIMU) http://wimu.aksw.org/
Resulting in
More complete results
Experiments with 3 state-of-the-art SPARQL query benchmarks,
LargeRDFBench, FedBench and FEASIBLE
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
9. The approach
Select ?p ?o
Where {<http://uri.com> ?p ?o}
Endpoint
hdt file
dump.bz2
file.rdf
...
http://uri1.com
http://uri2.com
http://uriN.com
Extract URIs
WIMU
1
2
3
Data Dumps
Query processor
Traversal Based
Query processor
Union of
the results
Source Filtering
SPARQL-a-lot
Query processor
SPARQL Endpoint
Query processor
wimuQ query
execution engine
Results
<subject1><predicate1><object1>
<subject2><predicate2><object2>
<subjectN><predicateN><objectN>
4
5
6
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
10. The approach
The source selection
Identify relevant datasets from WIMU
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
11. Evaluation
Hypothesis Identify automatically relevant sources from heterogeneous RDF
data, even with non-dereferenceable URIs, can improve the
resultset retrieval
Metrics Coverage and runtime
Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and
WIMU(dumps)
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
12. Evaluation
Experimental setup
Datasets 221.7 billion triples (>5 terabytes)
Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE
Each query executed 5 times
Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
13. Evaluation
Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data -
non-indexed)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench | |LargeRDFBench Feasible
100
1000
10000
100000
Averagenumberofresults
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Approaches and the best coverage
FedBench 55% endpoints
LargeRDFBench 81% wimuDumps
FEASIBLE 98% wimuDumps
Observation
The combination of those query
processing engines implies more resultset
retrieval
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
14. Evaluation
Number of datasets
More datasets discovered does not implies in more results
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
15. Evaluation
Runtime
Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints
13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench LargeRDFBench Feasible
| |
1
10
Averagerun-time(minutes)
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Interesting wimuQ takes 91% of results from wimuDumps, only 7% from
SPARQL endpoints. Possible reason, SPARQL endpoint
federation split among multiple endpoints, network and number
of intermediate results influence in the runtime
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
16. Conclusion & Future works
Conclusion
A hybrid SPARQL query processing engine to execute SPARQL queries over a
large amount of heterogeneous RDF data
Evaluation on real world datasets using the state of the art of federated and
non-federated query benchmarks (FedBench, LargeRDFBench and
FEASIBLE)
We present the first federated SPARQL query processing engine that executes
SPARQL queries over a total of 221.7 billion triples
Future work
Add more URIs into WIMU index and use Triple Pattern Fragments
A Large Scale approach to study the relation and similarity among the
datasets
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
17. That’s all Folks!
Thanks!
Questions?Github repository: https://github.com/firmao/wimuT
Prototype: https://w3id.org/wimuq/
Contact: valdestilhas@informatik.uni-leipzig.de
Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17