Wikidata on local Blazegraph : Expected an RDF val

2019-07-09 03:13发布

问题:

We (Thomas and Wolfgang) have installed locally wikidata and blazegraph following the instruction here : https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md

The

mvn package command was successful

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] parent ............................................. SUCCESS [ 54.103 s]
[INFO] Shared code ........................................ SUCCESS [ 23.085 s]
[INFO] Wikidata Query RDF Testing Tools ................... SUCCESS [ 11.698 s]
[INFO] Blazegraph extension to improve performance for Wikibase SUCCESS [02:12 min]
[INFO] Blazegraph Service Package ......................... SUCCESS [01:02 min]
[INFO] Wikidata Query RDF Tools ........................... SUCCESS [02:19 min]
[INFO] Wikibase RDF Query Service ......................... SUCCESS [ 25.466 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS

We are both using

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

We both downloaded the latest-all.ttl.gz e.g.

31064651574 Jan  3 19:30 latest-all.ttl.gz

from https://dumps.wikimedia.org/wikidatawiki/entities/ which took some 4 hours.

The .munge created 424 files as "wikidump-000000001.ttl.gz" in data/split

~/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT$ ./munge.sh -f data/latest-all.ttl.gz -d data/split -l en,de 
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
08:23:02.391 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000001.ttl.gz
08:24:21.249 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 10000 entities at (105, 47, 33)
08:25:07.369 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 20000 entities at (162, 70, 41)
08:25:56.862 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 30000 entities at (186, 91, 50)
08:26:43.594 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 40000 entities at (203, 109, 59)
08:27:24.042 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 50000 entities at (224, 126, 67)
08:28:00.770 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 60000 entities at (244, 142, 75)
08:28:32.670 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 70000 entities at (272, 161, 84)
08:29:12.529 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 80000 entities at (261, 172, 91)
08:29:47.764 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 90000 entities at (272, 184, 98)
08:30:20.254 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 100000 entities at (286, 196, 105)
08:30:20.256 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to data/split/wikidump-000000002.ttl.gz
08:30:55.058 [main] INFO  org.wikidata.query.rdf.tool.Munge - Processed 110000 entities at (286, 206, 112)

When Thomas tried to load one file on blazegraph with

./loadRestAPI.sh -n wdq -d data/split/wikidump-000000001.ttl.gz

he got the error below. Trying to import from the UPDATE tab of blazegraph also didn't work.

What can be done to fix this?

ERROR: uri=[file:/home/tsc/projects/TestSPARQL/wikidata-query-rdf-0.2.1/dist/target/service-0.2.1/data/split/wikidump-000000001.ttl.gz], context-uri=[] java.util.concurrent.ExecutionException: org.openrdf.rio.RDFParseException: Expected an RDF value here, found '' [line 1] at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:281) at com.bigdata.rdf.sail.webapp.InsertServlet.doPostWithURIs(InsertServlet.java:397) at com.bigdata.rdf.sail.webapp.InsertServlet.doPost(InsertServlet.java:116) at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:303) at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:192) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:808) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:748) Caused by: org.openrdf.rio.RDFParseException: Expected an RDF value here, found '' [line 1] at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:441) at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:671) at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1306) at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:637) at org.openrdf.rio.turtle.TurtleParser.parseSubject(TurtleParser.java:449) at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:383) at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:216) at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:159) at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithURLsTask.call(InsertServlet.java:556) at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithURLsTask.call(InsertServlet.java:414) at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

回答1:

The loadRestAPI.sh script is basically the one mentioned in:

https://wiki.blazegraph.com/wiki/index.php/Bulk_Data_Load#Command_line

so it should be possible to use the command line tool directly instead of the REST API.

Also the whole process seems to be quite awkward. The tool is relying on the .gz file which is 25% bigger than the .bz2 file and takes longer to download. Unzipping the .bz2 file is quicker than the munge process. My assumption is that processing the unzipped 230GB file e.g.

230033083334 Jan 4 07:29 wikidata-20180101-all-BETA.ttl

in "chunk-wise" fashion might work better. But first we need to see what makes the import choke.

My first issue was that the shell script runBlazegraph.sh gave an error for the missing mwservices.json.

I assume a file like https://github.com/wikimedia/wikidata-query-deploy/blob/master/mwservices.json is expected.

So i tried to fix this with

wget https://raw.githubusercontent.com/wikimedia/wikidata-query-deploy/master/mwservices.json

although I doubt this is of much relevance.

The actual call

./loadRestAPI.sh -n wdq -d data/split/wikidump-000000001.ttl.gz 
Loading with properties...
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=wdq
#Files to load
fileOrDirs=data/split/wikidump-000000001.ttl.gz
#Property file (if creating a new namespace)
propertyFile=/home/wf/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT/RWStore.properties
<?xml version="1.0"?><data modified="0" milliseconds="493832"/>DATALOADER-SERVLET: Loaded wdq with properties: /home/wf/wikidata/wikidata-query-rdf/dist/target/service-0.3.0-SNAPSHOT/RWStore.properties

worked for me on an Ubuntu 16.04 LTS server with Java 1.8.0_151 so I believe we have to look into more details to fix Thomas' problem.

see also https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query/Documentation

for more details.

To check the results I used an ssh tunnel to my ubuntu server

ssh -L 9999:localhost:9999 user@server

and then

http://localhost:9999/bigdata/namespace/wdq/sparql

in the browser of my local machines (laptop) browser.

The second import also worked o.k.

Then I checked the database content with the following SPARQL query:

SELECT ?type (COUNT(?type) AS ?typecount)
WHERE {
  ?subject a ?type.
}
GROUP by ?type
ORDER by desc(?typecount)
LIMIT 7

giving the result

type                                          typecount
<http://wikiba.se/ontology#BestRank>            2938060
schema:Article                                  2419109
<http://wikiba.se/ontology#QuantityValue>.        78105
<http://wikiba.se/ontology#TimeValue>.            61553
<http://wikiba.se/ontology#GlobecoordinateValue>  57032
<http://wikiba.se/ontology#GeoAutoPrecision>       3462
<http://www.wikidata.org/prop/novalue/P17>.         531

given the import experience i would say that the munge and loadRestAPI calls can be run somewhat in parallel since the loadRestAPI step is apparently slower.

It takes some 5 minutes per gz file to import. This later drops and some files actually took up to 1 hour 15 mins on Wolfgang's server.

Loading all the data will probably take 10 or more days or more on Wolfgang's first machine so please stay tuned for the final result.

Currently 358 of 440 files are imported after 158 hours on this machine. At this time the wikidata.jnl files is 250 GBytes big and some 1700 million statements have been imported.

The loading statistics are quite awkward. Loading one of the *.ttl.gz files takes anything from 87 to 11496 secs on Wolfgang's machine. The average is 944 secs at this time. It looks like at certain steps during the import the time per gz file goes way up e.g. from 805 to 4943 secs or from 4823 to 11496 - after that the timing seems to settle at a higher level and go back to as little as 293 or 511 secs. This timing behavior makes it very difficult to predict how long the full import will take.

Given that the loading took so long Wolfgang configured a second import machine slightly different.

  1. Machine: 8 cores, 56 GByte RAM, 6 Terrabyte 5.400 rpm harddisk
  2. Machine: 8 cores, 32 GByte RAM, 1 512 GByte 7.200 rpm harddisk and 1 480 GByte SSD

the second machine has the data to be imported on the 7.200 rpm hardisk and the blazegraph journal file on the SSD.

The second machines import shows a better timing behavior after 3.8 days the import had finished with the following statistics:

    |  sum d |   sum h |         mins |         secs |
----+--------+---------+--------------+--------------+
MIN |  0.0 d |   0.0 h |     1.2 mins |      74 secs |      
MAX |  0.0 d |   1.1 h |    64.4 mins |    3863 secs |
AVG |  0.0 d |   0.2 h |    12.3 mins |     738 secs | 
TOT |  3.8 d |  90.2 h |  5414.6 mins |  324878 secs |

the first machine is still not finished after 10 days

SUM | 10.5 d | 252.6 h | 15154.7 mins |  909281 secs |
----+--------+---------+--------------+--------------+
MIN |  0.0 d |   0.0 h |     1.5 mins |      87 secs |
MAX |  0.3 d |   7.3 h |   440.5 mins |   26428 secs |
AVG |  0.0 d |   0.6 h |    36.4 mins |    2185 secs |
TOT | 11.1 d | 267.1 h | 16029.0 mins |  961739 secs |
----+--------+---------+--------------+--------------+
ETA |  0.6 d |  14.6 h |   874.3 mins |   52458 secs |