Jena parsing issue for freebase RDF dump (Jan 2014

2019-02-18 06:36发布

问题:

I am trying to parse freebase dump file freebase-rdf-2014-01-12-00-00.gz (25 GB) using Jena. There has been many issues reported by Jena regarding bad data. Example - 150.0 not valid,true and false values not valid These issues I have resolved by adding double quotes around decimal and true/false in dump file. However issues are still being reported by Jena.(current - org.apache.jena.riot.RiotException: [line: 161083, col: 110] Illegal object: [MINUS])

Is there any way to pre process this data so that I don't have to fix each issues one by one. My Java Code :

    // Open TDB dataset
    String directory = "D:/test_dump";
    Dataset dataset = TDBFactory.createDataset(directory);

    // Assume we want the default model, or we could get a named model here
    Model tdb = dataset.getDefaultModel();

    // Read the input file - only needs to be done once
    String source = "D:/test_dump/fixed-freebase-second-rdf.gz";
    FileManager.get().readModel( tdb, source, "N-TRIPLES" ); 

回答1:

The data is in Turtle format, not N-Triples. They use various Turtle abbreviations (like true for "true"^^xsd:boolean or number -27 for "-27"^^xsd:integer).

There may still be errors as their dumps have also contained illegal syntax e.g. use of $ in prefix names without the necessary \

Adding quotes around things changes the RDF.



回答2:

Note: this a copy of my answer from the answers.semanticweb.com question, Does the Freebase RDF dump conform to the w3 n-triples spec? The short answer is that the data is in the Turtle serialization, not N-Triples. Turtle support various abbreviations, e.g., true for "true"^^xsd:boolean.

Even in the example data on Data Dumps there's incorrect N-Triples:

<http://rdf.freebase.com/ns/g.11vjz1ynm>  <http://rdf.freebase.com/ns/measurement_unit.dated_percentage.rate> 4.5 .

It looks more like their data is in Notation 3 (N3) or Turtle format than N-Triples. In fact, this post on the freebase-discuss from Shawn Simister on 29 August 2013 says (emphasis added):

I've been working on a new version of the Freebase RDF dumps which will address many of the issues that have been discovered since we first started publishing the data as RDF. … The biggest change in these dumps is that the format has switched to N-Triples from Turtle. In practice this a very minimal change since N-Triples is a subset of Turtle which follows the same one-triple-per-line format that we have now.

A later post (31 October 2013) touches on the boolean issues:

Hmm, yeah it appears that this is a bug. Turtle supports true and false as equivalent to "true"^^xsd:boolean and "false"^^xsd:boolean but even though N-Triples is a subset of Turtle it doesn't support the simplified boolean syntax.

It's worth reading more of that thread. It's a bit frustrating though, because when people are writing things like, "you can just use "true"," it's not clear whether they mean true, or "true". It sounds like some of the people don't actually care about valid RDF so much, or the difference between an untyped plain literal "true" and the boolean typed literal "true"^^xsd:boolean that can be abbreviated as true. At any rate, the short answer looks like it's "use a Turtle or N3 parser."