I am trying to parse freebase dump file freebase-rdf-2014-01-12-00-00.gz (25 GB) using Jena.
There has been many issues reported by Jena regarding bad data.
Example - 150.0 not valid,true and false values not valid
These issues I have resolved by adding double quotes around decimal and true/false in dump file.
However issues are still being reported by Jena.(current - org.apache.jena.riot.RiotException: [line: 161083, col: 110] Illegal object: [MINUS])
Is there any way to pre process this data so that I don't have to fix each issues one by one. My Java Code :
// Open TDB dataset
String directory = "D:/test_dump";
Dataset dataset = TDBFactory.createDataset(directory);
// Assume we want the default model, or we could get a named model here
Model tdb = dataset.getDefaultModel();
// Read the input file - only needs to be done once
String source = "D:/test_dump/fixed-freebase-second-rdf.gz";
FileManager.get().readModel( tdb, source, "N-TRIPLES" );
Note: this a copy of my answer from the answers.semanticweb.com question, Does the Freebase RDF dump conform to the w3 n-triples spec? The short answer is that the data is in the Turtle serialization, not N-Triples. Turtle support various abbreviations, e.g.,
true
for"true"^^xsd:boolean
.Even in the example data on Data Dumps there's incorrect N-Triples:
It looks more like their data is in Notation 3 (N3) or Turtle format than N-Triples. In fact, this post on the freebase-discuss from Shawn Simister on 29 August 2013 says (emphasis added):
A later post (31 October 2013) touches on the boolean issues:
It's worth reading more of that thread. It's a bit frustrating though, because when people are writing things like, "you can just use "true"," it's not clear whether they mean
true
, or"true"
. It sounds like some of the people don't actually care about valid RDF so much, or the difference between an untyped plain literal"true"
and the boolean typed literal"true"^^xsd:boolean
that can be abbreviated astrue
. At any rate, the short answer looks like it's "use a Turtle or N3 parser."The data is in Turtle format, not N-Triples. They use various Turtle abbreviations (like
true
for"true"^^xsd:boolean
or number-27
for"-27"^^xsd:integer
).There may still be errors as their dumps have also contained illegal syntax e.g. use of
$
in prefix names without the necessary\
Adding quotes around things changes the RDF.