I am trying to parse freebase dump file freebase-rdf-2014-01-12-00-00.gz (25 GB) using Jena.
There has been many issues reported by Jena regarding bad data.
Example - 150.0 not valid,true and false values not valid
These issues I have resolved by adding double quotes around decimal and true/false in dump file.
However issues are still being reported by Jena.(current - org.apache.jena.riot.RiotException: [line: 161083, col: 110] Illegal object: [MINUS])
Is there any way to pre process this data so that I don't have to fix each issues one by one.
My Java Code :
// Open TDB dataset
String directory = "D:/test_dump";
Dataset dataset = TDBFactory.createDataset(directory);
// Assume we want the default model, or we could get a named model here
Model tdb = dataset.getDefaultModel();
// Read the input file - only needs to be done once
String source = "D:/test_dump/fixed-freebase-second-rdf.gz";
FileManager.get().readModel( tdb, source, "N-TRIPLES" );
The data is in Turtle format, not N-Triples. They use various Turtle abbreviations (like true
for "true"^^xsd:boolean
or number -27
for "-27"^^xsd:integer
).
There may still be errors as their dumps have also contained illegal syntax e.g. use of $
in prefix names without the necessary \
Adding quotes around things changes the RDF.
Note: this a copy of my answer from the answers.semanticweb.com question, Does the Freebase RDF dump conform to the w3 n-triples spec? The short answer is that the data is in the Turtle serialization, not N-Triples. Turtle support various abbreviations, e.g., true
for "true"^^xsd:boolean
.
Even in the example data on Data Dumps there's incorrect N-Triples:
<http://rdf.freebase.com/ns/g.11vjz1ynm> <http://rdf.freebase.com/ns/measurement_unit.dated_percentage.rate> 4.5 .
It looks more like their data is in Notation 3 (N3) or Turtle format than N-Triples. In fact, this post on the freebase-discuss from Shawn Simister on 29 August 2013 says (emphasis added):
I've been working on a new version of
the Freebase RDF dumps which will
address many of the issues that have
been discovered since we first started
publishing the data as RDF. …
The biggest change in these dumps is
that the format has switched to
N-Triples from Turtle. In practice
this a very minimal change since
N-Triples is a subset of Turtle which
follows the same one-triple-per-line
format that we have now.
A later post (31 October 2013) touches on the boolean issues:
Hmm, yeah it appears that this is a
bug. Turtle supports true and false as
equivalent to "true"^^xsd:boolean and
"false"^^xsd:boolean but even though
N-Triples is a subset of Turtle it
doesn't support the simplified boolean
syntax.
It's worth reading more of that thread. It's a bit frustrating though, because when people are writing things like, "you can just use "true"," it's not clear whether they mean true
, or "true"
. It sounds like some of the people don't actually care about valid RDF so much, or the difference between an untyped plain literal "true"
and the boolean typed literal "true"^^xsd:boolean
that can be abbreviated as true
. At any rate, the short answer looks like it's "use a Turtle or N3 parser."