Import Freebase to Triplestore

2020-05-27 08:42发布

I'm currently planning a big project containing big data.

I already used the search and all results tell me that it's not possible to import Freebase into any triplestore without usage of 3rd Party Tools like BaseKB or Freebase to RDF

As I can see, the dump is already available as RDF, so where is the problem if I want to import the dump into my 4store triplestore and access the data via SPARQL?

3条回答
我命由我不由天
2楼-- · 2020-05-27 09:17

You are probably getting search results from at least two, if not three, different data sets:

  1. the old quad format dump
  2. the early RDF dumps
  3. (perhaps) the current RDF dump

The format in #1 is what required conversion. The early RDF dumps (#2) were syntactically invalid, so wouldn't import to most tools. The RDF dump has been improving over time. I'm not sure whether it's still true that it won't import at all without preprocessing, but, regardless, it'll almost be more useful if you pre-process it to remove redundancy, normalize to the format that works best for your application, etc.

Did you try importing the current dump? What were your results?

查看更多
干净又极端
3楼-- · 2020-05-27 09:34

For everybody having Problems importing the Freebase Dump:

1) Keep your RDF/Turtle Parser updated. (Latest Version of raptor 2 can recognize the '.', e.g. at ns:common.topic.notable_for.example

2) The dump must be cleaned up before you can import it. I used this scipt: http://people.apache.org/~andy/Freebase20121223/ (fixit)

3) The Turtle specification only allows these characters for URIs:

::= '<' ([^#x00-#x20<>\"{}|^`\] | UCHAR)* '>'

So it's very important to add this line to the fixit script at line 80:

$X =~ s/\\>/%3E/g ;
$X =~ s/\\.//g ;

# Add this Line
$X =~ [\x00-\x20\<\>\"\{\}\|\^\`] ;

$obj = "<".$X.">" ;

As a result, invalid syntax like this:

<http://www.wikipedia.org/object?key={invalid_braces}>

becomes

<http://www.wikipedia.org/object?key=invalid_braces>
查看更多
干净又极端
4楼-- · 2020-05-27 09:35

The problem with freebase turtle dump is this, they are not COMPLIANT with w3c turtle specification.

1) according to http://www.w3.org/TR/turtle/#sec-grammar, character '.' can only appear at the end of the triple, however freebase dump has lots of '.' before end of the triple. I read somewhere that "/" is not allowed as well outside uri, so they instead chose to use '.'

latest raptor2 library can get around this ('.'), but not the older ones

2) I think the way emit "blank node" is also not valid for e.g. line 141567 ns:m.01000m1 ns:common.topic.notable_for .

查看更多
登录 后发表回答