I admit that this is basically a duplicate question of Use freebase data on local server? but I need more detailed answers than have already been given there
I've fallen absolutely in love with Freebase. What I want now is to essentially create a very simple Freebase clone for storing content that may not belong on Freebase itself but can be described using the Freebase schema. Essentially what I want is a simple and elegant way to store data like Freebase itself does and be able to easily use that data in a Python (CherryPy) web application.
Chapter 2 of the MQL reference guide states:
The database that underlies Metaweb is fundamentally different than the relational databases that you may be familiar with. Relational databases store data in the form of tables, but the Metaweb database stores data as a graph of nodes and relationships between those nodes.
Which I guess means that I should be using either a triplestore or a graph database such as Neo4j? Does anybody here have any experience with using one of those from a Python environment?
(What I've actually tried so far is to create a relational database schema which would be able to easily store Freebase topics, but I'm having issues with configuring the mappings in SQLAlchemy).
Things I'm looking into
UPDATE [28/12/2011]:
I found an article on the Freebase blog that describes the proprietary tuple store / database Freebase themselves use (graphd): http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/
And this is the extra code for my other answer. The meat is in edb.py. Run from Python console and follow the examples. Or use the web2py controller and run in your browser.
Save this as edb.py:
And here is a sample web2py controller (just copy edb.py in the web2py models directory):
Have a look at https://cayley.io. I believe it is written by the same author and uses same principles as
graphd
, the backend of Freebase, before Google killed it.Regarding the data, you probably will want to run something like this to cleanup the Freebase DB dumps or use datahub.
SPARQL is the query language to query RDF, it allows to write SQL-alike queries. Most RDF databases implement SPARQL interfaces. Moreover, Freebase allows you to export data in RDF so you could potentially use that data directly in an RDF database and query it with SPARQL.
I would have a look at this tutorial to get a better sense of SPARQL.
If you are going to handle a big dataset, like freebase, I would use 4store together with any of the Python clients. 4store exposes SPARQL via HTTP, you can make HTTP requests to assert, remove and query data. It also handles resultsets in JSON, and this is really handy with Python. I have used this infrastructure in several projects, not with CherryPy but with Django, but I guess that this difference doesn't really matter.
My 2 cents...
I use a little bit of Java code to convert the Freebase data dump into RDF: https://github.com/castagna/freebase2rdf
I use Apache Jena's TDB store to load the RDF data and Fuseki to serve the data via SPARQL protocol over HTTP.
See also:
A good news for freebase dump users is that Freebase now offer RDF dump now: http://wiki.freebase.com/wiki/Data_dumps . It is in turtle format, so it is very convenient to use any graph database designed for RDF.
My suggestion is also 4store: http://4store.org/ . it is simple and easy to use. You could use http request to do the SPARQL operation.
One tricky thing in my project is that the "." used in Freebase dump (to represent shorten URL) is not recognizable to 4store. So I add a bracket "<>" o all the columns contained "." and deal with the shorten URL myself.
This is what worked for me. It allows you to load all of a Freebase dump in a standard MySQL installation on less than 100GB of disk. The key is understanding the data layout in a dump and then transforming it (optimizing it for space and speed).
Freebase notions you should understand before you attempt to use this (all taken from the documentation):
Some other important Freebase specifics:
[{'id':'/','mid':null}]
'/m/0cwtm'
is a human);'/m/03lmb2f'
of type'/film/performance'
is NOT a Topic (I choose to think of these as what Blank Nodes in RDF are although this may not be philosophically accurate), while'/m/04y78wb'
of type'/film/director'
(among others) is;Transforms
(see the Python code at the bottom)
TRANSFORM 1 (from shell, split links from namespaces ignoring notable_for and non /lang/en text):
TRANSFORM 2 (from Python console, split freebase_ns.tsv on freebase_ns_types.tsv, freebase_ns_props.tsv plus 15 others which we ignore for now)
TRANSFORM 3 (from Python console, convert property and destination to mids)
TRANSFORM 4 (from MySQL console, load freebase_links_mids.tsv, freebase_ns_props_mids.tsv and freebase_ns_types.tsv in DB):
Code
Save this as e.py:
Save this as parse.py:
Notes:
e.get_namespaced_data( 'freebase_ns_types.tsv' )
)And the standard disclaimer here. It has been a few months since I did this. I believe it is mostly correct but I do apologize if my notes missed something. Unfortunately the project I needed it for fell through the cracks but hope this helps someone else. If something isn't clear drop a comment here.