I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?
From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:
- enwiki-latest-page.sql.gz
- enwiki-latest-categorylinks.sql.gz
- enwiki-20141106-category.sql.gz
I tried followed the answer here (Wikipedia Category Hierarchy from dumps), but it doesn't seem that the categorylinks has the same schema (no pageId column).
What's the right way to build the hierarchy?
Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M according to wikipedia statistics)
Thanks
I met the same problem for japanese wikipedia.
I solved this problem as follows:
Yes, it turns out this stackoverflow answer was right. It referenced the right datasets, but I was too dense to understand how to relate them together.
Thanks to @svick for leading me through the individual steps in a private chat.
For the benefit of others, I've explicitly detailed the relationship between the data sets and the exact steps to traverse the graph in my blog, which is a summary of our private chat.
Parsing Wikipedia Page Hierarchy