So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take.
The files are preprocessed and hence are not in XML.
They are taken from
and the format is:
from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
from_page(5,700,000): to(xy) to(xz)
so on. So. basically it's a construction of a [5,700,000*5,700,000]
matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse
or scipy.dok.sparse
, now my issue is:
How on earth do I go about converting the .txt
file with the link information to a sparse matrix? Read it and compute it as a normal N*N matrix then convert it or what? I have no idea.
Also, the links sometimes span across lines so what would be the correct way to handle that?
eg: a random line is like..
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
exactly like this: no commas & no delimiters.
Any information on sparse matrix construction and data handling across lines would be helpful.