I'm trying to use this lda package to process a term-document matrix csv file with 39568 rows and 27519 columns containing counting/natural numbers only.
Problem: I'm getting a MemoryError with my approach to read the file and store it to a numpy array.
Goal: Get the numbers from the TDM csv file and convert it to numpy array so I can use the numpy array as input for the lda.
with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])
I've also tried using the numpy append, vstack and concatenate and I still get the MemoryError.
Is there a way to avoid the MemoryError?
Edit:
I've tried using dtype int32 and int and it gives me:
WindowsError: [Error 8] Not enough storage is available to process this command
I've also tried using dtype float64 and it gives me:
OverflowError: cannot fit 'long' into an index-sized integer
With these codes:
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]
and
with open("Results/TDM.csv", 'r') as tdm_file:
vocabulary = [value for value in tdm_file.readline().strip().split(',')]
fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
for idx, line in enumerate(tdm_file):
fp[idx] = np.array(line.strip().split(','))
Other info that might matter
- Win10 64bit
- 8GB RAM (7.9 usable) | peaks at 5.5GB from more or less 3GB (around 2GB used) before it reports MemoryError
- Python 2.7.10 [MSC v.1500 32 bit (Intel)]
- Using PyCharm Community Edition 5.0.3
Since your word counts will be almost all zeros, it would be much more efficient to store them in a
scipy.sparse
matrix. For example:X
is now an (ndocs, nwords)scipy.sparse.lil_matrix
, andwords
is a list corresponding to the columns ofX
:You could pass
X
directly tolda.LDA.fit
, although it will probably be faster to convert it to ascipy.sparse.csr_matrix
first: