I am using the Google Colab enviroment.
The file I am using can be found here. It is a csv file
https://drive.google.com/open?id=1v7Mm6S8BVtou1iIfobY43LRF8MgGdjfU
Warning: it has several million rows.
This code runs within a minute in Google Colab Python 3 notebook. I tried this several times with no problem.
from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' , dtype=int)
print(my_data[0:50])
The code below, on the other hand, runs for several minutes before disconnecting from Google Colab's server. I tried multiple times. Eventually colab gives me a 'running out of memory' warning.
from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' , dtype=int, names=True)
print(my_data[0:50])
It seems that there used to be an issue with names=True in Python 3 but that issue was fixed
https://github.com/numpy/numpy/issues/5411
I check which version I was using in Colab and it was up to date
import numpy as np
print(np.version.version)
>1.14.3
With
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' , dtype=int, max_rows=100)
I got a (100,4) int array.
With names=True
it took long, and then issued an long list of errors, all the same except for line number (even with the max_rows):
Line #4121986 (got 4 columns instead of 3)
The header line is screwy - with an initial blank name:
In [753]: !head ../Downloads/refinedRatings.csv
,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
5,2,26,4
7,2,33,4
8,2,301,5
9,2,2686,5
10,2,3753,5
So based on names it thinks there are 3 columns, but all data lines have 4. Hence the error. I don't know why it ignores the max_rows
in this case.
But with my own names
In [755]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
...: t, max_rows=10, names='foo,bar,dat,me')
Out[755]:
array([(-1, -1, -1, -1), ( 0, 1, 258, 5), ( 1, 2, 4081, 4),
( 2, 2, 260, 5), ( 3, 2, 9296, 5), ( 5, 2, 26, 4),
( 7, 2, 33, 4), ( 8, 2, 301, 5), ( 9, 2, 2686, 5),
(10, 2, 3753, 5)],
dtype=[('foo', '<i8'), ('bar', '<i8'), ('dat', '<i8'), ('me', '<i8')])
The first record (-1,-1,-1,-1)
is the initial bad header line, with -1 inplace of strings it couldn't turn into ints. So we should skip_header
as done below.
alternatively:
In [756]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
...: t, max_rows=10, skip_header=1)
Out[756]:
array([[ 0, 1, 258, 5],
[ 1, 2, 4081, 4],
[ 2, 2, 260, 5],
[ 3, 2, 9296, 5],
[ 5, 2, 26, 4],
[ 7, 2, 33, 4],
[ 8, 2, 301, 5],
[ 9, 2, 2686, 5],
[ 10, 2, 3753, 5],
[ 11, 2, 8519, 5]])
In sum, skip the header, and use your own names if you want a structured array.