Read a small random sample from a big CSV file int

2019-01-08 16:23发布

问题:

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

回答1:

Assuming no header in the CSV file:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

With header and unknown file length:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)


回答2:

@dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

Or, if you want to take every nth line:

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)


回答3:

This is not in Pandas, but it achieves the same result much faster through bash:

shuf -n 100000 data/original.tsv > data/sample.tsv

The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.

Relevant question: https://unix.stackexchange.com/q/108581

Benchmark on a 7M lines csv available here (2008):

Top answer:

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

While using shuf:

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

So shuf is about 12x faster and importantly does not read the whole file into memory.



回答4:

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

See code below:

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line


回答5:

The following code reads first the header, and then a random sample on the other lines:

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)


回答6:

No pandas!

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

You'll end up with a sampled_lines list. What kind of statistics do you mean?



回答7:

use subsample

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv


回答8:

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

something like that should work I think