The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
This is not in Pandas, but it achieves the same result much faster through bash:
The
shuf
command will shuffle the input and the and the-n
argument indicates how many lines we want in the output.Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
While using
shuf
:So
shuf
is about 12x faster and importantly does not read the whole file into memory.Assuming no header in the CSV file:
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
use subsample
The following code reads first the header, and then a random sample on the other lines:
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
No pandas!
You'll end up with a sampled_lines list. What kind of statistics do you mean?