Python random N lines from large file (no dupl

2019-01-20 12:48发布

I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:

  • These files may contain headers (some have multi-line headers).
  • Headers need to appear in the output in the same order.
  • Each line can be taken only once.
  • The largest file currently is about 150GB (about 60 000 000 lines).
  • Lines are roughly the same length in a file, but may vary between different files.
  • I will usually be taking 5000 random lines (I may need up to 1 000 000 lines)

Currently I have written the following code:

inputSize=os.path.getsize(options.input)
usedPositions=[] #Start positions of the lines already in output

with open(options.input) as input:
    with open(options.output, 'w') as output:

        #Handling of header lines
        for i in range(int(options.header)):
            output.write(input.readline())
            usedPositions.append(input.tell())

        # Find and write all random lines, except last
        for j in range(int(args[0])):
            input.seek(random.randrange(inputSize)) # Seek to random position in file (probably middle of line)
            input.readline() # Read the line (probably incomplete). Next input.readline() results in a complete line.
            while input.tell() in usedPositions: # Take a new line if current one is taken
                input.seek(random.randrange(inputSize))
                input.readline() 
            usedPositions.append(input.tell()) # Add line start position to usedPositions
            randomLine=input.readline() # Complete line
            if len(randomLine) == 0: # Take first line if end of the file is reached
                input.seek(0)
                for i in range(int(options.header)): # Exclude headers
                    input.readline()
                randomLine=input.readline()
            output.write(randomLine)            

This code seems to be working correctly.

I am aware that this code prefers lines that follow the longest lines in input, because seek() is most likely to return a position on the longest line and the next line is written to output. This is irrelevant as lines in the input file are roughly the same length. Also I am aware that this code results in an infinite loop if N is larger than number of lines in input file. I will not implement a check for this, as getting the line count takes a lot of time.

RAM and HDD limitations are irrelevant. I am only concerned about the speed of the program. Is there a way to further optimize this code? Or perhaps there is a better approach?

EDIT: To clarify, the lines in one file have roughly the same length. However, i have multiple files that this script needs to run on and the average length of a line will be different for these files. For example file A may have ~100 characters per line and file B ~50000 characters per line. I do not know the average line length of any file beforehand.

5条回答
贪生不怕死
2楼-- · 2019-01-20 13:20
  • Obvious improvement would be to use set() for your usedPositions variable - lookup will be faster, and since you need to handle up to 10^6 used positions, lookup time is not irrelevant.
  • Use xrange instead of range in a for loop. Allocating full list of integers doesn't seem necessary.
查看更多
劳资没心,怎么记你
3楼-- · 2019-01-20 13:22

If you need a uniform sample of N lines in your file, you need to know the exact number of lines to pick from; seeking at random doesn't do this, longer lines skew the results in favour of lines directly following the longest lines.

Luckily, you only need to read your file once to pick those N lines. You basically pick your N first lines (in random order), then randomly replace picked lines with new ones with a diminishing probability based on the number of lines read.

For N == 1, the chance that the nth line read replaces the previous random pick is randint(0, n) < 1, so, the second line has a 50% chance of being picked, the third has a 33.33% chance, etc. For larger N, replace one of the already picked lines in your set at random as more lines are read, with the same distribution.

In Python random lines from subfolders, Blkknght wrote a very helpful function for picking a random sample of size N from an iterable:

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

This is trivial to combine with your requirements to preserve a set of headers:

from itertools import islice

with open(options.input) as input:
    with open(options.output, 'w') as output:

        # Handling of header lines
        # Use islice to avoid buffer issues with .readline()
        for line in islice(input, int(options.header)):
            output.write(line)

        # Pick a random sample
        for line in random_sample(int(args[0]), input):
            output.write(line)

This will read your whole file in one go, pick a uniform random sample, and write it out to the output file. Thus, this has Θ(L) complexity, with L being the number of lines in the file.

查看更多
萌系小妹纸
4楼-- · 2019-01-20 13:29

Untested (and requires reading the file twice):

import random

N = 5000
with open('file.in') as fin:
    line_count = sum(1 for i in fin)
    fin.seek(0)
    to_take = set(random.sample(xrange(line_count), N))
    for lineno, line in enumerate(fin):
        if lineno in to_take:
            pass # use it

However, since you mention that lines are "roughly" the same size, then you could use os.path.getsize and divide it by the average line length (whether already known, or sniffed from N many lines from the file), then use that to generate line_count - it'd be close enough for a random sample.

You could also mmap the file and use a combination of filesize, average line length, best guess of number of lines, and a random line number to 'seek' and then, just search backwards or forwards to the next start of line. (Since mmap will enable you to treat it like a string, you'll be able to use .index with an offset or use re if you really wanted to).

查看更多
Fickle 薄情
5楼-- · 2019-01-20 13:36

I believe it would be faster to randomly choose N line numbers, and then to go over the file once, line by line and take the lines who's number is in your list. Currently you have to seek to the random place for each random number so it's O(N*M) where M is the size of the file. What I suggest is O(M).

查看更多
迷人小祖宗
6楼-- · 2019-01-20 13:38

There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:

You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.

Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.

This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:

#! /usr/bin/python
# coding: utf-8

import random, os


CHUNK_SIZE = 1000
PATH = "/var/log/cron"

def pick_next_random_line(file, offset):
    file.seek(offset)
    chunk = file.read(CHUNK_SIZE)
    lines = chunk.split(os.linesep)
    # Make some provision in case yiou had not read at least one full line here
    line_offset = offset + len(os.linesep) + chunk.find(os.linesep) 
    return line_offset, lines[1]

def get_n_random_lines(path, n=5):
    lenght = os.stat(path).st_size
    results = []
    result_offsets = set()
    with open(path) as input:
        for x in range(n):
            while True:
                offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
                if not offset in result_offsets:
                    result_offsets.add(offset)
                    results.append(line)
                    break
    return results

if __name__ == "__main__":
    print get_n_random_lines(PATH)
查看更多
登录 后发表回答