Collect data in chunks from stdin: Python

2019-09-19 11:37发布

问题:

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:
    data.append(line)
run_syntaxnet(data)    ##This is a function##

I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.

EDIT:

I used the code:

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) == 10000:
        run_syntaxnet(data)    ##This is a function##
        data = []

which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.

For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.

I want to do something like:

if len(data) > 10000 or 'EOF of input file is reached':
    run_syntaxnet(data)

Can someone tell me how to check for the EOF of input file? Thanks in advance!

PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

回答1:

I think this is all you need:

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) == 10000:
        run_syntaxnet(data)    ##This is a function##
        data = []

once the list get to 10000, then run the function and reset your data list. Also the maximum size of the list will vary from machine to machine, depending on how much memory you have, so it will probably be best to try it out with different lengths and find out what is optimum.



回答2:

I would gather the data into chunks and process those chunks when they get "large":

LARGE_DATA = 10

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) > LARGE_DATA:
        run_syntaxnet(data)
        data = []
run_syntaxnet(data)