I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.
data = []
for line in sys.stdin:
data.append(line)
run_syntaxnet(data) ##This is a function##
I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.
Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?
Also, I want to understand what can be the maximum length of the list data
so that I do not run out of memory.
EDIT:
I used the code:
data = []
for line in sys.stdin:
data.append(line)
if len(data) == 10000:
run_syntaxnet(data) ##This is a function##
data = []
which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.
For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000
is not met.
I want to do something like:
if len(data) > 10000 or 'EOF of input file is reached':
run_syntaxnet(data)
Can someone tell me how to check for the EOF of input file? Thanks in advance!
PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.