Efficient file reading in python with need to spli

2020-02-16 03:55发布

问题:

I've traditionally been reading in files with:

file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('\n')
for record in delimited[1:]:
    record_split = record.split(',')

and

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
  datalines = (line.rstrip('\r\n') for line in data)
  for record in datalines:
    split_line = record.split(',')
    if len(split_line) > 1:

But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on '\n'.

Here is the multiprocessing code:

pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)     
while not op_list.ready():
  print("Number of files left to process: {}".format(op_list._number_left))
  time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()

Here is the error log

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
    task = get()
MemoryError

I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:

pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre

But here are the error messages that I get:

Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b
uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master

Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp
tyuser\ppft\setup.py) egg_info for package ppft

    warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth
on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho
n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
  Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
 (from pathos==0.2a1.dev0)
  Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)

Storing debug log for failure in C:\Users\xxx\pip\pip.log

I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.

But Now I have a failure as I cannot open that many files:

Finished reading in Exposures...
Reading Samples from:  C:\XXX\XXX\XXX\
Traceback (most recent call last):
  File "events.py", line 568, in <module>
    mdrcv_dict = ReadDamages(damage_dir, value_dict)
  File "events.py", line 185, in ReadDamages
    res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
  File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr
ocessing.py", line 230, in amap
    return _pool.map_async(star(f), zip(*args)) # chunksize
  File "events.py", line 184, in <genexpr>
    files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\\xx.csv'

Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?

def PP_star(args_flat):
    return PP(*args_flat)

def PP(pathfilename, txtdatapath, my_dict):
    return com_dict

fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)

How can I perform the same function with pathos.multiprocessing

回答1:

just iterate over the lines, instead of reading the whole file. like this

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
    for dataline in data:
        split_line = record.split(',')
        if len(split_line) > 1:


回答2:

Let's say we have file1.txt:

hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123

file2.txt:

1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye

and so on, through file5.txt:

1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye

I'd suggest to use a hierarchical parallel map to read your files quickly. A fork of multiprocessing (called pathos.multiprocessing) can do this.

>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']

However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap) instead of an "asynchronous" map (amap). See this post for details: Python multiprocessing - tracking the process of pool.map operation

Get pathos here: https://github.com/uqfoundation



回答3:

Try this:

for line in file('file.txt'):
    print line.rstrip()

of course instead of printing them you could also add them to a list or perform some other operation on them