What is the fastest way to remove a number from th

2019-05-27 05:18发布

问题:

I have 1000 files each having one million lines. Each line has the following form:

a number,a text

I want to remove all of the numbers from the beginning of every line of every file. including the ,

Example:

14671823,aboasdyflj -> aboasdyflj

What I'm doing is:

os.system("sed -i -- 's/^.*,//g' data/*")

and it works fine but it's taking a huge amount of time.

What is the fastest way to do this?

I'm coding in python.

回答1:

This is much faster:

cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt

On a file with 11 million rows it took less than one second.

To use this on several files in a directory, use:

TMP=/pathto/tmpfile
for file in dir/*; do
    cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done

A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.



回答2:

I would use GNU awk (to leverage the -i inplace editing of file) with , as the field separator, no expensive Regex manipulation:

awk -F, -i inplace '{print $2}' file.txt

For example, if the filenames have a common prefix like file, you can use shell globbing:

awk -F, -i inplace '{print $2}' file*

awk will treat each file as different argument while applying the in-place modifications.


As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system() which is insecure and deprecated BTW in favor of subprocess.



回答3:

that's probably pretty fast & native python. Reduced loops and using csv.reader & csv.writer which are compiled in most implementations:

import csv,os,glob
for f1 in glob.glob("*.txt"):
    f2 = f1+".new"
    with open(f1) as fr, open(f2,"w",newline="") as fw:
        csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
    os.remove(f1)
    os.rename(f2,f1)  # move back the newfile into the old one

maybe the writerows part could be even faster by using map & operator.itemgetter to remove the inner loop:

csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))

Also:

  • it's portable on all systems including windows without MSYS installed
  • it stops with exception in case of problem avoiding to destroy the input
  • the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require shutil.move & would copy the data)


回答4:

You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.

FILES = ['a', 'b', 'c', 'd']
CORES = 4

q = multiprocessing.Queue(len(FILES))

for f in FILES:
    q.put(f)

def handler(q, i):
    while True:
        try:
            f = q.get(block=False)
        except Queue.Empty:
            return
        os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))

processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]

[p.start() for p in processes]
[p.join() for p in processes]

print "Done!"