What is the fastest way to remove a number from th

2019-05-27 04:23发布

I have 1000 files each having one million lines. Each line has the following form:

a number,a text

I want to remove all of the numbers from the beginning of every line of every file. including the ,

Example:

14671823,aboasdyflj -> aboasdyflj

What I'm doing is:

os.system("sed -i -- 's/^.*,//g' data/*")

and it works fine but it's taking a huge amount of time.

What is the fastest way to do this?

I'm coding in python.

4条回答
我想做一个坏孩纸
2楼-- · 2019-05-27 05:10

This is much faster:

cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt

On a file with 11 million rows it took less than one second.

To use this on several files in a directory, use:

TMP=/pathto/tmpfile
for file in dir/*; do
    cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done

A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.

查看更多
狗以群分
3楼-- · 2019-05-27 05:11

that's probably pretty fast & native python. Reduced loops and using csv.reader & csv.writer which are compiled in most implementations:

import csv,os,glob
for f1 in glob.glob("*.txt"):
    f2 = f1+".new"
    with open(f1) as fr, open(f2,"w",newline="") as fw:
        csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
    os.remove(f1)
    os.rename(f2,f1)  # move back the newfile into the old one

maybe the writerows part could be even faster by using map & operator.itemgetter to remove the inner loop:

csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))

Also:

  • it's portable on all systems including windows without MSYS installed
  • it stops with exception in case of problem avoiding to destroy the input
  • the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require shutil.move & would copy the data)
查看更多
成全新的幸福
4楼-- · 2019-05-27 05:24

I would use GNU awk (to leverage the -i inplace editing of file) with , as the field separator, no expensive Regex manipulation:

awk -F, -i inplace '{print $2}' file.txt

For example, if the filenames have a common prefix like file, you can use shell globbing:

awk -F, -i inplace '{print $2}' file*

awk will treat each file as different argument while applying the in-place modifications.


As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system() which is insecure and deprecated BTW in favor of subprocess.

查看更多
欢心
5楼-- · 2019-05-27 05:28

You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.

FILES = ['a', 'b', 'c', 'd']
CORES = 4

q = multiprocessing.Queue(len(FILES))

for f in FILES:
    q.put(f)

def handler(q, i):
    while True:
        try:
            f = q.get(block=False)
        except Queue.Empty:
            return
        os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))

processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]

[p.start() for p in processes]
[p.join() for p in processes]

print "Done!"
查看更多
登录 后发表回答