What is the fastest way to remove a number from th

I have 1000 files each having one million lines. Each line has the following form:

a number,a text

I want to remove all of the numbers from the beginning of every line of every file. including the ,

Example:

14671823,aboasdyflj -> aboasdyflj

What I'm doing is:

os.system("sed -i -- 's/^.*,//g' data/*")

and it works fine but it's taking a huge amount of time.

What is the fastest way to do this?

I'm coding in python.

标签： regex bash performance shell text-processing

4条回答

我想做一个坏孩纸

2楼-- · 2019-05-27 05:10

This is much faster:

cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt

On a file with 11 million rows it took less than one second.

To use this on several files in a directory, use:

TMP=/pathto/tmpfile
for file in dir/*; do
    cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done

A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-05-27 05:11

that's probably pretty fast & native python. Reduced loops and using csv.reader & csv.writer which are compiled in most implementations:

import csv,os,glob
for f1 in glob.glob("*.txt"):
    f2 = f1+".new"
    with open(f1) as fr, open(f2,"w",newline="") as fw:
        csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
    os.remove(f1)
    os.rename(f2,f1)  # move back the newfile into the old one

maybe the writerows part could be even faster by using map & operator.itemgetter to remove the inner loop:

csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))

Also:

it's portable on all systems including windows without MSYS installed
it stops with exception in case of problem avoiding to destroy the input
the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require shutil.move & would copy the data)

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2019-05-27 05:24

I would use GNU awk (to leverage the -i inplace editing of file) with , as the field separator, no expensive Regex manipulation:

awk -F, -i inplace '{print $2}' file.txt

For example, if the filenames have a common prefix like file, you can use shell globbing:

awk -F, -i inplace '{print $2}' file*

awk will treat each file as different argument while applying the in-place modifications.

As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system() which is insecure and deprecated BTW in favor of subprocess.

0人赞添加讨论(0) 举报

欢心

5楼-- · 2019-05-27 05:28

You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.

FILES = ['a', 'b', 'c', 'd']
CORES = 4

q = multiprocessing.Queue(len(FILES))

for f in FILES:
    q.put(f)

def handler(q, i):
    while True:
        try:
            f = q.get(block=False)
        except Queue.Empty:
            return
        os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))

processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]

[p.start() for p in processes]
[p.join() for p in processes]

print "Done!"

0人赞添加讨论(0) 举报

What is the fastest way to remove a number from th

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间