可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have 1000 files each having one million lines. Each line has the following form:
a number,a text
I want to remove all of the numbers from the beginning of every line of every file. including the ,
Example:
14671823,aboasdyflj -> aboasdyflj
What I'm doing is:
os.system("sed -i -- 's/^.*,//g' data/*")
and it works fine but it's taking a huge amount of time.
What is the fastest way to do this?
I'm coding in python.
回答1:
This is much faster:
cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt
On a file with 11 million rows it took less than one second.
To use this on several files in a directory, use:
TMP=/pathto/tmpfile
for file in dir/*; do
cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done
A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.
回答2:
I would use GNU awk
(to leverage the -i inplace
editing of file) with ,
as the field separator, no expensive Regex manipulation:
awk -F, -i inplace '{print $2}' file.txt
For example, if the filenames have a common prefix like file
, you can use shell globbing:
awk -F, -i inplace '{print $2}' file*
awk
will treat each file as different argument while applying the in-place modifications.
As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system()
which is insecure and deprecated BTW in favor of subprocess
.
回答3:
that's probably pretty fast & native python. Reduced loops and using csv.reader
& csv.writer
which are compiled in most implementations:
import csv,os,glob
for f1 in glob.glob("*.txt"):
f2 = f1+".new"
with open(f1) as fr, open(f2,"w",newline="") as fw:
csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
os.remove(f1)
os.rename(f2,f1) # move back the newfile into the old one
maybe the writerows
part could be even faster by using map
& operator.itemgetter
to remove the inner loop:
csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))
Also:
- it's portable on all systems including windows without MSYS installed
- it stops with exception in case of problem avoiding to destroy the input
- the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require
shutil.move
& would copy the data)
回答4:
You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.
FILES = ['a', 'b', 'c', 'd']
CORES = 4
q = multiprocessing.Queue(len(FILES))
for f in FILES:
q.put(f)
def handler(q, i):
while True:
try:
f = q.get(block=False)
except Queue.Empty:
return
os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))
processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]
[p.start() for p in processes]
[p.join() for p in processes]
print "Done!"