I have 1000 files each having one million lines. Each line has the following form:
a number,a text
I want to remove all of the numbers from the beginning of every line of every file. including the ,
Example:
14671823,aboasdyflj -> aboasdyflj
What I'm doing is:
os.system("sed -i -- 's/^.*,//g' data/*")
and it works fine but it's taking a huge amount of time.
What is the fastest way to do this?
I'm coding in python.
This is much faster:
On a file with 11 million rows it took less than one second.
To use this on several files in a directory, use:
A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.
that's probably pretty fast & native python. Reduced loops and using
csv.reader
&csv.writer
which are compiled in most implementations:maybe the
writerows
part could be even faster by usingmap
&operator.itemgetter
to remove the inner loop:Also:
shutil.move
& would copy the data)I would use GNU
awk
(to leverage the-i inplace
editing of file) with,
as the field separator, no expensive Regex manipulation:For example, if the filenames have a common prefix like
file
, you can use shell globbing:awk
will treat each file as different argument while applying the in-place modifications.As a side note, you could simply run the shell command in the shell directly instead of wrapping it in
os.system()
which is insecure and deprecated BTW in favor ofsubprocess
.You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.