I am running Python 2.7.
I have three text files: data.txt
, find.txt
, and replace.txt
. Now, find.txt
contains several lines that I want to search for in data.txt
and replace that section with the content in replace.txt
. Here is a simple example:
data.txt
pumpkin
apple
banana
cherry
himalaya
skeleton
apple
banana
cherry
watermelon
fruit
find.txt
apple
banana
cherry
replace.txt
1
2
3
So, in the above example, I want to search for all occurences of apple
, banana
, and cherry
in the data and replace those lines with 1,2,3
.
I am having some trouble with the right approach to this as my data.txt
is about 1MB so I want to be as efficient as possible. One dumb way is to concatenate everything into one long string and use replace
, and then output to a new text file so all the line breaks will be restored.
import re
data = open("data.txt", 'r')
find = open("find.txt", 'r')
replace = open("replace.txt", 'r')
data_str = ""
find_str = ""
replace_str = ""
for line in data: # concatenate it into one long string
data_str += line
for line in find: # concatenate it into one long string
find_str += line
for line in replace:
replace_str += line
new_data = data_str.replace(find, replace)
new_file = open("new_data.txt", "w")
new_file.write(new_data)
But this seems so convoluted and inefficient for a large data file like mine. Also, the replace
function appears to be deprecated so that's not good.
Another way is to step through the lines and keep a track of which line you found a match.
Something like this:
location = 0
LOOP1:
for find_line in find:
for i, data_line in enumerate(data).startingAtLine(location):
if find_line == data_line:
location = i # found possibility
for idx in range(NUMBER_LINES_IN_FIND):
if find_line[idx] != data_line[idx+location] # compare line by line
#if the subsequent lines don't match, then go back and search again
goto LOOP1
Not fully formed code, I know. I don't even know if it's possible to search through a file from a certain line on or between certain lines but again, I'm just a bit confused in the logic of it all. What is the best way to do this?
Thanks!