How to find and replace multiple lines in text fil

2019-06-22 09:56发布

问题:

I am running Python 2.7.

I have three text files: data.txt, find.txt, and replace.txt. Now, find.txt contains several lines that I want to search for in data.txt and replace that section with the content in replace.txt. Here is a simple example:

data.txt

pumpkin
apple
banana
cherry
himalaya
skeleton
apple
banana
cherry
watermelon
fruit

find.txt

apple
banana
cherry

replace.txt

1
2
3

So, in the above example, I want to search for all occurences of apple, banana, and cherry in the data and replace those lines with 1,2,3.

I am having some trouble with the right approach to this as my data.txt is about 1MB so I want to be as efficient as possible. One dumb way is to concatenate everything into one long string and use replace, and then output to a new text file so all the line breaks will be restored.

import re

data = open("data.txt", 'r')
find = open("find.txt", 'r')
replace = open("replace.txt", 'r')

data_str = ""
find_str = ""
replace_str = "" 

for line in data: # concatenate it into one long string
    data_str += line

for line in find: # concatenate it into one long string
    find_str += line

for line in replace: 
    replace_str += line


new_data = data_str.replace(find, replace)
new_file = open("new_data.txt", "w")
new_file.write(new_data)

But this seems so convoluted and inefficient for a large data file like mine. Also, the replace function appears to be deprecated so that's not good.

Another way is to step through the lines and keep a track of which line you found a match.

Something like this:

location = 0

LOOP1: 
for find_line in find:
    for i, data_line in enumerate(data).startingAtLine(location):
        if find_line == data_line:
            location = i # found possibility

for idx in range(NUMBER_LINES_IN_FIND):
    if find_line[idx] != data_line[idx+location]  # compare line by line
        #if the subsequent lines don't match, then go back and search again
        goto LOOP1

Not fully formed code, I know. I don't even know if it's possible to search through a file from a certain line on or between certain lines but again, I'm just a bit confused in the logic of it all. What is the best way to do this?

Thanks!

回答1:

If the file is large, you want to read and write one line at a time, so the whole thing isn't loaded into memory at once.

# create a dict of find keys and replace values
findlines = open('find.txt').read().split('\n')
replacelines = open('replace.txt').read().split('\n')
find_replace = dict(zip(findlines, replacelines))

with open('data.txt') as data:
    with open('new_data.txt', 'w') as new_data:
        for line in data:
            for key in find_replace:
                if key in line:
                    line = line.replace(key, find_replace[key])
            new_data.write(line)

Edit: I changed the code to read().split('\n') instead of readliens() so \n isn't included in the find and replace strings



回答2:

couple things here:

replace is not deprecated, see this discussion for details: Python 2.7: replace method of string object deprecated

If you are worried about reading data.txt in to memory all at once, you should be able to just iterate over data.txt one line at a time

data = open("data.txt", 'r')
for line in data:
    # fix the line

so all that's left is coming up with a whole bunch of find/replace pairs and fixing each line. Check out the zip function for a handy way to do that

find = open("find.txt", 'r').readlines()
replace = open("replace.txt", 'r').readlines()
new_data = open("new_data.txt", 'w')
for find_token, replace_token in zip(find, replace):
    new_line = line.replace(find_token, replace_token)
    new_data.write(new_line + os.linesep)