Python finds a string in multiple files recursivel

2019-09-10 06:28发布

问题:

I'm learning Python and would like to search for a keyword in multiple files recursively.

I have an example function which should find the *.doc extension in a directory. Then, the function should open each file with that file extension and read it. If a keyword is found while reading the file, the function should identify the file path and print it.

Else, if the keyword is not found, python should continue.

To do that, I have defined a function which takes two arguments:

def find_word(extension, word):
      # define the path for os.walk
      for dname, dirs, files in os.walk('/rootFolder'):
            #search for file name in files:
            for fname in files:
                  #define the path of each file
                  fpath = os.path.join(dname, fname)
                  #open each file and read it
                  with open(fpath) as f:
                        data=f.read()
                  # if data contains the word
                  if word in data:
                        #print the file path of that file  
                        print (fpath)
                  else: 
                        continue

Could you give me a hand to fix this code?

Thanks,

回答1:

def find_word(extension, word):
    for root, dirs, files in os.walk('/DOC'):
        # filter files for given extension:
        files = [fi for fi in files if fi.endswith(".{ext}".format(ext=extension))]
        for filename in files:
            path = os.path.join(root, filename)
            # open each file and read it
            with open(path) as f:
                # split() will create list of words and set will
                # create list of unique words 
                words = set(f.read().split())
                if word in words:
                    print(path)


回答2:

.doc files are rich text files, i.e. they wont open with a simple text editor or python open method. In this case, you can use other python modules such as python-docx.

Update

For doc files (previous to Word 2007) you can also use other tools such as catdoc or antiword. Try the following.

import subprocess


def doc_to_text(filename):
    return subprocess.Popen(
        'catdoc -w "%s"' % filename,
        shell=True,
        stdout=subprocess.PIPE
    ).stdout.read()

print doc_to_text('fixtures/doc.doc')


回答3:

If you are trying to read .doc file in your code the this won't work. you will have to change the part where you are reading the file.

Here are some links for reading a .doc file in python.

extracting text from MS word files in python

Reading/Writing MS Word files in Python

Reading/Writing MS Word files in Python