Finding a heading in word file and copying entire

2019-08-06 04:44发布

I have the following situation:

I have several hundred word files that contain company information. I would like to search these files for specific words to find specific paragraphs and copy just these paragraphs to new word files. Basically I just need to reduce the original couple hundred documents to a more readable size each.

The documents that I have are located in one directory and carry different names. In each of them I want to extract particular information that I need to define individually.

To go about this I started with the following code to first write all file names into a .csv file:

# list all transcript files and print names to .csv

import os
import csv

with open("C:\\Users\\Stef\\Desktop\\Files.csv", 'w') as f:
    writer = csv.writer(f)
    for path, dirs, files in os.walk("C:\\Users\\Stef\\Desktop\\Files"):
        for filename in files:
            writer.writerow([filename])

This works perfectly. Next I open Files.csv and edit the second column for the keywords that I need to search for in each document.

See picture below for how the .csv file looks:

CSV file

The couple hundred word files I have, are structured with different layers of headings. What I wanted to do now was to search for specific headings with the keywords I manually defined in the .csv and then copy the content of the following passage to a new file. I uploaded an extract from a word file, "Presentation" is a 'Heading 1' and "North America" and "China" are 'Heading 2'.

Word example

In this case I would like for example to search for the 'Headline 2' "North America" and then copy the text that is below ("In total [...] diluted basis.) to a new word file that has the same name as the old one just an added "_clean.docx".

I started with my code as follows:

import os
import glob
import csv
import docx

os.chdir('C:\\Users\\Stef\\Desktop')

f = open('Files.csv')
csv_f = csv.reader(f)

file_name = []
matched_keyword = []

for row in csv_f:
  file_name.append(row[0])
  matched_keyword.append(row[1])

filelist = file_name
filelist2 = matched_keyword

for i, j in zip(filelist, filelist2):
  rootdir = 'C:\\Users\\Stef\\Desktop\\Files'
  doc = docx.Document(os.path.join(rootdir, i))

After this I was not able to find any working solution. I tried a few things but could not succeed at all. I would greatly appreciate further help.

I think the end should then again look something like this, however not quite sure.

output = 
output.save(i +"._clean.docx")

Have considered the following questions and ideas:

Extracting MS Word document formatting elements along with raw text information

extracting text from MS word files in python

How can I search a word in a Word 2007 .docx file?

0条回答
登录 后发表回答