Accessing text file within subfolder

2020-05-01 07:05发布

问题:

File Structure
I have a folder, called test_folder, which has several subfolders (named different fruit names, as you'll see in my code below) within it. In each subfolder, there is always a metadump.xml file where I am extracting information from.

Current Stance
I have been able to achieve this on an individual basis, where I specify the subfolder path.

import re

in_file = open("C:/.../Downloads/test_folder/apple/metadump.xml")
contents = in_file.read()
in_file.close()

title = re.search('<dc:title rsfieldtitle="Title" 
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
contents).group(1)
print(title)

Next Steps
I would like to perform the following function on a larger scale by simply referencing the parent folder C:/.../Downloads/test_folder and making my program find the xml file for each subfolder to extract the desired information, rather than individually specifying every fruit subfolder.

Clarification
Rather than simply obtaining a list of subfolders or a list of xml files within these subfolders, I would like physically access these subfolders to perform this text extraction function from each xml file within each subfolder.

Thanks in advance for your help.

回答1:

You can use Python's os.walk() to traverse all of the subfolders. If the file is metadump.xml, it will open it and extract your title. The filename and the title is displayed:

import os

for root, dirs, files in os.walk(r"C:\...\Downloads\test_folder"):
    for file in files:
        if file == 'metadump.xml':
            filename = os.path.join(root, file) 

            with open(filename) as f_xml:
                contents = f_xml.read()
                title = re.search('<dc:title rsfieldtitle="Title" rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', contents).group(1)
                print('{} : {}'.format(filename, title))


回答2:

you can use os.listdir as the following:

import os
parent_folder = 'C:/.../Downloads/test_folder'
subfolders = os.listdir(parent_folder)
for subfolder in subfolders:
    in_file = open(parent_folder+'/'+ subfolder+'/metadump.xml')
    contents = in_file.read()
    in_file.close()
    title = re.search('<dc:title rsfieldtitle="Title" 
    rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
    contents).group(1)
    print(title)


回答3:

You can do this by using glob module if you are not sure number of subfolders in your folder. recursive=True will make it to check for all subfolders in your folder C:/../Downloads/test_folder/ and gives you list of all the metadump.xml files

import re
import glob
for file in glob.glob("C:/**/Downloads/test_folder/**/metadump.xml", recursive=True):
    with open(file) as in_file:
        contents= in_file.read()
    title = re.search('<dc:title rsfieldtitle="Title" 
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
contents).group(1)
    print(title)


回答4:

This might help you:

import os
for root, dirs, files in os.walk("/mydir"):
    for file in files:
        if file.endswith(".xml"):
            print(os.path.join(root, file))