File Structure
I have a folder, called test_folder, which has several subfolders (named different fruit names, as you'll see in my code below) within it. In each subfolder, there is always a metadump.xml file where I am extracting information from.
Current Stance
I have been able to achieve this on an individual basis, where I specify the subfolder path.
import re
in_file = open("C:/.../Downloads/test_folder/apple/metadump.xml")
contents = in_file.read()
in_file.close()
title = re.search('<dc:title rsfieldtitle="Title"
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>',
contents).group(1)
print(title)
Next Steps
I would like to perform the following function on a larger scale by simply referencing the parent folder C:/.../Downloads/test_folder and making my program find the xml file for each subfolder to extract the desired information, rather than individually specifying every fruit subfolder.
Clarification
Rather than simply obtaining a list of subfolders or a list of xml files within these subfolders, I would like physically access these subfolders to perform this text extraction function from each xml file within each subfolder.
Thanks in advance for your help.
You can use Python's os.walk()
to traverse all of the subfolders. If the file is metadump.xml
, it will open it and extract your title. The filename and the title is displayed:
import os
for root, dirs, files in os.walk(r"C:\...\Downloads\test_folder"):
for file in files:
if file == 'metadump.xml':
filename = os.path.join(root, file)
with open(filename) as f_xml:
contents = f_xml.read()
title = re.search('<dc:title rsfieldtitle="Title" rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', contents).group(1)
print('{} : {}'.format(filename, title))
you can use os.listdir as the following:
import os
parent_folder = 'C:/.../Downloads/test_folder'
subfolders = os.listdir(parent_folder)
for subfolder in subfolders:
in_file = open(parent_folder+'/'+ subfolder+'/metadump.xml')
contents = in_file.read()
in_file.close()
title = re.search('<dc:title rsfieldtitle="Title"
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>',
contents).group(1)
print(title)
You can do this by using glob module if you are not sure number of subfolders in your folder. recursive=True
will make it to check for all subfolders in your folder C:/../Downloads/test_folder/
and gives you list of all the metadump.xml
files
import re
import glob
for file in glob.glob("C:/**/Downloads/test_folder/**/metadump.xml", recursive=True):
with open(file) as in_file:
contents= in_file.read()
title = re.search('<dc:title rsfieldtitle="Title"
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>',
contents).group(1)
print(title)
This might help you:
import os
for root, dirs, files in os.walk("/mydir"):
for file in files:
if file.endswith(".xml"):
print(os.path.join(root, file))