Parsing text from XML node in Python

2020-01-29 02:40发布

I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?

2条回答
你好瞎i
2楼-- · 2020-01-29 03:20

You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

Here's an example of how to account for the namespace:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')

ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

for elem in tree.findall(".//sm:loc", ns):
    print(elem.text)

output:

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

Note that I used the namespace prefix sm, but you could use any NCName.

See here for more information on parsing XML with namespaces in ElementTree.

查看更多
再贱就再见
3楼-- · 2020-01-29 03:37

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)
  • note we need to append the name space from the open urlset definition to properly parse the xml
查看更多
登录 后发表回答