Parsing text from XML node in Python

I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?

标签： python xml python-3.x elementtree

2条回答

你好瞎i

2楼-- · 2020-01-29 03:20

You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

Here's an example of how to account for the namespace:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')

ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

for elem in tree.findall(".//sm:loc", ns):
    print(elem.text)

output:

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

Note that I used the namespace prefix sm, but you could use any NCName.

See here for more information on parsing XML with namespaces in ElementTree.

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2020-01-29 03:37

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

note we need to append the name space from the open urlset definition to properly parse the xml

0人赞添加讨论(0) 举报

Parsing text from XML node in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间