I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz
I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.bestbuy.com/</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
<priority>0.0</priority>
</url>
I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.
Per the documentation, I'm trying something like this:
import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()
value = root.findall(".//loc")
However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?
You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
).Here's an example of how to account for the namespace:
output:
Note that I used the namespace prefix
sm
, but you could use any NCName.See here for more information on parsing XML with namespaces in ElementTree.
We can iterate through the URLs, toss them into a list and write them to a file as such: