I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz
I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.bestbuy.com/</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
<priority>0.0</priority>
</url>
I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.
Per the documentation, I'm trying something like this:
import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()
value = root.findall(".//loc")
However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?