I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:
Remote file exists but does not contain any link -- not retrieving.
But for sure the sitemap.xml is full of "http://..." links.
I tried almost every option of wget but nothing worked for me:
wget -r --mirror http://mysite.com/sitemap.xml
Does anyone knows how to open all links inside of a website sitemap.xml?
Thanks, Dominic
It seems that
wget
can't parse XML. So, you'll have to extract the links manually. You could do something like this:I learned this trick here.
While this question is older, google send me here.
I finally used xsltproc to parse the sitemap.xml:
sitemap-txt.xsl:
Using it (in this case it is from a cache-prewarming-script, so the retrieved pages are not kept ("-o /dev/null"), only some statistics are printed ("-w ....")):
(Rewriting this to use wget instead of curl is left as an exercise for the reader ;-) ) What this does is:
You can use one of the sitemapping tools. Try Slickplan. It has the site crawler option and by using it you can import a structure of existing website and create a visual sitemap from it. Then you can export it to Slickplan XML format, which contains* not only links, but also SEO metadata, page titles (product names), and a bunch of other helpful data.