I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.
So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.
I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.
I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.
do wget -r -l0 www.oldsite.com
Then just find www.oldsite.com
would reveal all urls, I believe.
Alternatively, just serve that custom not-found page on every 404 request!
I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.
Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
Web Sitemap Generators
The following are links to tools that generate or maintain files in
the XML Sitemaps format, an open standard defined on sitemaps.org and
supported by the search engines such as Ask, Google, Microsoft Live
Search and Yahoo!. Sitemap files generally contain a collection of
URLs on a website along with some meta-data for these URLs. The
following tools generally generate "web-type" XML Sitemap and URL-list
files (some may also support other formats).
Please Note: Google has not tested or verified the features or
security of the third party software listed on this site. Please
direct any questions regarding the software to the software's author.
We hope you enjoy these tools!
Server-side Programs
- Enarion phpSitemapsNG (PHP)
- Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
- Outil en PHP (French, PHP)
- Perl Sitemap Generator (Perl)
- Python Sitemap Generator (Python)
- Simple Sitemaps (PHP)
- SiteMap XML Dynamic Sitemap Generator (PHP) $
- Sitemap generator for OS/2 (REXX-script)
- XML Sitemap Generator (PHP) $
CMS and Other Plugins:
- ASP.NET - Sitemaps.Net
- DotClear (Spanish)
- DotClear (2)
- Drupal
- ECommerce Templates (PHP) $
- Ecommerce Templates (PHP or ASP) $
- LifeType
- MediaWiki Sitemap generator
- mnoGoSearch
- OS Commerce
- phpWebSite
- Plone
- RapidWeaver
- Textpattern
- vBulletin
- Wikka Wiki (PHP)
- WordPress
Downloadable Tools
- GSiteCrawler (Windows)
- GWebCrawler & Sitemap Creator (Windows)
- G-Mapper (Windows)
- Inspyder Sitemap Creator (Windows) $
- IntelliMapper (Windows) $
- Microsys A1 Sitemap Generator (Windows) $
- Rage Google Sitemap Automator $ (OS-X)
- Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
- Site Map Pro (Windows) $
- Sitemap Writer (Windows) $
- Sitemap Generator by DevIntelligence (Windows)
- Sorrowmans Sitemap Tools (Windows)
- TheSiteMapper (Windows) $
- Vigos Gsitemap (Windows)
- Visual SEO Studio (Windows)
- WebDesignPros Sitemap Generator (Java Webstart Application)
- Weblight (Windows/Mac) $
- WonderWebWare Sitemap Generator (Windows)
Online Generators/Services
- AuditMyPc.com Sitemap Generator
- AutoMapIt
- Autositemap $
- Enarion phpSitemapsNG
- Free Sitemap Generator
- Neuroticweb.com Sitemap Generator
- ROR Sitemap Generator
- ScriptSocket Sitemap Generator
- SeoUtility Sitemap Generator (Italian)
- SitemapDoc
- Sitemapspal
- SitemapSubmit
- Smart-IT-Consulting Google Sitemaps XML Validator
- XML Sitemap Generator
- XML-Sitemaps Generator
CMS with integrated Sitemap generators
Google News Sitemap Generators The following plugins allow
publishers to update Google News Sitemap files, a variant of the
sitemaps.org protocol that we describe in our Help Center. In addition
to the normal properties of Sitemap files, Google News Sitemaps allow
publishers to describe the types of content they publish, along with
specifying levels of access for individual articles. More information
about Google News can be found in our Help Center and Help Forums.
- WordPress Google News plugin
Code Snippets / Libraries
- ASP script
- Emacs Lisp script
- Java library
- Perl script
- PHP class
- PHP generator script
If you believe that a tool should be added or removed for a legitimate
reason, please leave a comment in the Webmaster Help Forum.
The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.
It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.
So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.
You're presumably not in an ideal world. Why not do this...?
Create a mapping between the well
known old URLs and the new ones.
Redirect when you see an old URL.
I'd possibly consider presenting a
"this page has moved, it's new url
is XXX, you'll be redirected
shortly".
If you have no mapping, present a
"sorry - this page has moved. Here's
a link to the home page" message and
redirect them if you like.
Log all redirects - especially the
ones with no mapping. Over time, add
mappings for pages that are
important.
wget from a linux box might also be a good option as there are switches to spider and change it's output.
EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm
Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.
I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.