How to generate graphical sitemap of large website

2019-03-21 16:47发布

问题:

I would like to generate a graphical sitemap for my website. There are two stages, as far as I can tell:

  1. crawl the website and analyse the link relationship to extract the tree structure
  2. generate a visually pleasing render of the tree

Does anyone have advice or experience with achieving this, or know of existing work I can build on (ideally in Python)?

I came across some nice CSS for rendering the tree, but it only works for 3 levels.

Thanks

回答1:

Here is a python web crawler, which should make a good starting point. Your general strategy is this:

  • you need to take care that outbound links are never followed, including links on the same domain but higher up than your starting point.
  • as you spider, the site collect a hash of page urls mapped to a list of all the internal urls included in each page.
  • take a pass over this list, assigning a token to each unique url.
  • use your hash of {token => [tokens]} to generate a graphviz file that will lay out a graph for you
  • convert the graphviz output into an imagemap where each node links to its corresponding webpage

The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.



回答2:

The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.

So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.

As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.



回答3:

Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView on how to format tree views. You can also probably modify the example application http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your pages if they are organized as directories of HTML files.