Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.
相关问题
- How to join wagtail and django sitemaps?
- How to use multiple .sitemap files in ASP.NET
- Django Google News Sitemap.xml
- Using Firebase Cloud Functions to Update Hosted Fi
- Creating Sharepoint/MOSS sitemap
相关文章
- Using Scrapy to parse sitemaps
- Parse XML to UL
- Generating a Visual Site Map of an Existing Site [
- How to let sitemap generator fully crawl Angular r
- What is the most performance effective way to crea
- Rails sitemap_generator Uninitialized Constant?
- How to properly format last modified (lastmod) tim
- Google Sitemap Date Format
For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think
I would only create a site map for the more static pages of the site. For example on StackOverflow a sitemap could showlinks for the FAQ, About, Questions, Tags, Users, etc... pages but not show links to the actual questions, or all the tags, and the various users.
Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).
Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.
There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).
To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.
On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.
That is the only way search engines like Google can effectively crawl the site.
Jeff Atwood talks about it in a blog post: The Importance of Sitemaps
This is from Google's webmaster help page on sitemaps:
I'd like to share my solution here just in case it helps someone as well. It took me reading this question and many others to decide what to do.
My site structure.
Static pages
...etc
Dynamic Pages
My approach.
sitemap.xml: This url generates a
<sitemapindex />
with the first item being/sitemap-main.xml
. The number ofArtists
,Albums
,Songs
etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.So for e.g, 1900 songs = 1.9 = 2. I generate. add the urls
/sitemap-songs-0.xml
and/sitemap-songs-1.xml
to the index. I repeat this for all other items. Basically, I am paginating.The output is returned uncached. I want this to always be fresh.
sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.
sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.
I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks. Else, I only cache it for a few hours.
I guess this can help anyone else implement their own system.