Lets say we have Twitter, and every profile needs to get indexed in search engines, how does Twitter handle their sitemap? Is there something like "regex" sitemap for domain or do they re-generate a sitemap for each user?
How does this work, for pages that you don't know, so dynamic pages? Look at Wikipedia for example, how do they make sure everything is indexed by Search Engines?
Many systems uses dynamically generated site map.
You can upload any sitemap to Google via Webmaster Tools (the service is free of charge) - Optimization > Sitemaps. It does not have to be sitemap.xml; it can be JSP or ASPX page too.
Webmaster Tools allows you to upload many different sitemaps for a single website. However, I am not sure what is the maximum number of sitemaps.
Some crawlers support a
Sitemap
directive, allowing multiple Sitemaps in the samerobots.txt
in the form as follows:EDIT
Microsoft website is a very good example: The robots.txt file contains lots of sitemap entries. Example:
As you can see, some sitemaps are static (XML) and some are dynamic (ASPX).
Most likely, they don't bother to do a sitemap.
For highly dynamic sites, a sitemap will not help that much. Google will index only some amount, and if everything canges before Google considers to revisit it, you don't gain much.
For slowly changing sites this is different. The sitemaps tells Google on the one hand, which sites exist that it maybe has not yet visited at all, and (more importantly), which site have not changed and thus do not need to be revisited.
But the
sitemap.xml
mechanism just does not scale up to huge and highly dynamic sites such as twitter.