I would like to periodically check what sub-domains are being listed by Google.
To obtain list of sub-domains, I type 'site:example.com' in Google search box - this lists all the sub-domain results (over 20 pages for our domain).
What is the best way to extract only the URL of the addresses returned by the 'site:example.com' search?
I was thinking of writing a little python script that will do the above search and regex the URLs from the search results (repeat on all result pages). Is this a good start? Could there be a better methodology?
Cheers.
The Google Custom Search API can deliver results in ATOM XML format
Getting Started with Google Custom Search
Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.
Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.
Output:
Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.