BeautifulSoup -ing a website with login and site s

2020-08-03 04:36发布

问题:

I'm trying to scrape International Maritime Organization's data (https://gisis.imo.org/Public/PAR/Search.aspx) on shipping vessel attacks between the dates ("is between" in the site's search engine) 2002-01-01, 2005-12-31.

I've used bs4 and requests modules in python previously to scrape financial data from yahoo, and weather data from wunderground, but this site requires a login and password (under the "public" account type). Furthermore, as I said the data requires a search / filter before I can access the html on the page:

Once I click on a row here, it expands to the image below. (Before anyone asks why I don't just download the dataset and pull from there: the DL is for some reason filtered, and not all the columns are given out (for example, the IMO number).

ULTIMATELY THE DATA I AM TRYING TO PULL IS FROM THIS PAGE, and I need (item, css path):

  • position of incident

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(1) > td:nth-child(2) > span
    
  • date

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(6) > td.content > span
    
  • ship name

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(4) > td:nth-child(2) > span
    

Needless to say this seems like a daunting task. Any recommendations?

Here is the OLD code I've been using to scrape the weather data (haven't changed anything yet because I don't know where to start in terms of the login/filter process: http://pythonfiddle.com/get-wx-data

回答1:

requests alone isn't going to be enough. You'll want to look into mechanize: http://wwwsearch.sourceforge.net/mechanize/

The nice thing about mechanize is that it maintains state from page to page, unlike requests. (You probably could do it with just requests, but I'm not quite that clever.) Here's an example of a simple login interaction.

This would be awesome, if the IMO site were that easy. Instead, it's ASP-based, and that means it's relatively irritating to scrape. Some of the details will vary from site to site, so I'll suggest two things in particular: looking at the Network tab of your browser's developer tools and reading this ScraperWiki post on dealing with ASP sites.

Best of luck!