Python Scraping Web with Session Cookie

2019-06-03 14:31发布

问题:

Hi iam trying to scrap some data off from this URL:

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)

I tried to do it like this:

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

However, i ended up failing to scrap from that particular URL.

A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.

The question is how do i mitigate such example?

ps: i've tried to install request (via pip) how ever it gives me (404):

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

回答1:

Thanks to @Chainik i got it to work now. I ended up modify my code like this:

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

Once, the html text is retrieved. It's all about parsing its content.

Cheers



回答2:

Try setting a referer URL, see below.

Without referer URL set (302 redirect):

$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily                       
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/

With referer URL set (HTTP/200):

$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/

To set referer URL using urllib, see this post

-- ab1