Python Scraping Web with Session Cookie

Hi iam trying to scrap some data off from this URL:

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)

I tried to do it like this:

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

However, i ended up failing to scrap from that particular URL.

A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.

The question is how do i mitigate such example?

ps: i've tried to install request (via pip) how ever it gives me (404):

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

标签： python python-2.7 web-scraping session-cookies

2条回答

我命由我不由天

2楼-- · 2019-06-03 14:44

Try setting a referer URL, see below.

Without referer URL set (302 redirect):

$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily                       
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/

With referer URL set (HTTP/200):

$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/

To set referer URL using urllib, see this post

-- ab1

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-06-03 14:53

Thanks to @Chainik i got it to work now. I ended up modify my code like this:

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

Once, the html text is retrieved. It's all about parsing its content.

Cheers

0人赞添加讨论(0) 举报

Python Scraping Web with Session Cookie

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间