How to extract certain parts of a web page in Pyth

Target web page: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm

The section I want to extract:

  <tr>
  <td>Skilled &ndash; Independent (Residence) subclass 885<br />online</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>15 May 2011</td>
  <td>N/A</td>
  </tr>

Once the code finds this section by searching the keyword "subclass 885
online", it should then print the date which is within the 5th tag which is "15 May 2011" as shown above.

It's just a monitor for myself to keep an eye on the progress of my immigration application.

标签： python html string

3条回答

够拽才男人

2楼-- · 2020-06-17 06:38

There is a library called Beautiful Soup which does the job you asked for. http://www.crummy.com/software/BeautifulSoup/

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2020-06-17 06:45

You might want to use this as a starting point:

Python 2.6.7 (r267:88850, Jun 13 2011, 22:03:32) 
[GCC 4.6.1 20110608 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2, re
>>> from BeautifulSoup import BeautifulSoup
>>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm')
<addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>>
>>> html = _.read()
>>> soup = BeautifulSoup(html)
>>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$'))
u'15 May 2011'

0人赞添加讨论(0) 举报

爷、活的狠高调

4楼-- · 2020-06-17 06:57

"Beau--ootiful Soo--oop!

Beau--ootiful Soo--oop!

Soo--oop of the e--e--evening,

Beautiful, beauti--FUL SOUP!"

--Lewis Carroll, Alice's Adventures in Wonderland

I think this is exactly what he had in mind!

The Mock Turtle would probably do something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'
>>> page = urllib2.urlopen(url)
>>> soup = BeautifulSoup(page)
>>> for row in soup.html.body.findAll('tr'):
...     data = row.findAll('td')
...     if data and 'subclass 885online' in data[0].text:
...         print data[4].text
... 
15 May 2011

But I'm not sure it would help, since that date has already passed!

Good luck with the application!

0人赞添加讨论(0) 举报

How to extract certain parts of a web page in Pyth

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间