我试图用机械化抓住从这个网站纽约的大都会北方铁路价格:
http://as0.mta.info/mnr/fares/choosestation.cfm
问题是,当你选择第一个选项,该网站使用JavaScript来填充可能的目的地列表。 我已经用Python编写的等效代码,但我似乎无法得到它的所有工作。 这是我到目前为止有:
import mechanize
import cookielib
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://as0.mta.info/mnr/fares/choosestation.cfm")
br.select_form(name="form1")
br.form.set_all_readonly(False)
origin_control = br.form.find_control("orig_stat", type="select")
origin_control_list = origin_control.items
origin_control.value = [origin_control.items[0].name]
destination_control_list = reFillList(0, origin_control_list)
destination_control = br.form.find_control("dest_stat", type="select")
destination_control.items = destination_control_list
destination_control.value = [destination_control.items[0].name]
response = br.submit()
response_text = response.read()
print response_text
我知道我没有给你的代码的reFillList()
方法,因为它的长,但认为它正确地创建mechanize.option对象的列表。 Python不抱怨我什么,但我提交获得此警报的HTML:
“两线之间的旅行票价信息不能上线,请联系我们的客户信息中心511,并要求发言,以获取更多信息的代表。”
我失去了一些东西在这里? 感谢所有帮助!
如果您知道电台的ID,很容易给自己POST请求:
import mechanize
import urllib
post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'
orig = 295 #BEACON FALLS
dest = 292 #ANSONIA
params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)
fares_page = mechanize.urlopen(rq)
print fares_page.read()
如果你的代码找到目的地ID列表对于给定的起始ID(即变体refillList()
然后你可以运行每个组合这个请求:
import mechanize
import urllib, urllib2
from bs4 import BeautifulSoup
url = 'http://as0.mta.info/mnr/fares/choosestation.cfm'
post_url = 'http://as0.mta.info/mnr/fares/get_fares.cfm'
def get_fares(orig, dest):
params = urllib.urlencode({'dest_stat':dest, 'orig_stat':orig })
rq = mechanize.Request(post_url, params)
fares_page = mechanize.urlopen(rq)
print(fares_page.read())
pool = BeautifulSoup(urllib2.urlopen(url).read())
#let's keep our stations organised
stations = {}
# dict by station id
for option in pool.find('select', {'name':'orig_stat'}).findChildren():
stations[option['value']] = {'name':option.string}
#iterate over all routes
for origin in stations:
destinations = get_list_of_dests(origin) #use your code for this
stations[origin]['dests'] = destinations
for destination in destinations:
print('Processing from %s to %s' % (origin, destination))
get_fares(origin, destination)