可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to learn how to use RCurl (or some other suitable R package if I'm wrong about RCurl being the right tool) to automate the process of submitting search terms to a web form and placing the search results in a data file. The specific problem I'm working on is as follows:

I have a data file giving license plate number (LPN) and vehicle identification number (VIN) for several automobiles. The California Department of Motor Vehicles (DMV) has a web page search form where you enter the LPN and the last five digits of the VIN, and it returns the vehicle license fee (VLF) payment for either 2010 or 2009 (there's a selector for that on the input form as well). (FYI: This is for a research project to look at the distribution of VLF payments by vehicle make, model and model year)

I could go through the tedious process of manually entering data for each vehicle and then manually typing the result into a spreadsheet. But this is the 21st Century and I'd like to try and automate the process. I want to write a script that will submit each LPN and VIN to the DMV web form and then put the result (the VLF payment) in a new VLF variable in my data file, doing this repeatedly until it gets to the end of the list of LPNs and VINs. (The DMV web form is here by the way: https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do).

My plan was to use getHTMLFormDescription() (in the RHTMLForms package) to find out the names of the input fields and then use getForm() or postForm() (in the RCurl package) to retrieve the output. Unfortunately, I got stuck at the very first step. Here's the R command I used and the output:

> forms = getHTMLFormDescription("https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do")
Error in htmlParse(url, ...) : 
  File https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do does not exist

Unfortunately, being relatively new to R and almost completely new to HTTP and web-scraping, I'm not sure what to do next.

First, does anyone know why I'm getting an error on my getHTMLFormDescription() call? Alternatively, is there another way to figure out the names of the input fields?

Second, can you suggest some sample code to help me get started on actually submitting LPNs and VINs and retrieving the output? Is getForm() or postForm() the right approach or should I be doing something else? If it would help to have some real LPN-VIN combinations to submit, here are three:
LPN VIN
5MXH018 30135
4TOL562 74735
5CWR968 11802

Finally, since you can see I'm a complete novice at this, do you have suggestions on what I need to learn in order to become adept at web scraping of this sort and how to go about learning it (in R or in another language)? Specific suggestions for web sites, books, listservs, other StackOverflow questions, etc. would be great.

Thanks for your help.

回答1:

Adding to the suggestion by daroczig and Rguy, here is a short piece of code to automate the entire process of extracting the data into a data frame.

# construct sample data frame with lpn, vpn and years
lpn  = rep(c('5MXH018', '4TOL562', '5CWR968'), 2);
vpn  = rep(c('30135', '74735', '11802'), 2);
year = c(rep(2009, 3), rep(2010, 3));
mydf = data.frame(lpn, vpn, year);

# construct function to extract data for one record
get_data = function(df){

  library(XML);
  # root url
  root = 'http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfFees.do?method=calculateVlf&su%C2%ADbmit=Determine%20VLF'

  # construct url by adding lpn, year and vpn
  u = paste(root, '&vehicleLicense=', df$lpn, '&vehicleTaxYear=', 
            df$year, '&vehicleVin=',
      df$vpn, sep = "");

  # encode url correctly
  url  = URLencode(u);

  # extract data from the right table
  data = readHTMLTable(url)[[5]];

}

# apply function to every row of mydf and return data frame of results
library(plyr)
mydata = adply(mydf, 1, get_data);

# remove junk from column names
names(mydata) = gsub(':\302\240\302\240', '', names(mydata))

回答2:

Just use http instead of https and that should solve your problem. Here is the output you get if you try this

forms = getHTMLFormDescription("http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfForm.do",  
   dropButtons = TRUE)

[[1]] HTML Form: http://search.ca.gov/search q :[ Search DMV Site ]

$feeRequestForm HTML Form: http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfFees.do vehicleLicense :[ ]
vehicleTaxYear : vehicleVin :[ ]

Here is an example of how to fill a form and get a data table from a yahoo sports page.

# get form description
url   = 'http://ca.sports.yahoo.com/nhl/stats/byteam?cat=teamstats&sort=404'
forms = getHTMLFormDescription(url);

# create a function using form description, to query the url
efun  = createFunction(forms[[3]]);

# extract webpage by passing required arguments to function
page  = efun(year = 'season_2009', conference = 'Eastern');

# parse webpage and return html tree
doc   = htmlTreeParse(page, asText = T, useInternalNodes = T);

# extract table from the html tree
tab   = readHTMLTable(doc);

I applied this to the webpage you specified, but for some reason the form element VehicleTaxYear is returned incorrectly which causes errors. Someone with a more in-depth knowledge of HTML forms would be able to guide you on how to debug this error.

Hope this is useful

EDIT. I fixed an error. It should be createFunction(forms[[3]]), since we are only interested in the third form.

回答3:

The details of my comment above:

Fire up a Firefox with Firebug :)
Fill in the form with one pair of desired plate no. and so, and click submit ("Determine VLF")
Click "Net" tab in Firebug and check the sent request to the server, like (sorry, Hungarian user interface - but you will get the point I hope):

You can see that a POST request was sent, but the data can be reached via a GET request also, so just right click on the "POST vlfFees.do" and choose "Copy URL with all parameters" and you are done by getting the required URL.
You can modify the URL with desired LPN and VIN in the URL simply (after vehicleLicense and vehicleVin),
And just call the modified URL with readHTMLTable from XML package which will provide you a nice data frame of the required dataset.

So:

library(XML)
datas <- readHTMLTable("http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfFees.do?method=calculateVlf&su%C2%ADbmit=Determine%20VLF&vehicleLicense=5CWR968&vehicleTaxYear=2010&vehicleVin=11802")

And get the needed values of the table as @Rguy suggested above:

processed <- datas[[5]][[1]]
paid <- datas[[5]][[2]]
refund <- datas[[5]][[3]]

With this basic example you could easily write a loop the fetch all required data, but do not be greedy and do not loop without a sleep call (see: Sys.sleep). I would fetch only one plate per minute or so, which would definitely not bother the server much.

回答4:

I meant to post this as a comment after the original post, but do not have enough reputation.

I used the URL that @daroczig provided to obtain the actual data that eipi10 wants by doing the following:

datas <- readHTMLTable("http://www.dmv.ca.gov/wasapp/FeeCalculatorWeb/vlfFees.do?method=calculateVlf&su%C2%ADbmit=Determine%20VLF&vehicleLicense=5CWR968&vehicleTaxYear=2010&vehicleVin=11802")
processed <- datas[[5]][[1]]
paid <- datas[[5]][[2]]
refund <- datas[[5]][[3]]

In short, the readHTMLTable command is useful for formatting HTML code.
I second eipi10's request for how daroczig obtained the url.