I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!
相关问题
- Laravel Option Select - Default Issue
- HTML form is not sending $_POST values
- How to use Control.FromHandle?
- What is the best way to do a search in a large fil
- Xamarin. The name 'authorEntry does not exist
相关文章
- What is the complexity of bisect algorithm?
- Show a different value from an input that what wil
- How can I detect/watch “dirty-status” of an angula
- Why form submit opens new window/tab?
- Visual Studio: Is there an incremental search for
- Setting Angular 2 FormArray value in ReactiveForm?
- Rails: Using form fields that are unassociated wit
- How to get the “value” of a FilteringSelect <se
There are many tools for web scraping. There is a good firefox plugin called iMacros. It works great and needs no programming knowledge at all. The free version can be downloaded from here: https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/ The best thing about iMacros, is that it can get you started in minutes, and it can also be launched from the bash command line, and can also be called from within bash scripts.
A more advanced step would be selenium webdrive. The reason I chose selenium is that it is documented in a great way suiting beginners. reading just the following page:
would get you upand running in no time. Selenium supports java, python, php , c so if you are familiar with any of these languages, you would be familiar with all the commands needed. I prefer webdrive variation of selenium, as it opens a browser, so that you can check the fields and outputs. After setting up the script using webdrive, you can easily migrate the script to IDE, thus running headless.
To install selenium you can do by typing the command
This will take care of the dependencies and everything needed for you.
In order to run your script interactively, just open a terminal, and type
you will see the python prompt, >>> and you can type in the commands.
Here is a sample code which you can paste in the terminal, it will search google for the word cheeses
I hope that this can give you a head start.
Cheers :)
Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:
http://wwwsearch.sourceforge.net/mechanize/
Mechanize let's you control a browser:
With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.
Good luck!
(Adapted from Joe Albahri's "C# in a nutshell")