Submit data via web form and extract the results

My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on this.

This is the form dom:

<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction&nbsp;&nbsp;
<input type="radio" value="nonfiction" name="genre">
nonfiction&nbsp;&nbsp;
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>

results page dom:

<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>

标签： python web-crawler web-scraping

3条回答

戒情不戒烟

2楼-- · 2019-02-01 11:42

You can use mechanize to submit and retrieve content, and the re module for getting what you want. For example, the script below does it for the text of your own question:

import re
from mechanize import Browser

text = """
My python level is Novice. I have never written a web scraper 
or crawler. I have written a python code to connect to an api and 
extract the data that I want. But for some the extracted data I want to 
get the gender of the author. I found this web site 
http://bookblog.net/gender/genie.php but downside is there isn't an api 
available. I was wondering how to write a python to submit data to the 
form in the page and extract the return data. It would be a great help 
if I could get some guidance on this."""

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']

response = browser.submit()

content = response.read()

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)

print result[0]

What does it do? It creates a mechanize.Browser and goes to the given URL:

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

Then it selects the form (since there is only one form to be filled, it will be the first):

browser.select_form(nr=0)

Also, it sets the entries of the form...

browser['text'] = text
browser['genre'] = ['nonfiction']

... and submit it:

response = browser.submit()

Now, we get the result:

content = response.read()

We know that the result is in the form:

<b>The Gender Genie thinks the author of this passage is:</b> male!

So we create a regex for matching and use re.findall():

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
    content)

Now the result is available for your use:

print result[0]

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-02-01 11:53

You can use mechanize, see examples for details.

from mechanize import ParseResponse, urlopen, urljoin

uri = "http://bookblog.net"

response = urlopen(urljoin(uri, "/gender/genie.php"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]

#print form

form['text'] = 'cheese'
form['genre'] = ['fiction']

print urlopen(form.click()).read()

0人赞添加讨论(0) 举报

趁早两清

4楼-- · 2019-02-01 12:04

No need to use mechanize, just send the correct form data in a POST request.

Also, using regular expression to parse HTML is a bad idea. You would be better off using a HTML parser like lxml.html.

import requests
import lxml.html as lh


def gender_genie(text, genre):
    url = 'http://bookblog.net/gender/analysis.php'
    caption = 'The Gender Genie thinks the author of this passage is:'

    form_data = {
        'text': text,
        'genre': genre,
        'submit': 'submit',
    }

    response = requests.post(url, data=form_data)

    tree = lh.document_fromstring(response.content)

    return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()


if __name__ == '__main__':
    print gender_genie('I have a beard!', 'blog')

0人赞添加讨论(0) 举报

Submit data via web form and extract the results

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间