Python get request returning different HTML than v

2019-07-13 13:58发布

问题:

I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need).

First I tried with the built in urllib library (and BeautifulSoup):

import urllib
from bs4 import BeautifulSoup    
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()

Then I found out about the Requests library, and how the User Agent could be part of the problem, so I tried this with the same results:

import requests
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
        'Content-Type': 'text/html',
}
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text

Then I found out about Selenium and PhantomJS, so I installed those and tried this but again - same result:

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.PhantomJS()
browser.get("http://archiveofourown.org/works/6846694")
soup = BeautifulSoup(browser.page_source, "html.parser")
soup.prettify()

Am I doing something wrong in any of these attempts, or is this an issue with the server?

回答1:

The last approach is a step into the right direction if you need the complete page source with all the JavaScript executed and async requests made. You are just missing one thing - you need to give PhantomJS time to load the page before reading the source (pun intentional).

And, you need to also click "Proceed" that you agree to see the adult content:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.get("http://archiveofourown.org/works/6846694")

wait = WebDriverWait(driver, 10)

# click proceed
proceed = wait.until(EC.presence_of_element_located((By.LINK_TEXT, "Proceed")))
proceed.click()

# wait for the content to be present
wait.until(EC.presence_of_element_located((By.ID, "workskin")))

soup = BeautifulSoup(driver.page_source, "html.parser")
soup.prettify()


回答2:

Alexce has explained why your code did not give you what you want, if all you want is the text which is available in the source if you add the param view_adult=true:

import requests
from bs4 import BeautifulSoup
url = "http://archiveofourown.org/works/6846694?view_adult=true"


r= requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
chap = soup.select_one("#chapter-1")
preface = soup.select_one("div.preface.group")


print(preface)
print(chap)

That will give you:

<div class="preface group">
<h2 class="title heading">
      The Complete Works of Emmanuel Allen
    </h2>
<h3 class="byline heading">
<a href="http://archiveofourown.org/users/violue/pseuds/violue" rel="author">violue</a>
</h3>
<div class="summary module" role="complementary">
<h3 class="heading">Summary:</h3>
<blockquote class="userstuff">
<p>Dean Winchester, reluctant business owner, reluctant home owner, and reluctant cat owner, is striking up a very promising friendship with the author of his favorite book series.</p><p>And he has no idea.</p>
</blockquote>
</div>
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<blockquote class="userstuff">
<p>Oh yeah, I've got notes.</p><p>
<s>1.) This is complete, though later chapters are still being beta'd. I'll be posting a chapter at a time, whenever the hell I feel like it. Probably every day/every other day because it's hard to just SIT ON ALL THESE CHAPTERS I HAVE WHEN THEY'RE READY TO POST!!!</s>
</p><p>2.) This is of the mostly aimless domestic fluff variety, in that there's no big overarching storyline. But that's pretty common with my stories.  ¯\_(ツ)_/¯ </p><p>3.) There's a bit of <i>me</i> in this story. I am a depressed and surly cat owner living in the Pacific Northwest, and so is Dean, but most of this is just my imagination.</p><p>4.) Thanks to <a href="http://archiveofourown.org/users/Tennyo/works">TENNYO</a>, <a href="http://chiwalker.tumblr.com/">CHIWALKER</a>, <a href="http://buckysbuckhole.tumblr.com/">CASFUCKER</a>, and <a href="http://kelisab.tumblr.com">KELISAB</a> for beta'ing! If you find mistakes in the story, it's all their fault, and you should throw soggy tomatoes at them.</p><p>5.) No, I think that's it. Start reading.</p>
</blockquote>
</div>
</div>
<div class="chapter" id="chapter-1">
<!-- chapter management -->
<div class="chapter preface group" role="complementary">
<h3 class="title">
<a href="/works/6846694/chapters/15628576">Chapter 1</a>: Prologue
    </h3>
<!-- only display byline if different from the main byline -->
</div>
<!--main content-->
<div class="userstuff module" role="article">
<h3 class="landmark heading" id="work">Chapter Text</h3>
<p>“Wow, that’s beautiful!”</p><p>Dean doesn’t even have to look up from his book to know what this customer is talking about. Winchester General Store has a lot of things; food, beer, toiletries, camping gear, used books and more, but the only thing that could be considered “beautiful” in this store is the hand-carved, ornate wooden house sitting in a display case mounted on the wall behind Dean. Actually, “house” isn’t the right word. It started as a house in Dean’s mind, but by the time he was done carving, sanding, polishing, and in some places hot gluing the white oak structure, it had become a mausoleum. A beautiful, <em>inviting </em>mausoleum, but a mausoleum nonetheless. Dean had even borrowed some acrylic paints from Charlie to color the climbing ivy painstakingly carved onto the sides.</p><p>“Thanks, man,” Dean says, setting his book down. Might as well let the guy know this was <em>his </em>hard work.</p><p>The man’s eyes widen. “You <em>made </em>this?”</p><p>“Sure did. Worked on it for two months.” Dean nods toward the twelve pack of Mountain Dew the customer is holding. “You all set?”</p><p>The man puts the case on the counter by the register, and Dean rings it up. “How much?”</p><p>“Eight ninety-nine for the Dew.”</p><p>The man shakes his head. “No, I mean the sculpture. My wife and I just bought a place up in Cougar Falls, and that would look <em>great </em>in the front room.”</p><p>Dean blinks, surprised. He’s gotten a lot of compliments on the mausoleum in the past ten or so months, but no one’s ever assumed it was for sale before.</p><p>“Sorry, man, not for sale.”</p><p>“Come on. Name your price.” Dean gets all sorts of customers here. Locals, people out in the area for camping, people up here to go rafting down Filbert River, and of course, people just passing through on their way to some place bigger and better. This guy falls into the last category.</p><p>“No can do, that thing’s got something important inside. Can’t part with it.”</p><p>“Important? Like what?”</p><p>Dean shrugs. “My parents.”</p><p>“W… what?” the man stammers.</p><p>“Yeah. There’s an urn inside. Kinda had to glue the top of the building on to get the urn in there, but you can’t really tell unless you’re real close and looking at just the right angle.”</p><p>“<em>Both </em>of your parents?”</p><p>“Well, my mom died ages ago, and my dad kept her ashes the rest of his life.” Dean turns to look at his carving fondly. “And when my dad died, we had him cremated too. One night I got real drunk, I was still kind of in mourning, and I decided my parents should be together. So I dumped my dad’s ashes into my mom’s urn, and then I gave the urn a good shake,” Dean says, shaking an imaginary urn. “My brother was <em>pissed </em>when I told him, but he’s over it now. Anyway, I made this here structure to keep them in. Sort of an apology gift.”</p><p>The bell over the front door jingles, and Dean turns back to see the customer has taken off. “Don’t you want your Mountain Dew?” he yells, even though the guy’s already outside.</p><p>Jeez. What a wimp. Dean reaches into the display case, patting the top of the mausoleum gently. “What a baby. Am I right, guys?”</p><p>The urn full of Winchester ashes stays silent of course. Dean snickers, picks his book up off the counter, and gets back to reading.</p><p><br/>
<br/>
</p><p> </p>
</div>
<!--/main-->
</div>

Which should hopefully be all you need.