How to scrape content rendered in popup window wit

2019-05-20 07:52发布

I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to

1 - activate a javascript: link to expand a collapsed panel

2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped

The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

and

How to scrape coupon code of coupon site (coupon code comes on clicking button)

But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.

I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:

In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:

{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}

The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.

How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".

UPDATE

Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:

sel.xpath('//a[contains(@href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')

in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).

The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.

1条回答
疯言疯语
2楼-- · 2019-05-20 08:23

I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min Portia but I don't know if it'll do what you want https://codeload.github.com/scrapinghub/portia/zip/master Good

查看更多
登录 后发表回答