Web scraping Oracle (ATG) Commerce

2019-08-29 05:52发布

问题:

I am new to web scraping, and I use the following tool and method to scrap:

  • I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
  • Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
  • When I know in which node the data are, I use xpathApply to get them.

Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2

  • When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
  • You have to load the url again (by entering a second time the url), in order to get the page 2.
  • When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.

I tried to understand more this web site:

  • It seems that it is built with Oracle commerce (ATG commerce).
  • The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099

This doesn't help to know which selection I made.

Could you please help:

  • How can I access to more products ?

Thank you

回答1:

I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.