in R - crawling with rvest - fail to get the texts

2019-06-10 06:56发布

问题:

url <-"http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392"

hh = read_html(GET(url),encoding = "EUC-KR")

#guess_encoding(hh)

html_text(html_node(hh, 'div.par'))
#html_text(html_nodes(hh ,xpath='//*[@id="news_body_id"]/div[2]/div[3]'))

I'm trying to crawling the news data(just for practice) using rvest in R.

When I tried it on the homepage above, I failed to fetch the text from the page. (Xpath doesn't work either.)

I do not think I failed to find the link that contain texts that I want to get on the page. But when I try to extract the text from that link using html_text function, it is extracted as "" or blanks.

I can't find why.. I don't have any experience with HTML and crawling.

What I'm guessing is the HTML tag that contain news body contexts, has "class" and "data-dzo"(I don't know what is it).

So If anyone tell me how to solve it or let me know the search keywords that I can find on google to solve this problem.

回答1:

It builds quite a bit of the page dynamically. This should help.

The article content is in an XML file. The URL can be constructed from the contid parameter. Either pass in a full article HTML URL (like the one in your example) or just the contid value to this and it'll return an xml2 xml_document with the parsed XML results:

#' Retrieve article XML from chosun.com
#' 
#' @param full_url_or_article_id either a full URL like 
#'        `http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392`
#'        or just the id (e.g. `1999080570392`)
#' @return xml_document
read_chosun_article <- function(full_url_or_article_id) {

  require(rvest)
  require(httr)

  full_url_or_article_id <- full_url_or_article_id[1]

  if (grepl("^http", full_url_or_article_id)) {
    contid <- httr::parse_url(full_url_or_article_id)
    contid <- contid$query$contid
  } else {
    contid <- full_url_or_article_id
  }

  # The target article XML URLs are in the following format:
  #
  # http://news.chosun.com/priv/data/www/news/1999/08/05/1999080570392.xml
  #
  # so we need to construct it from substrings in the 'contid'

  sprintf(
    "http://news.chosun.com/priv/data/www/news/%s/%s/%s/%s.xml",
    substr(contid, 1, 4), # year
    substr(contid, 5, 6), # month
    substr(contid, 7, 8), # day
    contid
  ) -> contid_xml_url

  res <- httr::GET(contid_xml_url)

  httr::content(res)  

}

read_chosun_article("http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392")
## {xml_document}
## <content>
##  [1] <id>1999080570392</id>
##  [2] <site>\n  <id>1</id>\n  <name><![CDATA[www]]></name>\n</site>
##  [3] <category>\n  <id>3N1</id>\n  <name><![CDATA[사람들]]></name>\n  <path ...
##  [4] <type>0</type>
##  [5] <template>\n  <id>2006120400003</id>\n  <fileName>3N.tpl</fileName> ...
##  [6] <date>\n  <created>19990805192041</created>\n  <createdFormated>199 ...
##  [7] <editor>\n  <id>chosun</id>\n  <email><![CDATA[webmaster@chosun.com ...
##  [8] <source><![CDATA[0]]></source>
##  [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n  <page no="0">\n    <paragraph no="0">\n      <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...

read_chosun_article("1999080570392")
## {xml_document}
## <content>
##  [1] <id>1999080570392</id>
##  [2] <site>\n  <id>1</id>\n  <name><![CDATA[www]]></name>\n</site>
##  [3] <category>\n  <id>3N1</id>\n  <name><![CDATA[사람들]]></name>\n  <path ...
##  [4] <type>0</type>
##  [5] <template>\n  <id>2006120400003</id>\n  <fileName>3N.tpl</fileName> ...
##  [6] <date>\n  <created>19990805192041</created>\n  <createdFormated>199 ...
##  [7] <editor>\n  <id>chosun</id>\n  <email><![CDATA[webmaster@chosun.com ...
##  [8] <source><![CDATA[0]]></source>
##  [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n  <page no="0">\n    <paragraph no="0">\n      <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...

NOTE: I poked around that site to see this violates their terms of service and it does not seem to but I also relied on google translate and it may have made that harder to find. It's important to ensure you can legally (and, ethically, if you care about ethics) scrape this content for whatever use you intend.