How to split item names when writing csv file of s

2020-05-09 17:03发布

问题:

I am interested in creating a csv or similar Excel-compliant file with data that I scraped from the web by using R. So far I stored the data by doing this:

require(textreadr)
spiegel <- read_html("http://www.spiegel.de/schlagzeilen/")
headlines <- html_nodes(spiegel, ".headline-date")
mydata <- html_text(headlines)

The variable "mydata" now contains the following:

[1] "(Wirtschaft, 00:00)"       "(Kultur, 23:42)"           "(Sport, 23:38)"            "(Politik, 23:16)"         
  [5] "(Sport, 22:29)"            "(Panorama, 21:56)"         "(Sport, 21:39)"            "(Sport, 21:25)"           
  [9] "(Sport, 20:23)"            "(Politik, 20:21)"          "(Politik, 20:09)"          "(Wissenschaft, 19:41)"

When I use write.csv now I want to create two columns, the first one should contain the categories like "Wirtschaft, Sport, etc." and the second one the time. Can someone tell me how to do this specifically in this case?

回答1:

Remove the parentheses, convert to a tibble (whose since column will be called value) and use separate to split that into two columns. Finally write it out. Replace stdout() with your filename.

Lines <- c("(Wirtschaft, 00:00)", "(Kultur, 23:42)") # test data

library(dplyr)
library(tidyr)
library(tibble)

Lines %>% 
      gsub("[()]", "", .) %>%
      as.tibble %>%
      separate(value, into = c("Name", "Time"), sep = ", ") %>%
      write.csv(stdout(), row.names = FALSE)

giving:

"Name","Time"
"Wirtschaft","00:00"
"Kultur","23:42"


回答2:

We can do this with base R using read.csv after replacing the () with blank ("") with gsub

df1 <- read.csv(text = gsub("[()]", "", mydata), header = FALSE,
          col.names = c("Col1", "Col2"), stringsAsFactors = FALSE)
head(df1)
#      Col1   Col2
#1   Kultur  23:42
#2    Sport  23:38
#3  Politik  23:16
#4    Sport  22:29
#5 Panorama  21:56
#6    Sport  21:39

tail(df1)
#          Col1   Col2
#189 einestages  04:26
#190   Panorama  04:26
#191      Sport  04:09
#192    Politik  03:11
#193    Politik  01:56
#194    Politik  00:15