R: How can I extract an element from a column of d

2019-02-28 23:52发布

问题:

I have a dataset as below.

Because of its large amount of data, I uploaded it through the sparklyr package, so I can use only pipe statements.

pos <- str_sub(csj$helpful,2)
neg1 <- str_sub(csj$helpful,4)
csj <- csj %>% mutate(neg=replace(helpful,stringr::str_sub(csj$helpful,4)==1,0))
csj <- csj %>% mutate(help=pos/neg)
csj
is.null(csj$helpful)

I want to make a column named 'help' which is 'the first number of helpful column/2nd number of helpful column'.

If the 2nd number is 0, I need to change the 2nd number to 1 and then divide it.

The data frame name is csj.

But it doesn't work.

I'll be glad if someone can help me solve this problem.

After I followed @Sebastian Hoyos's advice but still I got this col1,col2,col3 as NAN as below picture. (But the example he gave me worked). How should I solve this problem?

enter image description here

+) After I tried without as.numeric the part then I got this result.

> csj %>%
+   mutate(col1 = stringi::stri_extract_first_regex(csj$helpful, pattern = "[0-9]"),#extract first number
+          col2 = stringi::stri_extract_last_regex(csj$helpful, pattern = "[0-9]"),#extract second
+          col3 = ifelse(col2 == 0, 1, col2 ),#change 0s to 1
+          help = col1/col3) #divide row1 and 3


# Source:   lazy query [?? x 12]
# Database: spark_connection
   `_c0` reviewerID     asin  helpful length_of_review overall unixReviewTime category   col1  col2  col3   help
   <int> <chr>          <chr> <chr>              <dbl> <chr>   <chr>          <chr>      <chr> <chr> <chr> <dbl>
 1     0 A1KLRMWW2FWPL4 31887 [0, 0]               172 5       1297468800     Clothes_s~ ""    ""    NA      NaN
 2     1 A2G5TCU2WDFZ65 31887 [0, 0]               306 5       1358553600     Clothes_s~ ""    ""    NA      NaN
 3     2 A1RLQXYNCMWRWN 31887 [0, 0]               312 5       1357257600     Clothes_s~ ""    ""    NA      NaN
 4     3 A8U3FAMSJVHS5  31887 [0, 0]               405 5       1398556800     Clothes_s~ ""    ""    NA      NaN
 5     4 A3GEOILWLK86XM 31887 [0, 0]               453 5       1394841600     Clothes_s~ ""    ""    NA      NaN
 6     5 A27UF1MSF3DB2  31887 [0, 0]               375 4       1396224000     Clothes_s~ ""    ""    NA      NaN
 7     6 A16GFPNVF4Y816 31887 [0, 0]               334 5       1399075200     Clothes_s~ ""    ""    NA      NaN
 8     7 A2M2APVYIB2U6K 31887 [0, 0]               158 5       1356220800     Clothes_s~ ""    ""    NA      NaN
 9     8 A1NJ71X3YPQNQ9 31887 [0, 0]                96 4       1384041600     Clothes_s~ ""    ""    NA      NaN
10     9 A3EERSWHAI6SO  31887 [7, 8]               532 5       1349568000     Clothes_s~ ""    ""    NA      NaN
# ... with more rows
> 

回答1:

Although this isn't the most elegant string of code, it should get the job done. Since no sample dataset is provided other than a screenshot, I just created a sample with the important elements you were interested in.

csj <- tibble(helpful = rep(c("[0,0]","[0,1]","[0,2]","[1,3]"),100),
                            overall = rep(c(5,4,3,2),100))
#this change the columns and creates the help column
csj %>%
      mutate(col1 = as.numeric(stringi::stri_extract_first_regex(helpful, pattern = "[0-9]")),#extract first number
             col2 = as.numeric(stringi::stri_extract_last_regex(helpful, pattern = "[0-9]")),#extract second
             col3 = ifelse(col2 == 0, 1, row2 ),#change 0s to 1
             help = col1/col3) %>% #divide row1 and 3
      select(helpful, help)#select the rows you wish to keep

This should work as long as you modify the functions to your dataset as needed. Also note that helpful is a character type in your dataset which is why you need to change it to numeric

EDIT: So I looked up some sparklyr and realized why the code isn't working so I created an example for myself to test out.Although I didn't replicate your data completely I came up with enough things to hopefully provide a working solution.

library(sparklyr)
library(dplyr)
library(ggplot2)
library(magrittr) 
sc <- spark_connect(master="local")
#create dataframe
cjs <- tibble(helpful = rep(c("[0,  0]","[0, 1]","[0, 2]","[1, 3]","[,1]",NA,"a"),100),
              overall = rep(c(6,5,4,3,2,1,0),100))

#transfer to sparkly
csj <- copy_to(sc, csj,"cjs")

#this should do the trick
csj %>% 
  mutate(newcol2 = regexp_replace(helpful, "[^0-9,]", " "), 
         newcol3 = as.numeric(substring_index(newcol2, ",", 1)),
         newcol4 = as.numeric(substring_index(newcol2,",",-1)),
         newcol5 = ifelse(newcol4 == 0, 1, newcol4),
         help = newcol3/newcol5) %>% 
  select(starts_with("new"),help) #select the columns you need with help calculated appropriately