I read the text file with below data and am trying to convert it to a dataframe
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
reviews: total: 2 downloaded: 2 avg rating: 5
Sample Dataframe with columns and data
Id | ASIN | title |group | similar | avg rating
1 | 0827229534 | Patterns of Preaching: A Sermon Sampler | Book | 0804215715 | 5
Code:
text <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt")
ids <- gsub('Id:\\s+', '', text)
ASIN <- gsub('ASIN:\\s+', '', text)
title <- gsub('title:\\s+', '', text)
group <- gsub('group:\\s+', '', text)
similar <- gsub('similar:\\s+', '', text)
rating <- gsub('avg rating:\\s+', '', text)
This isnt working and i get the full textfile as output.
EDIT: Correcting my answer.
Using
stringr
:This is just a start. Since im not a pro in regExp I will let others do the magic. :)
Either you define the rules for every object and do something like this.
Or you define a general rule, which should work for every line. Something like this:
Using the tidyverse package:
I put the text in a list because I assume that you will want to create a dataframe with more than one item being looked up. If you do just add a new list item for each readLines that you do.
Notice that mutate looks at each item in the list as an object which is equivalent to using text[[1]]...
If you have and item occur more than once you'll need to add
%>% str_c(collapse = ", ")
like I have done, otherwise you can remove it.UPDATE based on new sample data:
The new sample dataset creates some different challenges that weren't addressed in my original answer.
First, the data is all in a single file and I had assumed it would be in multiple files. It is possible to either separate everything into a list of lists, or to separate everything into a vector of characters. I chose the second option.
Because I chose the second option I now have to update my code to extract data until a \r is reached (Need to \\r in R because of how R handles escapes).
Next, some of the fields are empty! Have to add a check to see if the result is empty and fix the output if it is. I'm using
%>% ifelse(length(.)==0,NA,.)
to accomplish this.Note: if you add other fields such as categories: to this search the code will only capture the first line of text. It will need to be modified to capture more than one line.
I am mostly using baseR here (apart from zoo and tiydr), may be little long code, but it can get the desired results.
Output:
The text file is very wrapped up hence adding a screenshot , my apologies to community.
The output is ditto as per OP.
Here is a different approach using
separate_rows
andspread
to reformat the text file into a dataframe:Result:
Data:
Note:
Leave an extra blank row at the end of the text file. Otherwise
readLines
would return an error when attempting to read in the file.