Importing text data into R and removing extraneous

2019-07-14 18:00发布

问题:

I have a large text file that contains data from the uniform crime report. Ideally, what I would like to do is only import the data and leave out the other extraneous stuff in the file. The actual data is delimited by spaces and as the data goes onto another "page" the header information repeats itself. I first tried to import the data (and only the data) using the following code and to add my own headers manually:

  data <- read.fwf("2010SHRall.txt", 
        c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),   
        skip=5,       
        col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"), 
        strip.white=FALSE)

This works and then at line 51 it quits. I'm definitely a novice R programmer and I tried to Google the answer as well as to search Stack Overflow but I am at a loss for where to go from here. Here is a link to the text file that I am trying to import. Again, I am trying to import the data and remove any rows that have header info or other pieces that are not needed for the complete dataset.

Any help anyone could offer would be greatly appreciated.

回答1:

This should probably work:

text <- readLines('/tmp/2010SHRall.txt')
group.start <- '^      AGENCY'
group.end <- '(^B)|(^0END OF GROUP)'
data <- character()
inside.group <- FALSE
for (line in text) {
  if (inside.group) {
    if (grepl(group.end, line))
      inside.group <- FALSE
    else
      data <- append(data, line)
  } else if (grepl(group.start, line)) {
    inside.group <- TRUE
  }
}
read.fwf(textConnection(data),
         widths=c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),
         header=FALSE,
         col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"), 
         strip.white=TRUE)

It keeps all lines in between lines that match the group.start and group.end regular expressions and discards the rest.