Import raw data into R

2019-03-31 06:00发布

问题:

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.

1 NEW YORK  7,262,700
2 LOS ANGELES  3,259,340
3 CHICAGO  3,009,530
4 HOUSTON  1,728,910
5 PHILADELPHIA  1,642,900
6 DETROIT  1,086,220
7 SAN DIEGO  1,015,190
8 DALLAS  1,003,520
9 SAN ANTONIO  914,350
10 PHOENIX  894,070

回答1:

For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK  7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO  3,009,530")
# [1] "3 CHICAGO  3,009,530"

You can then interpret spaces as field separators.



回答2:

A variation on a theme... but first, some sample data:

cat("1 NEW YORK  7,262,700",
    "2 LOS ANGELES  3,259,340",
    "3 CHICAGO  3,009,530",
    "4 HOUSTON  1,728,910",
    "5 PHILADELPHIA  1,642,900",
    "6 DETROIT  1,086,220",
    "7 SAN DIEGO  1,015,190",
    "8 DALLAS  1,003,520",
    "9 SAN ANTONIO  914,350",
    "10 PHOENIX  894,070", sep = "\n", file = "test.txt")

Step 1: Read the data in with readLines

x <- readLines("test.txt")

Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
#  [1] "1\t NEW YORK  \t7,262,700"     "2\t LOS ANGELES  \t3,259,340" 
#  [3] "3\t CHICAGO  \t3,009,530"      "4\t HOUSTON  \t1,728,910"     
#  [5] "5\t PHILADELPHIA  \t1,642,900" "6\t DETROIT  \t1,086,220"     
#  [7] "7\t SAN DIEGO  \t1,015,190"    "8\t DALLAS  \t1,003,520"      
#  [9] "9\t SAN ANTONIO  \t914,350"    "10\t PHOENIX  \t894,070"  

Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
                  header = FALSE, strip.white = TRUE)
out
#    V1           V2        V3
# 1   1     NEW YORK 7,262,700
# 2   2  LOS ANGELES 3,259,340
# 3   3      CHICAGO 3,009,530
# 4   4      HOUSTON 1,728,910
# 5   5 PHILADELPHIA 1,642,900
# 6   6      DETROIT 1,086,220
# 7   7    SAN DIEGO 1,015,190
# 8   8       DALLAS 1,003,520
# 9   9  SAN ANTONIO   914,350
# 10 10      PHOENIX   894,070

One possible last step would be to convert the third column to numeric:

out$V3 <- as.numeric(gsub(",", "", out$V3))


回答3:

Expanding on @Hugh's answer I would try the following, although its not particularly efficient.

lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) { 
  gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x) 
}))

citiesDF <- data.frame(num  = rep(0, length(lines)), 
                       city = rep("", length(lines)), 
                       population = rep(0, length(lines)),
                       stringsAsFactors=FALSE)

for (i in 1:length(lines)) {
   splitted = strsplit(lines[i], " +")
   citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
   citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
   citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}