please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
10 PHOENIX 894,070
For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530")
# [1] "3 CHICAGO 3,009,530"
You can then interpret spaces as field separators.
A variation on a theme... but first, some sample data:
cat("1 NEW YORK 7,262,700",
"2 LOS ANGELES 3,259,340",
"3 CHICAGO 3,009,530",
"4 HOUSTON 1,728,910",
"5 PHILADELPHIA 1,642,900",
"6 DETROIT 1,086,220",
"7 SAN DIEGO 1,015,190",
"8 DALLAS 1,003,520",
"9 SAN ANTONIO 914,350",
"10 PHOENIX 894,070", sep = "\n", file = "test.txt")
Step 1: Read the data in with readLines
x <- readLines("test.txt")
Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t
). The extra slashes are to properly escape them.
gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
# [1] "1\t NEW YORK \t7,262,700" "2\t LOS ANGELES \t3,259,340"
# [3] "3\t CHICAGO \t3,009,530" "4\t HOUSTON \t1,728,910"
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070"
Step 3: Since we know our gsub
is working, and we know that read.delim
has a "text
" argument that can be used instead of a "file
" argument, we can use read.delim
directly on the result of gsub
:
out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x),
header = FALSE, strip.white = TRUE)
out
# V1 V2 V3
# 1 1 NEW YORK 7,262,700
# 2 2 LOS ANGELES 3,259,340
# 3 3 CHICAGO 3,009,530
# 4 4 HOUSTON 1,728,910
# 5 5 PHILADELPHIA 1,642,900
# 6 6 DETROIT 1,086,220
# 7 7 SAN DIEGO 1,015,190
# 8 8 DALLAS 1,003,520
# 9 9 SAN ANTONIO 914,350
# 10 10 PHOENIX 894,070
One possible last step would be to convert the third column to numeric:
out$V3 <- as.numeric(gsub(",", "", out$V3))
Expanding on @Hugh's answer I would try the following, although its not particularly efficient.
lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) {
gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x)
}))
citiesDF <- data.frame(num = rep(0, length(lines)),
city = rep("", length(lines)),
population = rep(0, length(lines)),
stringsAsFactors=FALSE)
for (i in 1:length(lines)) {
splitted = strsplit(lines[i], " +")
citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}