reading in a text file with a SUB (1a) (Control-Z)

2019-01-22 22:42发布

Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines() seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!

I have tried to readBin() but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?

Update

Now I'm confused - when I use the code

 h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
 identical(readLines(textConnection(h3)), h3)

I get TRUE which I find quite surprising!

Update 2

 h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"

So readLines() reacts differently when coming from a textConnection() and it silently truncates at the SUB character.

I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.

Update 3

Some vague success in solving this...

zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"

i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...

Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??

2条回答
可以哭但决不认输i
2楼-- · 2019-01-22 23:00

I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.

fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt

[1] "1,34,44.4,\" HIJK\032A \",99"

Update The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:

sReadLines <- function(fnam) {
    f <- file(fnam, "rb")
    res <- readLines(f)
    close(f)
    res
}
查看更多
叼着烟拽天下
3楼-- · 2019-01-22 23:04

I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.

Solved it with the readr package (if your file is comma separated)

library(readr)
read_csv("h3.txt")

If you have a ; as a separator, then use:

library(readr)
read_csv2("h3.txt")
查看更多
登录 后发表回答