Removing “NUL” characters (within R)

2019-02-18 05:08发布

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files. With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).

However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).

Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?

标签: r string nul
1条回答
We Are One
2楼-- · 2019-02-18 05:36

You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
#  chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__ ...

You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__
查看更多
登录 后发表回答