In R, find whether two files differ

2020-08-15 12:04发布

问题:

I would like a pure R way to test whether two arbitrary files are different. So, the equivalent to diff -q in Unix, but should work on Windows and without external dependencies.

I'm aware of tools::Rdiff, but it seems to only want to deal with R output files and complains loudly if I feed it something else.

回答1:

Without using memory, if the files are too large:

library(tools)
md5sum("file_1.txt") == md5sum("file_2.txt")


回答2:

I realize this is not exactly what you're asking for, but I post it for the benefit of others who run into this question wanting to see the full diff and willing to tolerate external dependencies. In that case, diffobj will show them to you with a real diff that works on windows, with the same algorithm as GNU diff. In this example, we compare the Moby Dick text to a version of it with 5 lines modified:

library(diffobj)
diffFile(mob.1.txt, mob.2.txt)   # or `diffChr` if you data in R already

Produces:

If you want something faster while still getting the locations of the differences you can get the shortest edit script, from the same package:

ses(readLines(mob.1.txt), readLines(mob.2.txt))
# [1] "1127c1127"   "2435c2435"   "6417c6417"   "13919c13919"

Code to get the Moby Dick data (note I didn't set seed, so you'll get different lines):

moby.dick.url <- 'http://www.gutenberg.org/files/2701/2701-0.txt'
moby.dick.raw <- moby.dick.UC <- readLines(moby.dick.url)
to.UC <- sample(length(moby.dick.raw), 5)
moby.dick.UC[to.UC] <- toupper(moby.dick.UC[to.UC])

mob.1.txt <- tempfile()
mob.2.txt <- tempfile()

writeLines(moby.dick.raw, mob.1.txt)
writeLines(moby.dick.UC, mob.2.txt)


回答3:

the closest to the unix command is diffr - it shows a really nice side by side window with all the different lines marked in color.

library(diffr)
diffr(filename1, filename2)

shows



回答4:

Example solution: (Using all.equals utility from: https://stat.ethz.ch/R-manual/R-devel/library/base/html/all.equal.html)

filenameForA <- "my_file_A.txt"
filenameForB <- "my_file_B.txt"
all.equal(readLines(filenameForA), readLines(filenameForB))

Note, that

readLines(filename)

reads all the lines from given file specified by filename, then all.equal can figure out if the files differ or not.

Make sure to read the documentation from above to understand fully. I've to admit, that if the files are very large, this might not be the best option.



回答5:

all.equal(readLines(f1), readLines(f2))


标签: r diff