Is there any way in Linux, using c, to generate a diff/patch of two files stored in memory, using a common format (ie: unified diff, like with the command-line diff
utility)?
I'm working on a system where I generate two text files in memory, and no external storage is available, or desired. I need to create a line-by-line diff of the two files, and since they are mmap
'ed, they don't have file names, preventing me from simply calling system("diff file1.txt file2.txt")
.
I have file descriptors (fd
s) available for use, and that's my only entry point to the data. Is there any way to generate a diff/patch by comparing the two open files? If the implementation is MIT/BSD licensed (ie: non-GPL), so much the better.
Thank you.
Considering the requirements, the best option would be to implement your own in-memory diff -au
. You could perhaps adapt the relevant parts of OpenBSD's diff
to your needs.
Here's an outline of one how you can use the /usr/bin/diff
command via pipes to obtain the unified diff between two strings stored in memory:
Create three pipes: I1, I2, and O.
Fork a child process.
In the child process:
Move the read ends of pipes I1 and I2 to descriptors 3 and 4, and the write end of pipe O to descriptor 1.
Close the other ends of those pipes in the child process. Open descriptor 0 for reading from /dev/null, and descriptor 2 for writing to /dev/null.
Execute execl("/usr/bin/diff", "diff", "-au", "/proc/self/fd/3", "/proc/self/fd/4", NULL);
This executes the diff
binary in the child process. It will read the inputs from the two pipes, I1 and I2, and output the differences to pipe O.
The parent process closes the read ends of the I1 and I2 pipes, and the write end of the O pipe.
The parent process writes the comparison data to the write ends of I1 and I2 pipes, and reads the differences from the read end of the O pipe.
Note that the parent process must use select()
or poll()
or a similar method (preferably with nonblocking descriptors) to avoid deadlock. (Deadlock occurs if both parent and child try to read at the same time, or write at the same time.) Typically, the parent process must avoid blocking at all costs, because that is likely to lead to a deadlock.
When the input data has been completely written, the parent process must close the respective write end of the pipe, so that the child process detects the end-of-input. (Unless an error occurs, the write ends must be closed before the child process closes its end of the O pipe.)
When the parent process notices that no more data is available in the O pipe (read()
returning 0
), either it has already closed the write ends of the I1 and I2 pipes, or there was an error. If there is no error, the data transfer is complete, and the child process can be reaped.
The parent process reaps the child using e.g. waitpid()
. Note that if there were any differences, diff
returns with exit status 1.
You can use a fourth pipe to receive the standard error stream from the child process; diff
does not normally output anything to standard error.
You can use a fifth pipe, write end marked O_CLOEXEC
with fcntl()
in the child, to detect execl()
errors. O_CLOEXEC
flag means the descriptor is closed when executing another binary, so the parent process can detect successful starting of the diff
command by detecting the end-of-data in the read end (read()
returning 0
). If the execl()
fails, the child can e.g. write the errno
value (as a decimal number, or as an int
) to this pipe, so that the parent process can read the exact cause for the failure.
In all, the complete method (that both records standard error, and detects exec errors) uses 10 descriptors. This should not be an issue in a normal application, but may be important -- for example, consider an internet-facing server with descriptors used by incoming connections.
On Linux you can use the /dev/fd/ pseudo filesystem (a symbolic link to /proc/self/fd). Use snprintf() to construct the path for both file descriptors like snprintf(path1, PATH_MAX, "/dev/fd/%d", fd1);
ditto for fd2 and run diff on them.