Why doesn't “sort file1 > file1” work?

2019-01-12 05:32发布

问题:

When I am trying to sort a file and save the sorted output in itself, like this

sort file1 > file1;

the contents of the file1 is getting erased altogether, whereas when i am trying to do the same with 'tee' command like this

sort file1 | tee file1;

it works fine [ed: "works fine" only for small files with lucky timing, will cause lost data on large ones or with unhelpful process scheduling], i.e it is overwriting the sorted output of file1 in itself and also showing it on standard output.

Can someone explain why the first case is not working?

回答1:

It doesn't work because '>' redirection implies truncation, and to avoid keeping the whole output of sort in the memory before re-directing to the file, bash truncates and redirects output before running sort. Thus, contents of the file1 file will be truncated before sort will have a chance to read it.



回答2:

As other people explained, the problem is that the I/O redirection is done before the sort command is executed, so the file is truncated before sort gets a chance to read it. If you think for a bit, the reason why is obvious - the shell handles the I/O redirection, and must do that before running the command.

The sort command has 'always' (since at least Version 7 UNIX) supported a -o option to make it safe to output to one of the input files:

sort -o file1 file1 file2 file3

The trick with tee depends on timing and luck (and probably a small data file). If you had a megabyte or larger file, I expect it would be clobbered, at least in part, by the tee command. That is, if the file is large enough, the tee command would open the file for output and truncate it before sort finished reading it.



回答3:

It's unwise to depend on either of these command to work the way you expect.

The way to modify a file in place is to write the modified version to a new file, then rename the new file to the original name:

sort file1 > file1.tmp && mv file1.tmp file1

This avoids the problem of reading the file after it's been partially modified, which is likely to mess up the results. It also makes it possible to deal gracefully with errors; if the file is N bytes long, and you only have N/2 bytes of space available on the file system, you can detect the failure creating the temporary file and not do the rename.

Or you can rename the original file, then read it and write to a new file with the same name:

mv file1 file1.bak && sort file1.bak > file1

Some commands have options to modify files in place (for example, perl and sed both have -i options (note that the syntax of sed's -i option can vary). But these options work by creating temporary files; it's just done internally.



回答4:

Bash open a new empty file when reads the pipe, and then calls to sort.

In the second case, tee opens the file after sort has already read the contents.



回答5:

Redirection has higher precedence. So in the first case, > file1 executes first and empties the file.



回答6:

The first command doesn't work (sort file1 > file1), because when using the redirection operator (> or >>) shell creates/truncates file before the sort command is even invoked, since it has higher precedence.

The second command works (sort file1 | tee file1), because sort reads lines from the file first, then writes sorted data to standard output.

So when using any other similar command, you should avoid using redirection operator when reading and writing into the same file, but you should use relevant in-place editors for that (e.g. ex, ed, sed), for example:

ex '+%!sort' -cwq file1

or use other utils such as sponge.

Luckily for sort there is the -o parameter which write results to the file (as suggested by @Jonathan), so the solution is straight forward: sort -o file1 file1.



回答7:

You can use this method

sort file1 -o file1

This will sort and store back to the original file. Also, you can use this command to remove duplicated line:

sort -u file1 -o file1