I have a directory with a few hundred txt files. I need to remove all duplicate lines from each of the existing files. Every line in the entire directory should be unique regardless of the file it's in, so I need to compare and check each file against the other. Is this possible to do without altering the existing file structure? The file names need to stay the same.
Let's say all the files are in directory "foo" and the total size of the directory is 30mb.
I think I can do this through comm or awk, but I haven't found a working command line to do this and I'm unfamiliar with the syntax.
UPDATE
I have tried this line which I believe posts all the duplicates in the shell, but it's not removing the duplicates from the files.
awk 'NR==FNR{a[$0]="";next}; !($0 in a)' tmp/*
awk '{
if(FNR==1){
if(fs!=lfn && NR!=1){
b[lfn]
};
lfn=FILENAME
};
if(!($0 in a)) {
a[$0];print $0>FILENAME;
fs=FILENAME
}
}
END{
if(fs!=lfn){
b[FILENAME]
};
for (i in b){
close(i);
printf (data) >i;
}
}' tmp/*
1st Condition:
if(!($0 in a)) {
a[$0];print $0>FILENAME;
fs=FILENAME
}
If the current line $0 is in array a
if not add the line to array a and to the current file being read else ignore the line. FILENAME awk built-in variable gives the name of the file being read.
If there is at least one different line in current file being read is found flag fs
with FILENAME
is set.
2st Condition:
if(FNR==1){
if(fs!=lfn && NR!=1){
b[lfn]
};
lfn=FILENAME
}
So when next file is read FNR==1
fs
(last file with different line) and lfn
(lastfilename) is compared if this differs then array b
with index lfn
is created.( To touch as empty file)
1st Condition:
END{
if(fs!=lfn){
b[FILENAME]
};
for (i in b){
close(i);
printf (data) >i;
}
}
In the END
, above condition 2 checked again to find if last file has different line. Also loops through the array b
to touch empty file where no different lines are found.
Here I have assumed there is no order in which file are read.
This is script is not optimal but will do the work.