I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.
From this,
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/
to this
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
Appreciate any suggestions,
Answer in case reordering the output is allowed.
sort -r file
: sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same patternawk 'a!~"^"$0{a=$0;print}'
: parse sorted output wherea
holds the previous line and$0
holds the current linea!~"^"$0
checks for each line if current line is not a substring at the beginning of the previous line.$0
is not a substring (ie. not similar prefix), weprint
it and save new string ina
(to be compared with next line)The first line
$0
is not ina
because no value was assigned toa
(first line is always printed)Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:
Explainations:
${p:0:${#s}}
take the first${#s}
(len ofs
) characters in stringp
.Test:
Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:
Explanations:
cat -n
: numbering all linessed 's:\t:|:'
: use '|' as the delimiter -- you need to change it to another one if neededsort -r -t'|' -k2
: reverse sort with delimiter='|' and use the key 2while ... done
: similar to solution of step 1sort -n -t'|' -k1
: sort back to original order (numbering sort)sed 's:^.*|::'
: remove the numberingTest:
Notes: In both solutions, the most costed operations are calls to
sort
. Solution in step 1 callssort
once, and solution in the step 2 callssort
twice. All other operations (cat
,sed
,while
, string compare,...) are not at the same level of cost.In solution of step 2,
cat + sed + while + sed
is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).The following awk does what is requested, it reads the file twice.
The code is:
You can also do it with reading the file a single time, but then you store it into memory :
Similar to the solution of Allan, but using
grep -c
:Take into account that this construct reads the file (N+1) times where N is the amount of lines.
A quick and dirty way of doing it is the following:
where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.