I have a file example.txt
with about 3000 lines with a string in each line. A small file example would be:
>cat example.txt
saudifh
sometestPOIFJEJ
sometextASLKJND
saudifh
sometextASLKJND
IHFEW
foo
bar
I want to check all repeated lines in this file and output them. The desired output would be:
>checkRepetitions.sh
found two equal lines: index1=1 , index2=4 , value=saudifh
found two equal lines: index1=3 , index2=5 , value=sometextASLKJND
I made a script checkRepetions.sh
:
#!bin/bash
size=$(cat example.txt | wc -l)
for i in $(seq 1 $size); do
i_next=$((i+1))
line1=$(cat example.txt | head -n$i | tail -n1)
for j in $(seq $i_next $size); do
line2=$(cat example.txt | head -n$j | tail -n1)
if [ "$line1" = "$line2" ]; then
echo "found two equal lines: index1=$i , index2=$j , value=$line1"
fi
done
done
However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds... I tried to store the file in memory by doing lines=$(cat example.txt)
and doing line1=$(cat $lines | cut -d',' -f$i)
but this is still very slow...
To demonstrate a relatively efficient (within the limits of the language and runtime) native-bash approach, which you can see running in an online interpreter at https://ideone.com/iFpJr7:
Note the use of
while read
to iterate line-by-line, as described in BashFAQ #1: How can I read a file line-by-line (or field-by-field)?; this permits us to open the file only once and read through it without needing any command substitutions (which fork off subshells) or external commands (which need to be individually started up by the operating system every time they're invoked, and are likewise expensive).The other part of the improvement here is that we're reading the whole file only once -- implementing an O(n) algorithm -- as opposed to running O(n^2) comparisons as the original code did.
When you do not want to use
awk
(a good tool for the job, parsing the input only once), you can run through the lines several times. Sorting is expensive, but this solution avoids the loops you tried.With
uniq -d <(sort example.txt)
you find all lines that occur more than once. Nextgrep
will search for these (option-f
) complete (-x
) lines without regular expressions (-F
) and show the line it occurs (-n
).See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.
To give you an idea of the performance difference using a bash script that's written to be as efficient as possible and an equivalent awk script:
bash:
awk:
There are no differences in the output of both scripts:
The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:
When the input file had zero duplicate lines (generated by
seq 100000 > nodups100k
) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above: