How to efficiently loop through the lines of a fil

2019-08-05 06:14发布

I have a file example.txt with about 3000 lines with a string in each line. A small file example would be:

>cat example.txt
saudifh
sometestPOIFJEJ
sometextASLKJND
saudifh
sometextASLKJND
IHFEW
foo
bar

I want to check all repeated lines in this file and output them. The desired output would be:

>checkRepetitions.sh
found two equal lines: index1=1 , index2=4 , value=saudifh
found two equal lines: index1=3 , index2=5 , value=sometextASLKJND

I made a script checkRepetions.sh:

#!bin/bash
size=$(cat example.txt | wc -l)
for i in $(seq 1 $size); do
i_next=$((i+1))
line1=$(cat example.txt | head -n$i | tail -n1)
for j in $(seq $i_next $size); do
line2=$(cat example.txt | head -n$j | tail -n1)
if [ "$line1" = "$line2" ]; then
echo "found two equal lines: index1=$i , index2=$j , value=$line1"
fi
done
done

However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds... I tried to store the file in memory by doing lines=$(cat example.txt) and doing line1=$(cat $lines | cut -d',' -f$i) but this is still very slow...

3条回答
狗以群分
2楼-- · 2019-08-05 06:34

To demonstrate a relatively efficient (within the limits of the language and runtime) native-bash approach, which you can see running in an online interpreter at https://ideone.com/iFpJr7:

#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac

# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0

while IFS= read -r line; do
  lineNum=$(( lineNum + 1 ))
  if [[ ${lines[$line]} ]]; then
     printf 'found two equal lines: index1=%s, index2=%s, value=%s\n' \
       "${lines[$line]}" "$lineNum" "$line"
  fi
  lines[$line]=$lineNum
done <example.txt

Note the use of while read to iterate line-by-line, as described in BashFAQ #1: How can I read a file line-by-line (or field-by-field)?; this permits us to open the file only once and read through it without needing any command substitutions (which fork off subshells) or external commands (which need to be individually started up by the operating system every time they're invoked, and are likewise expensive).

The other part of the improvement here is that we're reading the whole file only once -- implementing an O(n) algorithm -- as opposed to running O(n^2) comparisons as the original code did.

查看更多
贪生不怕死
3楼-- · 2019-08-05 06:39

When you do not want to use awk (a good tool for the job, parsing the input only once), you can run through the lines several times. Sorting is expensive, but this solution avoids the loops you tried.

grep -Fnxf <(uniq -d <(sort example.txt)) example.txt

With uniq -d <(sort example.txt) you find all lines that occur more than once. Next grep will search for these (option -f) complete (-x) lines without regular expressions (-F) and show the line it occurs (-n).

查看更多
冷血范
4楼-- · 2019-08-05 06:54

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.

$ cat tst.awk
{ val2hits[$0] = val2hits[$0] FS NR }
END {
    for (val in val2hits) {
        numHits = split(val2hits[val],hits)
        if ( numHits > 1 ) {
            printf "found %d equal lines:", numHits
            for ( hitNr=1; hitNr<=numHits; hitNr++ ) {
                printf " index%d=%d ,", hitNr, hits[hitNr]
            }
            print " value=" val
        }
    }
}

$ awk -f tst.awk file
found 2 equal lines: index1=1 , index2=4 , value=saudifh
found 2 equal lines: index1=3 , index2=5 , value=sometextASLKJND

To give you an idea of the performance difference using a bash script that's written to be as efficient as possible and an equivalent awk script:

bash:

$ cat tst.sh
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac

# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0

while IFS= read -r line; do
  (( ++lineNum ))
  if [[ ${lines[$line]} ]]; then
     printf 'Content previously seen on line %s also seen on line %s: %s\n' \
       "${lines[$line]}" "$lineNum" "$line"
  fi
  lines[$line]=$lineNum
done < "$1"

$ time ./tst.sh file100k > ou.sh
real    0m15.631s
user    0m13.806s
sys     0m1.029s

awk:

$ cat tst.awk
lines[$0] {
    printf "Content previously seen on line %s also seen on line %s: %s\n", \
       lines[$0], NR, $0
}
{ lines[$0]=NR }

$ time awk -f tst.awk file100k > ou.awk
real    0m0.234s
user    0m0.218s
sys     0m0.016s

There are no differences in the output of both scripts:

$ diff ou.sh ou.awk
$

The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:

awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100k

When the input file had zero duplicate lines (generated by seq 100000 > nodups100k) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:

$ time ./tst.sh nodups100k > ou.sh
real    0m15.179s
user    0m13.322s
sys     0m1.278s

$ time awk -f tst.awk nodups100k > ou.awk
real    0m0.078s
user    0m0.046s
sys     0m0.015s
查看更多
登录 后发表回答