How can I count most occuring sequence of 3 letter

I have a sample file like

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

Here I need to grep most occurring sequence of 3 letters within a word

Output should be

acc = 5 aco = 3

Is that possible in Bash?

I got absolutely no idea how I can accomplish it with either awk, sed, grep.

Any clue how it's possible...

PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...

标签： linux bash awk sed

3条回答

Fickle 薄情

2楼-- · 2020-03-26 04:59

This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.

awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
            END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file

When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:

awk -v n=3 -v RS='[[:space:]]' '
    (length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
    END {for(s in a) print s,a[s] }' file

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2020-03-26 05:01

Here's how to get started with what I THINK you're trying to do:

$ cat tst.awk
BEGIN { stringLgth = 3 }
{
    for (fldNr=1; fldNr<=NF; fldNr++) {
        field = $fldNr
        fieldLgth = length(field)
        if ( fieldLgth >= stringLgth ) {
            maxBegPos = fieldLgth - (stringLgth - 1)
            for (begPos=1; begPos<=maxBegPos; begPos++) {
                string = tolower(substr(field,begPos,stringLgth))
                cnt[string]++
            }
        }
    }
}
END {
    for (string in cnt) {
        print string, cnt[string]
    }
}

$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1

0人赞添加讨论(0) 举报

干净又极端

4楼-- · 2020-03-26 05:16

This might work for you (GNU sed, sort and uniq):

sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'

Use the first sed invocation to output 3 letter lower case words.

Sort the words.

Count the duplicates.

Sort the counts in reverse numerical order maintaining the alphabetical order.

Use the second sed invocation to manipulate the results into the desired format.

If you only want lines with duplicates and in alphabetical order and case wise, use:

sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p

0人赞添加讨论(0) 举报

How can I count most occuring sequence of 3 letter

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间