How to put sequential numbers at the end of repeat

2019-09-17 21:02发布

I have a file with some repeated information. The lines are numbered, followed by a colon, followed by the information. I want to put a sequential number only at the end of the repeated information.

Example.

Input:

1:Jose da Silva
2:Jose da Silva
3:Fulano de Tal
4:Jose da Silva
5:Sicrano Pereira
6:Ze Ruela
7:Sicrano Pereira
8:Jose da Silva

Output:

1:Jose da Silva #1
2:Jose da Silva #2
3:Fulano de Tal
4:Jose da Silva #3
5:Sicrano Pereira #1
6:Ze Ruela
7:Sicrano Pereira #2
8:Jose da Silva #4

[This question differs from this one because here the lines are allways different (every line has a different number). My input/output examples may look very similar, but in the real application they are not.]

1条回答
看我几分像从前
2楼-- · 2019-09-17 21:22

Tweaking my previous answer:

awk -F: 'FNR==NR {count[$2]++; next}
         count[$2]>1 {$0=$0 OFS "#"++times[$2]}
         1' file file

That is: the first time, count how many times each second block occurs. The second time, keep appending an incrementing number to those that appear more than once. So instead of comparing the whole line, it compares the second field, which is everything from the colon :.

Further explanation:

  • the FNR==NR {actions; next} {more_actions} file1 file2 consists in doing some stuff actions when reading the first file and other more_actions when reading the second one. This comes very handy when you want to compare files, like we are doing here. But wait, here we only have one file, right? Yes, but this also allows to compare lines in the file one to each other. More info about this in Idiomatic awk.
  • So FNR==NR {count[$2]++; next} stores in the array count how many times every 2nd field appears. This way, Jose da Silva is counter 4 times, etc. Note we use $2 as the index of the array: this is the second field based on the delimiter : that we set with -F:. That is, the first field is everything up to the first :, the second field everything from the first : up to the second one and so on.
  • count[$2]>1 {$0=$0 OFS "#"++times[$2]} thi sis already reading the file for the second time. Here it keeps checking if the counter on the second field of the current time says that it happens one or more times. If it is more than once, it adds to the original string $0 some content. This is OFS "#"++times[$2].
    • OFS is the output field separator. That is, the field separator that is used when printing data. Since we did not set it before running the program, it default to a space.
    • "#" this is just some text we want to add before the counter.
    • ++times[$2] this is just a counter to keep track of how many times it was printed so far. Since we have different 2nd fields, we need an array times[] to keep track of each one of them.
  • 1 at the very end of the script we have this 1. This is an idiomatic way to print a line: 1 is a true value and awk's behaviour when an expression is true is to print the current line. That is, to print $0 that can be either the original one or the one with some trailing new content.

Output:

$ awk -F: 'FNR==NR {count[$2]++; next} count[$2]>1 {$0=$0 OFS "#"++times[$2]}1' file file
1:Jose da Silva #1
2:Jose da Silva #2
3:Fulano de Tal
4:Jose da Silva #3
5:Sicrano Pereira #1
6:Ze Ruela
7:Sicrano Pereira #2
8:Jose da Silva #4
查看更多
登录 后发表回答