I'm building a Swedish-English sentence deck for ANKI from the Creative Common licensed content of tatoeba.org.
Please help me turning sample 1 to sample 2 (preferably in bash):
a 1
a 2
b 3
c 4
c 5
a 1<br>2
b 3
c 4<br>5
Duplicates in field 1 are always subsequent.
Thank you!
One way using awk
awk 'p==$1{printf "<br>%s", $2;next}{if(p){print ""};p=$1;printf "%s", $0}END{print ""}' file
a 1<br>2
b 3
c 4<br>5
perl -ape '$_ = ($l eq $F[0]) ? "<br>$F[1]" : "\n@F"; $l=$F[0]' file
Try this awk
command also,
awk 'BEGIN {getline; id=$1; line=$0} {if ($1 != id) {print line; line = $0; } else {line = line "<br>" $2;} id=$1;} END {print line;}' file
a 1<br>2
b 3
c 4<br>5
This might work for you (GNU sed):
sed -r 'N;s/^((\S+\s).*)\n\2/\1<br>/;P;D' file
Compare the current line with the subsequent line and if the keys match combine otherwise print the current line, delete it and repeat.
awk '{if(a[$1]){a[$1]=a[$1]"<br>"$2}else{a[$1]=$1FS$2;b[i++]=$1}} END{for(i=0;i in b; i++) print a[b[i]];}' sample1
a 1<br>2
b 3
c 4<br>5
Creates the output in array a
, uses array b
to preserve the order of lines.
Here is another awk
awk 'f!=$1 {printf (a?RS:"")$0;f=$1;a=1;next} {print "<br>"$2;f=$1;a=0}' file
a 1<br>2
b 3
c 4<br>5