How to transpose or pivot data of a text file in u

2019-09-19 15:45发布

问题:

I have an input text file in unix with this kind of data.

Event_date:20190512044638
Error_code:5858
Event_type:GPRS data
Duration:772
Missing_provider_id:46009

Event_date:20190512044638
Error_code:780678

Event_date:20190512064535
Error_code:5858
Event_type:GPRS data
Duration:2172
Missing_provider_id:722310

i want this data to be in this output format:

Event_date      Error_code  Event_type  Duration  Missing_provider_id
20190512044638  5858        GPRS data   772       46009
20190512044638  780678      
20190512064535  5858        GPRS data   2172      722310

I tried a combination of awk and sed commands, but didn't work out. How can i achieve this output?

Event_date:20190512044638
Error_code:5858
Event_type:GPRS data
Duration:772
Missing_provider_id:46009

Event_date:20190512044638
Error_code:780678

Event_date:20190512064535
Error_code:5858
Event_type:GPRS data
Duration:2172
Missing_provider_id:722310

i want this data to be in this output format:

Event_date      Error_code  Event_type  Duration  Missing_provider_id
20190512044638  5858        GPRS data   772       46009
20190512044638  780678      
20190512064535  5858        GPRS data   2172      722310

回答1:

This awk may do: (tab separated fields)

PS this will fail if on field is missing, all need to come in order.

awk -F: 'NR==1 {print $1,$3,$5,$7,$9} {print $2,$4,$6,$8,$10}'  RS= ORS='\n' OFS='\t' file
Event_date      Error_code      Event_type      Duration        Missing_provider_id
20190512044638  5858    GPRS data       772     46009
20190512044638  780678
20190512064535  5858    GPRS data       2172    722310

A more generic solution:

awk -F: 'NR==1 {print $1,$3,$5,$7,$9} {for(i=2;i<=NF;i+=2) printf "%s\t",$i;print ""}'  RS= ORS='\n' OFS='\t' file
Event_date      Error_code      Event_type      Duration        Missing_provider_id
20190512044638  5858    GPRS data       772     46009
20190512044638  780678
20190512064535  5858    GPRS data       2172    722310

NR==1 {print $1,$3,$5,$7,$9} can be set to some static header like NR==1 {print "F1","F2","F3","F4","F5"} etc



回答2:

Using GNU awk and 2D arrays:

awk '
BEGIN {                         
    r=2                                           # data records in a start from 2
    FS=":"                                        # split at :
    OFS="\t"                                      # tab separated fields
    a[0][0]                                       # initialize a array
}
$0!="" {                                          # for nonempty records
    if(!($1 in a[0])) {                           # add keys to headers when needed
        a[0][$1]=++f                              # for lookups
        a[1][f]=$1                                # for printing
    }
    a[r][a[0][$1]]=$2                             # store value
    next
}
{                                                 # empty record -> new array record
    r++
}
END {                                             # after records are processed
    # delete a[0][0]                              # 
    for(i=1;i<=r;i++)                             # iterate records
        for(j=1;j<=f;j++)                         # iterate fields
            printf "%s%s",a[i][j],(j==f?ORS:OFS)  # output
}
' file | column -t -s $'\t'                       # column used for pretty-print

Output:

Event_date      Error_code  Event_type  Duration  Missing_provider_id
20190512044638  5858        GPRS data   772       46009
20190512044638  780678
20190512064535  5858        GPRS data   2172      722310


回答3:

here is another one

awk -F: -v RS= 'BEGIN   {OFS=FS}
                NR==FNR {for(i=1;i<NF;i+=2)
                           if(!($i in h)) {h[$i]; ho[++c]=$i}; 
                         next}
                FNR==1  {for(i=1;i<=c;i++) printf "%s",ho[i] (i==c?ORS:OFS)}
                        {delete v;
                         for(i=1;i<NF;i+=2) v[$i]=$(i+1);
                         for(i=1;i<=c;i++) printf "%s", v[ho[i]] (i==c?ORS:OFS)}' file{,} | 
column -ts:

Event_date      Error_code  Event_type  Duration  Missing_provider_id
20190512044638  5858        GPRS data   772       46009
20190512044638  780678
20190512064535  5858        GPRS data   2172      722310

no 2D arrays, but needs to scan the file twice to get all header info in order not to keep any data in memory, but to process the lines as they appear.