How to merge rows from the same column using unix

2019-08-12 06:47发布

问题:

I have a text file that looks like the following:

1000000    45    M    This is a line        This is another line  Another line
                      that breaks into      that also breaks      that has a blank
                      multiple rows         into multiple rows -  row below.
                                            How annoying!

1000001    50    F    I am another          I am well behaved.    
                      column that has
                      text spanning
                      multiple rows

I would like to convert this into a csv file that looks like:

1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.

The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!

== EDIT ==

I've modified the text file to have \n between rows - maybe this will be helpful?

== EDIT 2 ==

I've modified the text file to have a blank row.

回答1:

Using GNU awk

gawk '
    BEGIN { FIELDWIDTHS="11 6 5 22 22" }
    length($1) == 11 {
        if ($1 ~ /[^[:blank:]]/) { 
            if (f1) print_line()
            f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
        }
        else { 
            f4 = f4" "$4; f5 = f5" "$5
        }
    }
    function rtrim(str) {
        sub(/[[:blank:]]+$/, "", str)
        return str
    }
    function print_line() {
        gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
        gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
        printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
    }
    END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "

I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.



回答2:

Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.

As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.

Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:

use strict;
use warnings;

chomp (my $first_line = <>);
my @fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } @fields;
my @cols = unpack $format, $first_line;

while(<>) {    
    chomp( my $line = $_ );
    my @tmp = unpack $format, $line;
    if ($tmp[0] ne '') {
        print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";
        @cols = @tmp;
    }
    else {
        for (1..$#tmp) {
            $cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
        }
    }    
}

print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";

Output:

1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."


回答3:

Using this awk:

awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5]; 
   for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
   NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
   END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.


回答4:

You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.

Something along the lines of :

#import csv

inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
    if len(results) == 0:
        isNewItem = True
    else if line == '':
        isNewItem = True
        continue
    else:
        inNewItem = False

    temp = line.split()
    if isNewItem:
        results.append(temp)
    else
        lastRow = results[-1]
        combinedRow = []
        for leftColumn, rigtColumn in lastRow, temp:
            combinedRow.append(leftColumn + rightColumn)

with open(csvOutputFileName, 'w') as outFile:
    csv.write(results)