可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n
between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
回答1:
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
回答2:
Here's a Perl script that does what you want. It uses unpack
to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21
, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep
block in both places to simply grep { $_ ne "" }
:
use strict;
use warnings;
chomp (my $first_line = <>);
my @fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } @fields;
my @cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my @tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";
@cols = @tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
回答3:
Using this awk
:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
回答4:
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)