How can I remove new line inside the "
from a file?
For example:
"one",
"three
four",
"seven"
So I want to remove the \n
between the three
and four
. Should I use regular expression, or I have to read that's file per character with program?
To handle specifically those newlines that are in doubly-quoted strings and leave those alone that are outside them, using GNU awk (for RT
):
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file
This works by splitting the file along "
characters and removing newlines in every other block. With a file containing
"one",
"three
four",
12,
"seven"
this will give the result
"one",
"threefour",
12,
"seven"
Note that it does not handle escape sequences. If strings in the input data can contain \"
, such as "He said: \"this is a direct quote.\""
, then it will not work as desired.
You can print those lines starting with "
. If they don't, accumulate its content into a variable and print it later on:
$ awk '/^"/ {if (f) print f; f=$0; next} {f=f FS $0} END {print f}' file
"one",
"three four",
"seven"
Since we are always printing the previous block of text, note the need of END
to print the last stored value after processing the full file.
You can use sed
for that:
sed -r '/^"[^"]+$/{:a;N;/",/!ba;s/\n/ /g}' text
The command searches for lines which start with a doublequote but don't contain another doublequote: /^"[^"]+$/
If such a line is found a label :a
is defined to mark the start of a loop. Using the N
command we append another line from input to the current buffer. If the new line again doesn't contain the closing double quote /",/!
we step again to label a
using ba
unless we found the closing quote.
If the quote was found all newlines gettting replaces by a space s/\n/ /g
and the buffer gets automatically printed by sed.
A simplistic solution:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
if (m/^\"/) { print "\n"; }
print;
}
__DATA__
"one",
"three
four",
"seven"
But taking the specific case of csv
style data, I'd suggest using a perl module called Text::CSV
which parses CSV properly - and treats the 'element with a linefeed' part of the preceeding row.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1 } );
open( my $input, "<", "input.csv" ) or die $!;
while ( my $row = $csv->getline($input) ) {
for (@$row) {
#remove linefeeds in each 'element'.
s/\n/ /g;
#print this specific element ('naked' e.g. without quotes).
print;
print ",";
}
print "\n";
}
close($input);
tested in a bash
purpose: replace newline inside double quote by \n
works for unix newline (\n), windows newline (\r\n) and mac newline (\n\r)
echo -e '"line1\nline2"'`
line1
line2
echo -e '"line1\nline2"' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\r?\n\r?/, "\n") } { printf("%s%s", $0, RT) }'
line1\nline2