Match any character (including newlines) in sed

I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string

style='text-align:center; color:blue;
exampleStyle:exampleValue'

The sed command that I am trying to modify is

sed "s/ style='[^']*'//" fileA > fileB

It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?

I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.

In the end, it turned out that Sinan Ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...

标签： html coding-style replace sed newline

5条回答

虎瘦雄心在

2楼-- · 2019-01-27 22:53

Another way is like:

$ cat toreplace.txt 
I want to make \
this into one line

I also want to \
merge this line

$ sed -e 'N;N;s/\\\n//g;P;D;' toreplace.txt

Output:

I want to make this into one line

I also want to merge this line

The N loads another line, P prints the pattern space up to the first newline, and D deletes the pattern space up to the first newline.

0人赞添加讨论(0) 举报

甜甜的少女心

3楼-- · 2019-01-27 23:00

sed goes over the input file line by line which means, as I understand, what you want is not possible in sed.

You could use the following Perl script (untested), though:

#!/usr/bin/perl

use strict;
use warnings;

{
    local $/; # slurp mode
    my $html = <>;
    $html =~ s/ style='[^']*'//g;
    print $html;
}

__END__

A one liner would be:

$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-01-27 23:04

You could remove all CR/LF using tr, run sed, and then import into an editor that auto-formats.

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2019-01-27 23:07

You can try this:

awk '/style/&&/exampleValue/{
    gsub(/style.*exampleValue\047/,"")
}
/style/&&!/exampleValue/{     
    gsub(/style.* /,"")
    f=1        
}
f &&/exampleValue/{  
  gsub(/.*exampleValue\047 /,"")
  f=0
}
1
' file

Output:

# more file
this is a line
    style='text-align:center; color:blue; exampleStyle:exampleValue'
this is a line
blah
blah
style='text-align:center; color:blue;
exampleStyle:exampleValue' blah blah....

# ./test.sh
this is a line

this is a line
blah
blah
blah blah....

0人赞添加讨论(0) 举报

欢心

6楼-- · 2019-01-27 23:12

Sed reads the input line by line, so it is not simple to do processing over one line... but it is not impossible either, you need to make use of sed branching. The following will work, I have commented it to explain what is going on (not the most readable syntax!):

sed "# if the line matches 'style='', then branch to label, 
     # otherwise process next line
     /style='/b style
     b
     # the line contains 'style', try to do a replace
     : style
     s/ style='[^']*'//
     # if the replace worked, then process next line
     t
     # otherwise append the next line to the pattern space and try again.
     N
     b style
 " fileA > fileB

0人赞添加讨论(0) 举报

Match any character (including newlines) in sed

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间