Trouble Getting Regular Expression To Work

I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:

1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed. An example of my regex code is:

$file =~ s/<ul(.*)>//gi;

Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.

2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove

<script type="text/javascript"> 

function getCookies() { return ""; }

</script>

I have tried using various regex codes, but nothing seems to remove these lines. For instance:

$file =~ s/<script type(.*)<\/script>//gi;

Which removes the <script type...> and </script> tags respectively, but leaves the

function getCookies() { return ""; }

...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!

Edit: Sorry all, I'm using Perl! Also: I just tried using

$file =~ /<script type(.*)<\/script>/sgi

...as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the

function getCookies() { return ""; }

...section stayed. Here is my entire code, including all regex:

use strict;
use warnings;

my $firstarg;
if ($ARGV[0]){
  $firstarg = $ARGV[0];
}

open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;

Ok, now that I deleted the <script> regex that was causing one problem, another has been created - using:

$file =~ s/<script type(.*)<\/script>//gi;

removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:

$file =~ s/<script type(.*)<\/script>//mgi;

results in the exact same thing. Using:

$file =~ s/<script type(.*)<\/script>//sgi;

results in the printing of several new line characters, but no other text, same for /msgi. Urgh, the problems never end... :(

NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.

标签： html regex perl

5条回答

唯我独甜

2楼-- · 2019-07-24 22:05

To reply your last comment:

perl -e'$file="<script etc>\nfoo\n</script>bar"; $file =~ s/<script.*script>//gis; print $file'

this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.

....

Can you add this:

use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($file);

before the regexp and give us the result?

.....

Bingo:

line 5 and 6 of your $file =~ list already filter them out:

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
## Here they come:
$file =~ s/<script(.*)>//gi;
$file =~ s/<\/script>//gi;
$file =~ s/<head>//gi;

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-07-24 22:08

You’re going to have to be a lot more careful than that. See both approaches in this answer.

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-07-24 22:09

I'm not sure what programming language you're using, but assuming that you're in perl, try putting the s modifier at the end of the regex:

$file =~ /<script type(.*)<\/script>/sgi

The /s modifier makes the . match any character, including newlines (normally it doesn't include newlines)

Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the s/ in front is for substitutions. In this case, your regex should be:

$file =~ s/<script type(.*)<\/script>/sgi

to remove everything, including the script tags. However, if you just want the content between the tags it is:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/$1$2/sgi;

Notice the $1$2 between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) like s/<ul(.*)>/ /gi.

Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/ /sgi;

and another generic regex for all the other tags:

$file =~ s/<\/?\s*[^>]+>//sgi

I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2019-07-24 22:15

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:

#!/usr/bin/perl -w
use strict;
use warnings;

$_ = do { local $/; <DATA> };

# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
s{
  <               # open tag
  (?:             # open group (A)
    (!--) |       #   comment (1) or
    (\?) |        #   another comment (2) or
    (?i:          #   open group (B) for /i
      (           #     one of start tags
        SCRIPT |  #     for which
        APPLET |  #     must be skipped
        OBJECT |  #     all content
        STYLE     #     to correspond
      )           #     end tag (3)
    ) |           #   close group (B), or
    ([!/A-Za-z])  #   one of these chars, remember in (4)
  )               # close group (A)
  (?(4)           # if previous case is (4)
    (?:           #   open group (C)
      (?!         #     and next is not : (D)
        [\s=]     #       \s or "="
        ["`']     #       with open quotes
      )           #     close (D)
      [^>] |      #     and not close tag or
      [\s=]       #     \s or "=" with
      `[^`]*` |   #     something in quotes ` or
      [\s=]       #     \s or "=" with
      '[^']*' |   #     something in quotes ' or
      [\s=]       #     \s or "=" with
      "[^"]*"     #     something in quotes "
    )*            #   repeat (C) 0 or more times
  |               # else (if previous case is not (4))
    .*?           #   minimum of any chars
  )               # end if previous char is (4)
  (?(1)           # if comment (1)
    (?<=--)       #   wait for "--"
  )               # end if comment (1)
  (?(2)           # if another comment (2)
    (?<=\?)       #   wait for "?"
  )               # end if another comment (2)
  (?(3)           # if one of tags-containers (3)
    </            #   wait for end
    (?i:\3)       #   of this tag
    (?:\s[^>]*)?  #   skip junk to ">"
  )               # end if (3)
  >               # tag closed
 }{}gsx;         # STRIP THIS TAG

print;

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

remove script, ul


1
2
paragraph

NOTE: This regex doesn't work for nested tag-containers e.g.:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested &lt;object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
  <object data="uri:another">uri:another fallback
  </object>!!!this text should be striped too!!!
</object>

Output

Nested &lt;object> example

!!!this text should be striped too!!!

Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:

#!/usr/bin/perl -w
use strict;
use warnings;

use HTML::Parser ();

HTML::Parser->new(
    ignore_elements => ["script"],
    ignore_tags => ["ul"],
    default_h => [ sub { print shift }, 'text'],
    )->parse_file(\*DATA) or die "error: $!\n";

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

<html><title>remove script, ul</title>

<body>
<li>1
<li>2
<p>paragraph

0人赞添加讨论(0) 举报

beautiful°

6楼-- · 2019-07-24 22:23

This:

$file =~ s/<div(.*)>//gi;

won't do what you expect. The '*' operator is greedy. If you have a line like:

hello<div id="foo"><b>bar!</b>baz

it'll substitute as much as it can, leaving only:

hellobaz

You want:

$file =~ s/<div[^>]*>//gi;

$file =~ s/<div.*?>//gi;

0人赞添加讨论(0) 举报