I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:
1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed. An example of my regex code is:
$file =~ s/<ul(.*)>//gi;
Which removes all lines with the basic format <ul...>
, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.
2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove
<script type="text/javascript">
function getCookies() { return ""; }
</script>
I have tried using various regex codes, but nothing seems to remove these lines. For instance:
$file =~ s/<script type(.*)<\/script>//gi;
Which removes the <script type...>
and </script>
tags respectively, but leaves the
function getCookies() { return ""; }
...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!
Edit: Sorry all, I'm using Perl! Also: I just tried using
$file =~ /<script type(.*)<\/script>/sgi
...as well as /msgi
, but neither worked unfortunately. Both the <script type>
and </script>
tags were removed, but for some reason the
function getCookies() { return ""; }
...section stayed. Here is my entire code, including all regex:
use strict;
use warnings;
my $firstarg;
if ($ARGV[0]){
$firstarg = $ARGV[0];
}
open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;
Ok, now that I deleted the <script>
regex that was causing one problem, another has been created - using:
$file =~ s/<script type(.*)<\/script>//gi;
removes everything in between the first instance of <script ...>
, but not the tag itself, not the repetitions of the tag throughout. Using:
$file =~ s/<script type(.*)<\/script>//mgi;
results in the exact same thing. Using:
$file =~ s/<script type(.*)<\/script>//sgi;
results in the printing of several new line characters, but no other text, same for /msgi
.
Urgh, the problems never end... :(
NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.
To reply your last comment:
this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.
....
Can you add this:
before the regexp and give us the result?
.....
Bingo:
line 5 and 6 of your $file =~ list already filter them out:
You’re going to have to be a lot more careful than that. See both approaches in this answer.
I'm not sure what programming language you're using, but assuming that you're in perl, try putting the
s
modifier at the end of the regex:The
/s
modifier makes the.
match any character, including newlines (normally it doesn't include newlines)Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the
s/
in front is for substitutions. In this case, your regex should be:to remove everything, including the script tags. However, if you just want the content between the tags it is:
Notice the
$1$2
between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi
) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) likes/<ul(.*)>/ /gi
.Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:
and another generic regex for all the other tags:
I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.
If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:
Output
NOTE: This regex doesn't work for nested tag-containers e.g.:
Output
Don't parse html with regexs. Use a html parser or a tool built on top of it e.g.,
HTML::Parser
:Output
This:
won't do what you expect. The '*' operator is greedy. If you have a line like:
it'll substitute as much as it can, leaving only:
You want:
or