可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a file in moinmoin text format:
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
All the words between the '[[' and ']]' are the short description of the entry. I need to extract the whole entry, but not each individual word.
I found an answer for a similar question here: https://stackoverflow.com/a/2700749/819596
but can't understand the answer: "my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;"
Anything that works will be accepted but explanations would help greatly, ie: what (?0)
or /xg
does.
回答1:
The code probably will look like this:
use warnings;
use strict;
my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/
\[ \[ # matching two `[` signs
\s* # ... and, if any, whitespace after them
([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
]]
/x;
# main processing loop:
while (<DATA>) { # reading the source file line by line
if (/$pattern/) { # if line is matched by our pattern
push @subjects, $1; # ... push the captured group of symbols into our array
}
}
print $_, "\n" for @subjects; # print our array of subject line by line
__DATA__
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
As I see, what you need can be described as follows: in each line of file try to find this sequence of symbols...
[[, an opening delimiter,
then 0 or more whitespace symbols,
then all the symbols that make a subject (which should be saved),
then ]], a closing delimiter
As you see, this description quite naturally translates into a regex. The only thing that is probably not needed is /x
regex modifier, which allowed me to extensively comment it. )
回答2:
If the text will never contain ]
, you can simply use the following as previously recommended:
/\[\[ ( [^\]]* ) \]\]/x
The following allows ]
in the contained text, but I recommend against incorporating it into a larger pattern:
/\[\[ ( .*? ) \]\]/x
The following allows ]
in the contained text, and is the most robust solution:
/\[\[ ( (?:(?!\]\]).)* ) \]\]/x
For example,
if (my ($match) = $line =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/x) {
print "$match\n";
}
or
my @matches = $file =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/xg;
/x
: Ignore whitespace in pattern. Allows spaces to be added to make the pattern readable without changing the meaning of the pattern. Documented in perlre.
/g
: Find all matches. Documented in perlop.
(?0)
was used to make the pattern recursive since the linked node had to deal with arbitrary nesting of curlies. * /g
: Find all matches. Documented in perlre.
回答3:
\[\[(.*)]]
\[
is a literal [,
]
is a literal ],
.*
means every sequence of 0 or more character,
something enclosed in parentheses is a capturing group, hence you can access it later in your script with $1 (or $2 .. $9 depending on how many groups you have).
Put all together you will match two [
then everything up to the last occurrence of two successive ]
Update
On a second read of your question I suddenly are confused, do you need the content between [[ and ]], or the whole line - in that case leave the parentheses out completely and just test if the pattern matches, no need to capture.
回答4:
The answer you found is for recursive pattern matching, that i think you don't need.
/x allows to use meaningless spaces and comments in the regexp.
/g runs the regexp through all the string. Without it runs only till the first match.
/xg is /x and /g combined.
(?0) runs the regexp itself again (recursion)
If i understand ok, you need something like this:
$text="* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
";
@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);
# this prints " Virtualbox Guest Additions, Abiword Wordprocessor, Sylpheed E-Mail, Kupfer"
回答5:
I would recommend using "extract_bracketed" or "extract_delimited" from module Text::Balanced - see here: http://perldoc.perl.org/Text/Balanced.html
回答6:
perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
tested below:
> cat temp
* [[ Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[ Abiword Wordprocessor]] (2010/10/27 20:17)
* [[ Sylpheed E-Mail]] (2010/03/30 21:49)
* [[ Kupfer]] (2010/05/16 20:18)
>
> perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
Virtualbox Guest Additions
Abiword Wordprocessor
Sylpheed E-Mail
Kupfer
>
- s/.[[(.)]].*/\1/g
- .*[[->match any charater till [[
- (.*)]] store any charater after the string "[[" till "]]" in \1
- .*->matching the rest of the line.
then since we have our data in \1 we can simply use it for printing on the console.
回答7:
my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;
The 'x' flag means that whitespace is ignored in the regex, to allow for a more readable expression. The 'g' flag means that the result will be a list of all matches from left to right (match *g*lobally).
The (?0)
represents the regular expression inside the first group of parentheses. It's a recursive regular expression, equivalent to a set of rules such as:
E := '{' ( NoBrace | E) '}'
NoBrace := [^{}]*