Perl: How to extract a string between brackets

2019-08-02 19:00发布

站内文章 / 前端开发

29 0

在下西门庆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a file in moinmoin text format:

* [[  Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[  Abiword Wordprocessor]] (2010/10/27 20:17)
* [[  Sylpheed E-Mail]] (2010/03/30 21:49)
* [[   Kupfer]] (2010/05/16 20:18)

All the words between the '[[' and ']]' are the short description of the entry. I need to extract the whole entry, but not each individual word.

I found an answer for a similar question here: https://stackoverflow.com/a/2700749/819596 but can't understand the answer: "my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;"

Anything that works will be accepted but explanations would help greatly, ie: what (?0) or /xg does.

回答1:

The code probably will look like this:

use warnings; 
use strict;

my @subjects; # declaring a lexical variable to store all the subjects
my $pattern = qr/ 
  \[ \[    # matching two `[` signs
  \s*      # ... and, if any, whitespace after them
  ([^]]+) # starting from the first non-whitespace symbol, capture all the non-']' symbols
  ]]
/x;

# main processing loop:
while (<DATA>) { # reading the source file line by line
  if (/$pattern/) {      # if line is matched by our pattern
    push @subjects, $1;  # ... push the captured group of symbols into our array
  }
}
print $_, "\n" for @subjects; # print our array of subject line by line

__DATA__
* [[  Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[  Abiword Wordprocessor]] (2010/10/27 20:17)
* [[  Sylpheed E-Mail]] (2010/03/30 21:49)
* [[   Kupfer]] (2010/05/16 20:18)

As I see, what you need can be described as follows: in each line of file try to find this sequence of symbols...

[[, an opening delimiter, 
then 0 or more whitespace symbols,
then all the symbols that make a subject (which should be saved),
then ]], a closing delimiter

As you see, this description quite naturally translates into a regex. The only thing that is probably not needed is /x regex modifier, which allowed me to extensively comment it. )

回答2:

If the text will never contain ], you can simply use the following as previously recommended:

/\[\[ ( [^\]]* ) \]\]/x

The following allows ] in the contained text, but I recommend against incorporating it into a larger pattern:

/\[\[ ( .*? ) \]\]/x

The following allows ] in the contained text, and is the most robust solution:

/\[\[ ( (?:(?!\]\]).)* ) \]\]/x

For example,

if (my ($match) = $line =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/x) {
   print "$match\n";
}

my @matches = $file =~ /\[\[ ( (?:(?!\]\]).)* ) \]\]/xg;

/x: Ignore whitespace in pattern. Allows spaces to be added to make the pattern readable without changing the meaning of the pattern. Documented in perlre.
/g: Find all matches. Documented in perlop.
(?0) was used to make the pattern recursive since the linked node had to deal with arbitrary nesting of curlies. * /g: Find all matches. Documented in perlre.

回答3:

\[\[(.*)]]

\[ is a literal [, ] is a literal ], .* means every sequence of 0 or more character, something enclosed in parentheses is a capturing group, hence you can access it later in your script with $1 (or $2 .. $9 depending on how many groups you have).

Put all together you will match two [ then everything up to the last occurrence of two successive ]

Update On a second read of your question I suddenly are confused, do you need the content between [[ and ]], or the whole line - in that case leave the parentheses out completely and just test if the pattern matches, no need to capture.

回答4:

The answer you found is for recursive pattern matching, that i think you don't need.

/x allows to use meaningless spaces and comments in the regexp.
/g runs the regexp through all the string. Without it runs only till the first match.
/xg is /x and /g combined.
(?0) runs the regexp itself again (recursion)

If i understand ok, you need something like this:

$text="* [[  Virtualbox Guest Additions]] (2011/10/17 15:19)
* [[  Abiword Wordprocessor]] (2010/10/27 20:17)
* [[  Sylpheed E-Mail]] (2010/03/30 21:49)
* [[   Kupfer]] (2010/05/16 20:18)
";

@array=($text=~/\[\[([^\]]*)\]\]/g);
print join(",",@array);

# this prints "  Virtualbox Guest Additions,  Abiword Wordprocessor,  Sylpheed E-Mail,   Kupfer"

回答5:

I would recommend using "extract_bracketed" or "extract_delimited" from module Text::Balanced - see here: http://perldoc.perl.org/Text/Balanced.html

回答6:

perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp

tested below:

> cat temp
        * [[  Virtualbox Guest Additions]] (2011/10/17 15:19)
        * [[  Abiword Wordprocessor]] (2010/10/27 20:17)
        * [[  Sylpheed E-Mail]] (2010/03/30 21:49)
        * [[   Kupfer]] (2010/05/16 20:18)
>
> perl -pe 's/.*\[\[(.*)\]\].*/\1/g' temp
  Virtualbox Guest Additions
  Abiword Wordprocessor
  Sylpheed E-Mail
   Kupfer
>

s/.[[(.)]].*/\1/g
.*[[->match any charater till [[
(.*)]] store any charater after the string "[[" till "]]" in \1
.*->matching the rest of the line.

then since we have our data in \1 we can simply use it for printing on the console.

回答7:

my @array = $str =~ /( \{ (?: [^{}]* | (?0) )* \} )/xg;

The 'x' flag means that whitespace is ignored in the regex, to allow for a more readable expression. The 'g' flag means that the result will be a list of all matches from left to right (match *g*lobally).

The (?0) represents the regular expression inside the first group of parentheses. It's a recursive regular expression, equivalent to a set of rules such as:

E := '{' ( NoBrace | E) '}'
NoBrace := [^{}]*

标签： perl matching

在下西门庆

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Perl: How to extract a string between brackets

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮