我如何提取和在Perl解析带引号的字符串？(How do I extract and parse q

2019-10-18 19:05发布

站内文章 / 移动开发

8 0

傲

女 | 书童

私信

美好的一天。

我在下面的文本文件的内容。 tmp.txt（一个非常大的大小的文件）

constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">  /* A Format */

constant fixup ConfigAlarms = <U1 0>         /*  U1 Format  */

constant fixup ConfigEvents = <U2 0>         /*  U2 Format  */

我下面的解析代码。该代码无法处理C:\\TMP\\ALARM.LOG （引号的字符串）在这里。我不知道如何更换代码 “S +（[A-ZA-Z0-9]）+>” 来处理两个[α-ZA-Z0-9]（以上0）串和quated字符串（” C：\ TMP \ ALARM.LOG”上文）。

$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_XML_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>) 
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
 if ($x == 0)
 {

     if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $6;
                    $name7 = $7;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7";
                    $x = 1;
  }
 }
}

谢谢您的意见。

**大家好，

感谢这么多伟大的解决方案。我是一个newbew，我愿做基于您的文章更多的研究。

非常感谢。**

Answer 1:

#!/usr/bin/perl


$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_CS_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>)   
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
    if ($x == 0)
    {

#        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $7;
                    $name7 = $8;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7\n";
#                    $x = 1;
        }
    }
}

使用下面的解析的代码：

if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

我已经添加了单引号和双引号都处理。我用引号匹配反向引用。此外，我已经更新了路径中的字符类。即，它现在包括冒号（:)，点（。）和反斜杠（）与字母数字字符一起。

Answer 2:

我不会写你的正则表达式为你或给你的东西剪切并粘贴到您的代码。在评价你的正则表达式是要它去反正在下一个特例打破。我会给你一个比较好的方法。

分割每行到分配的右侧和左侧。

my($lhs, $rhs) = split m{\s* = \s*}x, $line, 2;

现在，它更容易与他们独立工作。您可以通过简单地拆分它的空白让所有的标志（恒定，修正等）和最后一个字将是名被分配到提取从左侧的信息。

my @flags = split /\s+/, $lhs;
my $name  = pop @flags;

然后你就可以通过他们的@flags如果需要筛选线。

和值，这大概是括号内，可以很容易地得到。使用非正则表达式贪婪确保它正确地处理类似foo = <bar> /* comment <stuff> */ 。

my($value) = $rhs =~ /<(.*?)>/;

你可以从这种做法看，它避免了去猜测什么特殊的关键字（恒定，修正，私营）可能会出现在文件中。

我不知道还有什么可能是在这个文件中，你没有说。

Answer 3:

你必须在你的代码中的一些重大设计缺陷。我没有解决你的问题，但我已经清理你的代码。

最重要的是，不要使用全局变量。在代码相对较短的块使用的是3个全局。这是乞求神秘的错误是难以追踪。这将成为一个更大的问题，因为你的项目随时间增长变大。

考虑使用Perl的::评论家。这将帮助你提高你的代码。

这里是你的代码的注释，消毒版本：

# Always use strict and warnings.
# It prevents bugs.
use strict;
use warnings;

my $source_file   = "tmp.txt";
my $dest_xml_file = "my.xml";

# You aren't checking the existence of anyting here:
#Check existance of root directory 
# Is this a TODO item?

# Use 3 argument open with a lexical filehandle.
# Adding $! to your error messages makes them more useful.
open my $source_fh, '<', $source_file
    or die "Fail to open file $source_file - $!";

open( my $dest_fh, '>', $dest_xml_file 
    or die "Coult not open output file $dest_xml_file - $!";

my $x = 0;  # What the heck does this do?  Give it a meaningful name or
            # delete it.

print $dest_fh  "<!-- from tmp.txt-->\n";
while (my $line = <$source_fh>)   
{

    # Don't use global variables.
    # Explicitly pass all data your sub needs.
    # Any values that need to be applied to external 
    # data should be applied by the calling function,
    # from data that is returned.

    $x = ConstantParseAndPrint( $line, $x, $dest_fh );

}

sub ConstantParseAndPrint {
    my $line          = shift;
    my $mystery_value = shift;
    my $fh            = shift;

    if($mystery_value == 0) {

        # qr{} is a handy way to build a regex.
        # using {} instead of // to mark the boundaries helps
        # cut down on the escaping required when your pattern
        # contains the '/' character.

        # Use the x regex modifier to allow whitespace and 
        # comments in your regex.
        # This very is important when you can't avoid using a big, complex regex.

        # But really don't do it this way at all.
        # Do what Schwern says.
        my $line_match = qr{
            ^                      \s*  # Skip leading spaces
            (constant)             \s*  # look for the constant keyword
            (fixup|/\*fixup\*/|)   \s*  # look for the fixup keyword
            (private|)             \s*  # look for the prive keyword
            (\w+)                  \s+  # Get parameter name
            =                      \s+  
            <                           # get bracketed values
            ([a-zA-Z0-9]+)         \s+  # First value 
            ([a-zA-Z0-9])+              # Second value
            >                      \s*
            (/\*\s*(.*?)\s*\*/|)        # Find any trailing comment
            (\r|\n|\s)                  # Trailing whitespace
        }x;


        if( $line =~ /$line_match/ ) {

            # Any time you find yourself making variables
            # with names like $foo1, $foo2, etc, use an array.

            my @names = ( $1, $2, $3, $4, $5, $6, $7 );

            # printf is for printing formatted data.  
            # If you aren't using any format codes, use print.

            # Using an array makes it easy to print all the tokens.
            print $fh "\t\t$_" for @names;

            $mystery_value = 1;

        }
    }

    return $mystery_value;
}

至于你解析问题，请按照Schwern拥有的意见。大的，复杂的正则表达式是你需要简化的标志。打破大事化管理的任务。

Answer 4:

正如提到的，你需要在你的正则表达式的一些结构。在refatoring你的代码，我做了几个假设

你不希望只是打印出来的标签分隔的格式
对于唯一的原因$x变量，所以你只打印一行。（不过，一个last的循环结束会工作得很好。）。

已经假定这些东西，我决定，在解决你的问题，我想：

告诉你如何成为一个好修改的正则表达式。
代码很简单的“语义动作”，它存储的数据，并让你使用它，请你。

另外是要注意的是，我改变输入到__DATA__部分和输出被限制为STDERR -通过使用Smart::Comment意见，即HEP我检查我的结构。

首先，代码序言。

use strict;   # always in development!
use warnings; # always in development!
use English qw<$LIST_SEPARATOR>; # It's just helpful.
#use re 'debug';
#use Smart::Comments

注意注释掉的use re ....如果你真的想看到一个正则表达式被解析的方式，将推出大量的信息，你可能不希望看到（但可以用自己的方式通 - 随着对正则表达式解析一知半解，不过。）它注释掉了，因为它只是不是新手友好，并会垄断你的输出。（欲了解更多有关见重。）

此外注释掉是use Smart::Comments线。我推荐它，但你可以通过使用获得Data::Dumper和print Dumper( \%hash )线。（见Smart::Comments 。）

指定表达式

但到正则表达式。我用正则表达式的分解形式，使整个的部分进行说明（见perlre ）。我们希望有一个单一的字母数字字符或带引号的字符串（与允许逃逸）。

我们还使用了改良剂的名单，让“语言”才能进步。

接下来的正则表达式，我们做出一个“做块”或者我喜欢称之为“本土化块”，这样我可以本地化$LIST_SEPARATOR （又名$" ）是正则表达式交替字符（‘|’）。因此。当我包括在列表中进行插值，它被插值为交替。

我给你时间去看看第二正则表达式谈论它之前。

# Modifiable list of modifiers
my @mod_names = qw<constant fixup private>;
# Break out the more complex chunks into separate expressions
my $arg2_regex 
    = qr{ \p{IsAlnum}             # accept a single alphanumeric character
        |                         # OR 
          "                       # Starts with a double quote
          (?>                     # -> We just want to group, not capture
                                  # the '?> controls back tracing
              [^\\"\P{IsPrint}]+  # any print character as long as it is not
                                  # a backslash or a double quote
          |   \\"                 # but we will accept a backslash followed by
                                  # a double quote
          |   (\\\\)+             # OR any amount of doubled backslashes
          )*                      # any number of these
          "
        }msx;

my $line_RE 
    = do { local $LIST_SEPARATOR = '|';
           qr{ \A                # the beginning
               \s*               # however much whitespace you need
               # A sequence of modifier names followed by space
               ((?: (?: @mod_names ) \s+ )*)
               ( \p{IsAlnum}+ )  # at least one alphanumeric character
               \s*               # any amount of whitespace
               =                 # an equals sign
               \s*               # any amount of whitespace
               <                 # open angle bracket
                 (\p{IsAlnum}+)  # Alphanumeric identifier
                 \s+             # required whitespace
                 ( $arg2_regex ) # previously specified arg #2 expression
                 [^>]*?
               >                 # close angle bracket
             }msx
             ;   
          };

正则表达式只是说，我们想要的任何数量的空格后面的字母数字idenfier分离公认的“调节剂”的（我不知道为什么你不想下划线;我不包括他们，不管）

这后面是空白的任何量和等号。由于字母数字字符，空格和等号集都是不相交的，没有理由要求空白。在等号的另一边签字，该值由尖括号分隔的，所以我看不出有任何理由需要在该侧空白无论是。等号前，所有你允许的字母数字和空格，并在另一边，这一切都必须是在尖括号中。所需的空白给你什么，而不是需要更容错。忽略这一切，改变* s至+如果你期待电机输出。

在等号的另一侧签名，我们需要一个角撑架对。在对由字母数字说法，与第二个参数是一个单一的字母数字字符（基于您规格）OR其可以包含转义逃逸或报价和甚至结束角度托架一个字符串 - 只要该字符串不结束。

存储数据

一旦规范已经取得，这里只是你可以用它做的事情之一。因为我不知道你想和这个除了打印出来做什么 - 这我要去假设是不是脚本的全部目的。

### $line_RE
my %fixup_map;
while ( my $line = <DATA> ) { 
    ### $line
    my ( $mod_text, $identifier, $first_arg, $second_arg ) 
        = ( $line =~ /$line_RE/ )
        ;
    die 'Did not parse!' unless $identifier;
    $fixup_map{$identifier}
        = { modifiers_for => { map { $_ => 1 } split /\s+/, $mod_text }
          , first_arg     => $first_arg
          , second_arg    => $second_arg
          };

    ### $fixup_map{$identifier} : $fixup_map{$identifier}
}
__DATA__
constant fixup ConfigAlarms  = <U1 0>
constant fixup ConfigAlarms2 = <U1 2>
constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">

在最后你可以看到DATA节，当你在开始阶段，你似乎是在这里，这是最方便的与IO逻辑分配，使用内置手柄DATA ，因为我在这里做的。

我收集修饰符的哈希，让自己的语义动作可能是

#...
my $data = $fixup_map{$id};
#...
if ( $data->{modifiers_for}{public} ) {
    #...
}

肥皂盒

然而，主要的问题是，你似乎并不有一个计划。对于在角度brakets第二个“说法”，你有一个正则表达式，指定只有单个字母数字字符，但想要将它扩大到允许转义字符串。我有期望，要实现一个小的子集，并逐步扩大想它来做做其他事情。如果你从一开始就忽视了一个很好的设计，它一定会成为一个头疼的越来越来实现全功能的“解析器”。

您可能需要在某个时候实现多线值。如果你不知道如何从一个单一的字母数字得到一个报价分隔的说法，该行由行法和调整，正则表达式矮人这种复杂性差距。

所以，我建议你在这里使用的代码只能作为扩大复杂性的准则。我回答问题，并指示方向，而不是设计或编码的一个项目，所以我的正则表达式的代码是不是扩张，因为它可能应该是。

如果解析工作是足够复杂，我会指定一个最小的前瞻语法Parse::RecDescent ，并坚持编码语义动作。这是另一个建议。

Answer 5:

我特意删除匹配捕获（如果你愿意，你可以将它们添加）：

m{^\s*constant\s+fixup\s+(?:private\s+)?\w+\s*=\s*<[^>]+>(?:\s*/\*(?:\s*\w*)+\*/)?$};

Answer 6:

统一先！

$yourstring =~ s,\\,/,g;  # transform '\' into '/'
$yourstring =~ s,/+,/,g;  # transform multiple '/' into one '/'

文章来源: How do I extract and parse quoted strings in Perl?

标签： regex perl parsing

傲

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~