what is regular expression?

2019-01-22 21:12发布

I know this question seems stupid, but it isn't. I mean what is it exactly. I have a fair understanding of the parsing problem. I know BNF/EBNF, I've written grammar to parse simple context-free languages in one of my college courses. I just never met regular expressions before! The only thing that I remember about it is that context-free grammar can do all what regular expression can do.

Also, is it useful for a usual coding to parse strings? A simple example would be helpful.

标签: regex parsing
3条回答
We Are One
2楼-- · 2019-01-22 21:54

A regular expression is a specialized language for pattern matching. They're used in many text editors and programming languages for string matching.

You can do many more complicated things with regular expressions as well. There's a great O'Reilly book on the subject, and numerous examples on the web.

The thing that you can't do with regular expressions is proper parsing, because regular expressions aren't a sufficient language to encode a grammar. They're specialized for pattern matching, and if you try to use them for parsing something like XML, you'll likely have problems down the road. More specifically, you can't parse arbitrarily nested recursive structures using regular expressions. A simple example of a problem that a regular expression can't solve well is a set of nested braces like you would find in C:

int main() {    
    void func() {
    }   
}

You can make regular expressions solve this up to a certain point, but the memory requirements for this grow arbitrarily large as the number of braces grows. If you're interested in more detail, read this other StackOverflow question about why such a construct is difficult to parse with regular expressions:

Can regular expressions be used to match nested patterns?

Different languages implement regular expressions in different ways, but the Perl implementation is very popular. The family of regular expressions that are compatible with Perl are called PCRE, or Perl-Compatible Regular Expressions. Here's an example in Perl of a regular expression that can match integers:

#!/usr/bin/perl

use strict;
use warnings;

match_string( "one-two" );
match_string( "1-2" );

sub match_string {
   my $string = shift;
   if ( $string =~ /(\d+)/ ) {
      print "$string matches!\n";
      print "matched: ", $1, "\n";
   } else {
      print "$string doesn't match!\n";
   }
}  

$ perl test.pl 
one-two doesn't match!
1-2 matches!
matched: 1

In this example, the regular expression matches one or more examples of a digit. Here's the line:

   if ( $string =~ /(\d+)/ ) {

The way to read this is:

  • inside the conditional, the string is being matched against the regular expression between /'s.
  • the \d character translates to a digit, 0-9.
  • the + means "one or more times."
  • the parens () mean capture this match, and put it into a special variable. Because this is the first match, it's put into $1.

In some languages (such as Perl), you can also use regular expressions for doing substitutions, like this:

substitute_string( "one-two" );
substitute_string( "1-2" );

sub substitute_string {
   my $string = shift;
   print "before: ",  $string, "\n";
   $string =~ s/1/one/g;
   $string =~ s/2/two/g;
   print "after: ",  $string, "\n";
}

$ perl test.pl 
before: one-two
after: one-two
before: 1-2
after: one-two

Hopefully that's enough to get you started!

查看更多
男人必须洒脱
3楼-- · 2019-01-22 21:58

Other people have covered what a regular expression is, and what it can be used for, so I won't rehash previous answers. However, if you're interested in learning about regular expression syntax (ie. how to construct a regular expression), check out the Tutorial section at regular-expression.info; it's probably the most indepth regular expression syntax resource on the internet.

查看更多
你好瞎i
4楼-- · 2019-01-22 22:06

Regular expressions first came around in mathematics and automata theory. A regular expression is simply something which defines a regular language. Without going too much into what "regular" means, think of a language as this way:

  1. A language is made up of strings. English is a language, for example, and its made of strings.
  2. Those strings are made of symbols - called an alphabet. So a string is just a concatenation of symbols from the alphabet.

So you could have a string (which is, remember, just a concatenation of symbols) which is not part of a given language. Or it could be in the language.

So lets say you have an alphabet made of 2 symbols: "0" and "1". And lets say you want to create a language using the symbols in that alphabet. You could create the following rule: "In order for a string to be in my language, it must have only 0's and 1's in it."

So these strings are in your language:

  • 0
  • 1
  • 01
  • 11001101
  • ...etc

These would not be in your language:

  • 2
  • peaches
  • 00101105

That's a pretty simple language. How about this: "In my language, each string [analogous to a valid 'word' in English] must being with a 0, and then can be followed by any number of 0's or 1's"

These are in the language:

  • 0111111
  • 0000000
  • 0101010110001

These are not:

  • 1
  • 10000
  • 1010
  • 2000000

Well rather than defining the language using words - and these languages might get very complex ("1 followed by 2 0's followed by any combination of 1's and 0's ending with a 1"), we came up with this syntax called "regular expressions" to define the language.

The first language would have been:

(0|1)*

(0 or 1, repeated infinitely)

The next: 0(0|1)*

(0, followed by any number of 0's and 1's).

So lets think of programming now. When you create a regex, you are saying "Look at this text. Return to me strings which match this pattern." Which is really saying "I have defined a language. Return to me all strings within this document which are in my language."

So when you create a "regex", you are actually defining a regular language, which is a mathematical concept. (In actuality, perl-like regex define "nonregular" languages, but that is a separate issue.)

By learning the syntax of regex, you are learning the ins and outs of how to create a language, so that later you can see if a given string is "in" the language. Thus, commonly, people say that regex are for pattern matching - which is basically what you are doing when you look at a pattern, and see if it "matches" the rules for your language.

(this was long. does it answer your question at all?)

查看更多
登录 后发表回答