Change delimiter of grep command

2019-06-02 16:35发布

问题:

I am using grep to detect <a href="xxxx"> something here </a>
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.

So if input is like <a href="xxxx"> something here </a> it works, but if input is like

<a href="xxxx">

something here /a>    

, then it doesn't. Any solutions?

回答1:

I'd use awk rather than grep. This should work:

awk '/a href="xxxx">/,/\/a>/' filename



回答2:

I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).



回答3:

I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):

sed '/<[Aa][^A-Za-z]/{ :A
     /<\/[Aa]>/ bD
     N
     bA
     :D
     /\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'


回答4:

This is probably a repeat question: Grep search strings with line breaks

You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.



回答5:

Consider egrep -3 '(<a|</a>)'

"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.



回答6:

perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'

So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.

I am assuming that you meant </a> rather than /a> as an xml closing delimiter.

Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.

Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:

Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)

$_ = "a{b{c}e}f";

my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
   if($& eq "{") {
      ++$level;
   } elsif($& eq "}") {
      --$level;
      if($level == 1) {
         print "Result: ".$`.$&."\n";
         $_=$'; # reset searchspace to after the match
         last;
      }
   }
}

Result: {b{c}e}