Regex Replacing : to “:” etc

I've got a bunch of strings like:

"Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"

I would like to replace that with

"Hello, here's a test colon:. Here's a test semi-colon;"

And so on for all printable ASCII values.

At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).

Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.

Thanks,

Dom

标签： c++ regex boost ascii ncr

12条回答

\"骚年 ilove

2楼-- · 2020-03-24 04:52

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.

echo "Hello, here's a test colon:. Here's a test semi-colon;
Further test &#65;. abc.~.def."
| perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) )
{ return sprintf("%c",$x); } else { return "&#".$x.";"; } }
while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'

Pretty-printing that:

#!/usr/bin/perl -w

sub translate
{
  my $x=$_[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

Though perl being perl, I'm sure there's a much better way to write that...

Back to C code:

You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

0人赞添加讨论(0) 举报

闹够了就滚

3楼-- · 2020-03-24 04:56

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2020-03-24 04:59

The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.

I'd say a regex was the right choice.

Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all  -~. For that reason, you could consider &#1?\d\d;.

As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.

0人赞添加讨论(0) 举报

虎瘦雄心在

5楼-- · 2020-03-24 05:06

Here's another Perl's one-liner (see @mrree's answer):

a test file:

$ cat ent.txt 
Hello, &#12; here's a test colon&#58;. 
Here's a test semi-colon&#59; '&#131;'

the one-liner:

$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt

or using more specific regex:

$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt

both one-liners produce the same output:

Hello, &#12; here's a test colon:. 
Here's a test semi-colon; '&#131;'

0人赞添加讨论(0) 举报

放荡不羁爱自由

6楼-- · 2020-03-24 05:07

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!

I'm currently using python and would have solved it with this oneliner:

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

Does that make any sense?

0人赞添加讨论(0) 举报

Evening l夕情丶

7楼-- · 2020-03-24 05:10

* Repaired SNOBOL4 Solution
* &#38;#38; -> &#38;
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end

0人赞添加讨论(0) 举报

1 2 下一页

Regex Replacing : to “:” etc

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间