Regex Replacing : to “:” etc

2020-03-24 04:21发布

I've got a bunch of strings like:

"Hello, here's a test colon:. Here's a test semi-colon&#59;"

I would like to replace that with

"Hello, here's a test colon:. Here's a test semi-colon;"

And so on for all printable ASCII values.

At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).

Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.

Thanks,

Dom

12条回答
够拽才男人
2楼-- · 2020-03-24 05:10

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end
查看更多
爷、活的狠高调
3楼-- · 2020-03-24 05:10

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.

Here's a Python implementation:

s = "Hello, here's a test colon:. Here's a test semi-colon&#59;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

Producing:

"Hello, here's a test colon:. Here's a test semi-colon;"

I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

查看更多
成全新的幸福
4楼-- · 2020-03-24 05:10

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}
  • compile:
    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
  • run:
    $ echo "Hello, &#12; here's a test colon&#58;." | spirit_ncr2a
  • output:
    "Hello, &#12; here's a test colon:." 
查看更多
我只想做你的唯一
5楼-- · 2020-03-24 05:15

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END
查看更多
手持菜刀,她持情操
6楼-- · 2020-03-24 05:15

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., '&#58;' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}
查看更多
虎瘦雄心在
7楼-- · 2020-03-24 05:16

Here's a NCR scanner created using Flex:

/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
  /**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
  fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}

To make an executable:

$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl

Example:

$ echo "Hello, &#12; here's a test colon&#58;. 
> Here's a test semi-colon&#59; '&#131;'
> &#38;#59; <-- may be recursive" \
> | ncr2a

It prints for non-recursive version:

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
&#59; <-- may be recursive

And the recursive one produces:

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
; <-- may be recursive
查看更多
登录 后发表回答