modify regex to include comma

2019-08-03 16:01发布

问题:

I have the following string:

arg1('value1') arg2('value '')2') arg3('user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21')

The regex to extract the value looks like:

boost::regex re_arg_values("('[^']*(?:''[^']*)*'[^)]*)");

The above regex properly extracts the values. BUT when I include a comma , the code fails. For eg:

  arg1('value1') arg2('value '')2') arg3('user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21**,**')

How shall I modify this regex to include the comma? FYI. The value can contain spaces, special characters, and also tabs. The code is in CPP.

Thanks in advance.

回答1:

I'd not use a regex here.

The goal MUST be to parse values, and no doubt they will have useful values, that you need interpreted.

I'd devise a datastructure like:

#include <map>

namespace Config {
    using Key = std::string;
    using Value = boost::variant<int, std::string, bool>;
    using Setting = std::pair<Key, Value>;
    using Settings = std::map<Key, Value>;
}

For this you can write 1:1 a parser using Boost Spirit:

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_pair.hpp>

namespace Parser {
    using It = std::string::const_iterator;
    using namespace Config;
    namespace qi = boost::spirit::qi;

    using Skip = qi::blank_type;
    qi::rule<It, std::string()>   quoted_   = "'" >> *(
            "'" >> qi::char_("'") // double ''
          | '\\' >> qi::char_     // any character escaped
          | ~qi::char_("'")       // non-quotes
       ) >> "'";
    qi::rule<It, Key()>           key_      = +qi::char_("a-zA-Z0-9_"); // for example
    qi::rule<It, Value()>         value_    = qi::int_ | quoted_ | qi::bool_;
    qi::rule<It, Setting(), Skip> setting_  = key_ >> '(' >> value_ >> ')';
    qi::rule<It, Settings()>      settings_ = qi::skip(qi::blank) [*setting_];
}

Note how this

  • interprets non-string values correctly
  • specifies what keys look like and parses them too
  • interprets string escapes, so the Value in the map contains the "real" string, after un-escaping
  • ignores whitespace outside values (use space_type if you want to ignore newlines as whitespace as well)

You can use it like:

int main() {
    std::string const input = R"(    arg1('value1') arg2('value '')2') arg3('user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21**,**'))";

    Config::Settings map;
    if (parse(input.begin(), input.end(), Parser::settings_, map)) {
        for(auto& entry : map)
            std::cout << "config setting {" << entry.first << ", " << entry.second << "}\n";
    }
}

Which prints

config setting {arg1, value1}
config setting {arg2, value ')2}
config setting {arg3, user'~!@#$%^&*_~!@#$%^&"*_-=+[{]}|;:<.>?21**,**}

Live Demo

Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <map>
#include <boost/fusion/adapted/std_pair.hpp>

namespace Config {
    using Key = std::string;
    using Value = boost::variant<int, std::string, bool>;
    using Setting = std::pair<Key, Value>;
    using Settings = std::map<Key, Value>;
}

namespace Parser {
    using It = std::string::const_iterator;
    using namespace Config;
    namespace qi = boost::spirit::qi;

    using Skip = qi::blank_type;
    qi::rule<It, std::string()>   quoted_   = "'" >> *(
            "'" >> qi::char_("'") // double ''
          | '\\' >> qi::char_     // any character escaped
          | ~qi::char_("'")       // non-quotes
       ) >> "'";
    qi::rule<It, Key()>           key_      = +qi::char_("a-zA-Z0-9_"); // for example
    qi::rule<It, Value()>         value_    = qi::int_ | quoted_ | qi::bool_;
    qi::rule<It, Setting(), Skip> setting_  = key_ >> '(' >> value_ >> ')';
    qi::rule<It, Settings()>      settings_ = qi::skip(qi::blank) [*setting_];
}

int main() {
    std::string const input = R"(    arg1('value1') arg2('value '')2') arg3('user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21**,**'))";

    Config::Settings map;
    if (parse(input.begin(), input.end(), Parser::settings_, map)) {
        for(auto& entry : map)
            std::cout << "config setting {" << entry.first << ", " << entry.second << "}\n";
    }
}

BONUS

For comparison, here's the "same" but using regex:

Live On Coliru

#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <iostream>
#include <map>

namespace Config {
    using Key = std::string;
    using RawValue = std::string;
    using Settings = std::map<Key, RawValue>;

    Settings parse(std::string const& input) {
        Settings settings;

        boost::regex re(R"((\w+)\(('.*?')\))");
        auto f = boost::make_regex_iterator(input, re);

        for (auto& match : boost::make_iterator_range(f, {}))
            settings.emplace(match[1].str(), match[2].str());

        return settings;
    }
}

int main() {
    std::string const input = R"(    arg1('value1') arg2('value '')2') arg3('user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21**,**'))";

    Config::Settings map = Config::parse(input);
    for(auto& entry : map)
        std::cout << "config setting {" << entry.first << ", " << entry.second << "}\n";
}

Prints

config setting {arg1, 'value1'}
config setting {arg2, 'value ''}
config setting {arg3, 'user\'~!@#$%^&*_~!@#$%^&"*_-=+[{]}\|;:<.>?21**,**'}

Notes:

  • it no longer interprets and converts any values
  • it no longer processes escapes
  • it requires an additional runtime library dependency on boost_regex


标签: c++ regex boost