I try to get to grips with parsing.
I have some data that comes in a de-de
format with additional information at the end of the string.
I managed to get the de-de part correct but I struggle in getting the -
and %
parsed correctly. I read up on codecvt
but I do not understand the topic.
Here is a reflection of what I understand so far and an example of what I need to do.
#include <string>
#include <locale>
#include <iostream>
#include <sstream>
using namespace std;
#define EXPECT_EQ(actual, expected) { \
if (actual != expected) \
{ \
cout << "expected " << #actual << " to be " << expected << " but was " << actual << endl; \
} \
}
double parse(wstring numstr)
{
double value;
wstringstream is(numstr);
is.imbue(locale("de-de"));
is >> value;
return value;
}
int main()
{
EXPECT_EQ(parse(L"123"), 123); //ok
EXPECT_EQ(parse(L"123,45"), 123.45); //ok
EXPECT_EQ(parse(L"1.000,45"), 1000.45); //ok
EXPECT_EQ(parse(L"2,390%"), 0.0239); //% sign at the end
EXPECT_EQ(parse(L"1.234,56-"), -1234.56); //- sign at the end
}
The output is:
expected parse(L"2,390%") to be 0.0239 but was 2.39
expected parse(L"1.234,56-") to be -1234.56 but was 1234.56
How can I imbue my stream so that it reads the -
and %
sign like I need it to?
I'd tackle this head-on: let's get to grips with parsing here.
You'd end up writing that somewhere anyways, so I'd forget about the need to create an (expensive) string stream first.
Weapon Of Choice: Boost Spirit
Note,
I parse the string using it's iterators directly. My code is pretty
generic as to the type of floating point number used.
You can pretty much search replace double
by e.g.
boost::multiprecision::cpp_dec_float
(or make it a template
argument) and be parsing. Because I predict that you needed to parser
decimal floating point numbers, not binary floating point numbers. You're losing accuracy in the conversion.
UPDATE: extended sample Live On Coliru
The Simple Grammar
At it's core, the grammar is really simple:
if (parse(numstr.begin(), numstr.end(), mynum >> matches['-'] >> matches['%'],
value, sign, pct))
{
if (sign) value = -value;
if (pct) value /= 100;
return value;
}
There you have it. Of couse, we need to define mynum
so it parses the unsigned real numbers as expected:
using namespace qi;
real_parser<double, de_numpolicy<double> > mynum;
The Magic: real_policies<>
The documentation goes a long way to explaining how to tweak real number parsing using real_policies
. Here's the policy I came up with:
template <typename T>
struct de_numpolicy : qi::ureal_policies<T>
{
// No exponent
template <typename It> static bool parse_exp(It&, It const&) { return false; }
template <typename It, typename Attr> static bool parse_exp_n(It&, It const&, Attr&) { return false; }
// Thousands separated numbers
template <typename It, typename Attr>
static bool parse_n(It& first, It const& last, Attr& attr)
{
qi::uint_parser<unsigned, 10, 1, 3> uint3;
qi::uint_parser<unsigned, 10, 3, 3> uint3_3;
if (parse(first, last, uint3, attr)) {
for (T n; qi::parse(first, last, '.' >> uint3_3, n);)
attr = attr * 1000 + n;
return true;
}
return false;
}
template <typename It>
static bool parse_dot(It& first, It const& last) {
if (first == last || *first != ',')
return false;
++first;
return true;
}
};
Full Demo
Live On Coliru
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#define EXPECT_EQ(actual, expected) { \
double v = (actual); \
if (v != expected) \
{ \
std::cout << "expected " << #actual << " to be " << expected << " but was " << v << std::endl; \
} \
}
namespace mylib {
namespace qi = boost::spirit::qi;
template <typename T>
struct de_numpolicy : qi::ureal_policies<T>
{
// No exponent
template <typename It> static bool parse_exp(It&, It const&) { return false; }
template <typename It, typename Attr> static bool parse_exp_n(It&, It const&, Attr&) { return false; }
// Thousands separated numbers
template <typename It, typename Attr>
static bool parse_n(It& first, It const& last, Attr& attr)
{
qi::uint_parser<unsigned, 10, 1, 3> uint3;
qi::uint_parser<unsigned, 10, 3, 3> uint3_3;
if (parse(first, last, uint3, attr)) {
for (T n; qi::parse(first, last, '.' >> uint3_3, n);)
attr = attr * 1000 + n;
return true;
}
return false;
}
template <typename It>
static bool parse_dot(It& first, It const& last) {
if (first == last || *first != ',')
return false;
++first;
return true;
}
};
template<typename Char, typename CharT, typename Alloc>
double parse(std::basic_string<Char, CharT, Alloc> const& numstr)
{
using namespace qi;
real_parser<double, de_numpolicy<double> > mynum;
double value;
bool sign, pct;
if (parse(numstr.begin(), numstr.end(), mynum >> matches['-'] >> matches['%'],
value, sign, pct))
{
// std::cout << "DEBUG: " << std::boolalpha << " '" << numstr << "' -> (" << value << ", " << sign << ", " << pct << ")\n";
if (sign) value = -value;
if (pct) value /= 100;
return value;
}
assert(false); // TODO handle errors
}
} // namespace mylib
int main()
{
EXPECT_EQ(mylib::parse(std::string("123")), 123); // ok
EXPECT_EQ(mylib::parse(std::string("123,45")), 123.45); // ok
EXPECT_EQ(mylib::parse(std::string("1.000,45")), 1000.45); // ok
EXPECT_EQ(mylib::parse(std::string("2,390%")), 0.0239); // % sign at the end
EXPECT_EQ(mylib::parse(std::string("1.234,56-")), -1234.56); // - sign at the end
}
If you uncomment the "DEBUG" line, it prints:
DEBUG: '123' -> (123, false, false)
DEBUG: '123,45' -> (123.45, false, false)
DEBUG: '1.000,45' -> (1000.45, false, false)
DEBUG: '2,390%' -> (2.39, false, true)
DEBUG: '1.234,56-' -> (1234.56, true, false)
The codecvt
facet is the wrong place to look here. The codecvt
facet is only intended to deal with converting an external representation of a character into an internal representation of the same character (e.g., UTF-8 in the file, UTF-32/UCS-4 internally).
For parsing numbers like this, you're looking for the num_get
facet. The basic idea is that you'll create a class derived from std::num_get
that overrides do_get
for (at least) the types of numbers you care about.
In a typical case, you only do a "real" implementation for a few types (e.g., long long and long double) and have the functions for all the smaller types delegate to those, then convert the result to the target type.
Here's a fairly simple num_get
facet. For the moment, it only attempts to provide the special processing for type double
. To keep the example from getting too outrageously long, I've simplified the processing a bit:
- It doesn't try to parse exponents on the numbers (e.g., the '99' in 1e99).
- It doesn't try to deal with a suffix of
%-
(but will do -%
).
- It's hard-coded to treat ',' as the decimal point and '.' as the thousands separator.
- It makes no attempt at sanity checking thousands separators. e.g.,
1,,,3
will parse as 13
.
Within those limitations here's some code:
#include <ios>
#include <string>
#include <locale>
#include <iostream>
#include <sstream>
#include <iterator>
#include <cctype>
using namespace std;
template <class charT, class InputIterator = istreambuf_iterator<charT> >
class read_num : public std::num_get < charT > {
public:
typedef charT char_type;
typedef InputIterator iter_type;
protected:
iter_type do_get(iter_type in, iter_type end, ios_base& str, ios_base::iostate& err, double& val) const {
double ret = 0.0;
bool negative = false;
using uc = std::make_unsigned<charT>::type;
while (std::isspace((uc)*in))
++in;
if (*in == '-') {
negative = true;
++in;
while (std::isspace((uc)*in))
++in;
}
while (std::isdigit((uc)*in)) {
ret *= 10;
ret += *in - '0';
++in;
if (*in == '.')
++in;
}
if (*in == ',') {
++in;
double place = 10.0;
while (std::isdigit((uc)*in)) {
ret += (*in - '0') / place;
place *= 10;
++in;
}
}
if (*in == '-') {
negative = true;
++in;
}
if (*in == '%') {
ret /= 100.0;
++in;
}
if (negative)
ret = -ret;
val = ret;
return in;
}
};
Realistically, under the circumstances you probably don't want to do things this way-- you probably want to delegate to the existing facet to read the number proper, then at the end of what it parses, look for a -
and/or %
and react appropriately (and probably diagnose an error if, for example, you find both leading and trailing '-').