I try to get to grips with parsing.
I have some data that comes in a de-de
format with additional information at the end of the string.
I managed to get the de-de part correct but I struggle in getting the -
and %
parsed correctly. I read up on codecvt
but I do not understand the topic.
Here is a reflection of what I understand so far and an example of what I need to do.
#include <string>
#include <locale>
#include <iostream>
#include <sstream>
using namespace std;
#define EXPECT_EQ(actual, expected) { \
if (actual != expected) \
{ \
cout << "expected " << #actual << " to be " << expected << " but was " << actual << endl; \
} \
}
double parse(wstring numstr)
{
double value;
wstringstream is(numstr);
is.imbue(locale("de-de"));
is >> value;
return value;
}
int main()
{
EXPECT_EQ(parse(L"123"), 123); //ok
EXPECT_EQ(parse(L"123,45"), 123.45); //ok
EXPECT_EQ(parse(L"1.000,45"), 1000.45); //ok
EXPECT_EQ(parse(L"2,390%"), 0.0239); //% sign at the end
EXPECT_EQ(parse(L"1.234,56-"), -1234.56); //- sign at the end
}
The output is:
expected parse(L"2,390%") to be 0.0239 but was 2.39
expected parse(L"1.234,56-") to be -1234.56 but was 1234.56
How can I imbue my stream so that it reads the -
and %
sign like I need it to?
The
codecvt
facet is the wrong place to look here. Thecodecvt
facet is only intended to deal with converting an external representation of a character into an internal representation of the same character (e.g., UTF-8 in the file, UTF-32/UCS-4 internally).For parsing numbers like this, you're looking for the
num_get
facet. The basic idea is that you'll create a class derived fromstd::num_get
that overridesdo_get
for (at least) the types of numbers you care about.In a typical case, you only do a "real" implementation for a few types (e.g., long long and long double) and have the functions for all the smaller types delegate to those, then convert the result to the target type.
Here's a fairly simple
num_get
facet. For the moment, it only attempts to provide the special processing for typedouble
. To keep the example from getting too outrageously long, I've simplified the processing a bit:%-
(but will do-%
).1,,,3
will parse as13
.Within those limitations here's some code:
Realistically, under the circumstances you probably don't want to do things this way-- you probably want to delegate to the existing facet to read the number proper, then at the end of what it parses, look for a
-
and/or%
and react appropriately (and probably diagnose an error if, for example, you find both leading and trailing '-').I'd tackle this head-on: let's get to grips with parsing here.
You'd end up writing that somewhere anyways, so I'd forget about the need to create an (expensive) string stream first.
Weapon Of Choice: Boost Spirit
The Simple Grammar
At it's core, the grammar is really simple:
There you have it. Of couse, we need to define
mynum
so it parses the unsigned real numbers as expected:The Magic:
real_policies<>
The documentation goes a long way to explaining how to tweak real number parsing using
real_policies
. Here's the policy I came up with:Full Demo
Live On Coliru
If you uncomment the "DEBUG" line, it prints: