Toy code in coliru I am using for testing:
http://coliru.stacked-crooked.com/a/4039865d8d4dad52
I am getting used to C++ again after a long hiatus from it. I am writing code that parses a CSV that may have several columns with dates or nulls. My assumption is that every date column has exactly one kind of valid date format though different columns may have different formats.
For each date column that I have, I find the first value that is successfully parsed as a date given an std::vector of potential locales with a boost date_input_facet object. That first date that parses correctly will then return the index in my array of locales that worked. Once I have the appropriate format for the first parsable date, I want to fix that format forever more so that I no longer have to waste CPU time detecting the format.
Here is my array of locales:
const std::vector<std::locale> Date::date_formats = {
std::locale(std::locale::classic(), new date_input_facet("%Y-%m-%d")),
std::locale(std::locale::classic(), new date_input_facet("%Y/%m/%d")),
std::locale(std::locale::classic(), new date_input_facet("%m-%d-%Y")),
std::locale(std::locale::classic(), new date_input_facet("%m/%d/%Y")),
std::locale(std::locale::classic(), new date_input_facet("%d-%b-%Y")),
std::locale(std::locale::classic(), new date_input_facet("%Y%m%d")),
};
I use an array of date strings from 20170101 to 20170131 to test this out. I then print out the original date strings, the date that was parsed, along with the index of the date_formats vector that worked for parsing.
For 20170101 to 201700129, it says that the 0th index worked which is supposed to have the "%Y-%m-%d" format with the dashes?!?! Moreover, where the dashes go, I have numbers so it is reads 20170101 as 2017-10- then drop the last dash and interprets it as October 2017 which without a date is Oct 1, 2017. Why would it do that when that is not the format it was supposed to use?
Some results that one could see from my coliru (pY is parsed year, etc):
YYYYMMDD pY pM pD format_index
20170101 2017 Oct 1 0
20170102 2017 Oct 1 0
20170103 2017 Oct 1 0
20170104 2017 Oct 1 0
20170105 2017 Oct 1 0
For 20170130, 20170131, the correct format index (the 5th) is reported for "%Y%m%d".
Any ideas? I only want the precise format string I passed to be used.
I've made a multi-format capable date-time parser myself. I, too, found it hard/impossible to get the parsing strict using the facilities in the standard library and boost.
I ended up using strptime
- mostly¹.
adaptive_parser
Intended to be seeded with a list of supported formats, in order of
preference. By default, parser is not adaptive (mode is fixed
).
In adaptive modes the format can be required to be
sticky
(consistently reuse the first matched format)
ban_failed
(remove failed patterns from the list; banning only occurs
on successful parse to avoid banning all patterns on invalid input)
mru
(preserves the list but re-orders for performance)
Caution:
If formats are ambiguous (e.g. mm-dd-yyyy
vs dd-mm-yyyy
) allowing
re-ordering results in unpredictable results.
⇒ Only use mru
when there are no ambiguous formats
NOTE:
The function object is stateful. In algorithms, pass it by reference
(std::ref(obj)
) to avoid copying the patterns and to ensure correct
adaptive behaviour
Demo
I tried the parser on your test data:
#include "adaptive_parser.h"
#include <boost/date_time/gregorian/greg_date.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
class Date{
public:
Date() : y(0), m(0), d(0) {}
Date(int yy, int mm, int dd) : y(yy), m(mm), d(dd) {}
Date(boost::gregorian::date dt) : y(dt.year()), m(dt.month()), d(dt.day()) {}
Date(std::string const& delimitedString);
std::string to_string() const;
int getYear() const { return y; }
int getMonth() const { return m; }
int getDay() const { return d; }
private:
using parser_t = mylib::datetime::adaptive_parser;
parser_t parser { parser_t::full_match,
{
"%Y-%m-%d", "%Y/%m/%d",
"%m-%d-%Y", "%m/%d/%Y",
"%d-%b-%Y",
"%Y%m%d",
} };
int y, m, d;
};
Date::Date(const std::string& delimitedString)
{
using namespace boost::posix_time;
auto t = ptime({1970,1,1}) + seconds(parser(delimitedString).count());
*this = Date(t.date());
}
std::string Date::to_string() const
{
std::ostringstream os;
os << std::setfill('0')
<< std::setw(4) << y
<< std::setw(2) << m
<< std::setw(2) << d;
return os.str();
}
int main() {
std::vector<Date> vec(31);
std::generate(vec.begin(), vec.end(), [i=1]() mutable { return Date(2017,1,i++); });
std::vector<std::string> strvec;
std::transform(vec.begin(), vec.end(), back_inserter(strvec), std::mem_fn(&Date::to_string));
std::cout << "YYYYMMDD\tpY\tpM\tpD\tformat_index\n";
for (auto& str : strvec) {
Date parsed(str);
std::cout << str
<< "\t" << parsed.getYear()
<< "\t" << parsed.getMonth()
<< "\t" << parsed.getDay()
<< "\t" << "?"
<< "\n";
}
}
Prints:
YYYYMMDD pY pM pD format_index
20170101 2017 1 1 ?
20170102 2017 1 2 ?
20170103 2017 1 3 ?
20170104 2017 1 4 ?
20170105 2017 1 5 ?
20170106 2017 1 6 ?
20170107 2017 1 7 ?
20170108 2017 1 8 ?
20170109 2017 1 9 ?
20170110 2017 1 10 ?
20170111 2017 1 11 ?
20170112 2017 1 12 ?
20170113 2017 1 13 ?
20170114 2017 1 14 ?
20170115 2017 1 15 ?
20170116 2017 1 16 ?
20170117 2017 1 17 ?
20170118 2017 1 18 ?
20170119 2017 1 19 ?
20170120 2017 1 20 ?
20170121 2017 1 21 ?
20170122 2017 1 22 ?
20170123 2017 1 23 ?
20170124 2017 1 24 ?
20170125 2017 1 25 ?
20170126 2017 1 26 ?
20170127 2017 1 27 ?
20170128 2017 1 28 ?
20170129 2017 1 29 ?
20170130 2017 1 30 ?
20170131 2017 1 31 ?
¹ just the timezone stuff needs tweaks, mostly
Using Howard Hinnant's free, open-source C++11/14/17 date/time library, this:
#include "date/date.h"
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
int
localeIndexFromString(const std::string& delimitedString)
{
using namespace std;
static vector<string> date_formats
{
"%Y-%m-%d",
"%Y/%m/%d",
"%m-%d-%Y",
"%m/%d/%Y",
"%d-%b-%Y",
"%Y%m%d"
};
istringstream is;
date::year_month_day dt;
size_t i;
for (i = 0; i < date_formats.size(); ++i)
{
is.clear();
is.str(delimitedString);
is >> date::parse(date_formats[i], dt);
if (!is.fail())
{
std::cout << dt.year() << "\t" << dt.month() << "\t" << dt.day();
return i;
}
}
return -1;
}
int
main()
{
using namespace date::literals;
std::vector<date::year_month_day> vec;
for (auto i = 1; i < 32; ++i)
vec.push_back(2017_y/jan/i);
std::vector<std::string> strvec;
for (auto const& d : vec)
strvec.push_back(date::format("%Y%m%d", d));
std::cout << "YYYYMMDD\tpY\tpM\tpD\tformat_index\n";
for (size_t i=0; i < strvec.size(); ++i)
{
std::cout << strvec[i] << "\t";
int fmt_index = localeIndexFromString(strvec[i]);
std::cout << "\t" << fmt_index << "\n";
}
}
Outputs:
YYYYMMDD pY pM pD format_index
20170101 2017 Jan 01 5
20170102 2017 Jan 02 5
20170103 2017 Jan 03 5
20170104 2017 Jan 04 5
20170105 2017 Jan 05 5
20170106 2017 Jan 06 5
20170107 2017 Jan 07 5
20170108 2017 Jan 08 5
20170109 2017 Jan 09 5
20170110 2017 Jan 10 5
20170111 2017 Jan 11 5
20170112 2017 Jan 12 5
20170113 2017 Jan 13 5
20170114 2017 Jan 14 5
20170115 2017 Jan 15 5
20170116 2017 Jan 16 5
20170117 2017 Jan 17 5
20170118 2017 Jan 18 5
20170119 2017 Jan 19 5
20170120 2017 Jan 20 5
20170121 2017 Jan 21 5
20170122 2017 Jan 22 5
20170123 2017 Jan 23 5
20170124 2017 Jan 24 5
20170125 2017 Jan 25 5
20170126 2017 Jan 26 5
20170127 2017 Jan 27 5
20170128 2017 Jan 28 5
20170129 2017 Jan 29 5
20170130 2017 Jan 30 5
20170131 2017 Jan 31 5