C++ boost date_input_facet seems to parse dates un

2020-07-29 15:46发布

Toy code in coliru I am using for testing: http://coliru.stacked-crooked.com/a/4039865d8d4dad52

I am getting used to C++ again after a long hiatus from it. I am writing code that parses a CSV that may have several columns with dates or nulls. My assumption is that every date column has exactly one kind of valid date format though different columns may have different formats.

For each date column that I have, I find the first value that is successfully parsed as a date given an std::vector of potential locales with a boost date_input_facet object. That first date that parses correctly will then return the index in my array of locales that worked. Once I have the appropriate format for the first parsable date, I want to fix that format forever more so that I no longer have to waste CPU time detecting the format.

Here is my array of locales:

const std::vector<std::locale> Date::date_formats = {
    std::locale(std::locale::classic(), new date_input_facet("%Y-%m-%d")),
    std::locale(std::locale::classic(), new date_input_facet("%Y/%m/%d")),
    std::locale(std::locale::classic(), new date_input_facet("%m-%d-%Y")),
    std::locale(std::locale::classic(), new date_input_facet("%m/%d/%Y")),
    std::locale(std::locale::classic(), new date_input_facet("%d-%b-%Y")),
    std::locale(std::locale::classic(), new date_input_facet("%Y%m%d")),
};

I use an array of date strings from 20170101 to 20170131 to test this out. I then print out the original date strings, the date that was parsed, along with the index of the date_formats vector that worked for parsing.

For 20170101 to 201700129, it says that the 0th index worked which is supposed to have the "%Y-%m-%d" format with the dashes?!?! Moreover, where the dashes go, I have numbers so it is reads 20170101 as 2017-10- then drop the last dash and interprets it as October 2017 which without a date is Oct 1, 2017. Why would it do that when that is not the format it was supposed to use?

Some results that one could see from my coliru (pY is parsed year, etc):

YYYYMMDD    pY     pM   pD  format_index
20170101    2017    Oct 1   0
20170102    2017    Oct 1   0
20170103    2017    Oct 1   0
20170104    2017    Oct 1   0
20170105    2017    Oct 1   0

For 20170130, 20170131, the correct format index (the 5th) is reported for "%Y%m%d".

Any ideas? I only want the precise format string I passed to be used.

2条回答
虎瘦雄心在
2楼-- · 2020-07-29 16:20

Using Howard Hinnant's free, open-source C++11/14/17 date/time library, this:

#include "date/date.h"
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

int
localeIndexFromString(const std::string& delimitedString)
{
    using namespace std;
    static vector<string> date_formats
    {
        "%Y-%m-%d",
        "%Y/%m/%d",
        "%m-%d-%Y",
        "%m/%d/%Y",
        "%d-%b-%Y",
        "%Y%m%d"
    };

    istringstream is;
    date::year_month_day dt;
    size_t i;
    for (i = 0; i < date_formats.size(); ++i)
    {
        is.clear();
        is.str(delimitedString);
        is >> date::parse(date_formats[i], dt);
        if (!is.fail())
        {
            std::cout << dt.year() << "\t" << dt.month() << "\t" << dt.day();
            return i;
        }
    }
    return -1;
}

int
main()
{
    using namespace date::literals;
    std::vector<date::year_month_day> vec;
    for (auto i = 1; i < 32; ++i)
        vec.push_back(2017_y/jan/i);

    std::vector<std::string> strvec;
    for (auto const& d : vec)
        strvec.push_back(date::format("%Y%m%d", d));

    std::cout << "YYYYMMDD\tpY\tpM\tpD\tformat_index\n";

    for (size_t i=0; i < strvec.size(); ++i)
    {
        std::cout << strvec[i] << "\t";
        int fmt_index = localeIndexFromString(strvec[i]);
        std::cout << "\t" << fmt_index << "\n";
    }
}

Outputs:

YYYYMMDD        pY        pM        pD        format_index
20170101        2017      Jan       01        5
20170102        2017      Jan       02        5
20170103        2017      Jan       03        5
20170104        2017      Jan       04        5
20170105        2017      Jan       05        5
20170106        2017      Jan       06        5
20170107        2017      Jan       07        5
20170108        2017      Jan       08        5
20170109        2017      Jan       09        5
20170110        2017      Jan       10        5
20170111        2017      Jan       11        5
20170112        2017      Jan       12        5
20170113        2017      Jan       13        5
20170114        2017      Jan       14        5
20170115        2017      Jan       15        5
20170116        2017      Jan       16        5
20170117        2017      Jan       17        5
20170118        2017      Jan       18        5
20170119        2017      Jan       19        5
20170120        2017      Jan       20        5
20170121        2017      Jan       21        5
20170122        2017      Jan       22        5
20170123        2017      Jan       23        5
20170124        2017      Jan       24        5
20170125        2017      Jan       25        5
20170126        2017      Jan       26        5
20170127        2017      Jan       27        5
20170128        2017      Jan       28        5
20170129        2017      Jan       29        5
20170130        2017      Jan       30        5
20170131        2017      Jan       31        5
查看更多
Lonely孤独者°
3楼-- · 2020-07-29 16:28

I've made a multi-format capable date-time parser myself. I, too, found it hard/impossible to get the parsing strict using the facilities in the standard library and boost.

I ended up using strptime - mostly¹.

adaptive_parser

Intended to be seeded with a list of supported formats, in order of preference. By default, parser is not adaptive (mode is fixed).

In adaptive modes the format can be required to be

  • sticky (consistently reuse the first matched format)
  • ban_failed (remove failed patterns from the list; banning only occurs on successful parse to avoid banning all patterns on invalid input)
  • mru (preserves the list but re-orders for performance)
  • Caution:
    If formats are ambiguous (e.g. mm-dd-yyyy vs dd-mm-yyyy) allowing re-ordering results in unpredictable results.

    ⇒ Only use mru when there are no ambiguous formats

  • NOTE:
    The function object is stateful. In algorithms, pass it by reference (std::ref(obj)) to avoid copying the patterns and to ensure correct adaptive behaviour

Demo

I tried the parser on your test data:

#include "adaptive_parser.h"
#include <boost/date_time/gregorian/greg_date.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

class Date{
public:
    Date() : y(0), m(0), d(0) {}
    Date(int yy, int mm, int dd) : y(yy), m(mm), d(dd) {}
    Date(boost::gregorian::date dt) : y(dt.year()), m(dt.month()), d(dt.day()) {}
    Date(std::string const& delimitedString);

    std::string to_string() const;

    int getYear()  const { return y; }
    int getMonth() const { return m; }
    int getDay()   const { return d; }
 private:
    using parser_t = mylib::datetime::adaptive_parser;
    parser_t parser { parser_t::full_match, 
        {
            "%Y-%m-%d", "%Y/%m/%d",
            "%m-%d-%Y", "%m/%d/%Y",
            "%d-%b-%Y",
            "%Y%m%d",
        } };

    int y, m, d;
};

Date::Date(const std::string& delimitedString)
{
    using namespace boost::posix_time;

    auto t = ptime({1970,1,1}) + seconds(parser(delimitedString).count());

    *this = Date(t.date());
}

std::string Date::to_string() const
{
    std::ostringstream os;

    os << std::setfill('0')
       << std::setw(4) << y 
       << std::setw(2) << m 
       << std::setw(2) << d;

    return os.str();
}

int main() {
    std::vector<Date> vec(31);
    std::generate(vec.begin(), vec.end(), [i=1]() mutable { return Date(2017,1,i++); });

    std::vector<std::string> strvec;
    std::transform(vec.begin(), vec.end(), back_inserter(strvec), std::mem_fn(&Date::to_string));

    std::cout << "YYYYMMDD\tpY\tpM\tpD\tformat_index\n";

    for (auto& str : strvec) {
        Date parsed(str);

        std::cout << str 
            << "\t" << parsed.getYear()
            << "\t" << parsed.getMonth()
            << "\t" << parsed.getDay()
            << "\t" << "?"
            << "\n";
    }
}

Prints:

YYYYMMDD    pY  pM  pD  format_index
20170101    2017    1   1   ?
20170102    2017    1   2   ?
20170103    2017    1   3   ?
20170104    2017    1   4   ?
20170105    2017    1   5   ?
20170106    2017    1   6   ?
20170107    2017    1   7   ?
20170108    2017    1   8   ?
20170109    2017    1   9   ?
20170110    2017    1   10  ?
20170111    2017    1   11  ?
20170112    2017    1   12  ?
20170113    2017    1   13  ?
20170114    2017    1   14  ?
20170115    2017    1   15  ?
20170116    2017    1   16  ?
20170117    2017    1   17  ?
20170118    2017    1   18  ?
20170119    2017    1   19  ?
20170120    2017    1   20  ?
20170121    2017    1   21  ?
20170122    2017    1   22  ?
20170123    2017    1   23  ?
20170124    2017    1   24  ?
20170125    2017    1   25  ?
20170126    2017    1   26  ?
20170127    2017    1   27  ?
20170128    2017    1   28  ?
20170129    2017    1   29  ?
20170130    2017    1   30  ?
20170131    2017    1   31  ?

¹ just the timezone stuff needs tweaks, mostly

查看更多
登录 后发表回答