Parse without string split

2020-03-19 03:04发布

问题:

This is a spin-off from the discussion in some other question.

Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.

The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.

I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)

int startIdx, endIdx = 0;
while(true)
{
    startIdx = endIdx;
    // no find_first_not_of in C#
    while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
    if (startIdx == s.Length) break;
    endIdx = s.IndexOf(' ', startIdx);
    if (endIdx == -1) endIdx = s.Length;
    // how to extract a double here?
}

There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.

Does anyone have an idea?

回答1:

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.

Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.

Getting a pointer to string is actually a fully supported operation:

    string l_pos = new string(' ', 100); //don't write to a shared string!
    unsafe 
    {
        fixed (char* l_pSrc = l_pos)
        {               
              // do some work
        }
    }

C# has special syntax to bind a string to a char*.



回答2:

if you want to do it really fast, i would use a state machine

this could look like:

enum State
{
    Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
    switch(CurrentState)
    { // set new currentstate in dependence of ch and CurrentState
        case Separator:
           GotNewDouble(Prefix, Exponent, Mantisse); 


    }

}


标签: c# parsing