Parse without string split

This is a spin-off from the discussion in some other question.

Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.

The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.

I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)

int startIdx, endIdx = 0;
while(true)
{
    startIdx = endIdx;
    // no find_first_not_of in C#
    while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
    if (startIdx == s.Length) break;
    endIdx = s.IndexOf(' ', startIdx);
    if (endIdx == -1) endIdx = s.Length;
    // how to extract a double here?
}

There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.

Does anyone have an idea?

标签： c# parsing

2条回答

对你真心纯属浪费

2楼-- · 2020-03-19 03:23

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.

Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.

Getting a pointer to string is actually a fully supported operation:

    string l_pos = new string(' ', 100); //don't write to a shared string!
    unsafe 
    {
        fixed (char* l_pSrc = l_pos)
        {               
              // do some work
        }
    }

C# has special syntax to bind a string to a char*.

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2020-03-19 03:29

if you want to do it really fast, i would use a state machine

this could look like:

enum State
{
    Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
    switch(CurrentState)
    { // set new currentstate in dependence of ch and CurrentState
        case Separator:
           GotNewDouble(Prefix, Exponent, Mantisse); 


    }

}

0人赞添加讨论(0) 举报

Parse without string split

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间