String Parsing in C#

2020-03-04 04:09发布

问题:

What is the most efficient way to parse a C# string in the form of

"(params (abc 1.3)(sdc 2.0)(www 3.05)....)"

into a struct in the form

struct Params
{
  double abc,sdc,www....;
}

Thanks

EDIT The structure always have the same parameters (same names,only doubles, known at compile time).. but the order is not granted.. only one struct at a time..

回答1:

Depending on your complete grammar you have a few options: if it's a very simple grammar and you don't have to test for errors in it you could simply go with the below (which will be fast)

var input = "(params (abc 1.3)(sdc 2.0)(www 3.05)....)";
var tokens = input.Split('(');
var typeName = tokens[0];
//you'll need more than the type name (assembly/namespace) so I'll leave that to you
Type t = getStructFromType(typeName);
var obj = TypeDescriptor.CreateInstance(null, t, null, null);
for(var i = 1;i<tokens.Length;i++)
{
    var innerTokens = tokens[i].Trim(' ', ')').Split(' ');
    var fieldName = innerTokens[0];
    var value = Convert.ToDouble(innerTokens[1]);
    var field = t.GetField(fieldName);
    field.SetValue(obj, value);
}

that simple approach however requires a well conforming string or it will misbehave.

If the grammar is a bit more complicated e.g. nested ( ) then that simple approach won't work.

you could try to use a regEx but that still requires a rather simple grammar so if you end up having a complex grammar your best choice is a real parser. Irony is easy to use since you can write it all in simple c# (some knowledge of BNF is a plus though).



回答2:

using System;

namespace ConsoleApplication1
{
    class Program
    {
        struct Params
        {
            public double abc, sdc;
        };

        static void Main(string[] args)
        {
            string s = "(params (abc 1.3)(sdc 2.0))";
            Params p = new Params();
            object pbox = (object)p; // structs must be boxed for SetValue() to work

            string[] arr = s.Substring(8).Replace(")", "").Split(new char[] { ' ', '(', }, StringSplitOptions.RemoveEmptyEntries);
            for (int i = 0; i < arr.Length; i+=2)
                typeof(Params).GetField(arr[i]).SetValue(pbox, double.Parse(arr[i + 1]));
            p = (Params)pbox;
            Console.WriteLine("p.abc={0} p.sdc={1}", p.abc, p.sdc);
        }
    }
}

Note: if you used a class instead of a struct the boxing/unboxing would not be necessary.



回答3:

Do you need to support multiple structs ? In other words, does this need to be dynamic; or do you know the struct definition at compile time ?

Parsing the string with a regex would be the obvious choice.

Here is a regex, that will parse your string format:

private static readonly Regex regParser = new Regex(@"^\(params\s(\((?<name>[a-zA-Z]+)\s(?<value>[\d\.]+)\))+\)$", RegexOptions.Compiled);

Running that regex on a string will give you two groups named "name" and "value". The Captures property of each group will contain the names and values.

If the struct type is unknown at compile time, then you will need to use reflection to fill in the fields.

If you mean to generate the struct definition at runtime, you will need to use Reflection to emit the type; or you will need to generate the source code.

Which part are you having trouble with ?



回答4:

A regex can do the job for you:

public Dictionary<string, double> ParseString(string input){
    var dict = new Dictionary<string, double>();
    try
    {
        var re = new Regex(@"(?:\(params\s)?(?:\((?<n>[^\s]+)\s(?<v>[^\)]+)\))");
        foreach (Match m in re.Matches(input))
            dict.Add(m.Groups["n"].Value, double.Parse(m.Groups["v"].Value));
    }
    catch
    {
        throw new Exception("Invalid format!");
    }
    return dict;
}

use it like:

string str = "(params (abc 1.3)(sdc 2.0)(www 3.05))";
var parsed = ParseString(str);

// parsed["abc"] would now return 1.3

That might fit better than creating a lot of different structs for every possible input string, and using reflection for filling them. I dont think that is worth the effort.

Furthermore I assumed the input string is always in exactly the format you posted.



回答5:

You might consider performing just enough string manipulation to make the input look like standard command line arguments then use an off-the-shelf command line argument parser like NDesk.Options to populate the Params object. You give up some efficiency but you make it up in maintainability.

public Params Parse(string input)
{
    var @params = new Params();
    var argv = ConvertToArgv(input);
    new NDesk.Options.OptionSet
        {
            {"abc=", v => Double.TryParse(v, out @params.abc)},
            {"sdc=", v => Double.TryParse(v, out @params.sdc)},
            {"www=", v => Double.TryParse(v, out @params.www)}
        }
        .Parse(argv);

    return @params;
}

private string[] ConvertToArgv(string input)
{
    return input
        .Replace('(', '-')
        .Split(new[] {')', ' '});
}


回答6:

Do you want to build a data representation of your defined syntax?

If you are looking for easily maintainability, without having to write long RegEx statements you could build your own Lexer parser. here is a prior discussion on SO with good links in the answers as well to help you

Poor man's "lexer" for C#



回答7:

I would just do a basic recursive-descent parser. It may be more general than you want, but nothing else will be much faster.



回答8:

Here's an out-of-the-box approach: convert () to {} and [SPACE] to ":", then use System.Web.Script.Serialization.JavaScriptSerializer.Deserialize

string s = "(params (abc 1.3)(sdc 2.0))"
  .Replace(" ", ":")
  .Replace("(", "{")
  .Replace(")","}"); 

return new System.Web.Script.Serialization.JavaScriptSerializer().Deserialize(s);