Currently, I have the following c# code to extract a value out of text. If its XML, I want the value within it - otherwise, if its not XML, it can just return the text itself.
String data = "..."
try
{
return XElement.Parse(data).Value;
}
catch (System.Xml.XmlException)
{
return data;
}
I know exceptions are expensive in C#, so I was wondering if there was a better way to determine if the text I'm dealing with is xml or not?
I thought of regex testing, but I dont' see that as a cheaper alternative. Note, I'm asking for a less expensive method of doing this.
A variation on Colin Burnett's technique: you could do a simple regex at the beginning to see if the text starts with a tag, then try to parse it. Probably >99% of strings you'll deal with that start with a valid element are XML. That way you could skip the regex processing for full-blown valid XML and also skip the exception-based processing in almost every case.
Something like
^<[^>]+>
would probably do the trick.How about this, take your string or object and toss in into a new XDocument or XElement. Everything resolves using ToString().
I am not exactly sure if your requirement considers the file format and as this question was asked a long time back & i happen to search for a similar thing, i would like you to know what worked for me , so if any one comes here this might help :)
We can use Path.GetExtension(filePath) and check if it is XML then use it other wise do what ever is required
Update: (original post is below) Colin has the brilliant idea of moving the regex instantiation outside of the calls so that they're created only once. Heres the new program:
And here are the new results:
There you have it. Precompiled regex are the way to go, and pretty efficient to boot.
(original post)
I cobbled together the following program to benchmark the code samples that were provided for this answer, to demonstrate the reasoning for my post as well as evaluate the speed of the privded answers.
Without further ado, heres the program.
And here are the results. Each one was executed 1 million times.
Test 4 took too long, as 30 minutes later it was deemed too slow. To demonstrate how much slower it was, here is the same test only run 1000 times.
Extrapolating out to a million executions, it would've taken 3456 seconds, or just over 57 minutes.
This is a good example as to why complex regex are a bad idea if you're looking for efficient code. However it showed that simple regex can still be good answer in some cases - i.e. the small 'pre-test' of xml in colinBurnett answer created a potentially more expensive base case, (regex was created in case 2) but also a much shorter else case by avoiding the exception.
Clue -- all valid xml must start with
"<?xml
"You may have to deal with character set differences but checking plain ASCII, utf-8 and unicode will cover 99.5% of the xml out there.
There is no way of validating that the text is XML other than doing something like XElement.Parse. If, for example, the very last close-angle-bracket is missing from the text field then it's not valid XML, and it's very unlikely that you'll spot this with RegEx or text parsing. There are number of illegal characters, illegal sequences etc that RegEx parsiing will most likely miss.
All you can hope to do is to short cut your failure cases.
So, if you expect to see lots of non-XML data and the less-expected case is XML then employing RegEx or substring searches to detect angle brackets might save you a little bit of time, but I'd suggest this is only useful if you're batch processing lots of data in a tight loop.
If, instead, this is parsing user entered data from a web form or a winforms app then I think paying the cost of the Exception might be better than spending the dev and test effort ensuring that your short-cut code doesn't generate false positive/negative results.
It's not clear where you're getting your XML from (file, stream, textbox or somewhere else) but remember that whitespace, comments, byte order marks and other stuff can get in the way of simple rules like "it must start with a <".