I have seen a few similar questions but I am trying to achieve this.
Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.
the
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
earth
I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.
string[] words = Regex.Split(line, @"\W+");
Would surely appreciate some nudges in the right direction.
A regex solution.
And if you really want to fix that last
.
oni.e.
you could use this.Here's the code I'm using.
Results:
I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?
Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.
This works for me.
Results:
you could do some post-processing of the results, removing commas and semicolons, etc.