I'm helpless on regular expressions so please help me on this problem.
Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.
How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.
This works quite well thanks to Pavel .split(/[^[:alpha:]]/).uniq!
try
\b\w*\b
to match whole wordsI think that what fits you best would be splitting of the string into words. In this case,
String::split
function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by
[:alpha:]
. So, here's the example of what you need:You may further filter the result by intersecting the resultant array with array that contains only English words: