I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word
). Any instance of more than one word
should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace
will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex
you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, @"(?<=\b)" + word + @"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace
method
foreach (string word in bannedWords) {
text = Regex.Replace(text, @"(?<=\b)" + word + @"(?=\b)", "")
}
Note: I used the following Regex
pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b
, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word)
.
Using @zmbq's idea you could create a Regex
pattern once with
string pattern =
@"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
@")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word
you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf
.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>
.