I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".
相关问题
- How to get a list of antonyms lemmas using Python,
- Improve converting string to readable urls
- Regex to match charset
- Regex subsequence matching
- How to match dependency patterns with spaCy?
相关文章
- Optimization techniques for backtracking regex imp
- Regex to check for new line
- Allow only 2 decimal points entry to a textbox usi
- Comparing speed of non-matching regexp
- Regular expression to get URL in string swift with
- 请问如何删除之前和之后的非字母中文单字
- Lazy (ungreedy) matching multiple groups using reg
- when [:punct:] is too much [duplicate]
I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.
Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory
Trying to determine whether
her
is a possessive or personal pronoun is harder than trying to determine the class ofhim
orhis
. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences ofhim
andhis
. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicateshim
orhis
. Then you can use those probability estimates on an occurrence ofher
to determine whether you should be usinghim
orhis
. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of
her
in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:her_followed_by_noun ? possessive : personal
You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.
Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.
OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.
Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:
If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.
And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)
G'day,
This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.
I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?
You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:
Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.
Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.
Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:
Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.
HTH
I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.
When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.
Given the scope of your project: reversing all gender-related words, it appears that :
Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).
Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.
This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.