Is there a fairly simple way for a script to tell

2019-02-06 22:59发布

站内文章 / 前端开发

24 0

男人必须洒脱

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".

回答1:

Okay. Lets look at this like a linguist might. I am thinking aloud here.

"Her" is a pronoun. It can either be a:

1. possessive pronoun

This is her book.

2. personal pronoun

Give it to her. (after preposition)

He wrote her a letter. (indirect object)

He treated her for a cold. (direct object)

So lets look at case (1), possessive pronoun. That is it is a pronoun which is in the "genitive" case (meaning, it is a noun which is being "possessive." Okay, that detail isn't quite as important as the next one.)

In this case, "her" is acting as a "determiner". Determiners may occur in two places in a sentence (this is a simplification):

Det + Noun ("her book")

Det + Adj + Noun ("her nice book")

So to figure out if her is a determiner, you could have this logic:

a. If the word following "her" is a noun, then "her" is a determiner.

b. If the 2 words following "her" is an adjective, then a noun, then "her" is a determiner"

And if you establish that "her" is a determiner, then you know that you must replace it with "his", which is also a determiner (aka genitive noun, aka possessive pronoun).

If it doesn't match criteria (a) and (b) above, then you could possibly conclude that it is not a determiner, which means it must be a personal pronoun. In that case, you would replace "her" with "him".

You wouldn't even have to do the tests below, but I'll try to describe them anyway.

Looking at (2) from above: personal pronoun, rather than possessive. This gets trickier.

The examples above show "her" occurring in 3 ways:

(1) Give it to her. (after preposition. we call this the "object of a preposition".)

So you could maybe devise a rule: "If 'her' occurs immediately after a preposition, then it should be treated as a noun, so we would replace it with 'him'".

The next two are tricky. "her" can either be a direct object or an indirect object.

(2) He wrote her a letter. (indirect object)

(3) He treated her for a cold. (direct object)

Syntactically, how can we tell the difference?

A direct object occurs immediately after a verb.

If you have a verb, followed by a noun, then that noun is a direct object. eg:

He treated her.*

If you have a verb, followed by a noun, followed by a prepositional phrase, then the noun is a direct object.

He treated her for a cold. ("her" is a noun, and it comes immediately after the verb "treated". "for a cold" is a prepositional phrase.)

Which means that you could say "If you have Verb + Noun + Prep" then the noun is a direct object. Since the noun is a direct object, then it is a personal pronoun, so use "him". (note, you only have to check for a preposition, not the entire prep phrase, since the phrase will always begin with a preposition.)

If it is an indirect object, then you'll have the form "verb + noun + noun".

He wrote her a letter. ("her" is a noun, "letter" is a noun. well, "a letter" is a "noun phrase", so you'd have to account for determiners as well.)

So... if "her" is a direct object, indirect object, or obj of prep, you could change it to "him", otherwise, change it to "his".

This method seems a lot more complicated - so I'd just start by checking to see if "her" is a determiner (see above), and if it is a determiner, use "his" otherwise, just use "him".

So, the above has a lot of simplifications. It doesn't cover "interrupting phrases", or clause structures, or constituency tests, or embedded clauses, or punctuation, or anything like that.

Also, this solution requires a dictionary - a list of "nouns" and "verbs" and "prepositions" so that you can determine the lexical category of each word in the sentence.

And even there, man, natural language processing is hard. You'd want to do some sort of "training" for your model to have a good solution. BUT for very simple things, try some of the stuff described above.

Sorry for being so verbose! (None of the existing answers gave any hard data, or precise linguistic definitions, so here goes.)

回答2:

Given the scope of your project: reversing all gender-related words, it appears that :

The "investment" in a more fundamental approach would be justified
No heuristic based on simple lookup/substitution will adequately serve all or even most cases.

Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).

Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.

This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.

回答3:

G'day,

This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.

I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?

You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:

Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.

Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.

Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:

Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.

HTH

回答4:

Trying to determine whether her is a possessive or personal pronoun is harder than trying to determine the class of him or his. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences of him and his. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicates him or his. Then you can use those probability estimates on an occurrence of her to determine whether you should be using him or his. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.

I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of her in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:

her_followed_by_noun ? possessive : personal

You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.

Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.

回答5:

I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.

Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory

回答6:

Definitely no. You would have to do syntactic analysis on your input text (parsing the English language, really, that's where the word “to parse” comes from). That's the only way you can determine with certainty what the “her” in your text stand for, you can't rely on search-and-replace. There are many ways to do that, but none would qualify as “fairly simple”, I think.

回答7:

I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.

When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.

回答8:

About the only one I can think of (and I'm sure someone in the comments will prove me wrong!) is any instance of her followed by punctuation can most probably be replace with him. But I still agree with the previous answers that you're probably best off doing a manual replace.

回答9:

OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.

Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:

If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.

And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)