Is there a fairly simple way for a script to tell

2楼-- · 2019-02-06 22:46

I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.

Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-02-06 22:49

Trying to determine whether her is a possessive or personal pronoun is harder than trying to determine the class of him or his. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences of him and his. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicates him or his. Then you can use those probability estimates on an occurrence of her to determine whether you should be using him or his. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.

I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of her in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:

her_followed_by_noun ? possessive : personal

You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.

Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.

0人赞添加讨论(0) 举报

相关推荐>>

4楼-- · 2019-02-06 22:50

OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.

Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:

If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.

And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)

0人赞添加讨论(0) 举报

Rolldiameter

5楼-- · 2019-02-06 22:51

G'day,

This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.

I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?

You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:

Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.

Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.

Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:

Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.

HTH

0人赞添加讨论(0) 举报

爷的心禁止访问

6楼-- · 2019-02-06 22:53

I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.

When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.

0人赞添加讨论(0) 举报

干净又极端

7楼-- · 2019-02-06 22:57

Given the scope of your project: reversing all gender-related words, it appears that :

The "investment" in a more fundamental approach would be justified
No heuristic based on simple lookup/substitution will adequately serve all or even most cases.

Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).

Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.

This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.

0人赞添加讨论(0) 举报

Is there a fairly simple way for a script to tell

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间