keyword extraction by using KEA or other python li

2019-07-29 10:03发布

问题:

I am now working for a keyword extraction project. Basically, I use python to do that. Let me talk about what my project is first. My goal in this project is going to find out the key word (key-phrase is not that preferable) from a paragraph or a webpage.

I assume that I can crawl a pretty well structure of content from a website.

Let say I got lots of paragraph and all the paragraph are from the same industry. Here is one of the example paragraph:

About us

We are the greatest bank in the world, which provide the most safe service in the world. Our bank is providing FX, security trading and saving services. Over the past few years, we successfully build up a reliable reputation.

Secondly, I have labeled the keyword from these paragraph in other to have a supervised learning model.

Finally, I have tried to use KEA, which is a JAVA program ( I use python to call JAVA program), to have a model.

However, the result finally is damn bad. The accuracy rate is only about 15%. It means that if I give a paragraph to my KEA program, KEA will output 10 keywords to me and nearly 85% that all these keyword are actually not a desirable keyword.

Here I got few questions:

  1. This question is about the preparing material for KEA. Is it that the keyword should be include in the .txt file in training data? or I should delete it from the paragraph? Because it is about confused in the readme file from KEA :

'Delete the author-assigned keyphrases from those documents and put them into separate ".key" files. For example, if your document file is called doc1.txt, move the keyphrases into a new file called "doc1.key". It is important that you put each keyphrase on a separate line in this file!'

So let say I have the last example paragraph for my training data and assume that 'safe' and 'reliable' . Should I delete these two words from the paragraph??

  1. As KEA can apply SKOS vocabularies, does it mean that if I use a appropriate SKOS vocabularies in a certain kind of topic ( Let's say financial industry), my model result will be better? If yes, where can I find these SKOS vocabularies, for example I want SKOS vocabularies about financial industry.

  2. Are there suggestion library from python that can be powerful in this topic? can anyone share to me?

Thanks a lot.

回答1:

Actually, I tried to use this (https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words) kaggle example to do that. However, it is just too simple.

Somehow, I wanna know more practice case. Like how netfix or facebook detects people's comment. Will anyone shall much more about this text mining information to me?