Text mining with PHP [closed]

2019-01-30 12:00发布

问题:

I'm doing a project for a college class I'm taking.

I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree.

However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP?

I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient.

Do you have any idea what I should use for this project? Or should I just switch to Python?

Thanks

回答1:

If you're going to be using a Naive Bayes classifier, you don't really need a whole ton of NL processing. All you'll need is an algorithm to stem the words in the tweets and if you want, remove stop words.

Stemming algorithms abound and aren't difficult to code. Removing stop words is just a matter of searching a hash map or something similar. I don't see a justification to switch your development platform to accomodate the NLTK, although it is a very nice tool.



回答2:

I did a very similar project a while ago - only classifying RSS news items instead of twitter - also using PHP for the front-end and WEKA for the back-end. I used PHP/Java Bridge which was relatively simple to use - a couple of lines added to your Java (WEKA) code and it allows your PHP to call its methods. Here's an example of the PHP-side code from their website:

<?php 
require_once("http://localhost:8087/JavaBridge/java/Java.inc");

$world = new java("HelloWorld");
echo $world->hello(array("from PHP"));
?>

Then (as someone has already mentioned), you just need to filter out the stop words. Keeping a txt file for this is pretty handy for adding new words (they tend to pile up when you start filtering out irrelevant words and account for typos).

The naive-bayes model has strong independent-feature assumptions, i.e. it doesn't account for words that are commonly paired (such as an idiom or phrase) - just taking each word as an independent occurrence. However, it can outperform some of the more complex methods (such as word-stemming, IIRC) and should be perfect for a college class without making it needlessly complex.



回答3:

You can also use the uClassify API to do something similar to Naive Bayes. You basically train a classifier as you would with any algorithm (except here you're doing it via the web interface or by sending xml documents to the API). Then whenever you get a new tweet (or batch of tweets), you call the API to have it classify them. It's fast and you don't have to worry about tuning it. Of course, that means you lose the flexibility you get by controlling the classifier yourself, but that also means less work for you if that in itself is not the goal of the class project.



回答4:

Try open calais - http://viewer.opencalais.com/ . It has api, PHP classes and many more. Also, LingPipe for this task - http://alias-i.com/lingpipe/index.html



回答5:

you can check this library https://github.com/Dachande663/PHP-Classifier very straight forward



回答6:

you can also use thrift or gearman to deal with nltk