I have around 100thousand
rows in a mysql table, where each row has about 8 fields.
I have finally got the hold on how to use Zend Lucene
to index and search data from a mysql table.
Before I fully implement this funcionality to my website, I have some questions:
1- Is it possible to determine the size of a index in advance? This because in the Zend manual it says the max size of a index is 2GB. I am straight away thinking that isn't enough for my table!
2- I have read posts where they say Zend Lucene search is very slow on large indexes, up to minutes! Is it faster to use mysql commands directly (SELECT, LIKE etc) instead of zend?
3- Is there any other solutions to my problem which is to create a search engine for classifieds which has these functions atleast, and doesn't require full-text mysql indexes (fields).
Thanks
SOLR is basically an Apache Tomcat container that implements a REST interface to query an Apache Lucene index. Yes, you need to be able to run a Java application on your web server. This is an issue for you to work out with your hosting provider.
Clients using your web app don't need to run Java. Your PHP app could make a REST query to the SOLR service and format the results in HTML. A client sees only the HTML output; it never needs to know that the data came from a service implemented in Java.
Zend_Search_Lucene
is a pure-PHP implementation that is supposed to work identically to Apache Lucene. The Zend solution even uses an identical index file format. So storage-wise they should be equal.I used Java Lucene to index the StackOverflow data dump (October 2009). I indexed 1.5 million rows, including about 1 gig of text data. The Lucene index was 1323 MB, whereas the MySQL FULLTEXT index of the same data was only 466 MB.
Using SQL
LIKE
predicates in lieu of any fulltext indexing solution requires no space of course, because it cannot make use of a conventional index anyway. But in my tests usingLIKE
was about 200 times slower than Java Lucene, which was in turn about 40% slower than a MySQL FULLTEXT index on the same data.See my recent presentation about fulltext indexing solutions with MySQL:
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
It's not surprising that it can't match the performance and scalability of the Java Lucene technology. PHP's advantage as a language is increasing development efficiency, not runtime efficiency.
update: I just tried creating an index using
Zend_Search_Lucene
. Creating an index is far slower with PHP than with the Java Lucene technology, so I only indexed 10,000 documents. This took almost 15 minutes, which would make it take about 36 hours to index the whole collection. Compare this to Java Lucene, which in my test indexed the full collection of 1.5 million documents in under 7 minutes.The size of the index I created with
Zend_Search_Lucene
is 8.75 MB. Extrapolating this 150x, I estimate the full index would be 1312.5 MB. So I conclude thatZend_Search_Lucene
creates an index of about the same size as the index produced by Java Lucene. This is as expected.