I'm building a Django site and I am looking for a search engine.

A few candidates:

Lucene/Lucene with Compass/Solr
Sphinx
Postgresql built-in full text search
MySQl built-in full text search

Selection criteria:

result relevance and ranking
searching and indexing speed
ease of use and ease of integration with Django
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
scalability
extra features such as "did you mean?", related searches, etc

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.

EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay

标签： mysql postgresql full-text-search lucene sphinx

8条回答

步步皆殇っ

2楼-- · 2018-12-31 18:40

Apache Solr

Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.

Simple Introduction

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.

Solr shouldn't be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.

Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.

result relevance and ranking

The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.

http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john

As you can see, firstname field is boosted up with a score of 2.

Getting Started

Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.

After downloading the archive , extract it to a folder of your choice. Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/

On your prompt.. Navigate inside the directory

shankar@shankar-lenovo: cd Downloads/solr-4.8.1

So now you are here ..

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1$

Start the Jetty Application Server

Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar

Now , do not close the terminal , minimize it and let it stay aside.

( TIP : Use & after start.jar to make the Jetty Server run in the background )

To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr

Running Jetty on custom Port

It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.

java -Djetty.port=9091 -jar start.jar

Download the JConnector

This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here

After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib

Creating the MySQL table to be linked to Apache Solr

To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.

1.Table Structure

CREATE TABLE test_solr_mysql
 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(45) NULL,
  created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
 );

2.Populate the above table

INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');

Getting inside the core and adding the lib directives

1.Navigate to

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf

2.Modifying the solrconfig.xml

Add these two directives to this file..

  <lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
  <lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

Now add the DIH (Data Import Handler)

<requestHandler name="/dataimport" 
  class="org.apache.solr.handler.dataimport.DataImportHandler" >
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
</requestHandler>

3.Create the db-data-config.xml file

If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
    <document>
   <entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
      OR `created` > '${dataimporter.last_index_time}'" >
    <field name="id" column="rid" />
    <field name="solr_name" column="name" />
    </entity>
   </document>
</dataConfig>

( TIP : You can have any number of entities but watch out for id field, if they are same then indexing will skipped. )

4.Modify the schema.xml file

Add this to your schema.xml as shown..

<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />

Implementation

Indexing

This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.

Step 1: Go to Solr Admin Panel

Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.

This is the main Apache Solr Administration Panel

As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.

Step 2: Check your Logs

Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.

This is the logging section of your Apache Solr engine

Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport

Step 3: DIH (Data Import Handler)

Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.

To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.

Alternatively, you could use a direct full-import query like this too..

http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true

The Data Import Handler

After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.

Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.

Indexing Success

Step 4: Running Solr Queries

Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.

You will see the indexed records as shown.

The corresponding Solr query for listing all the records is

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true

The indexed data

Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.

http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true

The JSON data starting with Ja*

That's how you write Solr Queries. To read more about it, Check this beautiful article.

0人赞添加讨论(0) 举报

姐姐魅力值爆表

3楼-- · 2018-12-31 18:49

Just my two cents to this very old question. I would highly recommend taking a look at ElasticSearch.

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

The advantages over other FTS (full text search) Engines are:

RESTful interface
Better scalability
Large community
Built by Lucene developers
Extensive documentation
There are many open source libraries available (including Django)

We are using this search engine at our project and very happy with it.

0人赞添加讨论(0) 举报

唯独是你

4楼-- · 2018-12-31 18:52

I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.

Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.

As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.

It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.

0人赞添加讨论(0) 举报

倾城一夜雪

5楼-- · 2018-12-31 18:55

I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).

The answer at the link below details a few things about Sphinx which also applies to Solr. Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Solr also provides the following additional features:

Supports replication
Multiple cores (think of these as separate databases with their own configuration and own indexes)
Boolean searches
Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
Update index via XML or delimited file
Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
PDF, Word document indexing
Dynamic fields
Facets
Aggregate fields
Stop words, synonyms, etc.
More Like this...
Index directly from the database with custom queries
Auto-suggest
Cache Autowarming
Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)

BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.

Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)

0人赞添加讨论(0) 举报

旧时光的记忆

6楼-- · 2018-12-31 18:58

I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.

The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.

In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.

And, of course mnoGoSearch is open source, GNU GPL.

0人赞添加讨论(0) 举报

浮光初槿花落

7楼-- · 2018-12-31 18:59

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."

FYIs, The MySQL fulltext min word length is adjustable since at least MySQL 5.0. Google 'mysql fulltext min length' for simple instructions.

That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...

0人赞添加讨论(0) 举报

1 2 下一页

Comparison of full text search engine - Lucene, Sp

Apache Solr

Simple Introduction

Detailed Installation

Getting Started

Start the Jetty Application Server

Running Jetty on custom Port

Download the JConnector

Creating the MySQL table to be linked to Apache Solr

1.Table Structure

2.Populate the above table

Getting inside the core and adding the lib directives

1.Navigate to

2.Modifying the solrconfig.xml

3.Create the db-data-config.xml file

4.Modify the schema.xml file

Implementation

Indexing

Step 1: Go to Solr Admin Panel

Step 2: Check your Logs

Step 3: DIH (Data Import Handler)

Step 4: Running Solr Queries

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间