Using Lucene like a relational database

2019-02-07 10:14发布

I am just wondering if we could achieve some RDBMS capabilities in lucene.

Example: 1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search. 2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.

I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).

My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.

This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.

please advise.

5条回答
对你真心纯属浪费
2楼-- · 2019-02-07 10:15

If I understand you correctly, you have two questions:

  1. Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
  2. Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
查看更多
趁早两清
3楼-- · 2019-02-07 10:15

You can use Lucene that way;

Pros:

Full-text search is easy to implement, which is not the case in an RDBMS.

Cons:

Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

查看更多
Root(大扎)
4楼-- · 2019-02-07 10:23

This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.

In particular, there are a few areas to keep a close eye on:

  • Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
  • Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.

If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.

As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.

查看更多
甜甜的少女心
5楼-- · 2019-02-07 10:28

I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.

查看更多
来,给爷笑一个
6楼-- · 2019-02-07 10:35

Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.

查看更多
登录 后发表回答