Dedicated faceted search engine for dealing with d

2019-03-09 18:46发布

问题:

I've been thinking for a while about modeling typical ecommerce site with ebay-like taxonomy and attributes dependent on a particular product category.

First attempt was choosing between EAV and Table Per Class db inheritance modeling. I've chosen the latter because of the performance, but what it meant was creating dedicated table for each specific (leaf in the category tree) product category with specific category attributes (like resolution for TVs) modeled as a separate column.

While performant this setup is not flexible if you need adding attributes to the existing categories or adding new categories. For each such change following is needed:

  • Alter/create table
  • New form for filtering withing such category by specific attributes
  • New code for generating db queries for searching and filtering
  • Some new viewmodels/DTOs and views for presenting products from new categories

To cope with that complexity I think some kind of meta representation of those attributes is needed (even outside of the application) in xml or even excel file, so that on each change all mentioned code could be auto-generated (sql/orm queries, application code, templates). So it can help with development, but still testing and extra deployment is needed.

At that point I've learned that ebay doesn't really use relational db for search, and that their taxonomy is so flexible, that they can quite quickly add new leaf categories. Also their categories aren't probably categories from a hierarchical tree modeled in relational db, but just search attributes (facets).

After having a quick look into most promising dedicated faceted search setup (separate Solr instance) I'm not sure whether it could help me in being flexible to taxonomy changes since usually Solr just mirrors somehow relational DB, so specific category attributes would still have to be modelled in DB as DBMS metadata, so eg. dynamic generating UI forms for filtering attributes would be hard unless:

1) I would keep the data in RDBMS using EAV fasion and overcome its performance problems with using SOLR search (but there still would be problems with EAV messiness, no data integrity enforcement etc)

2) I would keep just the attributes dictionary (ie. just their names and types) in RDBMS and store the specific attribute values in SOLR using it as kind of non-relational data store apart from search facility. I'm not convinced to this solution either (even if it's possible) since application would be coupled to tight with solr (ie. product edition admin CRUD would interact with SOLR directly).

What are your thoughts? Do you think that for any kind of such (performant) taxonomy flexibility code generation is inevitable? How would you handle that? Maybe some separate data dictionary in EAV fashion in DB just for code generation purposes? I guess I could also use something like MongoDB, but the UI code generation (runtime or not) would still need some kind of metadata.

There's lot of question here, but I didn't want to break it up into smaller questions since I'm interested in a general design approach when dealing with a bigger class of such problems.

回答1:

I don't claim to have a definitive answer to all of this (it's a rather open-ended question which you should try to break into smaller parts and it depends on your actual requirements, in fact I'm tempted to vote to close it) but I will comment on a few things:

  1. I would forget about modelling this on a RDBMS. Faceted search just doesn't work in a relational schema.
  2. IMO this is not the right place for code generation. You should design your code so it doesn't change with data changes (I'm not talking about schema changes).
  3. Storing metadata / attributes on an Excel spreadsheet seems like a very bad idea. I'd build a UI to edit this, which would be stored on Solr / MongoDB / CouchDB / whatever you choose to manage this.
  4. Solr does not "just mirror relational DB". In fact, Solr is completely independent of relational databases. One of the most common cases is dumping data from a RDBMS to Solr (denormalizing data in the process), but Solr is flexible enough to work without any relational data source.
  5. Hierarchical faceting in Solr is still an open issue in research. Currently there are two separate approaches being researched (SOLR-64, SOLR-792)


回答2:

What if you had different types of categories for different types of products?

Taking the eBay example, we would have Products that can be either Books or TV/Displays.

Books have title and ISBN, and may be in the sci-fi category, or in the erotic category, or in the non-fiction category, or autobiographical category. Or maybe you have a book that is in the non-fiction, autobiographical erotic categories.

Displays have screen resolution and watt-consumption (?), and may be in the flat-screen category, CRT category, or HD category.

From a purely relational point of view, you could maybe model this like so:

[Product]-(1)------(1)-[  Book  ]-(n)------(m)-[ book_category ]
| id    |              | title  |              |  name         |
| price |              | ISBN   |
| ...   |
| ...   |-(1)---(1)-[   display  ]-(n)------(m)-[ display_category ]
                    | resolution |              |  name            |
                    |   watts    |

Instead of modeling attributes dependent on a particular product category, you would have different properties and categories dependent on the type/class of product.

See supertypes & subtypes