One of the most essential features of any database is query speed. We store data away and want quick access to data that matches our criteria. However, of late, schema-less databases have become popular. It's one thing if we have a schema-less database but there actually is an inferred (in-the-head/in-the-app) schema; it just hasn't been declared formally by the database.
On the other hand, let's say we truly need an open database where several users have their own schemas for their own individual problem areas. A user would define his own "domain". That domain (a database on a RDBMS server) would have its types (tables in an RDBMS) and those types would have their own properities (columns in an RDBMS). How do I create compound indexes to pull specific objects/documents/records (what have you) from a given domain? My query should select one or more domains (an IN clause), just one topic type (e.g. a CalendarEvent), against certain columns (start_date >= today, start_date <= today + 1 week, open_for_registration = true, calendar_name = ‘Public'). In a database with a fixed schema (implied even if not declared), this is simple: you create a compound index against the columns.
The complexity is that we have essentially made a single instance of let's say MongoDB act like a RDBMS server with many databases where each database and its related schema is our "domain".
After busting my brain on this problem for a week and looking at various databases (MongoDB, Neo4j, MySQL, PostgreSQL) I have only found a few possible solutions:
- Index all properties. A property could be represented in a Properties table or as an embedded document in MongoDB. In an RDBMS the property values would have to be serialized to strings. CONS: a) Can only search against one property at a time (no compound indexes), b) everything gets an index so we're incurring needless overhead.
- Index select properties. In PostgreSQL this could be done with a Filtered Index. Basically, the property record would have a bit called "indexed" that I would have to maintain. This bit would drive whether or not the filtered index uses that particular property. CONS: a) we can still only search against one property at a time. This eliminates "compound indexes" from use. The only way I can imagine to mimic a compound index would be to search against each individual indexed property and return the intersection of the PKs.
- Create/maintain database constructs to reflect working indexes. In MongoDB, I could create an "indexables" collection. A document in this collection might look like this: {domain_id: ObjectId(..), type_id: ObjectId(..), fields: {field1: "some int value", field2: "some date value", field3: "some bit value"}}. Then I index the "indexables" collection on {domain_id: 1, type_id: 1, "fields.field1": 1, "fields:field2": 1, "fields:field3", 1}. Then every time I create/update a document in my "things" collection I would have to plug it's values into the field1, field2, field3 slots of indexables. (This works nicely with MongoDB because I can plug values of any datatype into those placeholders. In MySQL, using the same pattern I would have to serialize values to strings.) I would also have to maintain the domain_id and type_id. Basically, it's an index layer (that I manage myself) built on top of indexes handled by the database. CONS: There's additional overhead. Whereas the database would normally manage indexes on my behalf, I now have to take care to do this myself. As MongoDB has no concept of transactions I couldn't guarantee that the document and it's various indexes were committed in a single step. PROS: I have my compound indexes back. Indexes are maintained at the domain level.
- I have considered allowing users to have their own instances of database X. Or in MongoDB their own collections. But I wondered if this wouldn't create more issues especially as we run up against practical limitations (number of databases or collections allowed). I tossed this idea out after not too much thought.
Other ideas? Other kinds of databases that might better handle this problem?
Again, the idea is this: different users manage their own domains. Within a domain can be items of any "type". For each typed item we have properties. I want to allow users to run queries against their domains to get items of a type having properties that match their conditions. (thus compound indexes)
One last thought. A domain in itself is not intended to be humongous. It might have 10-20 "types". Within each type their might be as many as 5000 records (in most cases) and say 20000 in extreme cases.
Unfortunately, this is one of those cases where despite Joel Spolsky's advice I attempted astronaut architecture.
Have you considered Excel? Maybe just indexed flat files :)
Look, the basic problem you're going to have here is that there is not silver bullet. Your idea is fine, but at some point you have to accept some set of trade-offs.
You can't index everything. At some point you'll have to identify "commonly-used" queries and build some indexes for those things. Unless you're planning to keep everything in memory, you'll end up creating indexes at some point.
Hey there's a true limitation. How much hardware can you throw at 5k records? How about 200k records? Is it going to be enough to keep it all in RAM? Keep part of it in RAM? Keep just the indexes in RAM?
If you want to let users just stuff in their own "dynamic" schemas, I personally feel that MongoDB is a natural fit. Especially for these small data sets you're indicating.
But it's not a silver bullet by any means. Each of these solutions will have their own set of problems. If there was an actual DB that could handle all of the requirements you put forth, let's face it, we'd all be using that DB :)