I'm using an object database (ZODB) in order to store complex relationships between many objects but am running into performance issues. As a result I started to construct indexes in order to speed up object retrieval and insertion. Here is my story and I hope that you can help.
Initially when I would add an object to the database I would insert it in a branch dedicated to that object type. In order to prevent multiple objects representing the same entity I added a method that would iterate over existing objects in the branch in order to find duplicates. This worked at first but as the database grew in size the time it took to load each object into memory and check attributes grew exponentially and unacceptably.
To solve that issue I started to create indexes based on the attributes in the object so that when an object would be added it would be saved in the type branch as well as within an attribute value index branch. For example, say I was saving an person object with attributes firstName = 'John' and lastName = 'Smith', the object would be appended to the person object type branch and would also be appended to lists within the attribute index branch with keys 'John' and 'Smith'.
This saved a lot of time with duplicate checking since the new object could be analysed and only the set of objects which intersect within the attribute indexes would need to be checked.
However, I quickly ran into another issue with regards to dealing when updating objects. The indexes would need to updated to reflect the fact that they may not be accurate any more. This requires either remembering old values so that they could be directly accessed and the object removed or iterating over all values of an attribute type in order to find then remove the object. Either way performance is quickly beginning to degrade again and I can't figure out a way to solve it.
Has you had this kind of issue before? What did you do solve it, or is this just something that I have to deal with when using OODBMS's?
Thank in advance for the help.
Yes, repoze.catalog is nice, and well documented.
In short : don't make indexing part of your site structure!
Look at using a container/item hierarchy to store and traverse content item objects; plan to be able to traverse content by either (a) path (graph edges look like a filesystem) or (b) by identifying singleton containers at some distinct location.
Identify your content using either RFC 4122 UUIDs (uuid.UUID type) or 64-bit integers.
Use a central catalog to index (e.g. repoze.catalog); the catalog should be at a known location relative to the root application object of your ZODB. And your catalog will likely index attributes of objects and return record-ids (usually integers) on query. Your job is to map those integer ids to (perhaps indrecting via UUIDs) to some physical traversal path in the database where you are storing content. It helps if you use zope.location and zope.container for common interfaces for traversal of your object graph from root/application downward.
Use zope.lifecycleevent handlers to index content and keep things fresh.
The problem -- generalized
ZODB is too flexible: it is just a persistent object graph with transactions, but this leaves room for you to sink or swim in your own data-structures and interfaces.
The solution -- generalized
Usually, just picking pre-existing idioms from the community around the ZODB will work: zope.lifecycleevent handlers, "containerish" traversal using zope.container and zope.location, and something like repoze.catalog.
More particular
Only when you exhaust the generalized idioms and know why they won't work, try to build your own indexes using the various flavors of BTrees in ZODB. I actually do this more than I care to admit, but usually have good cause.
In all cases, keep your indexes (search, discovery) and site (traversal and storage) structure distinct.
The idioms for the problem domain
Master ZODB BTrees: you likely want:
- To store content objects as subclasses of Persistent in containers that are subclasses of OOBTree providing container interfaces (see below).
- To store BTrees for your catalog or global indexes or use packages like repoze.catalog and zope.index to abstract that detail away (hint: catalog solutions typically store indexes as OIBTrees that will yield integer record ids for search results; you then typically have some sort of document mapper utility that translates those record ids into something resolvable in your application like a uuid (provided you can traverse the graph to the UUID) or a path (the way the Zope2 catalog does).
IMHO, don't bother working with intids and key-references and such (these are less idiomatic and more difficult if you don't need them). Just use a Catalog and DocumentMap from repoze.catalog to get results in integer to uuid or path form, and then figure out how to get your object. Note, you likely want some utility/singleton that has the job of retrieving your object given an id or uuid returned from a search.
Use zope.lifecycleevent or similar package that provides synchronous event callback (handler) registrations. These handlers are what you should call whenever an atomic edit is made on your object (likely once per transaction, but not in transaction machinery).
Learn the Zope Component Architecture; not an absolute requirement, but surely helpful, even if just to understand zope.interface interfaces of upstream packages like zope.container
Understanding of how Zope2 (ZCatalog) does this: a catalog fronts for multiple indexes or various sorts, which each search for a query, each have specialized data structures, and each return integer record id sequences. These are merged across indexes by the catalog doing set intersections and returned as a lazy-mapping of "brain" objects containing metadata stubs (each brain has a getObject() method to get the actual content object). Getting actual objects from a catalog search relies upon the Zope2 idiom of using paths from the root application object to identify the location of the item cataloged.
Think about using an attribute hash (something like Java's hashCode()), then use the 32-bit hash value as the key. Python has a hash function, but I am not real familiar with it.