I'm having a little problem coming up with an architecture for tag driven software I'm designing.
What I want to do is store plain text in database, which is liked to an owner and other entities. The plain text is filled with tags just like Twitters hashtags and should be searchable/indexable. That can be done application side and as a result I'm gonna have tons of small chunks of data that need to be processed for business intelligens.
No one is gonna ready the plain text it's only about the analysis which doesn't need to be consistent and can be run asynchron.
I know that Twitter uses several databases: Gizzard and Cassandra for tweets and FlockDb for relations.
I don't feel like using a hybrid to accomplish relations and I don't want to build the next social network either. What I need to do though is analytics over all tags in relation to other entities.
How can I solve the hash tag problem, or how can I process the text to make it work?
I'm really searching for a nice solution not just any solution. I really know how to create a schema for SQL.
Thanks for helping me through that database jungle.
Here is a fairly standard way for representing tags in the database:
The application parses the RAW_TEXT, for each identified tag checks if it's already in TAG and inserts it there if it isn't. Then it just connects the post with the tags extracted from it, by inserting into TAG_POST.
Note the order of fields in the composite primary key of TAG_POST - having TAG_ID at the leading edge facilitates efficient searching for posts of a given tag. If you want to get tags of a given post, flip the order. If you want both, create a composite index in the opposite order from the PK.
If your DBMS supports clustering, TAG_POST would be a good candidate for it.
If your DBMS supports leading-edge index compression (Oracle), consider simply doing this:
A compressed index will store many repetitions of the same TAG_NAME efficiently, so there is no need to "outsource" it to another table just to save space.