可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to store user agent strings in a database for tracking and comparing customer behavior and sales performance between different browsers. A pretty plain user agent string is around 100 characters long. It was decided to use a varchar(1024) for holding the useragent data in the database. (I know this is overkill, but that's the idea; it's supposed to accommodate useragent data for years to come and some devices, toolbars, applications are already pushing 500 characters in length.) The table holding these strings will be normalized (each distinct user agent string will only be stored once) and treated like a cache so we don't have to interpret user agents over and over again.

The typical use case is:

User comes to our site, is detected as being a new vistor
New session information is created for this user
Determine if we need to analyze the user agent string or if we have a valid analysis on file for it
If we have it, great, if not, analyze it (currently, we plan on calling a 3rd party API)
Store the pertinent information (browser name, version, os etc.) in a join table tied the existing user session information and pointing to the cache entry

Note: I have a tendency to say 'searching' for the user agent string in the database because it's not a simple look up. But to be clear, the queries are going to use '=' operators, not regexes or LIKE % syntax.

So the speed of looking up the user agent string is paramount. I've explored a few methods of making sure it will have good performance. Indexing the whole column is right out for size reasons. A partial index isn't such a good idea either because most user agents have the distinguishing information at the end; the partial index would have to be fairly long to make it worthwhile by which point its size is causing problems.

So it comes down to a hash function. My thought is to hash the user agent string in web server code and run the select looking for the hash value in the database. I feel like this would minimize the load on the database server (as opposed to having it compute the hash), especially since if the hash isn't found, the code would turn around and ask the database to compute the hash again on the insert.

Hashing to an integer value would offer the best performance at the risk of higher collisions. I'm expecting to see thousands or tens of thousands user agents at the most; even 100,000 user agents would fit reasonably well into a 2^32 size integer with very few collisions which could be deciphered by the webservice with minimal performance impact. Even if you think an integer hash isn't such a good idea, using a 32 character digest (SHA-1, MD5 e.g.) should be much faster for selects than the raw string, right?

My database is MySQL InnoDB engine. The web code will be coming from C# at first and php later (after we consolidate some hosting and authentication) (not that the web code should make a big difference).

Let me apologize at this point if you think this is lame choose-my-hash-algorithm question. I'm really hoping to get some input from people who've done something similar before and their decision process. So, the question:

Which hash would you use for this application?
Would you compute the hash in code or let the db handle it?
Is there a radically different approach for storing/searching long strings in a database?

回答1:

Your idea of hashing long strings to create a token upon which to lookup within a store (cache, or database) is a good one. I have seen this done for extremely large strings, and within high volume environments, and it works great.

"Which hash would you use for this application?"

I don't think the encryption (hashing) algorithm really matters, as you are not hashing to encrypt data, you are hashing to create a token upon which to use as a key to look up longer values. So the choice of hashing algorithm should be based off of speed.

"Would you compute the hash in code or let the db handle it?"

If it were my project, I would do the hashing at the app layer and then pass it through to look up within the store (cache, then database).

"Is there a radically different approach for storing/searching long strings in a database?"

As I mentioned, I think for your specific purpose, your proposed solution is a good one.

Table recommendations (demonstrative only):

user

id int(11) unsigned not null
name_first varchar(100) not null

user_agent_history

user_id int(11) unsigned not null
agent_hash varchar(255) not null

agent

agent_hash varchar(255) not null
browser varchar(100) not null
agent text not null

Few notes on schema:

From your OP it sounds like you need a M:M relationship between user and agent, due to the fact that a user may be using Firefox from work, but then may switch to IE9 at home. Hence the need for the pivot table.
The varchar(255) used for agent_hash is up for debate. MySQL suggests using a varbinary column type for storing hashes, of which there are several types.
I would also suggest either making agent_hash a primary key, or at the very least, adding a UNIQUE constraint to the column.

回答2:

Your hash idea is sound. I've actually used hashing to speed up some searches on millions of records. A hash index will be quicker since each entry is the same size. md5 will likely be fine in your case and will probably give you the shortest hash length. If you are worried about hash collisions, you can add include the length of the agent string.