optimizing generated string for storing into a dat

2019-09-07 05:51发布

问题:

I have a 64bit integer timestamp and a Sting username to be combined into one string and eventually stored into a database column. Leave aside why I can't store them in separate columns with appropriate type, my question is how to combine them to get better performance from the underlying database. That would be sqlite, PostgreSQL or MySQL, not sure yet.

I am imagining that they would be using b-trees as indexes and it would be bad to concat like (timestamp-username) because timestamp would generally always progress and tree would need balancing often. username-timestamp should be much better but still each user record would increase with every new entry. I was thinking to also put timestamp with reverse order of bits.

Anything else I can do? Some clever xor or whatever? What would be the reasonably best schema? Data will ever be accessed by requesting the exact generated string, no ranges and such.

The only requirements are to have relatively fast conversion between the generated string and source data in both ways.

UPDATE: Please guys, I'm reaching for information what kind of string would be better for storing as a primary key to a database (one of sqlite, mysql and postgresql). Perhaps the answer is that it doesn't matter, or depends on the DB engine. I don't have a particular problem with the schema I'm using or the caching solution. I'm just asking if there is any room to improve and how. I'll appreciate some on-topic answers.

UPDATE2: Great answers still not definitive for me: does incremented column makes the b-tree index on the column unbalanced? https://stackoverflow.com/a/2362693/520567

回答1:

There is a contradiction in your question, you specify you can't split them and store them in separate columns but then you're talking about indexing both parts separately - you can't do that without splitting them.

I can see you really have two choices:

  1. Storing them in separate columns
  2. Hash the output to lower the index memory footprint

Ideally you should store them in two columns and create a composite index if you will always search for them together in the same order. In that case its hard to give accurate advice without first giving more information - however generally username, timestamp would make logical sense if you query per user, or reversing it if you want to query by timestamp. You could also create an index on each column if you need to search on one or the other.

Hashing your generated string

INSERT INTO table (crc_hash_column, value_column_name)
values (CRC32(@generated_value), @generated_value)

would reduce the size to a 32bit integer (only 4bytes of index per row), much smaller than the equilivant VARCHAR or CHAR index space required.

If you take this approach then you should take measures to avoid collisions, due to the Birthday Paradox it will happen, and be more likely as your dataset grows. Even with collisions the extra filtering will still yield greater performance given the size of the index than the alternatives.

SELECT * FROM table
WHERE crc_hash_column = CRC32(@search_value) 
AND value_column_name = @searchvalue

Using the hash will cause a few more CPU cycles - but a CRC32 hash is very quick so even though you have to rehash each time you search this extra work is tiny for the benefits given over indexing larger amounts of data.

Generally I would prefer the first option, but its almost impossible to recommend without knowing your use case.

You should profile both options and see if they fit your requirements.



回答2:

That you say you can't keep them in separate columns (you can't even set up a new table with a 1:1 relationship / mirror the data to materialized view with triggers / replace the existing table with a view on a corrected table structure ????!!!!) means that any solution will be an ugly hack.

Yes, how much the data changes and how it's structured will affect the efficiency of updates. However the purpose of an index is to speed up retrievals - you've given us no information about how the data is accessed / how it's likely to change.

I was thinking to also put timestamp with reverse order of bits

Why? this is more likely to speed up index fragmentation than decrease it.

MariaDB supports virtual columns - and indexes on virtual columns hence you can do silly things like throw the rules of normalization out of the window - but if you can't fix a trivial problem in the schema then replacing the DBMS probably won't be a very practical solution either.

Frankly, if it's worth spending time and money developing a bad solution to the problem that already costs as much as a proper solution, and will likely incur future costs then choosing the bad solution is waste of both time and money.