I have a requirement to (very) quickly process strings of a limited range, tallying their values. The input file is of the form:
January 7
March 22
September 87
March 36
and so forth. Because the line widths are identical, I can simply read in a line with fread
reasonably fast, and I've developed a perfect hashing function which works, but I wanted to see if anyone could offer any advice on how to make it even faster. I'll profile each suggestion to see how it goes.
The hashing function is based on the month name to allow fast allocation of the value to a bucket. Bear with me here. I first figured out the minimal number of characters for a perfect hash:
January
February
March
April
May
June
July
August
September
October
November
December
Keep in mind that the months are all nine characters due to the fact I have the entire input line.
Unfortunately, there is no single column to mark a month unique. Column 1 duplicates J
, column 2 duplicates a
, column 3 duplicates r
, column 4 duplicates u
and columns 5 onwards duplicate <space>
(there are other duplicates but one is enough to prevent a single-column hash key).
However, by using the first and fourth column, I get the values Ju
, Fr
, Mc
, Ai
, M<space>
, Je
, Jy
, Au
, St
, Oo
, Ne
and De
, which are unique. There will be no invalid values in this file so I don't have to worry about incorrect buckets for the input data.
By viewing the hex codes for the characters, I found I could get low unique values by just ANDing with strategic values:
FirstChar Hex Binary &0x0f
--------- --- --------- -----
A x41 0100 0001 1
D x44 0100 0100 4
F x46 0100 0110 6
J x4a 0100 1010 10
M x4d 0100 1101 13
N x4e 0100 1110 14
O x4f 0100 1111 15
S x53 0101 0011 3
SecondChar Hex Binary &0x1f
---------- --- --------- -----
<space> x20 0010 0000 0
c x63 0110 0011 3
e x65 0110 0101 5
i x69 0110 1001 9
o x6f 0110 1111 15
r x72 0111 0010 18
t x74 0111 0100 20
u x75 0111 0101 21
y x79 0111 1001 25
and this allowed me to set up a static array to create a (hopefully) blindingly-fast hash function:
#define __ -1
static unsigned int hash (const char *str) {
static unsigned char bucket[] = {
// A S D F J M N O
__, __, __, __, __, __, __, __, __, __, __, __, __, 4, __, __, // space
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, 2, __, __, // c
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, 11, __, __, __, __, __, 5, __, __, __, 10, __, // e
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, 3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, 9, // o
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, 1, __, __, __, __, __, __, __, __, __, // r
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, 8, __, __, __, __, __, __, __, __, __, __, __, __, // t
__, 7, __, __, __, __, __, __, __, __, 0, __, __, __, __, __, // u
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, 6, __, __, __, __, __ // y
};
return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)];
}
Testing that with the code:
#include <stdio.h>
#include <string.h>
// Hash function here.
static char *months[] = {
"January ", "February ", "March ", "April ", "May ", "June ",
"July ", "August ", "September", "October ", "November ", "December "
};
int main (void) {
int i;
for (i = 0; i < sizeof(months)/sizeof(*months); i++)
printf ("%-10s -> %2d\n", months[i], hash(months[i]));
return 0;
}
shows that it's functionally correct:
January -> 0
February -> 1
March -> 2
April -> 3
May -> 4
June -> 5
July -> 6
August -> 7
September -> 8
October -> 9
November -> 10
December -> 11
but I want to know if it can be made faster.
Any suggestions out there? I'm open to any simple optimisations or even a total rewrite if there's something inherently bad with my hashing function.
I don't think this is that important but the final version will be using EBCDIC. The theory will still stand but the AND operation may change slightly since the characters have different code points. I'll be happy with any assistance only on the ASCII front since I'm confident whatever advice is offered will translate okay to EBCDIC.
Looks nice enough for me. The question is if the hash function itself is enough of a bottleneck to justify the ongoing efforts of eliminating one or two more simple binary operations from it. Given that file access seems to be involved, I doubt it, without knowing any details about the overall processing, of course.
EDIT:
Maybe you could see if you find any pair of characters that results in unique lower bits (4, 5 or 6) when added:
If addition won't do, maybe one of the other operations
& | ^
. If this won't help, maybe using three characters.Here's the smallest sequence I could find for EBCDIC-US:
It has 24 elements in the bucket and uses only 2 operations to compute the index:
Second best, 25 items with xor:
(Actually, tab[] should be 32 entries long here, because 0x1f can generate an overflow for incorrect inputs).
Update from Pax: For what it's worth, the first option worked perfectly for EBCDIC code page 500:
I would start with a detailed profile of your larger process to make sure you aren't engaging in premature optimization.
This looks pretty fast on the face of it, but if memory is really cheap it might be better to just use an even sparser array and let your cache do some of the work. For instance (and thinking off the cuff here), what if you simply add the
short
found in the first two bytes to theshort
at the next two. That includes both the first and fourth characters, so at a guess it should produce your 12 distinct values, and it doesn't involve bit field extractions which may not optimize well. Then, make the matchingbucket[]
array have 64K entries, only 12 of which are ever hit. If it works out right, those 12 entries end up occupying some of your D cache and you've traded a handful of arithmetic operations for an index into a cached larger array.But do profile both before and after any mucking about trying to make arithmetic faster, and don't bother optimizing where it won't actually save time. (I know Pax knows this, but its the obligatory warning attached to any optimization discussion.)
Ok, as everyone on SO, I'm all in it for the rep.. ;*) As I wrote in comments above, the lower end of your target architectures has a cache line size of 256 bytes, so you might end up with some cache trashing in your table lookups (your table is more than 256 bytes). An attempt to fold the table using some cheap bit-trick might actually gain some performance.
I've been playing around with your data. You also have the option of column 2 and 3. Haven't figured out a way to get that under 8 bits yet though.
... and as always, profile, make sure it's the best point to apply effort, and profile again afterward, make sure it's faster.
... and you are reading more than one line at a time, right? Fixed record sizes are good that way, that you don't have to search for separators (newlines), and you can read a big chunk of them at a time.
You can reduce the array size by using:
which changes it from 16-by-26 to 16-by-13.
EDIT
If, as suggested by other posts, your strings ARE aligned, so that you may use them as shorts, you may add the first and second short, xor the two bytes together and you'll have a unique 8-bit key (well, seven, actually). Might be worth your while too. This is ASCII though, so might not work in EBCDIC. In ASCII, the keys turn out to be:
This is tested for EBDIC (CCSID 500), the table 32 byte (smaller than yours, same size as x4u's):
Do you really need the mapping between the hash and the month index to do the tallying? You could eliminate a lookup if instead of returning the month you returned the hash and use that for tallying. In x4u's answer the last line of the hash function could look like
and you'd still be able to do the sums, sorting the results only in the end, not inside the loop.