c string compare vs hash compare

2019-04-11 05:15发布

问题:

I need to compare a string to multiple other constant strings in c. I am curious which is faster, to hash the string I am going to compare and compare it to all the other constant string hashes or just compare the strings as strings. thank you in advance

thank you for the answers I am going to be doing many comparisons. can anyone give me a good, fast, low resource intensive algorithm to use? The only hash I know of is MD5 and I have a feeling that is over kill.

I also want to add that the strings are maybe 20 or 30 characters long at the max with most being around 7.

回答1:

Is the comparison going to be done once or many times? If the comparison is going to be done only once then you are likely better off doing a straight comparison. If you are going to need to compare very many strings to this set of constant strings, then you can probably save time in the long run by doing it with hashes.

This is a simple enough problem that you can easily write it both ways and see which works better for a representative set of input.



回答2:

If you are trying to match a subject string against a set of other strings, you might consider using the Aho-Corasick String Matching Algorithm. It uses a trie to match the subject against all of the target strings in a single pass (it's also quite simple to implement).



回答3:

It's difficult to get ahead, string hashing functions are O(n). String comparison is O(n) as well, with a smaller Oh. You would only be ahead if you can store the hash values you compute and use them repeatedly. For both.

Simple sample C hash functions are here.



回答4:

Equality of a hash value does not guarantee equality - a mismatch will guarantee inequality, though. If you're going to need to compare a lot of strings against your collection the a hash would be great - if it's a one-off comparison (unlikely I guess) then strcmp will do nicely.



回答5:

I think if you have a static list of strings, I would store them in a sorted array and then use bsearch to determine if a string is in that list. This returns NULL if it does not exist, or a pointer to the value should it exist and is probably faster than a linear search or hashing.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* cmp function for qsort and bsearch */
static int pstrcmp(const void *a, const void *b)
{
  return strcmp(*(char * const *)a, *(char * const *)b);
}

/* check an input against the list of known strings */
static char *check_for_match(char *input)
{
  static char *static_list[] = { "one", "two", "three", "four", "five" };
  static int nelems;

  /* this sorts the list, for demonstration purposes, but if the list
     is static then it could be sorted prior to compiling */
  if (! nelems)
  {
    nelems = sizeof(static_list) / sizeof(*static_list);
    qsort(static_list, nelems, sizeof(*static_list), pstrcmp);
  }


  return bsearch(&input, static_list, nelems, sizeof(*static_list), pstrcmp);
}

int main(int argc, char *argv[])
{
  if (check_for_match("should_not_match"))
  {
    printf("Match found.\n");
  } else {
    printf("No match found.\n");
  }

  if (check_for_match("two"))
  {
    printf("Match found.\n");
  } else {
    printf("No match found.\n");
  }
  return EXIT_SUCCESS;
}


回答6:

It depends. What's the hashing algorithm? How long are the strings? What's the platform?

Also note that a matching hash doesn't guarantee matching strings.



回答7:

If your constant strings are known at compile time, take a look at the idea of a "perfect hash".

Wikipedia: A perfect hash function for a set S is a hash function that maps distinct elements in S to distinct integers, with no collisions.

That "no collisions" thing saves you work. Possibilities for further reading and implementations are:

  • cmph -- http://cmph.sourceforge.net/
  • gperf -- http://www.gnu.org/software/gperf/


回答8:

It greatly depends on the length of the strings and the complexity of your hash function. Implement and benchmark yourself would be the best answer...



回答9:

Another approach that could work, is to have your constant string sorted and making a dichotomic search of your string, this way you only have at most log2(n) comparisons (that's for example only 10 comparisons for 1024 strings or even only 20 for 1000000 strings). I don't know if it is applicable to your problem but I had really good results with that approach. Hashing is really difficult to get right, the corner cases can get really nasty and the computation of the key can often be quite costly.



回答10:

thank you for the answers I am going to be doing many comparisons. can anyone give me a good, fast, low resource intensive algorithm to use? The only hash I know of is MD5 and I have a feeling that is over kill.

Murmur hash is simple, fast and behaves well on statistical tests.