How to find all brotherhood strings?

2019-05-12 00:19发布

问题:

I have a string, and another text file which contains a list of strings.

We call 2 strings "brotherhood strings" when they're exactly the same after sorting alphabetically.

For example, "abc" and "cba" will be sorted into "abc" and "abc", so the original two are brotherhood. But "abc" and "aaa" are not.

So, is there an efficient way to pick out all brotherhood strings from the text file, according to the one string provided?

For example, we have "abc" and a text file which writes like this:

abc
cba
acb
lalala

then "abc", "cba", "acb" are the answers.

Of course, "sort & compare" is a nice try, but by "efficient", i mean if there is a way, we can determine a candidate string is or not brotherhood of the original one after one pass processing.

This is the most efficient way, i think. After all, you can not tell out the answer without even reading candidate strings. For sorting, most of the time, we need to do more than 1 pass to the candidate string. So, hash table might be a good solution, but i've no idea what hash function to choose.

回答1:

Most efficient algorithm I can think of:

  • Set up a hash table for the original string. Let each letter be the key, and the number of times the letter appears in the string be the value. Call this hash table inputStringTable
  • Parse the input string, and each time you see a character, increment the value of the hash entry by one
  • for each string in the file
  • create a new hash table. Call this one brotherStringTable.
  • for each character in the string, add one to a new hash table. If brotherStringTable[character] > inputStringTable[character], this string is not a brother (one character shows up too many times)
  • once string is parsed, compare each inputStringTable value with the corresponding brotherStringTable value. If one is different, then this string is not a brother string. If all match, then the string is a brother string.

This will be O(nk), where n is the length of the input string (any strings longer than the input string can be discarded immediately) and k is the number of strings in the file. Any sort based algorithm will be O(nk lg n), so in certain cases, this algorithm is faster than a sort based algorithm.



回答2:

Sorting each string, then comparing it, works out to something like O(N*(k+log S)), where N is the number of strings, k is the search key length, and S is the average string length.

It seems like counting the occurrences of each character might be a possible way to go here (assuming the strings are of a reasonable length). That gives you O(k+N*S). Whether that's actually faster than the sort & compare is obviously going to depend on the values of k, N, and S.

I think that in practice, the cache-thrashing effect of re-writing all the strings in the sorting case will kill performance, compared to any algorithm that doesn't modify the strings...



回答3:

iterate, sort, compare. that shouldn't be too hard, right?



回答4:

Let's assume your alphabet is from 'a' to 'z' and you can index an array based on the characters. Then, for each element in a 26 element array, you store the number of times that letter appears in the input string.

Then you go through the set of strings you're searching, and iterate through the characters in each string. You can decrement the count associated with each letter in (a copy of) the array of counts from the key string.

If you finish your loop through the candidate string without having to stop, and you have seen the same number of characters as there were in the input string, it's a match.

This allows you to skip the sorts in favor of a constant-time array copy and a single iteration through each string.

EDIT: Upon further reflection, this is effectively sorting the characters of the first string using a bucket sort.



回答5:

I think what will help you is the test if two strings are anagrams. Here is how you can do it. I am assuming the string can contain 256 ascii characters for now.

#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];

bool isAnagram(char *src, char *dest) {
    len1 = strlen(src);
    len2 = strlen(dest);
    if (len1 != len2)
        return false;

    memset(alphabets, 0, sizeof(alphabets));
    for (i = 0; i < len1; i++)
        alphabets[src[i]]++;
    for (i = 0; i < len2; i++) {
        alphabets[dest[i]]--;
        if (alphabets[dest[i]] < 0)
            return false;
    }

   return true;
}

This will run in O(mn) if you have 'm' strings in the file of average length 'n'



回答6:

  • Sort your query string
  • Iterate through the Collection, doing the following:
    • Sort current string
    • Compare against query string
    • If it matches, this is a "brotherhood" match, save it/index/whatever you want

That's pretty much it. If you're doing lots of searching, presorting all of your collection will make the routine a lot faster (at the cost of extra memory). If you are doing this even more, you could pre-sort and save a dictionary (or some hashed collection) based off the first character, etc, to find matches much faster.



回答7:

It's fairly obvious that each brotherhood string will have the same histogram of letters as the original. It is trivial to construct such a histogram, and fairly efficient to test whether the input string has the same histogram as the test string ( you have to increment or decrement counters for twice the length of the input string ).

The steps would be:

  • construct histogram of test string ( zero an array int histogram[128] and increment position for each character in test string )
  • for each input string

    • for each character in input string c, test whether histogram[c] is zero. If it is, it is a non-match and restore the histogram.

      • decrement histogram[c]
    • to restore the histogram, traverse the input string back to its start incrementing rather than decrementing

At most, it requires two increments/decrements of an array for each character in the input.



回答8:

The most efficient answer will depend on the contents of the file. Any algorithm we come up with will have complexity proportional to N (number of words in file) and L (average length of the strings) and possibly V (variety in the length of strings)

If this were a real world situation, I would start with KISS and not try to overcomplicate it. Checking the length of the target string is simple but could help avoid lots of nlogn sort operations.

target = sort_characters("target string")
count = 0
foreach (word in inputfile){
   if target.len == word.len && target == sort_characters(word){
      count++
    }
 }


回答9:

I would recommend:
for each string in text file :

  1. compare size with "source string" (size of brotherhood strings should be equal)
  2. compare hashes (CRC or default framework hash should be good)
  3. in case of equity, do a finer compare with string sorted.

It's not the fastest algorithm but it will work for any alphabet/encoding.



回答10:

Here's another method, which works if you have a relatively small set of possible "letters" in the strings, or good support for large integers. Basically consists of writing a position-independent hash function...

Assign a different prime number for each letter:

prime['a']=2;
prime['b']=3;
prime['c']=5;

Write a function that runs through a string, repeatedly multiplying the prime associated with each letter into a running product

long long key(char *string)
{
  long long product=1;
  while (*string++) {
    product *= prime[*string];
  }
  return product;
}

This function will return a guaranteed-unique integer for any set of letters, independent of the order that they appear in the string. Once you've got the value for the "key", you can go through the list of strings to match, and perform the same operation.

Time complexity of this is O(N), of course. You can even re-generate the (sorted) search string by factoring the key. The disadvantage, of course, is that the keys do get large pretty quickly if you have a large alphabet.



回答11:

Here's an implementation. It creates a dict of the letters of the master, and a string version of the same as string comparisons will be done at C++ speed. When creating a dict of the letters in a trial string, it checks against the master dict in order to fail at the first possible moment - if it finds a letter not in the original, or more of that letter than the original, it will fail. You could replace the strings with integer-based hashes (as per one answer regarding base 26) if that proves quicker. Currently the hash for comparison looks like a3c2b1 for abacca.

This should work out O(N log( min(M,K) )) for N strings of length M and a reference string of length K, and requires the minimum number of lookups of the trial string.

master = "abc"
wordset = "def cba accb aepojpaohge abd bac ajghe aegage abc".split()

def dictmaster(str):
    charmap = {}
    for char in str:
        if char not in charmap:
            charmap[char]=1
        else:
            charmap[char] += 1
    return charmap

def dicttrial(str,mastermap):
    trialmap = {}
    for char in str:
        if char in mastermap:
            # check if this means there are more incidences
            # than in the master
            if char not in trialmap:
                trialmap[char]=1
            else:
                trialmap[char] += 1
        else:
            return False
    return trialmap

def dicttostring(hash):
    if hash==False:
        return False
    str = ""
    for char in hash:
        str += char + `hash[char]`
    return str

def testtrial(str,master,mastermap,masterhashstring):
    if len(master) != len(str):
        return False
    trialhashstring=dicttostring(dicttrial(str,mastermap))
    if (trialhashstring==False) or (trialhashstring != masterhashstring):
        return False
    else:
        return True

mastermap = dictmaster(master)
masterhashstring = dicttostring(mastermap)
for word in wordset:
    if testtrial(word,master,mastermap,masterhashstring):
        print word+"\n"