Permutation of string as substring of another

2019-02-02 17:01发布

站内文章 / 移动开发

37 0

走好不送

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Given a string A and another string B. Find whether any permutation of B exists as a substring of A.

For example,

if A = "encyclopedia"

if B="dep" then return true as ped is a permutation of dep and ped is a substring of A.

My solution->

if length(A)=n and length(B)=m

I did this in 0((n-m+1)*m) by sorting B and then checking A 
with window size of m each time.

I need to find a better and a faster solution.

回答1:

Building a little on the algorithm presented by j_random_hacker in comments, it is possible to find the match in O(|A|+|B|), as follows: (Note: throughout, we use |A| to mean "the length of A".)

Create an integer array count whose domain is the size of the alphabet, initialized to all 0s.
Set distance to 0
For each character B_i in B:
- Decrement count[B_i].
- If the previous count of count[B_i] was 0, also increment distance.
For each character A_i in A:
- Increment count[A_i]
- If i is greater than |B| decrement count[A_i-|B|].
- For each of the two count values modified, if the previous value was 0, then increment distance and if the new value is 0 then decrement distance.
- If the result is that distance is 0 then a match has been found.

Note: The algorithm presented by j_random_hacker is also O(|A|+|B]) because the cost of comparing freqA with freqB is O(|alphabet|), which is a constant. However, the above algorithm reduces the comparison cost to a small constant. In addition, it is theoretically possible to make this work even if the alphabet is not a constant size by using the standard trick for uninitialized arrays.

回答2:

If I only have to worry about ASCII characters, it can be done in O(n) time with O(1) space. My code also prints the permutations out, but can be easily modified to simply return true at the first instance instead. The main part of the code is located in the printAllPermutations() method. Here is my solution:

Some Background

This is a solution that I came up with, it is somewhat similar to the idea behind the Rabin Karp Algorithm. Before I understanding the algorithm, I will explain the math behind it as follows:

Let S = {A_1, ..., A_n} be a multiset list of size N that contains only prime numbers. Let the sum of the numbers in S equal some integer Q. Then S is the only possible entirely prime multiset of size N, whose elements can sum to Q.

Because of this, we know we can map every character to a prime number. I propose a map as follows:

1 -> 1st prime
2 -> 2nd prime
3 -> 3rd prime
...
n -> nth prime

If we do this (which we can because ASCII only has 256 possible characters), then it becomes very easy for us to find each permutation in the larger string B.

The Algorithm:

We will do the following:

1: calculate the sum of the primes mapped to by each of the characters in A, let's call it smallHash.

2: create 2 indices (righti and lefti). righti is initialized to zero, and lefti is initialzed to the size of A.

ex:     |  |
        v  v
       "abcdabcd"
        ^  ^
        |  |

3: Create a variable currHash, and initialize it to the sum of the corresponding prime numbers mapped to by each of the characters in B, between (inclusive) righti, and lefti - 1.

4: Iterate both righti and lefti by 1, each time updating currHash by subtracting the prime mapped from the character that is no longer in the range (lefti - 1) and adding the prime corresponding to the character just added to the range (righti)

5: Each time currHash is equal to smallHash, the characters in the range must be a permutation. So we print them out.

6: Continue until we have reached the end of B. (When righti is equal to the length of B)

This solution runs in O(n) time complexity and O(1) space.

The Actual Code:

public class FindPermutationsInString {
    //This is an array containing the first 256 prime numbers
    static int primes[] = 
          {
            2,     3,     5,     7,    11,    13,    17,    19,    23,    29,
            31,    37,    41,    43,    47,    53,    59,    61,    67,    71,
            73,    79,    83,    89,    97,   101,   103,   107,   109,   113,
            127,   131,   137,   139,   149,   151,   157,   163,   167,   173,
            179,   181,   191,   193,   197,   199,   211,   223,   227,   229,
            233,   239,   241,   251,   257,   263,   269,   271,   277,   281,
            283,   293,   307,   311,   313,   317,   331,   337,   347,   349,
            353,   359,   367,   373,   379,   383,   389,   397,   401,   409,
            419,   421,   431,   433,   439,   443,   449,   457,   461,   463,
            467,   479,   487,   491,   499,   503,   509,   521,   523,   541,
            547,   557,   563,   569,   571,   577,   587,   593,   599,   601,
            607,   613,   617,   619,   631,   641,   643,   647,   653,   659,
            661,   673,   677,   683,   691,   701,   709,   719,   727,   733,
            739,   743,   751,   757,   761,   769,   773,   787,   797,   809,
            811,   821,   823,   827,   829,   839,   853,   857,   859,   863,
            877,   881,   883,   887,   907,   911,   919,   929,   937,   941,
            947,   953,   967,   971,   977,   983,   991,   997,  1009,  1013,
           1019,  1021,  1031,  1033,  1039,  1049,  1051,  1061,  1063,  1069,
           1087,  1091,  1093,  1097,  1103,  1109,  1117,  1123,  1129,  1151,
           1153,  1163,  1171,  1181,  1187,  1193,  1201,  1213,  1217,  1223,
           1229,  1231,  1237,  1249,  1259,  1277,  1279,  1283,  1289,  1291,
           1297,  1301,  1303,  1307,  1319,  1321,  1327,  1361,  1367,  1373,
           1381,  1399,  1409,  1423,  1427,  1429,  1433,  1439,  1447,  1451,
           1453,  1459,  1471,  1481,  1483,  1487,  1489,  1493,  1499,  1511,
           1523,  1531,  1543,  1549,  1553,  1559,  1567,  1571,  1579,  1583,
           1597,  1601,  1607,  1609,  1613,  1619
          };

    public static void main(String[] args) {
        String big = "abcdabcd";
        String small = "abcd";
        printAllPermutations(big, small);
    }

    static void printAllPermutations(String big, String small) {

        // If the big one is smaller than the small one,
        // there can't be any permutations, so return
        if (big.length() < small.length()) return;

        // Initialize smallHash to be the sum of the primes
        // corresponding to each of the characters in small.
        int smallHash = primeHash(small, 0, small.length());

        // Initialize righti and lefti.
        int lefti = 0, righti = small.length();

        // Initialize smallHash to be the sum of the primes
        // corresponding to each of the characters in big.
        int currentHash = primeHash(small, 0, righti);

        while (righti <= big.length()) {
            // If the current section of big is a permutation
            // of small, print it out.
            if (currentHash == smallHash)
                System.out.println(big.substring(lefti, righti));

            // Subtract the corresponding prime value in position
            // lefti. Then increment lefti
            currentHash -= primeHash(big.charAt(lefti++));

            if (righti < big.length()) // To prevent index out of bounds
                // Add the corresponding prime value in position righti.
                currentHash += primeHash(big.charAt(righti));

            //Increment righti.
            righti++;
        }

    }

    // Gets the sum of all the nth primes corresponding
    // to n being each of the characters in str, starting
    // from position start, and ending at position end - 1.
    static int primeHash(String str, int start, int end) {
        int value = 0;
        for (int i = start; i < end; i++) {
            value += primeHash(str.charAt(i));
        }
        return value;
    }

    // Get's the n-th prime, where n is the ASCII value of chr
    static int primeHash(Character chr) {
        return primes[chr];
    }
}

Keep in mind, however, that this solution only works when the characters can only be ASCII characters. If we are talking about unicode, we start getting into prime numbers that exceed the maximum size of an int, or even a double. Also, I'm not sure that there are 1,114,112 known primes.

回答3:

There is a simpler solution to this problem which can be done in linear time.

Here: n = A.size (), m = B.size ()

The idea is to use hashing.

First we hash the characters of string B.

Suppose: B = "dep"

hash_B ['d'] = 1;
hash_B ['e'] = 1;
hash_B ['p'] = 1;

Now we run a loop over the string 'A' for each window of size 'm'.

Suppose: A = "encyclopedia"

First window of size 'm' will have characters {e, n, c}. We will hash them now.

win ['e'] = 1
win ['n'] = 1
win ['c'] = 1

Now we check if the frequency of each character from both the arrays (hash_B [] and win []) are same. Note: Maximum size of hash_B [] or win [] is 26.

If they are not same we shift our window.

After shifting the window we decrease the count of win ['e'] by 1 and increase the count of win ['y'] by 1.

win ['n'] = 1
win ['c'] = 1
win ['y'] = 1

During the seventh shift, the status of your win array is:

win ['p'] = 1;
win ['e'] = 1;
win ['d'] = 1;

which is same as the hash_B array. So, Print "SUCCESS" and exit.

回答4:

The idea is clear in above talkings. An implementation with O(n) time complexity is below.

#include <stdio.h>
#include <string.h>

const char *a = "dep";
const char *b = "encyclopedia";

int cnt_a[26];
int cnt_b[26];

int main(void)
{
    const int len_a = strlen(a);
    const int len_b = strlen(b);

    for (int i = 0; i < len_a; i++) {
            cnt_a[a[i]-'a']++;
            cnt_b[b[i]-'a']++;
    }

    for (int i = 0; i < len_b-len_a; i++) {
            if (memcmp(cnt_a, cnt_b, sizeof(cnt_a)) == 0)
                    printf("%d\n", i);
            cnt_b[b[i]-'a']--;
            cnt_b[b[i+len_a]-'a']++;
    }

    return 0;
}

回答5:

My approach is first give yourself a big example such as

a: abbc b: cbabadcbbabbc Then literally go through and underline each permutation a: abbc b: cbabadcbbabbc '__' '__' '__' Therefore For i-> b.len: sub = b.substring(i,i+len) isPermuted ? Here is code in java

class Test {
  public static boolean isPermuted(int [] asciiA, String subB){
    int [] asciiB = new int[26];

    for(int i=0; i < subB.length();i++){
      asciiB[subB.charAt(i) - 'a']++;
    }
    for(int i=0; i < 26;i++){
        if(asciiA[i] != asciiB[i])
        return false;
    }
    return true;
  }
  public static void main(String args[]){
    String a = "abbc";
    String b = "cbabadcbbabbc";
    int len = a.length();
    int [] asciiA = new int[26];
    for(int i=0;i<a.length();i++){
      asciiA[a.charAt(i) - 'a']++;
    }
    int lastSeenIndex=0;
    for(int i=0;i<b.length()-len+1;i++){
      String sub = b.substring(i,i+len);
      System.out.printf("%s,%s\n",sub,isPermuted(asciiA,sub));
} }
}

回答6:

The below function will return true if the String B is a permuted substring of String A.

public boolean isPermutedSubstring(String B, String A){
    int[] arr = new int[26];

    for(int i = 0 ; i < A.length();++i){
        arr[A.charAt(i) - 'a']++;
    }
    for(int j=0; j < B.length();++j){
        if(--arr[B.charAt(j)-'a']<0) return false;
    }
    return true;
}

回答7:

Here's a solution that's pretty much rici's answer. https://wandbox.org/permlink/PdzyFvv8yDf3t69l It allocates a little more than 1k stack memory for the frequency table. O(|A| + |B|), no heap allocations.

#include <string>

bool is_permuted_substring(std::string_view input_string,
                           std::string_view search_string) {
  if (search_string.empty()) {
    return true;
  }

  if (search_string.length() > input_string.length()) {
    return false;
  }

  int character_frequencies[256]{};
  auto distance = search_string.length();
  for (auto c : search_string) {
    character_frequencies[(uint8_t)c]++;
  }

  for (auto i = 0u; i < input_string.length(); ++i) {
    auto& cur_frequency = character_frequencies[(uint8_t)input_string[i]];
    if (cur_frequency > 0) distance--;
    cur_frequency--;

    if (i >= search_string.length()) {
      auto& prev_frequency = ++character_frequencies[(
          uint8_t)input_string[i - search_string.length()]];
      if (prev_frequency > 0) {
        distance++;
      }
    }

    if (!distance) return true;
  }

  return false;
}

int main() {
  auto test = [](std::string_view input, std::string_view search,
                 auto expected) {
    auto result = is_permuted_substring(input, search);
    printf("%s: is_permuted_substring(\"%.*s\", \"%.*s\") => %s\n",
           result == expected ? "PASS" : "FAIL", (int)input.length(),
           input.data(), (int)search.length(), search.data(),
           result ? "T" : "F");
  };

  test("", "", true);
  test("", "a", false);
  test("a", "a", true);
  test("ab", "ab", true);
  test("ab", "ba", true);
  test("aba", "aa", false);
  test("baa", "aa", true);
  test("aacbba", "aab", false);
  test("encyclopedia", "dep", true);
  test("encyclopedia", "dop", false);

  constexpr char negative_input[]{-1, -2, -3, 0};
  constexpr char negative_search[]{-3, -2, 0};
  test(negative_input, negative_search, true);

  return 0;
}

回答8:

I am late to this party...

The question is also discussed in the book named Cracking the Coding Interview, 6th Edition on page number 70. The auther says there is a possiblity of finding all permutations using O(n) time complexity (linear) but she doesnt write the algorithm so I thought I should give it a go.

Here is the C# solution just in case if someone was looking...

Also, I think (not 100% sure) it finds the count of permutations using O(n) time complexity.

public int PermutationOfPatternInString(string text, string pattern)
{
    int matchCount = 0;
    Dictionary<char, int> charCount = new Dictionary<char, int>();
    int patLen = pattern.Length;
    foreach (char c in pattern)
    {
        if (charCount.ContainsKey(c))
        {
            charCount[c]++;
        }
        else
        {
            charCount.Add(c, 1);
        }
    }

    var subStringCharCount = new Dictionary<char, int>();

    // loop through each character in the given text (string)....
    for (int i = 0; i <= text.Length - patLen; i++)
    {
        // check if current char and current + length of pattern-th char are in the pattern.
        if (charCount.ContainsKey(text[i]) && charCount.ContainsKey(text[i + patLen - 1]))
        {
            string subString = text.Substring(i, patLen);
            int j = 0;
            for (; j < patLen; j++)
            {
                // there is no point going on if this subString doesnt contain chars that are in pattern...
                if (charCount.ContainsKey(subString[j]))
                {
                    if (subStringCharCount.ContainsKey(subString[j]))
                    {
                        subStringCharCount[subString[j]]++;
                    }
                    else
                    {
                        subStringCharCount.Add(subString[j], 1);
                    }
                }
                else
                {
                    // if any of the chars dont appear in the subString that we are looking for
                    // break this loop and continue...
                    break;
                }
            }

            int x = 0;

            // this will be true only when we have current subString's permutation count
            // matched with pattern's.
            // we need this because the char count could be different 
            if (subStringCharCount.Count == charCount.Count)
            {
                for (; x < patLen; x++)
                {
                    if (subStringCharCount[subString[x]] != charCount[subString[x]])
                    {
                        // if any count dont match then we break from this loop and continue...
                        break;
                    }
                }
            }

            if (x == patLen)
            {
                // this means we have found a permutation of pattern in the text...
                // increment the counter.
                matchCount++;
            }

            subStringCharCount.Clear(); // clear the count map.
        }
    }

    return matchCount;
}

and here is the unit test method...

[TestCase("encyclopedia", "dep", 1)]
[TestCase("cbabadcbbabbcbabaabccbabc", "abbc", 7)]
[TestCase("xyabxxbcbaxeabbxebbca", "abbc", 2)]
public void PermutationOfStringInText(string text, string pattern, int expectedAnswer)
{
    int answer = runner.PermutationOfPatternInString(text, pattern);
    Assert.AreEqual(expectedAnswer, answer);
}