Input:
1) A huge sorted array of string SA;
2) A prefix string P;
Output:
The index of the first string matching the input prefix if any.
If there is no such match, then output will be -1.
Example:
SA = {"ab", "abd", "abdf", "abz"}
P = "abd"
The output should be 1 (index starting from 0).
What's the most algorithm way to do this kind of job?
If you only want to do this once, use binary search, if on the other hand you need to do it for many different prefixes but on the same string array, building a radix tree can be a good idea, after you've built the tree each look up will be very fast.
It can be done in linear time using a Suffix Tree. Building the suffix tree takes linear time.
This is just a modified bisection search:
- Only check as many characters in each element as are in the search string; and
- If you find a match, keep searching backwards (either linearly or by further bisection searches) until you find a non-matching result and then return the index of the last matching result.
The FreeBSD kernel use a Radix tree for its routing table, you should check that.
My current solution in mind is, instead of to find the "prefix", try to find a "virtual prefix".
For example, prefix is “abd", try to find a virtual-prefix “abc(255)". (255) just represents the max char number. After locating the "abc(255)". The next word should be the first word matching "abd" if any.
Are you in the position to precalculate all possible prefixes?
If so, you can do that, then use a binary search to find the prefix in the precalculated table. Store the subscript to the desired value with the prefix.
My solution:
Used binary search.
private static int search(String[] words, String searchPrefix) {
if (words == null || words.length == 0) {
return -1;
}
int low = 0;
int high = words.length - 1;
int searchPrefixLength = searchPrefix.length();
while (low <= high) {
int mid = low + (high - low) / 2;
String word = words[mid];
int compare = -1;
if (searchPrefixLength <= word.length()) {
compare = word.substring(0, searchPrefixLength).compareTo(searchPrefix);
}
if (compare == 0) {
return mid;
} else if (compare > 0) {
high = mid - 1;
} else {
low = mid + 1;
}
}
return -1;
}
Here is a possible solution (in Python), which has O(k.log(n)) time complexity and O(1) additional space complexity (considering n strings and k prefix length).
The rationale behind it to perform a binary search which only considers a given character index of the strings. If these are present, continue to the next character index. If any of the prefix characters cannot be found in any string, it returns immediately.
from typing import List
def first(items: List[str], prefix: str, i: int, c: str, left: int, right: int):
result = -1
while left <= right:
mid = left + ((right - left) // 2)
if ( i >= len(items[mid]) ):
left = mid + 1
elif (c < items[mid][i]):
right = mid - 1
elif (c > items[mid][i]):
left = mid + 1
else:
result = mid
right = mid - 1
return result
def last(items: List[str], prefix: str, i: int, c: str, left: int, right: int):
result = -1
while left <= right:
mid = left + ((right - left) // 2)
if ( i >= len(items[mid]) ):
left = mid + 1
elif (c < items[mid][i]):
right = mid - 1
elif (c > items[mid][i]):
left = mid + 1
else:
result = mid
left = mid + 1
return result
def is_prefix(items: List[str], prefix: str):
left = 0
right = len(items) - 1
for i in range(len(prefix)):
c = prefix[i]
left = first(items, prefix, i, c, left, right)
right = last(items, prefix, i, c, left, right)
if (left == -1 or right == -1):
return False
return True
# Test cases
a = ['ab', 'abjsiohjd', 'abikshdiu', 'ashdi','abcde Aasioudhf', 'abcdefgOAJ', 'aa', 'aaap', 'aas', 'asd', 'bbbbb', 'bsadiojh', 'iod', '0asdn', 'asdjd', 'bqw', 'ba']
a.sort()
print(a)
print(is_prefix(a, 'abcdf'))
print(is_prefix(a, 'abcde'))
print(is_prefix(a, 'abcdef'))
print(is_prefix(a, 'abcdefg'))
print(is_prefix(a, 'abcdefgh'))
print(is_prefix(a, 'abcde Aa'))
print(is_prefix(a, 'iod'))
print(is_prefix(a, 'ZZZZZZiod'))
This gist is available at https://gist.github.com/lopespm/9790d60492aff25ea0960fe9ed389c0f