The generic algorithm to deduce if a string contains all unique characters (and that does not use any other data structures) says to go through the string, iterating each letter against the entire string searching for a match. This approach is O(n^2).
The approach below (written in C) uses an offset for the iterating over the string part, since for example in a short string there is no reason to test the last character with the first character as the first character already did that.
My question is this: Is the runtime of the algorithm then O(n!) or something like O(nlogn)?
#include <stdio.h>
int strunique(const char *str)
{
size_t offset = 1;
char *scout = (char *)str, *start;
for (; *scout != '\0'; ++scout, ++offset)
for (start = (char *)str + offset; *start != '\0'; ++start)
if (*start == *scout)
return 0;
return 1;
}
int main(void)
{
printf("%d\n", strunique("uniq"));
printf("%d\n", strunique("repatee"));
return 0;
}
No, it's still O(n^2). You just slightly improved the constant. You still have to make two loops- basically the naive count the loops way of measuring big O time should tell you this.
Also, there is no such thing as O(n+1/2n). Big O notation is to give you an idea of the order of magnitude something should take. n+1/2n= 1.5n. Since big O drops all constant factors, that would just be n.
You can beat O(n^2) without extra memory though. If nothing else you can sort the strings by ascii value (nlog(n) time) and then walk the array looking for dupes (n time) for O(n+nlogn)=O(nlogn) time. There's probably other tricks as well.
Note that the sorting approach may not give better runtime though- the naive way has a best case runtime of 1, while a sort first algorithm has to sort, so it has a best case of nlogn. So best big-O time may not be the best choice.
In addition to other answers, I'd like to point out that the problem can be solved in O(1)
without additional memory, and without modifying the contents of the input string.
First, do strnlen(str, 256)
. If you get more than 255, return 0
. By the pigeonhole principle, some character must occur more than once. This operation takes only O(1)
since we examine only a bounded prefix of the string.
Now, the string is shorter than a constant (256), so use any naive algorithm to finish in only O(1)
additional time.
Short answer
If the number of possible characters (not to be confused with the length of the strings) is not fixed (not the case here) the time complexity of your algorithm is O(n^2). If we make the assumption there are only a fixed number of valid characters (in this case 255
/4G
), your algorithm runs in worst-case O(n). If the condition holds, the algorithm can then easily be improved to run in O(1).
Note on asymptotic behavior and big-oh: these are theoretical results. It's not because an algorithm runs in O(1), it runs in reasonable time. It means it runs in constant time. So that - asymptotically speaking - it won't make any difference whether you enter a string with length 101000 or one with length 1010'000 (given these lengths are large enough). The time it takes can be more than one hundred times the age of the universe as well.
Explanation
You can do a simple more than worst-case analysis on the for loops:
for (; *scout != '\0'; ++scout, ++offset)
for (start = (char *)str + offset; *start != '\0'; ++start)
//single statement
Now we want to know how many times the single statement (it contains a fixed number of statements) will be repeated. Since you never modify the content of the string. We know that there is an index n at which the value is \0
.
So we can rewrite it as:
for (scout = 0,offset = 0; scout < n; ++scout, ++offset)
for (start = offset; *start < n; ++start)
//single statement
(I've assumed the string starts at memory address 0
), but since that's only a shift, that allowed, it makes it only simpler to reason about this.
Now we're going to calculate the number of statements in the inner for
loop (parameterized). That's equal to:
With o the offset and n the length of the string.
Now we can use this formula to calculate the number of instructions at the outer for
-loop level. Here o starts with 0
and iterates through (excluding) n
. So the total number of instructions is:
Which is O(n^2).
But now one has to ask: is it possible to construct such a string? The answer is no! There are only 255
valid characters (the NUL
character is not considered to be a character); if we cannot make this assumption the above holds. Say the first character is an a
(with a an arbitrary char), then it either matches with another a
in the string, which can be resolved in O(n) time (loop through the remainder of the string); or it means that all other characters are different from a
. In the first case, the algorithm terminates in O(n); in the second case, this means that the second character is different.
Let's say the second character is b
. Then we again iterate over the string in O(n) and if it finds another b
we terminate, after 2n or O(n) steps. If not, we need try to find a match for the next character c
.
The point is that we only need to do this at most 255
times: because there are only 255 valid characters. As a result the time complexity is 255n or O(n).
Alternative explanation
Another variant of this explanation is "if the outer for
loop is looking for the i-th character we know that all characters on the left of i, are different from that character (otherwise we would have already rejected earlier)." Now since there are only 255
characters and all characters on the left are different from each other and the current character, we know that for the 256
-th character of the string, we cannot find a new different character anymore, because there are no such characters.
Example
Say you have an alphabet with 3
characters (a
,b
and c
) - this only to make it easier to understand the matter. Now say we have a string:
scout
v
b a a a a a b
^
start
It is clear that your algorithm will use O(n) steps: the start
will simply iterate over the entire string once, reach b
and return.
Now say there is no duplicate of b
in the string. In that case the algorithm does not stop after iterating over the string once. But this implies all the other characters should differ from a (after all we've iterated over the string, and didn't find a duplicate). So now consider a string with that condition:
scout
v
b a a a a a a
^
start
Now it is clear that a first attempt to find a character b
in the remainder of the string will fail. Now your algorithm increments the scout:
scout
v
b a a a a a a
^
start
And starts searching for a
. Here we're very lucky: the first character is an a
. But if there is a duplicate a
; it would cost at most two iterations, so O(2n) to find the duplicate.
Now we're reaching the bound case: there is no a
either. In that case, we know the string must begin with
scout
v
b a c ...
We furthermore know that the remainder of the string cannot contain b
's nor a
's because otherwise the scout
would never have moved that far. The only remaining possibility is that the remainder of the string contains c
's. So the string reads:
scout
v
b a c c c c c c c c c c
^
start
This means that after iterating over the string at most 3 times, we will find such duplicate, regardless of the size of the string, or how the characters are distributed among that string.
Modify this to O(1) time
You can easily modify this algorithm to let it run in O(1) time: simply place additional bounds on the indices:
int strunique(const char *str)
{
size_t offset = 1;
char *scout = (char *)str, *start, *stop = scout+255;
for (; scout < stop && *scout != '\0'; ++scout, ++offset)
for (start = (char *)str + offset; start <= stop && *start != '\0'; ++start)
if (*start == *scout)
return 0;
return 1;
}
In this case we've bounded the first loop such that it visits at most the first 255 characters, the inner loop visits only the first 256 (notice the <=
instead of <
). So the total number of steps is bounded by 255 x 256 or O(1). The explanation above already demonstrates why this is sufficient.
Note: In case this is C
, you need to replace 255
by 4'294'967'296
which makes it indeed theoretically O(n), but practically O(n^2) in the sense that the constant factor before the n is that huge for O(n) that O(n^2) will outperform it.
Since we combine the string termination check with the 256
-check this algorithm will run nearly always faster than the one proposed above. The only source of potentially extra cycles is the additional testing that ships with the modified for
-loops. But since these lead to faster termination, it will in many cases not result in additional time.
Big-oh
One can say: "Yes that's true for strings with length greater than or equal to 256 characters.", "What about strings with a size less than 256
?". The point is that big-oh analysis deals with asymptotic behavior. Even if the behavior was super-exponential for some strings less than or equal to a certain length, you don't take these into account.
To emphasize the (problematic) aspect of asymptotic behavior more. One can say that the following algorithm would be correct asymptotically speaking:
int strunique(const char *str) {
return 0;
}
It always returns false; because "There is a length n0 such that for every input length n > n0 this algorithm will return the correct result." This has not much to do with big-oh itself, it's more to illustrate that one must be careful with saying an algorithm running in O(1) will outperform an algorithm in O(n^6) for any reasonable input. Sometimes the constant factor can be gigantic.
Your algorithm is O(N^2)
. This is easy to very by simply noting that in the worst case, a string with all unique characters, every character must be checked against every character that comes after it. That is, worst-case, N*(N-1)/2 = O(N^2)
comparisons.
Note that by definition:
f(x) = O(g(x))
if there exists some constant such that |f(x)| <= M|g(x)|
for all sufficiently large x
. So when you say f(x) = O(n + 1/2n)
(which is incorrect for your algorithm), that follows that:
f(x) = O(n + 1/2n)
f(x) <= M * (n + 1/2n) for some M, n_0 for n >= n_0, by definition
f(x) <= (3/2 * M) n, n >= n_0
f(x) <= M' n, setting M' = 3/2 M, n >= n_0
f(x) = O(n), by definition
That is, constants always drop out, which is why you might hear the expression that constants don't matter (at least as far as calculating runtime complexity goes - obviously they matter for actual performance)
A string with all unique characters can have length at most 255. In that case, your algorithm runs in O(1) time.
If the string contains duplicate characters, one of those duplicate characters appears in the first 255 elements of the string. Then the worst case is when the first 254 characters of the string are unique and the 255th character is repeated until the end of the string. Then your algorithm runs in O(N) time.
You can guarantee O(1) time for your algorithm by first checking if the length of the string is greater than 255 and failing immediately if so.
All that's assuming that char
takes one of 256 values. If you treat the number of characters in char
as a variable C then the complexities for your algorithm are O(C^2) in the case the string contains only unique characters, O(NC) in the case that the string contains duplicates, and you can guarantee O(C^2) time by first checking the length of the string isn't greater than C.
The optimal algorithm is O(min(N, C)) by first checking the string isn't longer than C, and then using any linear-time duplicate detection algorithm.