I had a job interview today. And was asked about complexity of std:set_intersection
. When I was answering I mentioned that
O(n+m)
is equal to:
O(max(n,m))
I was told that this is incorrect. I was unsuccessfully trying to show equivalence with:
O(0.5*(n+m)) ≤ O(max(n,m)) ≤ O(n+m)
My question is: am I really incorrect?
For all m, n ≥ 0 it is valid that max(m, n) ≤ m + n → max(m, n) in O(m + n), and m + n ≤ 2max(m, n) → m + n in O(max(m, n)).
Thus O(max(m, n)) = O(m + n).
ADDENDUM: If f belongs O(m + n) then a constant D > 0 exists, that f(n, m) < D * (m + n) for m and n large enough. Thus f(n, m) < 2 D * max(m, n), and O(m + n) must be a subset of O(max(m, n)). The proof of O(max(m, n)) is a subset of O(m + n) is made analogously.
Well you have totally right about O(n+m) is equal to O(max(n,m)),even more precise we can prove Θ(n+m)=Θ(max(n,m) which is more tight and proves your sentence. The mathematical proof is (for both big-O and Θ) very simple and easy to understand with common sense. So since we have a
mathematical proof
which is a way to say something but in a more well defined and strict way whichdoesn't leaves any ambiguity
.Though you was (wrongly) told that this is incorrect because if we want to be very precise this is not the appropriate - mathematical way to express that order of max(m,n) is same as m+n. You used the words "is equal" referring to big-O notation but what is the definition of big-O notation?
The problem caused is that you didn't try to refer to the order of a function like f is O(g) but you tried to compare Sets O(f) and O(g).But proving two infinite sets are equal is not easy (and that may confused the interviewer).
We can say Sets A and B are identical(or equal) when contain same elements (we do not try to compare but instead refer to elements they contain so they must be finite). And even identification can't be easily applied when talking about Big O Sets.
One more thing, we said that these functions have same order and this is absolutely mathematically correct but when trying to do optimization of an algorithm and you may need to take into account some lower order factors then maybe they give you slightly different results but the asymptotic behavior is proved to be the same.
As you can read in wikipedia (and in all cs courses in every university or in every algorithm book) Big O/θ/Ω/ω/ο notations
helps us compare functions
and find their order of growth and not for Sets of Functions and this is why you were told you were wrong. Though is easy to say O(n^2) is subset of O(n) it is very difficult to compare infinite to say if two sets are identical. Cantor have worked on categorizing infinite sets, for example we know that natural numbers are countable infinite and real numbers are uncountable infinite so real numbers are more than natural numbers even though both are infinite. It is getting very complicating when trying t order and categorize infinite sets and this would be more of a research in maths than a way of comparing functions.UPDATE
It turns out you could simply prove O(n+m) equals to O(max(n,m)):
for every function F which belongs to O(n+m) this means that there are constant c and k such:
then also stands:
so F belongs to O(max(n,m)) and as a result O(m+n) is subset of O(max(n,m)). Now consider F belongs to O(max(n,m)) then there are constants c and k such:
and we also have:
so there is c'=2c and with same k by definition: F is O(m+n) and as a result O(max(n,m)) is subset of O(n+m). Because we proved O(m+n) is subset of O(max(n,m)) we proved that O(max(m,n)) and O(m+n) are equal and this mathematical proof proves you had totally right without any doubt.
Finally note that proving that m+n is O(max(n,m)) and max(n,m) is O(m+n) doesn't proves immediately that sets are equal (we need a proof for that) as your saying it just proves that functions have same order but we didn't examine the sets. Though it is easy to see (in general case) that if f is O(g) and g is O(F) then you can easily prove in that case the big O sets equality like we did in the previous paragraph.
We'll show by rigorous Big-O analysis that you are indeed correct, given one possible choice of parameter of growth in your analysis. However, this does not necessarily mean that the viewpoint of the interviewer is incorrect, rather that his/her choice of parameter of growth differs. His/her prompt that your answer was outright incorrect, however, is questionable: you've possibly simply used two slightly different approaches to analyzing the asymptotic complexity of
std::set_intersection
, both leading to the general consensus that the algorithm runs in linear time.Preparations
Lets start by looking at the reference of
std::set_intersection
at cppreference (emphasis mine)std::distance
itself is naturally linear (worst case: no random access)We'll proceed to briefly recall the basic of Big-O notation.
Big-O notation
We loosely state the definition of a function or algorithm
f
being inO(g(n))
(to be picky,O(g(n))
being a set of functions, hencef ∈ O(...)
, rather than the commonly misusedf(n) ∈ O(...)
).Hence, to show that
f ∈ O(g(n))
, we need to find a set of (non-negative) constants(c, n0)
that fulfilsWe note, however, that this set is not unique; the problem of finding the constants
(c, n0)
such that (+) holds is degenerate. In fact, if any such pair of constants exists, there will exist an infinite amount of different such pairs.We proceed with the Big-O analysis of
std::set_intersection
, based on the already known worst case number of comparisons of the algorithm (we'll consider one such comparison a basic operation).Applying Big-O asymptotic analysis to the
set_intersection
exampleNow consider two ranges of elements, say
range1
andrange2
, and assume that the number of elements contained in these two ranges arem
andn
, respectively.k = m+n
be the parameter of choice: we would still conclude thatstd::set_intersection
is of linear-time complexity, but rather in terms ofk
(which ism+n
which is notmax(m, n)
) than the largest ofm
andn
. These are simply the preconditions we freely choose to set prior to proceeding with our Big-O notation/asymptotic analysis, and it's quite possibly that the interviewer had a preference of choosing to analyze the complexity usingk
as parameter of growth rather than the largest of its two components.Now, from above we know that as worst case,
std::set_intersection
will run2 · (m + n - 1)
comparisons/basic operations. LetSince the goal is to find an expression of the asymptotic complexity in terms of Big-O (upper bound), we may, without loss of generality, assume that
n > m
, and defineWe proceed to analyze the asymptotic complexity of
f(n)
, in terms of Big-O notation. Letand note that
Now (choose to) let
c = 4
andn0 = 1
, and we can state the fact that:Given
(**)
, we know from(+)
that we've now shown thatFurthermore, since `(*) holds, naturally
holds.
If we switch our initial assumption and assume that
m > n
, re-tracing the analysis above will, conversely, yield the similar resultConclusion
Hence, given two ranges
range1
andrange2
holdingm
andn
elements, respectively, we've shown that the asymptotic complexity ofstd::set_intersection
applied two these two ranges is indeedwhere we're chosen the largest of
m
andn
as the parameter of growth of our analysis.This is, however, not really valid annotation (at least not common) when speaking about Big-O notation. When we use Big-O notation to describe the asymptotic complexity of some algorithm or function, we do so with regard to some single parameter of growth (not two of them).
Rather than answering that the complexity is
O(max(m, n))
we may, without loss of generality, assume thatn
describes the number of elements in the range with the most elements, and given that assumption, simply state than an upper bound for the asymptotic complexity ofstd::set_intersection
isO(n)
(linear time).A speculation as to the interview feedback: as mentioned above, it's possible that the interviewer simply had a firm view that the Big-O notation/asymptotic analysis should've been based on
k = m+n
as parameter of growth rather than the largest of its two components. Another possibility could, naturally, be that the interviewer simply confusingly queried about the worst case of actual number of comparisons ofstd::set_intersection
, while mixing this with the separate matter of Big-O notation and asymptotic complexity.Final remarks
Finally note that the analysis of worst case complexity of
std::set_intersect
is not at all representative for the commonly studied non-ordered set intersection problem: the former is applied to ranges that are already sorted (see quote from Boost'sset_intersection
below: the origin ofstd::set_intersection
), whereas in the latter, we study the computation of the intersection of non-ordered collections.set_intersection
As an example of the latter, the
Intersection
set method of Python applies to non-ordered collections, and is applied to say setss
andt
, it has an average case and a worst-case complexity ofO(min(len(s), len(t))
andO(len(s) * len(t))
, respectively. The huge difference between average and worst case in this implementation stems from the fact that hash based solutions generally works very well in practice, but can, for some applications, theoretically have a very poor worst-case performance.For additional details of the latter problem, see e.g.