Is there an efficient algorithm to enumerate the factors of a number n, in ascending order, without sorting? By “efficient” I mean:
The algorithm avoids a brute-force search for divisors by starting with the prime-power factorization of n.
The runtime complexity of the algorithm is O(d log₂ d) or better, where d is the divisor count of n.
The spatial complexity of the algorithm is O(d).
The algorithm avoids a sort operation. That is, the factors are produced in order rather than produced out of order and sorted afterward. Although enumerating using a simple recursive approach and then sorting is O(d log₂ d), there is a very ugly cost for all the memory accesses involved in sorting.
A trivial example is n = 360 = 2³ × 3² × 5, which has d=24 factors: { 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 18, 20, 24, 30, 36, 40, 45, 60, 72, 90, 120, 180, 360 }.
A more serious example is n = 278282512406132373381723386382308832000 = 2⁸ × 3⁴ × 5³ × 7² × 11² × 13² × 17 × 19 × 23 × 29 × 31 × 37 × 41 × 43 × 47 × 53 × 59 × 61 × 67 × 71 × 73 × 79, which has d=318504960 factors (obviously too many to list here!). This number, incidentally, has the greatest number of factors of any number up to 2^128.
I could swear that I saw a description of such algorithm a few weeks ago, with sample code, but now I can't seem to find it anywhere. It used some magic trick of maintaining a list of progenitor indexes in the output list for each prime factor. (Update: I was confusing factor generation with Hamming numbers, which operate similarly.)
Update
I ended up using a solution which is O(d) in runtime, has extremely low memory overhead creating the O(d) output in-place, and is significantly faster than any other method I am aware of. I've posted this solution as answer, with C source code. It is a heavily optimized, simplified version of a beautiful algorithm presented here in Haskell by another contributor, Will Ness. I've selected Will's answer as the accepted answer, as it provided a very elegant solution that matched all the requirements as originally stated.
We can merge streams of multiples, produced so there are no duplicates in the first place.
Starting with the list
[1]
, for each unique prime factorp
, we multiply the list byp
iterativelyk
times (wherek
is the multiplicity ofp
) to getk
new lists, and merge them together with that incoming list.In Haskell,
foldi
arranges the binarymerge
s (which presuppose that there's no duplicates in the streams being merged, for streamlined operation) in a tree-like fashion for speed; whereasfoldr
works in linear fashion.group
is a standard function that groups adjacent equals in a list together, andfactorize
is a well-known trial-division algorithm that produces prime factors of a number in non-decreasing order:(.)
is a functional composition operator, chaining the output of its argument function on its right as input into the one on its left. It (and$
) can be read aloud as "of".In short: Repeatedly pull the next-smallest factor from a heap, then push back every multiple of that factor that is still a factor of n. Use a trick to avoid duplicates arising, so that the heap size never exceeds d. The time complexity is O(kd log d), where k is the number of distinct prime factors.
The key property that we make use of is that if x and y are both factors of n with y = x*p for some factor p >= 2 -- i.e. if the prime factors of x are a proper submultiset of the prime factors of y -- then x < y. This means that it is always safe to output x before any such y is even inserted into the heap. The heap is effectively used only to compare factors where neither is a submultiset of the other.
A first attempt: Duplicates slow things down
It will be helpful to first describe an algorithm that produces the right answer, but also produces many duplicates:
The only problem with the above algorithm is that it can generate the same factor many times. For example, if n = 30, then the factor 15 will be generated as a child of the factor 5 (by multiplying by the prime factor 3), and also as a child of the factor 3 (by multiplying by 5). One way around this is to notice that any duplicates must be read out in a contiguous block when they reach the top of the heap, so you can simply check whether the top of the heap is equal to the just-extracted value, and to keep extracting and discarding it if so. But a better approach is possible:
Killing duplicates at the source
How many ways can a factor x be generated? First consider the case in which x contains no prime factors with multiplicity > 1. In that case, if it contains m distinct prime factors, then there are m-1 "parent" factors that will generate it as a "child" in the previous algorithm -- each of these parents consists of some subset of m-1 of the m prime factors, with the remaining prime factor being the one that gets added to the child. (If x has a prime factor with multiplicity > 1, then there are in fact m parents.) If we had a way of deciding on exactly one of these parents to be the "chosen one" that actually generates x as a child, and this rule resulted in a test that could be applied to each parent at the time that parent is popped off, then we could avoid ever creating any duplicates in the first place.
We can use the following rule: For any given x, choose the potential parent y that is missing the largest of x's m factors. This makes for a simple rule: A parent y produces a child x if and only if x = y*p for some p greater than or equal to any prime factor already in y. This is easy to test for: Just loop through prime factors in decreasing order, generating children for each, until we hit a prime factor that already divides y. In the previous example, the parent 3 will produce 15, but the parent 5 will not (because 3 < 5) -- so 15 indeed gets produced only once. For n = 30, the complete tree looks like:
Notice that each factor is generated exactly once.
The new, duplicate-free algorithm is as follows:
[I'm posting this answer just as a formality for completeness. I've already chosen someone else's answer as the accepted answer.]
Overview of algorithm. In searching for the fastest way to generate an in-memory list of factors (64-bit unsigned values in my case), I settled upon a hybrid algorithm that implements a two-dimensional bucket sort, which takes advantage of the internal knowledge of the sort keys (i.e., they are just integers and can therefore be computed upon). The specific method is something closer to a “ProxMapSort” but with two levels of keys (major, minor) instead of just one. The major key is simply the base-2 logarithm of the value. The minor key is the minimal number of most significant digits of the value needed to produce a reasonable spread at the second layer of buckets. Factors are produced first into a temporary work array of unsorted factors. Next, their distribution is analyzed and an array of bucket indexes is allocated and populated. Finally, the factors are stored directly into place in the final sorted array, using insertion sort. The vast majority of buckets have only 1, 2, or 3 values. Examples are given in the source code, which is attached at the bottom of this answer.
Spatial complexity. Memory utilization is approximately 4x that of a Quicksort-based solution, but this is actually rather insignificant, as the maximum memory ever used in the worst case (for 64-bit input) is 5.5 MB, of which 4.0 MB is held for only a small handful of milliseconds.
Runtime complexity. Performance is far better than a hand-coded Quicksort-based solution: for numbers with a nontrivial number of factors, it is unformly about 2.5x times faster. On my 3.4 GHz. Intel i7, it produces the 184,320 factors of 18,401,055,938,125,660,800 in sorted order in 0.0052 seconds, or about 96 clock cycles per factor, or about 35 million factors per second.
Graphs. Memory and runtime performance were profiled for the 47,616 canonical representatives of the equivalance classes of prime signatures of numbers up to 2⁶⁴–1. These are the so-called “highly factorable numbers” in 64-bit search space.
Total runtime is ~2.5x better than a Quicksort-based solution for nontrivial factor counts, shown below on this log–log plot:
The number of sorted factors produced per second is essentially the inversion of the above. Performance on a per-factor basis declines after the sweet spot of approximately 2000 factors, but not by much. Performance is affected by L1, L2, and L3 cache sizes, as well as the count of unique prime factors of the number being factored, which goes up roughly with the logarithm of the input value.
Peak memory usage is a straight line in this log–log plot, since it is proportional to the base-2 logarithm of the number of factors. Note that peak memory usage is only for a very brief period of time; short-lived work arrays are discarded within milliseconds. After the temporary arrays are discarded, what remains is the final list of factors, which is the same minimal usage as seen in the Quicksort-based solution.
Source Code. Attached below is a proof-of-concept program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.
This answer gives a C implementation that I believe is the fastest and most memory-efficient.
Overview of algorithm. This algorithm is based on the bottom-up merge approach introduced by Will Ness in another answer, but is further simplified so that the lists being merged do not actually ever exist anywhere in memory. The head element of each list is groomed and kept in a small array, while all other elements of the lists are constructed on-the-fly as needed. This use of “phantom lists”—figments of the imagination of the running code—greatly reduces the memory footprint, as well as the volume of memory accesses, both read and write, and also improves spatial locality, which in turn significantly increases the speed of the algorithm. Factors at each level are written directly into their final resting place in the output array, in order.
The basic idea is to produce the factors using mathematical induction on the prime-power factorization. For example, to produce the factors of 360, the factors of 72 are computed and then multiplied by the relevant powers of 5, in this case {1,5}; to produce the factors of 72, the factors of 8 are computed and then multiplied by the relevant powers of 3, in this case {1,3,9}; to produce the factors of 8, the base case 1 is multiplied by the relevant powers of 2, in this case {1,2,4,8}. Thus, at each inductive step, a Cartesian product is calculated between sets of existing factors and sets of prime powers, and the results are reduced back to integers via multiplication.
Below is an illustration for the number 360. Left-to-right are memory cells; one row represents one time step. Time is represented as flowing vertically.
Spatial complexity. Temporary data structures to produce the factors are extremely small: only O(log₂(n)) space is used, where n is the number whose factors are being generated. For example, in the 128-bit implementation of this algorithm in C, a number such as 333,939,014,887,358,848,058,068,063,658,770,598,400 (whose base-2 logarithm is ≈127.97) allocates 5.1 GB to store the list of its 318,504,960 factors, but uses only 360 bytes of additional overhead to produce that list. At most, approximately 5 KB overhead is needed for any 128-bit number.
Runtime complexity. Runtime depends entirely on the exponents of the prime-power factorization (e.g., the prime signature). For best results, largest exponents should be processed first and smallest exponents last, so that the runtime is dominated in the final stages by low-complexity merges, which in practice often turn out to be binary merges. Asymptotic runtime is O(c(n) d(n)), where d(n) is the divisor count of n and where c(n) is some constant dependent on the prime signature of n. That is, each equivalence class associated with a prime signature has a different constant. Prime signatures dominated by small exponents have smaller constants; prime signatures dominated by large exponents have larger constants. Thus, runtime complexity is clustered by prime signature.
Graphs. Runtime performance was profiled on a 3.4 GHz. Intel Core i7 for a sampling of 66,591 values of n having d(n) factors for unique d(n) up to 160 million. The largest value of n profiled for this graph was 14,550,525,518,294,259,162,294,162,737,840,640,000, producing 159,744,000 factors in 2.35 seconds.
The number of sorted factors produced per second is essentially the inversion of the above. Clustering by prime signature is apparent in the data. Performance is also affected by L1, L2, and L3 cache sizes, as well as physical RAM limitations.
Source Code. Attached below is a proof-of-concept program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.