How does Scala's Vector work?

2020-05-20 08:01发布

问题:

I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.

It made me wonder what Vector is. I read the document and it says:

Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences. It is backed by a little endian bit-mapped vector trie with a branching factor of 32. Locality is very good, but not contiguous, which is good for very large sequences.

As with everything else about Scala, it's pretty vague. How actually does Vector work?

回答1:

The keyword here is Trie. Vector is implemented as a Trie datastructure. See http://en.wikipedia.org/wiki/Trie.

More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:

https://bitbucket.org/astrieanna/bitmapped-vector-trie

The most relevant excerpt is:

A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.

UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:

I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".

If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant. Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case. Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.

But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.

Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows. Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).



回答2:

The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:

  • Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
  • Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
  • The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards

Why this? Simple: for performance.

Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.

:)



回答3:

These may be interesting for you:

  • Ideal Hash Trees by Phil Bagwell.
  • Implementing Persistent Vectors in Scala - Daniel Spiewak
  • More Persistent Vectors: Performance Analysis - Daniel Spiewak
  • Persistent data structures in Scala