Optimizing lazy collections

2019-07-08 09:29发布

This question is about optimizing lazy collections. I will first explain the problem and then give some thoughts for a possible solution. Questions are in bold.

Problem

Swift expects operations on Collections to be O(1). Some operations, especially prefix and suffix-like types, deviate and are on the order of O(n) or higher.

Lazy collections can't iterate through the base collection during initialization since computation should be deferred for as long as possible until the value is actually needed.

So, how can we optimize lazy collections? And of course this begs the question, what constitutes an optimized lazy collection?

Thoughts

The most obvious solution is caching. This means that the first call to a collection's method has an unfavourable time complexity, but subsequent calls to the same or other methods can possibly be computed in O(1). We trade some space complexity to the order of O(n) for faster computation.

Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.

This leaves us with classes. Classes are mutable, meaning that all computed properties and methods can internally mutate state. When we use classes to optimize a lazy collection we have two options. First, if the properties of the lazy type are variables then we're bringing ourselves into a world of hurt. If we change a property it could potentially invalidate previously cached results. I can imagine managing the code paths to make properties mutable to be headache inducing. Second, if we use lets we're good; the state set during initialization can't be changed so a cached result doesn't need to be updated. Note that we're only talking about lazy collections with pure methods without side effects here.

But classes are reference types. What are the downsides of using reference types for lazy collections? The Swift standard library doesn't use them for starters.

Any thoughts or thoughts on different approaches?

2条回答
贪生不怕死
2楼-- · 2019-07-08 10:12

I completely agree with Alexander here. If you're storing lazy collections, you're generally doing something wrong, and the cost of repeated accesses is going to constantly surprise you.

These collections already blow up their complexity requirements, it's true:

Note: The performance of accessing startIndex, first, or any methods that depend on startIndex depends on how many elements satisfy the predicate at the start of the collection, and may not offer the usual performance given by the Collection protocol. Be aware, therefore, that general operations on LazyDropWhileCollection instances may not have the documented complexity.

But caching won't fix that. They'll still be O(n) on the first access, so a loop like

for i in 0..<xs.count { print(xs[i]) }

is still O(n^2). Also remember that O(1) and "fast" are not the same thing. It feels like you're trying to get to "fast" but that doesn't fix the complexity promise (that said, lazy structures are already breaking their complexity promises in Swift).

Caching is a net-negative because it makes the normal (and expected) use of lazy data structures slower. The normal way to use lazy data structures is to consume them either zero or one times. If you were going to consume them more than one time, you should use a strict data structure. Caching something that you never use is a waste of time and space.

There are certainly conceivable use cases where you have a large data structure that will be sparsely accessed multiple times, and so caching would be useful, but this isn't the use case lazy was built to handle.

Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.

This isn't true. A struct can internally store a reference type to hold its cache and this is common. Strings do exactly this. They include a StringBuffer which is a reference type (for reasons related to a Swift compiler bug, StringBuffer is actually implemented as a struct that wraps a class, but conceptually it is a reference type). Lots of value types in Swift store internal buffer classes this way, which allows them to be internally mutable while presenting an immutable interface. (It's also important for CoW and lots of other performance and memory related reasons.)

Note that adding caching today would also break existing use cases of lazy:

struct Massive {
    let id: Int
    // Lots of data, but rarely needed.
}

// We have lots of items that we look at occassionally
let ids = 0..<10_000_000

// `massives` is lazy. When we ask for something it creates it, but when we're 
// done with it, it's thrown away. If `lazy` forced caching, then everything 
// we accessed would be forever. Also, if the values in `Massive` change over
// time, I certainly may want it to be rebuilt at this point and not cached.
let massives = ids.lazy.map(Massive.init)
let aMassive = massives[10]

This isn't to say a caching data structure wouldn't be useful in some cases, but it certainly isn't always a win. It imposes a lot of costs and breaks some uses while helping others. So if you want those other use cases, you should build a data structure that provides them. But it's reasonable that lazy is not that tool.

查看更多
▲ chillily
3楼-- · 2019-07-08 10:29

Swift's lazy collections are intended to provide one off access to elements. Subsequent access cause redundant computation (e.g. a lazy map sequence would recompute the transform closure.

In the case where you want repeated access to elements, it's best to just slice the portion of the lazy sequence/collection you care about, and create a proper Collection (e.g. an Array) out of it.

The book keeping overhead of lazily evaluating and caching each element would probably be greater than the benefits.

查看更多
登录 后发表回答