What's a space leak?

2020-02-25 22:23发布

I found the haskell wiki page on space leaks, which claims to list examples of real-world leaks, which it doesn't. It doesn't really say what a space leak is; it just links to the page for memory leaks.

What's a space leak?

2条回答
等我变得足够好
2楼-- · 2020-02-25 23:04

As noted in @Rasko's answer, a space leak refers to a situation where a program or specific computation uses more (usually much more) memory than is necessary for the computation and/or expected by the programmer.

Haskell programs tend to be particularly susceptible to space leaks, mostly because of the lazy evaluation model (sometimes complicated by the way IO interacts with this model) and the highly abstract nature of the language which can make it difficult for a programmer to determine exactly how a particular computation is likely to be performed.

It helps to consider a specific example. This Haskell program:

main = print $ sum [1..1000000000]

is an idiomatic way to sum the first billion integers. Compiled with -O2, it runs in a few seconds in constant memory (a few megabytes, basically the runtime overhead).

Now, any programmer would expect a program to sum the first billion integers should run without chewing up memory, but it's actually a little surprising that this Haskell version is well behaved. After all, read literally, it constructs a list of a billion integers before summing them up, so it ought to require at least a few gigabytes (just for storage for the billion integers, not to mention the overhead of a Haskell linked list).

However, lazy evaluation ensures that the list is only generated as it's needed and -- equally importantly -- optimizations performed by the compiler ensure that as list elements are added to the accumulating sum, the program recognizes they are no longer needed and allows them to be garbage collected instead of keeping them around until the end of the computation. So, at any point during the computation, only a sliding "window" into the middle of the list needs to be kept in memory -- earlier elements have been discarded, and later elements are yet to be lazily computed. (In fact, the optimizations go further than this: no list is even constructed, but this is far from obvious to the programmer.)

Soooo... Haskell programmers get used to the idea that tossing around giant (or even infinite) data structures will "just work" with computations automatically using only the memory they need.

But, a minor change to the program, like also printing the length of the list as proof of all the hard work we are doing:

main = let vals = [1..1000000000]
       in print (sum vals, length vals)

suddenly causes space usage to explode to dozens of gigabytes (or in the case of my laptop, to about 13Gigs before it starts swapping hopelessly and I kill it).

This is a space leak. Calculating the sum and length of this list are obviously things that can be done in constant space using a "sliding window" view into the list, but the above program uses much more memory than needed. The reason, it turns out, is that once the list has been given a name vals that's used in two places, the compiler no longer allows the "used" elements to be immediately discarded. If the sum vals is evaluated first, the list is lazily generated and summed, but the entire, giant list is then kept around until length vals can be evaluated.

As a more practical example, you might write a simple program to count words and characters in a file:

main = do txt <- getContents
          print (length txt, length (words txt))

This works fine on small test files up to a couple megabytes, but it's noticeably sluggish on 10meg file, and if you try to run it on a 100meg file, it'll slowly but surely start gobbling up all available memory. Again, the problem is that -- even though the file contents are read lazily into txt -- because txt is used twice, the entire contents are read into memory as a Haskell String type (a memory-inefficient representation of large blocks of text) when, say, length txt is evaluated, and none of that memory can be freed until length (words txt) has also been computed.

Note that:

main = do txt <- getContents
          print $ length txt

and:

main = do txt <- getContents
          print $ length (words txt)

both run quickly in constant space even on big files.

As a side note, fixing the above space leak normally involves rewriting the computation so the characters and words are counted with one pass through the contents, so the compiler can determine that the contents of the file that have already been processed do not need to be kept around in memory until the end of the computation. One possible solution is:

{-# LANGUAGE BangPatterns #-}

import Data.List
import Data.Char

charsWords :: String -> (Int, Int)
charsWords str = let (_, chrs, wrds) = foldl' step (False, 0, 0) str
                 in (chrs, wrds)
  where step (inWord, cs, ws) c =
          let !cs' = succ cs
              !ws' = if not inWord && inWord' then succ ws else ws
              !inWord' = not (isSpace c)
          in (inWord', cs', ws')

main = do txt <- getContents
          print $ charsWords txt

The complexity of this solution (use of bang (!) patterns and an explicit fold instead of length and words) illustrates how tough space leaks can be, especially for new Haskell programmers. And it's not at all obvious that using foldl' instead of foldl makes no difference (but using foldr or foldr' would be a disaster!), that the bangs before cs' and ws' are critical to avoid a space leak, but that the bang before inWord' isn't (though it slightly improves performance), etc.

查看更多
可以哭但决不认输i
3楼-- · 2020-02-25 23:07

A space leak occurs when a computer program uses more memory than necessary. In contrast to memory leaks, where the leaked memory is never released, the memory consumed by a space leak is released, but later than expected. * Source

查看更多
登录 后发表回答