Finding the cause of a memory leak in Ruby

2019-01-10 07:14发布

问题:

I've discovered a memory leak in my Rails code - that is to say, I've found what code leaks but not why it leaks. I've reduced it down to a testcase that doesn't require Rails:

require 'csspool'
require 'ruby-mass'

def report
    puts 'Memory ' + `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)[1].to_s + 'KB'
    Mass.print
end

report

# note I do not store the return value here
CSSPool::CSS::Document.parse(File.new('/home/jason/big.css'))

ObjectSpace.garbage_collect
sleep 1

report

ruby-mass supposedly lets me see all the objects in memory. CSSPool is a CSS parser based on racc. /home/jason/big.css is a 1.5MB CSS file.

This outputs:

Memory 9264KB

==================================================
 Objects within [] namespace
==================================================
  String: 7261
  RubyVM::InstructionSequence: 1151
  Array: 562
  Class: 313
  Regexp: 181
  Proc: 111
  Encoding: 99
  Gem::StubSpecification: 66
  Gem::StubSpecification::StubLine: 60
  Gem::Version: 60
  Module: 31
  Hash: 29
  Gem::Requirement: 25
  RubyVM::Env: 11
  Gem::Specification: 8
  Float: 7
  Gem::Dependency: 7
  Range: 4
  Bignum: 3
  IO: 3
  Mutex: 3
  Time: 3
  Object: 2
  ARGF.class: 1
  Binding: 1
  Complex: 1
  Data: 1
  Gem::PathSupport: 1
  IOError: 1
  MatchData: 1
  Monitor: 1
  NoMemoryError: 1
  Process::Status: 1
  Random: 1
  RubyVM: 1
  SystemStackError: 1
  Thread: 1
  ThreadGroup: 1
  fatal: 1
==================================================

Memory 258860KB

==================================================
 Objects within [] namespace
==================================================
  String: 7456
  RubyVM::InstructionSequence: 1151
  Array: 564
  Class: 313
  Regexp: 181
  Proc: 113
  Encoding: 99
  Gem::StubSpecification: 66
  Gem::StubSpecification::StubLine: 60
  Gem::Version: 60
  Module: 31
  Hash: 30
  Gem::Requirement: 25
  RubyVM::Env: 13
  Gem::Specification: 8
  Float: 7
  Gem::Dependency: 7
  Range: 4
  Bignum: 3
  IO: 3
  Mutex: 3
  Time: 3
  Object: 2
  ARGF.class: 1
  Binding: 1
  Complex: 1
  Data: 1
  Gem::PathSupport: 1
  IOError: 1
  MatchData: 1
  Monitor: 1
  NoMemoryError: 1
  Process::Status: 1
  Random: 1
  RubyVM: 1
  SystemStackError: 1
  Thread: 1
  ThreadGroup: 1
  fatal: 1
==================================================

You can see the memory going way up. Some of the counters go up, but no objects specific to CSSPool are present. I used ruby-mass's "index" method to inspect the objects that have references like so:

Mass.index.each do |k,v|
    v.each do |id|
        refs = Mass.references(Mass[id])
        puts refs if !refs.empty?
    end
end

But again, this doesn't give me anything related to CSSPool, just gem info and such.

I've also tried outputting "GC.stat"...

puts GC.stat
CSSPool::CSS::Document.parse(File.new('/home/jason/big.css'))
ObjectSpace.garbage_collect
sleep 1
puts GC.stat

Result:

{:count=>4, :heap_used=>126, :heap_length=>138, :heap_increment=>12, :heap_live_num=>50924, :heap_free_num=>24595, :heap_final_num=>0, :total_allocated_object=>86030, :total_freed_object=>35106}
{:count=>16, :heap_used=>6039, :heap_length=>12933, :heap_increment=>3841, :heap_live_num=>13369, :heap_free_num=>2443302, :heap_final_num=>0, :total_allocated_object=>3771675, :total_freed_object=>3758306}

As I understand it, if an object is not referenced and garbage collection happens, then that object should be cleared from memory. But that doesn't seem to be what's happening here.

I've also read about C-level memory leaks, and since CSSPool uses Racc which uses C code, I think this is a possibility. I've run my code through Valgrind:

valgrind --partial-loads-ok=yes --undef-value-errors=no --leak-check=full --fullpath-after= ruby leak.rb 2> valgrind.txt

Results are here. I'm not sure if this confirms a C-level leak, as I've also read that Ruby does things with memory that Valgrind doesn't understand.

Versions used:

  • Ruby 2.0.0-p247 (this is what my Rails app runs)
  • Ruby 1.9.3-p392-ref (for testing with ruby-mass)
  • ruby-mass 0.1.3
  • CSSPool 4.0.0 from here
  • CentOS 6.4 and Ubuntu 13.10

回答1:

It looks like you are entering The Lost World here. I don’t think the problem is with c-bindings in racc either.

Ruby memory management is both elegant and cumbersome. It stores objects (named RVALUEs) in so-called heaps of size of approx 16KB. On a low level, RVALUE is a c-struct, containing a union of different standard ruby object representations.

So, heaps store RVALUE objects, which size is not more than 40 bytes. For such objects as String, Array, Hash etc. this means that small objects can fit in the heap, but as soon as they reach a threshold, an extra memory outside of the Ruby heaps will be allocated.

This extra memory is flexible; is will be freed as soon as an object became GC’ed. That’s why your testcase with big_string shows the memory up-down behaviour:

def report
  puts 'Memory ' + `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`
          .strip.split.map(&:to_i)[1].to_s + 'KB'
end
report
big_var = " " * 10000000
report
big_var = nil 
report
ObjectSpace.garbage_collect
sleep 1
report
# ⇒ Memory 11788KB
# ⇒ Memory 65188KB
# ⇒ Memory 65188KB
# ⇒ Memory 11788KB

But the heaps (see GC[:heap_length]) themselves are not released back to OS, once acquired. Look, I’ll make a humdrum change to your testcase:

- big_var = " " * 10000000
+ big_var = 1_000_000.times.map(&:to_s)

And, voilá:

# ⇒ Memory 11788KB
# ⇒ Memory 65188KB
# ⇒ Memory 65188KB
# ⇒ Memory 57448KB

The memory is not released back to OS anymore, because each element of the array I introduced suits the RVALUE size and is stored in the ruby heap.

If you’ll examine the output of GC.stat after the GC was run, you’ll find that GC[:heap_used] value is decreased as expected. Ruby now has a lot of empty heaps, ready.

The summing up: I don’t think, the c code leaks. I think the problem is within base64 representation of huge image in your css. I have no clue, what’s happening inside parser, but it looks like the huge string forces the ruby heap count to increase.

Hope it helps.



回答2:

Okay, I found the answer. I am leaving my other answer up because that information was very difficult to gather, it is related, and it could help someone else searching for a related issue.

Your problem, however, appears to be due to the fact that Ruby actually does not release memory back to the Operating System once it has acquired it.

Memory Allocation

While Ruby programmers do not often worry about memory allocation, sometimes the following question comes up:

Why did my Ruby process stay so big even after I’ve cleared all references to big objects? I’m /sure/ GC has run several times and freed my big objects and I’m not leaking memory.

A C programmer might ask the same question:

I free()-ed a lot of memory, why is my process still so big?

Memory allocation to user space from the kernel is cheaper in large chunks, thus user space avoids interaction with the kernel by doing more work itself.

User space libraries/runtimes implement a memory allocator (e.g.: malloc(3) in libc) which takes large chunks of kernel memory2 and divides them up into smaller pieces for user space applications to use.

Thus, several user space memory allocations may occur before user space needs to ask the kernel for more memory. Thus if you got a large chunk of memory from the kernel and are only using a small part of that, that large chunk of memory remains allocated.

Releasing memory back to the kernel also has a cost. User space memory allocators may hold onto that memory (privately) in the hope it can be reused within the same process and not give it back to the kernel for use in other processes. (Ruby Best Practices)

So, your objects may very well have been garbage collected and released back to Ruby's available memory, but because Ruby never gives back unused memory to the OS, the rss value for the process remains the same, even after garbage collection. This is actually by design. According to Mike Perham:

...And since MRI never gives back unused memory, our daemon can easily be taking 300-400MB when it’s only using 100-200.

It’s important to note that this is essentially by design. Ruby’s history is mostly as a command line tool for text processing and therefore it values quick startup and a small memory footprint. It was not designed for long-running daemon/server processes. Java makes a similar tradeoff in its client and server VMs.



回答3:

This could be due to the "Lazy Sweeping" feature in Ruby 1.9.3 and above.

Lazy sweeping basically means that, during garbage collection, Ruby only "sweeps" away enough objects to create space for the new objects it needs to create. It does this because, while the Ruby garbage collector runs, nothing else does. This is known as "Stop the world" garbage collection.

Essentially, lazy sweeping reduces the time that Ruby needs to "stop the world." You can read more about lazy sweeping here.

What does your RUBY_GC_MALLOC_LIMIT environment variable look like?

Here is an excerpt from Sam Saffron's blog concerning lazy sweeping and the RUBY_GC_MALLOC_LIMIT:

The GC in Ruby 2.0 comes in 2 different flavors. We have a "full" GC that runs after we allocate more than our malloc_limit and a lazy sweep (partial GC) that will run if we ever run out of free slots in our heaps.

The lazy sweep takes less time than a full GC, however only performs a partial GC. It's goal is to perform a short GC more frequently thus increasing overall throughput. The world stops, but for less time.

The malloc_limit is set to 8MB out of the box, you can raise it by setting the RUBY_GC_MALLOC_LIMIT higher.

Is your RUBY_GC_MALLOC_LIMIT extremely high? Mine is set to 100000000 (100MB). The Default is around 8MB, but for rails apps they recommend it to be quite a bit higher. If yours is too high it could be preventing Ruby from deleting garbage objects, because it thinks it has plenty of room to grow.



回答4:

Building on @mudasobwa's explanation, I finally tracked down the cause. The code in CSSPool was checking the very long data URI for escape sequences. It would call scan on the URI with a regexp that matched an escape sequence or a single character, map those results to unescape, and then join it back into a string. This was effectively allocating a string for every character in the URI. I modified it to gsub the escape sequences, which seems to have the same results (all the tests pass) and greatly reduces the ending memory used.

Using the same test case as originally posted (minus the Mass.print output) this is the result before the change:

Memory 12404KB
Memory 292516KB

and this is the result after the change:

Memory 12236KB
Memory 19584KB