Why is counting letters faster using String#count

2019-01-26 16:20发布

站内文章 / Ruby

23 0

仙女界的扛把子

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Using the following benchmark:

def create_genome
  "gattaca" * 100
end

def count_frequency_using_chars(sequence)
  100000.times do
    sequence.chars.group_by{|x| x}.map{|letter, array| [letter, array.count]}
  end
end

def count_frequency_using_count(sequence)
  100000.times do
    ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
  end
end

sequence = create_genome
count_frequency_using_chars(sequence)
count_frequency_using_count(sequence)

I found that, in C-based Ruby for both 1.8 and 1.9.2, using String#count(letter) is approximately 50 times faster than sorting and counting them using Enumerable#group_by and Array#count. I was slightly surprised at this, because the String#count approach reads through the the string four times each iteration, whereas the latter only reads through it once.

I tried running the code under ruby-prof and perftools.rb, and both of them merely indicated that String#chars took 90% of the time, with no break-down of where that 90% of time was spent.

If I had to guess why there's a difference, I'd say that creating 70 million single-character strings would be expensive, but how would I be able to know? (Update: String#chars wasn't the culprit - see the benchmark for mainly_execute_a_trivial_block)

Edit: Current benchmarks using 1.9.2 patchlevel 180:

require 'pp'
require 'benchmark'

def create_genome
  "gattaca" * 100
end

ZILLION = 100000

def count_frequency_using_count(sequence)
  ZILLION.times do
    ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
  end
end

def count_frequency_using_chars(sequence)
  ZILLION.times do
    sequence.chars.group_by{|x| x}.map{|letter, array| [letter, array.count]}
  end
end

def count_frequency_using_inject_hash(sequence)
  ZILLION.times do
     sequence.chars.inject(Hash.new(0)) { |h, e| h[e] += 1 ; h }
  end
end

def count_frequency_using_each_with_object(sequence)
  ZILLION.times do
     sequence.chars.each_with_object(Hash.new(0)) { |char, hash| hash[char] += 1}
  end
end


def just_group_by(sequence)
  ZILLION.times do
    sequence.chars.group_by{|x| x}
  end
end

def just_chars_and_trivial_block(sequence)
  ZILLION.times do
    sequence.chars() {}
  end
end

def mainly_execute_a_trivial_block(sequence)
  ZILLION.times do
    sequence.length.times() {}
  end
end

def execute_an_empty_loop_instead(sequence)
  ZILLION.times do
    i = 0
    max = sequence.length
    until i == max
      i += 1
    end
  end
end

sequence = create_genome

puts RUBY_VERSION

Benchmark.bm do |benchmark|
  benchmark.report do
    count_frequency_using_count(sequence)
  end
  benchmark.report do
    count_frequency_using_chars(sequence)
  end
  benchmark.report do
    count_frequency_using_inject_hash(sequence)
  end
  benchmark.report do
    count_frequency_using_each_with_object(sequence)
  end
  benchmark.report do
    just_group_by(sequence)
  end
  benchmark.report do
    just_chars_and_trivial_block(sequence)
  end
  benchmark.report do
    mainly_execute_a_trivial_block(sequence)
  end
  benchmark.report do
    execute_an_empty_for_loop_instead(sequence)
  end
end

Results:

     user     system      total        real
 0.510000   0.000000   0.510000 (  0.508499) # count_frequency_using_count
23.470000   0.030000  23.500000 ( 23.575487) # count_frequency_using_chars
32.770000   0.050000  32.820000 ( 32.874634) # count_frequency_using_inject_hash
31.880000   0.040000  31.920000 ( 31.942437) # count_frequency_using_each_with_object
22.950000   0.030000  22.980000 ( 22.970500) # just_group_by
13.300000   0.020000  13.320000 ( 13.314466) # just_chars_and_trivial_block
 5.660000   0.000000   5.660000 (  5.661516) # mainly_execute_a_trivial_block
 1.930000   0.010000   1.940000 (  1.934861) # execute_an_empty_loop_instead

回答1:

Its nothing to do with the ruby internals. You are comparing apples with oranges.

in your first example, you are grouping 700 char string 100000 times and finding the count. So its a problem in your logic. not in counting.In the second approach you are just counting,

And in both the approaches you are just using count only

just change the first example like this

def count_frequency_using_chars(sequence)
  grouped = sequence.chars.group_by{|x| x}
  100000.times do
    grouped.map{|letter, array| [letter, array.count]}
  end
end

And its as fast as your second

Edit

This approach is 3x faster than the count_frequency_using_count, check the benchmarks

  def count_frequency_using_chars_with_single_group(sequence)
    grouped = sequence.chars.group_by{|x| x}
      100000.times do
        grouped.map{|letter, array| [letter, array.count]}
      end
    end

    def count_frequency_using_count(sequence)
      100000.times do
        ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
      end
    end

Benchmark.bm do |benchmark|
  benchmark.report do
    pp count_frequency_using_chars_with_single_group(sequence)
  end
  benchmark.report do
    pp count_frequency_using_count(sequence)
  end
end


    user     system      total        real
  0.410000   0.000000   0.410000 (  0.419100)
  1.330000   0.000000   1.330000 (  1.324431)

Andrew to your comments,

measuring the character composition of 100000 sequences once each, not the character composition of one sequence 100000 times, still your count approach is too slower than the group_by approach. I just benchmarked the large strings as you said

seq = "gattaca" * 10000
#seq length is 70000

arr_seq = (1..10).map {|x| seq}
#10 seq items

and changed the methods to handle the multiple sequences

def count_frequency_using_chars_with_single_group(sequences)
  sequences.each do |sequence|
    grouped = sequence.chars.group_by{|x| x}
    100000.times do
      grouped.map{|letter, array| [letter, array.count]}
    end
  end
end

def count_frequency_using_count(sequence)
  sequences.each do |sequence|
    100000.times do
      ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
    end
  end
end


Benchmark.bm do |benchmark|
  benchmark.report do
    pp count_frequency_using_chars_with_single_group(arr_seq)
  end
  benchmark.report do
    pp count_frequency_using_count(arr_seq)
  end
end

For processing 100000 times, 10 sequences each with 70000 length

  user        system      total        real
 3.710000     0.040000    3.750000     ( 23.452889)   #modified group_by approach
 1039.180000  6.920000    1046.100000  (1071.374730) #simple char count approach

Your simple char count approach is 47% slower than the modified group_by approach for the high volume strings. I ran the above benchmark for just 10 sequences each with 70000 length. Assume this for 100 or 1000 sequences, simple count would never an option. right?

回答2:

The thing that's slow is the group_by. Effectively, while you need to do 4 passes for the count method, the group_by method is a lot slower because it's doing a lot of work in order to do that group_by.

Breaking the code out a bit to have a method which does only the group by:

def create_genome
  "gattaca" * 100
end

def count_frequency_using_chars(sequence)
  100000.times do
    sequence.chars.group_by{|x| x}.map{|letter, array| [letter, array.count]}
  end
end

def count_frequency_using_count(sequence)
  100000.times do
    ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
  end
end

def just_group_by(sequence)
  100000.times do
    sequence.chars.group_by{|x| x}
  end
end

sequence = create_genome

...

ruby-1.9.2-p180 :068 > puts Time.now()
2011-06-17 11:17:36 -0400

ruby-1.9.2-p180 :069 > count_frequency_using_chars(sequence)
 => 100000 
ruby-1.9.2-p180 :070 > puts Time.now()
2011-06-17 11:18:07 -0400

ruby-1.9.2-p180 :071 > count_frequency_using_count(sequence)
 => 100000 
ruby-1.9.2-p180 :072 > puts Time.now()
2011-06-17 11:18:08 -0400

ruby-1.9.2-p180 :073 > just_group_by(sequence)
 => 100000 
ruby-1.9.2-p180 :074 > puts Time.now()
2011-06-17 11:18:37 -0400

You can see

using group_by took 31 seconds
using count took 1 second
just doing the group_by took 29 seconds

While using the group_by to get the information you need is good conceptually, it's doing extra work that you don't need done.

回答3:

A ~~faster~~ slower way would be to pass the array only once.

hash = sequence.chars.inject(Hash.new(0)) { |h, e| h[e] += 1 ; h }
=> {"g"=>100, "a"=>300, "t"=>200, "c"=>100}

but actually it's NOT faster

This was a total copy & paste fail.

I leave the answer anyway since it shows how you use Benchmark from the standard library.

require 'pp'
require 'benchmark'

def count_frequency_using_inject_hash(sequence)
  100000.times do
     sequence.chars.inject(Hash.new(0)) { |h, e| h[e] += 1 ; h }
  end
end

sequence = create_genome

Benchmark.bm do |benchmark|
  benchmark.report do
    pp count_frequency_using_chars(sequence)
  end
  benchmark.report do
    pp count_frequency_using_count(sequence)
  end
  benchmark.report do
    pp count_frequency_using_inject_hash(sequence)
  end
end

      user     system      total        real
 41.500000   0.000000  41.500000 ( 42.484375)
  1.312000   0.000000   1.312000 (  1.406250)
 49.265000   0.000000  49.265000 ( 49.348633)

回答4:

You can see what's going on better by profiling the VM itself.

A block being yielded an excessive number of times is the culprit here. If you use perftools' profiler for the VM, using the instructions listed under "Profiling the Ruby VM and C extensions" at https://github.com/tmm1/perftools.rb (note: this is more or less vanilla perftools, not perftools.rb)

Removing _init from all stack traces.
Total: 3883 samples
    1321  34.0%  34.0%     3883 100.0% rb_yield_0
     273   7.0%  41.1%      274   7.1% _IO_str_pbackfail
     191   4.9%  46.0%      191   4.9% __i686.get_pc_thunk.bx
     171   4.4%  50.4%      171   4.4% _init
     131   3.4%  53.7%     3880  99.9% rb_eval
     122   3.1%  56.9%      347   8.9% st_lookup
     105   2.7%  59.6%      423  10.9% new_dvar
      93   2.4%  62.0%      326   8.4% rb_newobj
      89   2.3%  64.3%       89   2.3% _setjmp
      77   2.0%  66.3%      400  10.3% str_new
      67   1.7%  68.0%      357   9.2% dvar_asgn_internal
      63   1.6%  69.6%      204   5.3% malloc
      62   1.6%  71.2%     3820  98.4% rb_str_each_char
      58   1.5%  72.7%      187   4.8% rb_ary_store
      55   1.4%  74.1%       55   1.4% rb_memcmp
      55   1.4%  75.5%     3883 100.0% rb_yield
# rest snipped for brevity

As you can see, rb_yield_0 is accounting for over a third of the activity, so even if you could optimize everything else, you'd still be slower than if you were using String#count.

You can also confirm this by doing a benchmark where you're just creating a block that doesn't do anything.

require 'pp'
require 'benchmark'

def create_genome
  "gattaca" * 100
end

ZILLION = 100000

def mainly_execute_a_trivial_block(sequence)
  ZILLION.times do
    sequence.length.times() {}
  end
end

def execute_an_empty_loop_instead(sequence)
  ZILLION.times do
    i = 0
    max = sequence.length
    until i == max
      i += 1
    end
  end
end

sequence = create_genome

puts RUBY_VERSION

Benchmark.bm do |benchmark|
  benchmark.report do
    pp mainly_execute_a_trivial_block(sequence)
  end
  benchmark.report do
    pp execute_an_empty_loop_instead(sequence)
  end
end

gives

      user     system      total        real
  5.700000   0.000000   5.700000 (  5.727715) # mainly_execute_a_trivial_block
  1.920000   0.000000   1.920000 (  1.942096) # execute_an_empty_loop_instead