Number crunching in Ruby (optimisation needed)

2019-07-22 05:08发布

站内文章 / Ruby

25 0

仙女界的扛把子

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Ruby may not be the optimal language for this but I'm sort of comfortable working with this in my terminal so that's what I'm going with.

I need to process the numbers from 1 to 666666 so I pin out all the numbers that contain 6 but doesn't contain 7, 8 or 9. The first number will be 6, the next 16, then 26 and so forth. Then I needed it printed like this (6=6) (16=6) (26=6) and when I have ranges like 60 to 66 I need it printed like (60 THRU 66=6) (SPSS syntax).

I have this code and it works but it's neither beautiful nor very efficient so how could I optimize it?

(silly code may follow)

class Array
  def to_ranges
    array = self.compact.uniq.sort
    ranges = []
    if !array.empty?
      # Initialize the left and right endpoints of the range
      left, right = array.first, nil
      array.each do |obj|
        # If the right endpoint is set and obj is not equal to right's successor 
        # then we need to create a range.
        if right && obj != right.succ
          ranges << Range.new(left,right)
          left = obj
        end
        right = obj
      end
      ranges << Range.new(left,right) unless left == right
    end
    ranges
  end
end

write = ""
numbers = (1..666666).to_a

# split each number in an array containing it's ciphers
numbers = numbers.map { |i| i.to_s.split(//) }

# delete the arrays that doesn't contain 6 and the ones that contains 6 but also 8, 7 and 9
numbers = numbers.delete_if { |i| !i.include?('6') }
numbers = numbers.delete_if { |i| i.include?('7') }
numbers = numbers.delete_if { |i| i.include?('8') }
numbers = numbers.delete_if { |i| i.include?('9') }

# join the ciphers back into the original numbers
numbers = numbers.map { |i| i.join }

numbers = numbers.map { |i| i = Integer(i) }

# rangify consecutive numbers
numbers = numbers.to_ranges

# edit the ranges that go from 1..1 into just 1
numbers = numbers.map do |i|
  if i.first == i.last
    i = i.first
  else
    i = i
  end
end

# string stuff
numbers = numbers.map { |i| i.to_s.gsub(".."," thru ") }

numbers = numbers.map { |i| "(" + i.to_s + "=6)"}

numbers.each { |i| write << " " + i }

File.open('numbers.txt','w') { |f| f.write(write) }

As I said it works for numbers even in the millions - but I'd like some advice on how to make prettier and more efficient.

回答1:

I deleted my earlier attempt to parlez-vous-ruby? and made up for that. I know have an optimized version of x3ro's excellent example.

$,="\n"
puts ["(0=6)", "(6=6)", *(1.."66666".to_i(7)).collect {|i| i.to_s 7}.collect do |s|
    s.include?('6')? "(#{s}0 THRU #{s}6=6)" : "(#{s}6=6)"
end ]

Compared to x3ro's version

... It is down to three lines

... 204.2 x faster (to 66666666)

... has byte-identical output

It uses all my ideas for optimization

gen numbers based on modulo 7 digits (so base-7 numbers)
generate the last digit 'smart': this is what compresses the ranges

So... what are the timings? This was testing with 8 digits (to 66666666, or 823544 lines of output):

$ time ./x3ro.rb >  /dev/null

real    8m37.749s
user    8m36.700s
sys 0m0.976s

$ time ./my.rb >  /dev/null

real    0m2.535s
user    0m2.460s
sys 0m0.072s

Even though the performance is actually good, it isn't even close to the C optimized version I posted before: I couldn't run my.rb to 6666666666 (6x10) because of OutOfMemory. When running to 9 digits, this is the comparative result:

sehe@meerkat:/tmp$ time ./my.rb >  /dev/null

real    0m21.764s
user    0m21.289s
sys 0m0.476s

sehe@meerkat:/tmp$ time ./t2 > /dev/null

real    0m1.424s
user    0m1.408s
sys 0m0.012s

The C version is still some 15x faster... which is only fair considering that it runs on the bare metal.

Hope you enjoyed it, and can I please have your votes if only for learning Ruby for the purpose :)

(Can you tell I'm proud? This is my first encounter with ruby; I started the ruby koans 2 hours ago...)

Edit by @johndouthat:

Very nice! The use of base7 is very clever and this a great job for your first ruby trial :)

Here's a slight modification of your snippet that will let you test 10+ digits without getting an OutOfMemory error:

puts ["(0=6)", "(6=6)"]
(1.."66666666".to_i(7)).each do |i| 
  s = i.to_s(7)
  puts s.include?('6') ? "(#{s}0 THRU #{s}6=6)" : "(#{s}6=6)"
end

# before:
real    0m26.714s
user    0m23.368s
sys 0m2.865s
# after
real    0m15.894s
user    0m13.258s
sys 0m1.724s

回答2:

Exploiting patterns in the numbers, you can short-circuit lots of the loops, like this:

If you define a prefix as the 100s place and everything before it, and define the suffix as everything in the 10s and 1s place, then, looping through each possible prefix:

If the prefix is blank (i.e. you're testing 0-99), then there are 13 possible matches
elsif the prefix contains a 7, 8, or 9, there are no possible matches.
elsif the prefix contains a 6, there are 49 possible matches (a 7x7 grid)
else, there are 13 possible matches. (see the image below)

(the code doesn't yet exclude numbers that aren't specifically in the range, but it's pretty close)

number_range = (1..666_666)
prefix_range = ((number_range.first / 100)..(number_range.last / 100))

for p in prefix_range
  ps = p.to_s

  # TODO: if p == prefix_range.last or p == prefix_range.first,
  # TODO: test to see if number_range.include?("#{ps}6".to_i), etc...

  if ps == '0'
    puts "(6=6) (16=6) (26=6) (36=6) (46=6) (56=6) (60 thru 66) "

  elsif ps =~ /7|8|9/
    # there are no candidate suffixes if the prefix contains 7, 8, or 9.

  elsif ps =~ /6/
    # If the prefix contains a 6, then there are 49 candidate suffixes
    for i in (0..6)
      print "(#{ps}#{i}0 thru #{ps}#{i}6) "
    end
    puts

  else
    # If the prefix doesn't contain 6, 7, 8, or 9, then there are only 13 candidate suffixes.
    puts "(#{ps}06=6) (#{ps}16=6) (#{ps}26=6) (#{ps}36=6) (#{ps}46=6) (#{ps}56=6) (#{ps}60 thru #{ps}66) "

  end
end

Which prints out the following:

(6=6) (16=6) (26=6) (36=6) (46=6) (56=6) (60 thru 66) 
(106=6) (116=6) (126=6) (136=6) (146=6) (156=6) (160 thru 166) 
(206=6) (216=6) (226=6) (236=6) (246=6) (256=6) (260 thru 266) 
(306=6) (316=6) (326=6) (336=6) (346=6) (356=6) (360 thru 366) 
(406=6) (416=6) (426=6) (436=6) (446=6) (456=6) (460 thru 466) 
(506=6) (516=6) (526=6) (536=6) (546=6) (556=6) (560 thru 566) 
(600 thru 606) (610 thru 616) (620 thru 626) (630 thru 636) (640 thru 646) (650 thru 656) (660 thru 666) 
(1006=6) (1016=6) (1026=6) (1036=6) (1046=6) (1056=6) (1060 thru 1066) 
(1106=6) (1116=6) (1126=6) (1136=6) (1146=6) (1156=6) (1160 thru 1166) 
(1206=6) (1216=6) (1226=6) (1236=6) (1246=6) (1256=6) (1260 thru 1266) 
(1306=6) (1316=6) (1326=6) (1336=6) (1346=6) (1356=6) (1360 thru 1366) 
(1406=6) (1416=6) (1426=6) (1436=6) (1446=6) (1456=6) (1460 thru 1466) 
(1506=6) (1516=6) (1526=6) (1536=6) (1546=6) (1556=6) (1560 thru 1566) 
(1600 thru 1606) (1610 thru 1616) (1620 thru 1626) (1630 thru 1636) (1640 thru 1646) (1650 thru 1656) (1660 thru 1666)

etc...

回答3:

Note I don't speak ruby, but I ~~intend to do~~have done a ruby version later just for speed comparison :)

If you just iterate all numbers from 0 to 117648 (ruby <<< 'print "666666".to_i(7)') and print them in base-7 notation, you'll at least have discarded any numbers containing 7,8,9. This includes the optimization suggestion by MrE, apart from lifting the problem to simple int arithmetic instead of char-sequence manipulations.

All that remains, is to check for the presence of at least one 6. This would make the algorithm skip at most 6 items in a row, so I deem it less unimportant (the average number of skippable items on the total range is 40%).

Simple benchmark to 6666666666

(Note that this means outputting 222,009,073 (222M) lines of 6-y numbers)

Staying close to this idea, I wrote this quite highly optimized C code (I don't speak ruby) to demonstrate the idea. I ran it to 282475248 (congruent to 6666666666 (mod 7)) so it was more of a benchmark to measure: 0m26.5s

#include <stdio.h>

static char buf[11]; 
char* const bufend = buf+10;

char* genbase7(int n)
{
    char* it = bufend; int has6 = 0;
    do
    { 
        has6 |= 6 == (*--it = n%7); 
        n/=7; 
    } while(n);

    return has6? it : 0;
}

void asciify(char* rawdigits)
{
    do { *rawdigits += '0'; } 
    while (++rawdigits != bufend);
}

int main()
{
    *bufend = 0; // init

    long i;
    for (i=6; i<=282475248; i++)
    {
        char* b7 = genbase7(i);
        if (b7)
        {
            asciify(b7);
            puts(b7);
        }
    }
}

I also benchmarked another approach, which unsurprisingly ran in less than half the time because

this version directly manipulates the results in ascii string form, ready for display
this version shortcuts the has6 flag for deeper recursion levels
this version also optimizes the 'twiddling' of the last digit when it is required to be '6'
the code is simply shorter...

Running time: 0m12.8s

#include <stdio.h>
#include <string.h>

inline void recursive_permute2(char* const b, char* const m, char* const e, int has6)
{
    if (m<e)
        for (*m = '0'; *m<'7'; (*m)++)
            recursive_permute2(b, m+1, e, has6 || (*m=='6'));
    else
        if (has6)
            for (*e = '0'; *e<'7'; (*e)++)
                puts(b);
        else /* optimize for last digit must be 6 */
            puts((*e='6', b));
}

inline void recursive_permute(char* const b, char* const e)
{
    recursive_permute2(b, b, e-1, 0);
}

int main()
{
    char buf[] = "0000000000"; 
    recursive_permute(buf, buf+sizeof(buf)/sizeof(*buf)-1);
}

Benchmarks measured with:

gcc -O4 t6.c -o t6
time ./t6 > /dev/null

回答4:

$range_start = -1
$range_end = -1
$f = File.open('numbers.txt','w')

def output_number(i)
  if $range_end ==  i-1
    $range_end = i
  elsif $range_start < $range_end
    $f.puts "(#{$range_start} thru #{$range_end})"
    $range_start = $range_end = i
  else
    $f.puts "(#{$range_start}=6)" if $range_start > 0 # no range, print out previous number
    $range_start = $range_end = i
  end
end

'1'.upto('666') do |n|
  next unless n =~ /6/ # keep only numbers that contain 6
  next if n =~ /[789]/ # remove nubmers that contain 7, 8 or 9
  output_number n.to_i
end
if $range_start < $range_end
  $f.puts "(#{$range_start} thru #{$range_end})"
end
$f.close

puts "Ruby is beautiful :)"

回答5:

I came up with this piece of code, which I tried to keep more or less in FP-styling. Probably not much more efficient (as it has been said, with basic number logic you will be able to increase performance, for example by skipping from 19xx to 2000 directly, but that I will leave up to you :)

def check(n)
    n = n.to_s
    n.include?('6') and
    not n.include?('7') and
    not n.include?('8') and
    not n.include?('9')
end

def spss(ranges)
  ranges.each do |range|
    if range.first === range.last
      puts "(" + range.first.to_s + "=6)"
    else
      puts "(" + range.first.to_s + " THRU " + range.last.to_s + "=6)"
    end
  end
end

range = (1..666666)

range = range.select { |n| check(n) }

range = range.inject([0..0]) do |ranges, n|
  temp = ranges.last
  if temp.last + 1 === n
    ranges.pop
    ranges.push(temp.first..n)
  else
    ranges.push(n..n)
  end
end

spss(range)

回答6:

My first answer was trying to be too clever. Here is a much simpler version

class MutablePrintingCandidateRange < Struct.new(:first, :last)
  def to_s
    if self.first == nil and self.last == nil
      ''
    elsif self.first == self.last
      "(#{self.first}=6)"
    else
      "(#{self.first} thru #{self.last})"
    end
  end

  def <<(x)
    if self.first == nil and self.last == nil
      self.first = self.last = x
    elsif self.last == x - 1
      self.last = x
    else
      puts(self) # print the candidates
      self.first = self.last = x # reset the range
    end
  end

end

and how to use it:

numer_range = (1..666_666)
current_range = MutablePrintingCandidateRange.new

for i in numer_range
  candidate = i.to_s

  if candidate =~ /6/ and candidate !~ /7|8|9/
    # number contains a 6, but not a 7, 8, or 9
    current_range << i
  end
end
puts current_range

回答7:

Basic observation: If the current number is (say) 1900 you know that you can safely skip up to at least 2000...

回答8:

(I didn't bother updating my C solution for formatting. Instead I went with x3ro's excellent ruby version and optimized that)

Undeleted:

I still am not sure whether the changed range-notation behaviour isn't actually what the OP wants: This version changes the behaviour of breaking up ranges that are actually contiguous modulo 6; I wouldn't be surprised the OP actually expected .

....
(555536=6)
(555546=6)
(555556 THRU 666666=6)

instead of

....
(666640 THRU 666646=6)
(666650 THRU 666656=6)
(666660 THRU 666666=6)

I'll let the OP decide, and here is the modified version, which runs in 18% of the time as x3ro's version (3.2s instead of 17.0s when generating up to 6666666 (7x6)).

def check(n)
    n.to_s(7).include?('6')
end

def spss(ranges)
  ranges.each do |range|
    if range.first === range.last
      puts "(" + range.first.to_s(7) + "=6)"
    else
      puts "(" + range.first.to_s(7) + " THRU " + range.last.to_s(7) + "=6)"
    end
  end
end

range = (1..117648)

range = range.select { |n| check(n) }

range = range.inject([0..0]) do |ranges, n|
  temp = ranges.last
  if temp.last + 1 === n
    ranges.pop
    ranges.push(temp.first..n)
  else
    ranges.push(n..n)
  end
end

spss(range)

回答9:

My answer below is not complete, but just to show a path (I might come back and continue the answer):

There are only two cases:

1) All the digits besides the lowest one is either absent or not 6

6, 16, ...

2) At least one digit besides the lowest one includes 6

60--66, 160--166, 600--606, ...

Cases in (1) do not include any continuous numbers because they all have 6 in the lowest digit, and are different from one another. Cases in (2) all appear as continuous ranges where the lowest digit continues from 0 to 6. Any single continuation in (2) is not continuous with another one in (2) or with anything from (1) because a number one less than xxxxx0 will be xxxxy9, and a number one more than xxxxxx6 will be xxxxxx7, and hence be excluded.

Therefore, the question reduces to the following:

Get all strings between "" to "66666" that do not include "6"
For each of them ("xxx"), output the string "(xxx6=6)"

Get all strings between "" to "66666" that include at least one "6"
For each of them ("xxx"), output the string "(xxx0 THRU xxx6=6)"

回答10:

The killer here is

numbers = (1..666666).to_a

Range supports iterations so you would be better off by going over the whole range and accumulating numbers that include your segments in blocks. When one block is finished and supplanted by another you could write it out.