Word parser script and implementing memoization

2019-09-11 14:22发布

问题:

Description

Given a dictionary, my program generates two output files, 'sequences.txt' and 'words.txt'.

  • 'sequences' contain every sequence of four letters (A-z) that appear in exactly one word of the dictionary, one sequence per line.
  • 'words' will contain the corresponding words that contain the sequences, in the same order, again one per line.

For example, given spec/fixtures/sample_words.txt dictionary containing only

arrows
carrots
give
me

The outputs should be:

'sequences'             'words'

carr                    carrots
give                    give
rots                    carrots
rows                    arrows
rrot                    carrots
rrow                    arrows

Of course, 'arro' does not appear in the output since it is found in more than one word.

What I've come up with so far

Project structure:

├── Gemfile
├── Gemfile.lock
├── examples
│   └── dictionary.txt
├── lib
│   └── word_sequence_parser.rb
├── main.rb
├── output
├── readme.md
└── spec
    ├── fixtures
    │   └── sample_words.txt
    └── word_sequence_parser_spec.rb

To run the script: ruby main.rb examples/dictionary.txt

main.rb

require_relative 'lib/word_sequence_parser.rb'

dict_path = ARGV.shift

if dict_path.nil?
  dict_path = 'spec/fixtures/sample_words.txt'
end

parser = WordSequenceParser.new(dict_path)

# step 1 - Opens dictionary file and generates a new set of words
parser.set

# step 2 - Parses word sequences
parser.sequence

# step 3 - Prints to files in ./output
parser.dump_text

The script that works

word_sequence_parser.rb

require 'set'

class WordSequenceParser

  def initialize(path)
    @path = path
  end

  def set
    set = Set.new

    File.open(@path) do |f|
      f.each_line do |line|
        set.add(line.chomp.downcase)
      end
    end
    set
  end

  def sequence
    sequences = Set.new
    words = Set.new
    to_remove = Set.new

    set.each do |w|
      letters = w.split(//)
      letters.each_cons(4) do |seq|
        s = seq.join
        if !words.add?(s)
          to_remove.add(s)
        end
        sequences.add( {seq: s, word: w} )
      end
    end
    sequences.delete_if { |hash| to_remove.include?(hash[:seq]) }
  end

  def dump_text
    output_s = File.open( 'output/sequences.txt', 'w' )
    output_w = File.open( 'output/words.txt', 'w' )

    sequence.each do |hash|
      output_s.puts("#{hash[:seq]}")
      output_w.puts("#{hash[:word]}")
    end

    output_s.close
    output_w.close
  end
end

My shot at the script with memorization that doesn't work

require 'set'

class WordSequenceParser

  def initialize(path)
    @path = path
  end

  def set
    set = Set.new

    File.open(@path) do |f|
      f.each_line do |line|
        set.add(line.chomp.downcase)
      end
    end
    set
  end

  def memoize
    @set = set
  end

  def sequence
    sequences = Set.new
    words = Set.new
    to_remove = Set.new

    @set.each do |w|
      letters = w.split(//)
      letters.each_cons(4) do |seq|
        s = seq.join
        if !words.add?(s)
          to_remove.add(s)
        end
        sequences.add( {seq: s, word: w} )
      end
    end
    sequences.delete_if { |hash| to_remove.include?(hash[:seq]) }
  end

  def dump_text
    output_s = File.open( 'output/sequences.txt', 'w' )
    output_w = File.open( 'output/words.txt', 'w' )

    sequence.each do |hash|
      output_s.puts("#{hash[:seq]}")
      output_w.puts("#{hash[:word]}")
    end

    output_s.close
    output_w.close
  end
end

I get this error message when trying to run the script.

../word_sequence_parser.rb:29:in `sequence': undefined method `each'     for nil:NilClass (NoMethodError)
    from main.rb:15:in `<main>'

I've read up on Justin Weiss' article on memoization and for the most part get it. Just having a hard time implementing this technique into something I've already written.

回答1:

It does not work since you never call memoize, so @set is never initialized.

However the real problem here, is that there is nothing to memoize.

Your original code looks pretty good, and if you think about how it works there is no redundant execution of any of the code. Every line that is executed either once, or if more than once, returns a different value.

Thus there is no purpose in memoization.

Lets say however you wanted to call dump_text (or just sequence) multiple times then you would definitely want to memoize sequence as follows:

def sequence
  @sequence ||= begin
    sequences = Set.new
    words = Set.new
    to_remove = Set.new

    set.each do |w|
      letters = w.split(//)
      letters.each_cons(4) do |seq|
        s = seq.join
        if !words.add?(s)
          to_remove.add(s)
        end
        sequences.add( {seq: s, word: w} )
      end
    end
    sequences.delete_if { |hash| to_remove.include?(hash[:seq]) }
  end
end

This will only execute your original sequence calculating code once, then assign @sequence. Every other call to @sequence will reuse the value of @sequence already calculated (because its now not nil.)

I love this question because this was the first thing I remember when my company started using ruby. We had a consultant redoing a lot of old asp.net code, and he had these @foo ||= ... expressions in methods, which I had never seen before.