How can I process huge JSON files as streams in Ru

2019-04-22 18:55发布

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.

I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:

For larger documents we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

Here's what I've done with Yajl:

file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
    entry.do_something
end
file_stream.close

The memory usage keeps getting higher until the process is killed.

I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?

If it cannot be done using Yajl: is there a way to this in Ruby via any library?

3条回答
叛逆
2楼-- · 2019-04-22 19:13

Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.

查看更多
ら.Afraid
3楼-- · 2019-04-22 19:21

Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            {|k| puts "key: #{k}" }
  @parser.value          {|v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

There, he sets up the callbacks for possible data events that the stream parser can experience.

Given a json document that looks like:

{
  1: {
    name: "fred",
    color: "red",
    dead: true,
  },
  2: {
    name: "tony",
    color: "six",
    dead: true,
  },
  ...
  n: {
    name: "erik",
    color: "black",
    dead: false,
  },
}

One could stream parse it with yajl-ffi something like this:

def parse_dudes file_io, chunk_size
  parser = Yajl::FFI::Parser.new
  object_nesting_level = 0
  current_row = {}
  current_key = nil

  parser.start_object { object_nesting_level += 1 }
  parser.end_object do
    if object_nesting_level.eql? 2
      yield current_row #here, we yield the fully collected record to the passed block
      current_row = {}
    end
    object_nesting_level -= 1
  end

  parser.key do |k|
    if object_nesting_level.eql? 2
      current_key = k
    elsif object_nesting_level.eql? 1
      current_row["id"] = k
    end
  end

  parser.value { |v| current_row[current_key] = v }

  file_io.each(chunk_size) { |chunk| parser << chunk }
end

File.open('dudes.json') do |f|
  parse_dudes f, 1024 do |dude|
    pp dude
  end
end
查看更多
何必那么认真
4楼-- · 2019-04-22 19:30

Problem

json = Yajl::Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

The example given in the README is:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

(Assume we're in an EventMachine::Connection instance)

def post_init
  @parser = Yajl::Parser.new(:symbolize_keys => true)
end

def object_parsed(obj)
  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
  puts obj.inspect
end

def connection_completed
  # once a full JSON object has been parsed from the stream
  # object_parsed will be called, and passed the constructed object
  @parser.on_parse_complete = method(:object_parsed)
end

def receive_data(data)
  # continue passing chunks
  @parser << data
end

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.

查看更多
登录 后发表回答