How can I stream JSON from a file?

2019-02-17 13:11发布

问题:

I will have a possibly very large JSON file and I want to stream from it instead of load it all into memory. Based on the following statement (I added the emphasis) from JSON::XS, I believe it won't suit my needs. Is there a Perl 5 JSON module that will stream the results from the disk?

In some cases, there is the need for incremental parsing of JSON texts. While this module always has to keep both JSON text and resulting Perl data structure in memory at one time, it does allow you to parse a JSON stream incrementally. It does so by accumulating text until it has a full JSON object, which it then can decode. This process is similar to using decode_prefix to see if a full JSON object is available, but is much more efficient (and can be implemented with a minimum of method calls).

To clarify, the JSON will contain an array of objects. I want to read one object at a time from the file.

回答1:

Have you looked at JSON::Streaming::Reader which shows up as first while searching for 'JSON Stream' on search.cpan.org?

Alternatively JSON::SL found by searching for 'JSON SAX' - not quite as obvious search terms, but what you describe sounds like a SAX parsers for XML.



回答2:

In terms of ease of use and speed, JSON::SL seems to be the winner:

#!/usr/bin/perl

use strict;
use warnings;

use JSON::SL;

my $p = JSON::SL->new;

#look for everthing past the first level (i.e. everything in the array)
$p->set_jsonpointer(["/^"]);

local $/ = \5; #read only 5 bytes at a time
while (my $buf = <DATA>) {
    $p->feed($buf); #parse what you can
    #fetch anything that completed the parse and matches the JSON Pointer
    while (my $obj = $p->fetch) {
        print "$obj->{Value}{n}: $obj->{Value}{s}\n";
    }
}

__DATA__
[
    { "n": 0, "s": "zero" },
    { "n": 1, "s": "one"  },
    { "n": 2, "s": "two"  }
]

JSON::Streaming::Reader was okay, but it is slower and suffers from too verbose an interface (all of these coderefs are required even though many do nothing):

#!/usr/bin/perl

use strict;
use warnings;

use JSON::Streaming::Reader;

my $p = JSON::Streaming::Reader->for_stream(\*DATA);

my $obj;
my $attr;
$p->process_tokens(
    start_array    => sub {}, #who cares?
    end_array      => sub {}, #who cares?
    end_property   => sub {}, #who cares?
    start_object   => sub { $obj = {}; },     #clear the current object
    start_property => sub { $attr = shift; }, #get the name of the attribute
    #add the value of the attribute to the object
    add_string     => sub { $obj->{$attr} = shift; },
    add_number     => sub { $obj->{$attr} = shift; },
    #object has finished parsing, it can be used now
    end_object     => sub { print "$obj->{n}: $obj->{s}\n"; },
);

__DATA__
[
    { "n": 0, "s": "zero" },
    { "n": 1, "s": "one"  },
    { "n": 2, "s": "two"  }
]

To parse 1,000 records it took JSON::SL .2 seconds and JSON::Streaming::Reader 3.6 seconds (note, JSON::SL was being fed 4k at a time, I had no control over JSON::Streaming::Reader's buffer size).



回答3:

It does so by accumulating text until it has a full JSON object, which it then can decode.

This is what screws your over. A JSON document is one object.

You need to define more clearly what you want from incremental parsing. Are you looking for one element of a large mapping? What are you trying to do with the information you read out/write?


I don't know any library that will incrementally parse JSON data by reading one element out of an array at once. However this is quite simple to implement yourself using a finite state automaton (basically your file has the format \s*\[\s*([^,]+,)*([^,]+)?\s*\]\s* except that you need to parse commas in strings correctly.)



回答4:

Did you try to skip first right braket [ and then the commas , :

$json->incr_text =~ s/^ \s* \[ //x;
...
$json->incr_text =~ s/^ \s* , //x;
...
$json->incr_text =~ s/^ \s* \] //x;

like in the third example : http://search.cpan.org/dist/JSON-XS/XS.pm#EXAMPLES



回答5:

If you have control over how you're generating your JSON, then I suggest turning pretty formatting off and printing one object per line. This makes parsing simple, like so:

use Data::Dumper;
use JSON::Parse 'json_to_perl';
use JSON;
use JSON::SL;
my $json_sl = JSON::SL->new();
use JSON::XS;
my $json_xs = JSON::XS->new();
$json_xs = $json_xs->pretty(0);
#$json_xs = $json_xs->utf8(1);
#$json_xs = $json_xs->ascii(0);
#$json_xs = $json_xs->allow_unknown(1);

my ($file) = @ARGV;
unless( defined $file && -f $file )
{
  print STDERR "usage: $0 FILE\n";
  exit 1;
}


my @cmd = ( qw( CMD ARGS ), $file );
open my $JSON, '-|', @cmd or die "Failed to exec @cmd: $!";

# local $/ = \4096; #read 4k at a time
while( my $line = <$JSON> )
{
  if( my $obj = json($line) )
  {
     print Dumper($obj);
  }
  else
  {
     die "error: failed to parse line - $line";
  }
  exit if( $. == 5 );
}

exit 0;

sub json
{
  my ($data) = @_;

  return decode_json($data);
}

sub json_parse
{
  my ($data) = @_;

  return json_to_perl($data);
}

sub json_xs
{
  my ($data) = @_;

  return $json_xs->decode($data);
}

sub json_xs_incremental
{
  my ($data) = @_;
  my $result = [];

  $json_xs->incr_parse($data);  # void context, so no parsing
  push( @$result, $_ ) for( $json_xs->incr_parse );

  return $result;
}

sub json_sl_incremental
{
  my ($data) = @_;
  my $result = [];

  $json_sl->feed($data);
  push( @$result, $_ ) for( $json_sl->fetch );
  # ? error: JSON::SL - Got error CANT_INSERT at position 552 at json_to_perl.pl line 82, <$JSON> line 2.

  return $result;
}


标签: json perl stream