Parsing binary structure with Perl6 Grammar

What is the best option to parse a binary structure with Perl6 .

in Perl5 we have the pack/unpack methods on Perl6 they seems experimental

is it possible to use Perl6 grammar to parse binary data let's say i have a file which have records on the following binary format :

struct record {
short int ut_type;

char ut_line[UT_LINESIZE];
char ut_id[4];
char ut_user[UT_NAMESIZE];
char ut_host[UT_HOSTSIZE];


}

is it possible to parse this file with Perl6 grammar ?

回答1:

What is the best option to parse a binary structure with Perl6?

Especially given that you know about P5's pack/unpack, the new P5pack module is plausibly the appropriate solution. (I haven't tested it. It's new. Aiui it doesn't implement everything and it doesn't mimic P5's pack slavishly. But it's Liz.)

If the new pure P6 implementation of P5's pack interface linked above does not do what you need done, another obvious solution is using the original P5 functions, executed by a regular perl 5 binary, in your P6 code. The following is incomplete / untested but I mean something roughly like:

use Inline::Perl5 ; my \P5 = Inline::Perl5.new ;

my $mem = Buf ... ;

my $hex = P5.call('unpack', 'H*', $mem) ;

(Or, conversely, write the mainline as P5 code, adding P6 code via Inline::Perl6.)

In the current version of P6, which is 6.c, grammars can only process text.

2 years ago P6er "skids" wrote:

"There are enough people wanting binary grammars that I think it will come to be" (written in 2016).

At that time they also put together the following related links:

skid's then new idea for a packing foo { template bar { ... } } construct plus their much older "very old, low level types thoughts".
ShimmerFairy's "thoughts on possible binary grammars in P6 (incomplete)"
Juerd's "RFC: A more Perl6-esque "unpack"".
smls's "Binary Parsing in Perl 6".
masak's "A first approach to pack/unpack in Perl 6".
jnthn's "talk on Grammar::Generative" (video).

回答2:

I totally agree with raiph's answer and comments, and just want to add a bit.

I think about two types of parsing, one where you can figure it out based on what you find internally, and one where you are using an external schema that describes how the data are arranged. Binary data could be arranged either way msgpack is an example of the former for binary data.

The example you are using, unpacking a binary struct is an example of the later. It looks like NativeCall CStructs are almost up to doing what you want directly. Perhaps it can already do it and I'm just not aware, but it seems to be lacking the ability to express an embedded sized array. (Is that working yet?)

Without that, your first job is to figure out the structure you're trying to unpack. There are several ways to do this. The first is the easiest, but perhaps most error-prone -- just look at the struct. (I'm going to make up some fake defines so I have real numbers to work with):

record.h:

#define UT_LINESIZE 80
#define UT_IDSIZE   4
#define UT_NAMESIZE 50
#define UT_HOSTSIZE 20

struct record {
    short int ut_type;
    char ut_line[UT_LINESIZE];
    char ut_id[UT_IDSIZE];
    char ut_user[UT_NAMESIZE];
    char ut_host[UT_HOSTSIZE];
};

Looking at that, I can capture the offset and size of each field:

constant \type-size := nativesizeof(int16);
constant \line-size := 80;
constant \id-size   := 4;
constant \user-size := 50;
constant \host-size := 20;

constant \record-size := type-size + line-size + id-size + user-size + host-size;

constant \type-offset := 0;
constant \line-offset := type-offset + type-size;
constant \id-offset   := line-offset + line-size;
constant \user-offset := id-offset   + id-size;
constant \host-offset := user-offset + id-size;

Some caveats here -- you have to understand your format enough to take into account any alignment or padding. Your example here works more easily than some others might.

That gives us enough information to figure out which bytes in the binary structure map to each field.

Next you need to convert each chunk of bytes into the right Perl type. NativeCall's nativecast routine can do that for you. It can easily turn a chunk of bytes into many Perl data types.

I'm going to make the assumption that your fields are C strings, always correctly terminated by NULs, and are suitable for decoding as UTF8. You can tweak this for other particular cases.

use NativeCall;

class record {
    has Int $.ut-type;
    has Str $.ut-line;
    has Str $.ut-id;
    has Str $.ut-user;
    has Str $.ut-host;
}

sub unpack-buf(Mu:U $type, Blob $buf, $offset, $size) {
    nativecast($type, CArray[uint8].new($buf[$offset ..^ $offset+$size]))
}

sub unpack-record($buf) {
    record.new(
        ut-type => unpack-buf(int16, $buf, type-offset, type-size),
        ut-line => unpack-buf(Str,   $buf, line-offset, line-size),
        ut-id   => unpack-buf(Str,   $buf, id-offset,   id-size),
        ut-user => unpack-buf(Str,   $buf, user-offset, user-size),
        ut-host => unpack-buf(Str,   $buf, host-offset, host-size)
    )
}

Then you can seek/read the binary data out of your data file to grab individual records, or just walk through all of it:

my @data = gather {
    given 'data'.IO.open(:bin) {
        while .read(record-size) -> $buf {
            take unpack-record($buf)
        }
        .close
    }
}

Because we've manually copied things from the C struct definition, any changes to that have to be updated, and alignment/padding may always bite us.

Another option is to directly read the C header file and use sizeof() and offsetof() to hand us all the numbers. That will naturally take into account alignment/padding. You can even use TCC from within your Perl code to directly access the header file. Instead of all the constant lines above, you can use this to pull everything right out of the C .h file.

my $tcc = TCC.new('');

$tcc.compile: q:to/END/;
    #include <stddef.h>
    #include "record.h"
    size_t record_size() { return sizeof(struct record); }
    size_t type_offset() { return offsetof(struct record, ut_type); }
    size_t type_size()   { return sizeof(short int); }
    size_t line_offset() { return offsetof(struct record, ut_line); }
    size_t line_size()   { return UT_LINESIZE; }
    size_t id_offset()   { return offsetof(struct record, ut_id); }
    size_t id_size()     { return UT_IDSIZE; }
    size_t user_offset() { return offsetof(struct record, ut_user); }
    size_t user_size()   { return UT_NAMESIZE; }
    size_t host_offset() { return offsetof(struct record, ut_host); }
    size_t host_size()   { return UT_HOSTSIZE; }
    END

$tcc.relocate;

my &record-size := $tcc.bind('record_size', :(--> size_t));
my &type-offset := $tcc.bind('type_offset', :(--> size_t));
my &type-size   := $tcc.bind('type_size',   :(--> size_t));
my &line-offset := $tcc.bind('line_offset', :(--> size_t));
my &line-size   := $tcc.bind('line_size',   :(--> size_t));
my &id-offset   := $tcc.bind('id_offset',   :(--> size_t));
my &id-size     := $tcc.bind('id_size',     :(--> size_t));
my &user-offset := $tcc.bind('user_offset', :(--> size_t));
my &user-size   := $tcc.bind('user_size',   :(--> size_t));
my &host-offset := $tcc.bind('host_offset', :(--> size_t));
my &host-size   := $tcc.bind('host_size',   :(--> size_t));