Is there any sample code in C# for reading the Azure Event Hub Archive files (Avro format)?
I am trying to use the Microsoft.Hadoop.Avro library. I dumped the schema out using a java avro tool which produces this:
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{ ""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{ ""type"":""map"",""values"":[""long"",""double"",""string"",""bytes"", ""null""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
However, when trying to deserialize the file to read the data back in like this:
using (var reader = AvroContainer.CreateReader<EventData>(stream))
{
using (var streamReader = new SequentialReader<EventData>(reader))
{
foreach (EventData dta in streamReader.Objects)
{
//stuff here
}
}
}
It doesn't work when passing the actual EventData type used on the Producer side so I tried to create a special class marked up with DataContract attributes like this:
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
[DataMember(Name = "SequenceNumber")]
public long SequenceNumber { get; set; }
[DataMember(Name = "Offset")]
public string Offset { get; set; }
[DataMember(Name = "EnqueuedTimeUtc")]
public string EnqueuedTimeUtc { get; set; }
[DataMember(Name = "Body")]
public ArraySegment<byte> Body { get; set; }
//[DataMember(Name = "SystemProperties")]
//public SystemPropertiesCollection SystemProperties { get; set; }
//[DataMember(Name = "Properties")]
//public IDictionary<string, object> Properties { get; set; }
}
It errors with the following:
System.Runtime.Serialization.SerializationException occurred
Message=Cannot match the union schema.
Is there a reason no sample code exists from MS for this use case of reading the Avro archive files using C#?
If you're trying to read the Avro files using Microsoft.Hadoop.Avro library, you can use the following class:
When you're reading your avro file, you can read it as a dynamic object and then serialize it. Here's an example:
You can refer to this answer for more details.
I used both the Microsoft.Hadoop.Avro and apache avro C# libs and they seemed to have the same exact issue. When just trying to read the sequence, offset, and EnqueuedTimeUTC they both get the same garbled data that appears to be the codec and schema definition data. So here's what I found out. I was downloading the blob to a memorystream and then trying to deserialize from there. The issue is that the deserializer was not taking into account the header and schema in the file and was trying to deserialize from the very beginning of the stream.
To solve this and what worked was to use the Apache Avro C# library and use their gen tool to create the C# class based off of the dumped json formatted schema and then use a DataFileReader that can read from the stream.
where evtSample.Schema is an instance of the EventData class which contains it's schema.
Now to find out if I can do the same thing with the Microsoft.Hadoop.Avro library.
BTW, here is the generated C# class output from the Apache AVRO gen tool:
}