I'm trying to read and parse a large JSON file that cannot fit in memory with the new JSON reader System.Text.Json
in .NET Core 3.0.
The example code from Microsoft takes a ReadOnlySpan<byte>
as input
public static void Utf8JsonReaderLoop(ReadOnlySpan<byte> dataUtf8)
{
var json = new Utf8JsonReader(dataUtf8, isFinalBlock: true, state: default);
while (json.Read())
{
JsonTokenType tokenType = json.TokenType;
ReadOnlySpan<byte> valueSpan = json.ValueSpan;
switch (tokenType)
{
case JsonTokenType.StartObject:
case JsonTokenType.EndObject:
break;
case JsonTokenType.StartArray:
case JsonTokenType.EndArray:
break;
case JsonTokenType.PropertyName:
break;
case JsonTokenType.String:
string valueString = json.GetString();
break;
case JsonTokenType.Number:
if (!json.TryGetInt32(out int valueInteger))
{
throw new FormatException();
}
break;
case JsonTokenType.True:
case JsonTokenType.False:
bool valueBool = json.GetBoolean();
break;
case JsonTokenType.Null:
break;
default:
throw new ArgumentException();
}
}
dataUtf8 = dataUtf8.Slice((int)json.BytesConsumed);
JsonReaderState state = json.CurrentState;
}
What I'm struggling to find out is how to actually use this code with a FileStream
, getting a FileStream
into a ReadOnlySpan<byte>
.
I tried reading the file using the following code and ReadAndProcessLargeFile("latest-all.json");
const int megabyte = 1024 * 1024;
public static void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
{
FileStream fileStram = new FileStream(theFilename, FileMode.Open, FileAccess.Read);
using (fileStram)
{
byte[] buffer = new byte[megabyte];
fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
int bytesRead = fileStram.Read(buffer, 0, megabyte);
while (bytesRead > 0)
{
ProcessChunk(buffer, bytesRead);
bytesRead = fileStram.Read(buffer, 0, megabyte);
}
}
}
private static void ProcessChunk(byte[] buffer, int bytesRead)
{
var span = new ReadOnlySpan<byte>(buffer);
Utf8JsonReaderLoop(span);
}
It crashes with the error messaage
System.Text.Json.JsonReaderException: 'Expected end of string, but instead reached end of data. LineNumber: 8 | BytePositionInLine: 123335.'
As a reference, here is my working code that's using Newtonsoft.Json
dynamic o;
var serializer = new Newtonsoft.Json.JsonSerializer();
using (FileStream s = File.Open("latest-all.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
o = serializer.Deserialize(reader);
}
}
}
Update 2019-10-13: Rewritten the Utf8JsonStreamReader to use ReadOnlySequences internally, added wrapper for JsonSerializer.Deserialize method.
I have created a wrapper around Utf8JsonReader for exactly this purpose:
You can use it as replacement for Utf8JsonReader, or for deserializing json into typed objects (as wrapper around System.Text.Json.JsonSerializer.Deserialize).
Example of usage for deserializing objects from huge JSON array:
Deserialize method reads data from stream until it finds end of the current object. Then it constructs new Utf8JsonReader with data read and calls JsonSerializer.Deserialize.
Other methods are pass through to Utf8JsonReader.
And, as always, don't forget to dispose your objects at the end.