The problem is: how to parse a file when encoding is set at runtime?
encoding could be: utf-8, utf-16, latin1 or other
The goal it is to convert ubyte[] to a string from the selected encoding. Because when you use std.stdio.File.byChunk or std.mmFile.MmFile you have ubyte[] as data.
Are you trying to convert text file to utf-8?
If answer is 'yes', Phobos have function specialy for this: @trusted string toUTF8(in char[] s)
.
See http://dlang.org/phobos/std_utf.html for details.
Sorry if it not what you need.
I have found a way, maybe use std.algorithm.reduce should be better
import std.string;
import std.stdio;
import std.encoding;
import std.algorithm;
void main( string[] args ){
File f = File( "pathToAfFile.txt", "r" );
size_t i;
auto e = EncodingScheme.create("utf-8");
foreach( const(ubyte)[] buffer; f.byChunk( 4096 ) ){
size_t step = 0;
if( step == 0 ) step = e.firstSequence( buffer );
for( size_t start; start + step < buffer.length; start = start + step )
write( e.decode( buffer[start..start + step] ) );
}
}
D strings are already UTF-8. No transcoding is necessary. You can use validate
from std.utf
to check if the file contains valid UTF-8. If you use readText
from std.file
, it will do the validation for you.
File.byChunk returns a range which returns a ubyte[] via front.
A quick Google search seemed to indicate that UTF-8 uses 1 to 6 bytes to encode data so just make sure you always have 6 bytes of data and you can use std.encoding's decode to convert it a dchar character. You can then use std.utf's toUFT8 to convert to a regular string instead of a dstring.
The convert function below will convert any unsigned array range to a string.
import std.encoding, std.stdio, std.traits, std.utf;
void main()
{
File input = File("test.txt");
string data = convert(input.byChunk(512));
writeln("Data: ", data);
}
string convert(R)(R chunkRange)
in
{
assert(isArray!(typeof(chunkRange.front)) && isUnsigned!(typeof(chunkRange.front[0])));
}
body
{
ubyte[] inbuffer;
dchar[] outbuffer;
while(inbuffer.length > 0 || !chunkRange.empty)
{
while((inbuffer.length < 6) && !chunkRange.empty)// Max UTF-8 byte length is 6
{
inbuffer ~= chunkRange.front;
chunkRange.popFront();
}
outbuffer ~= decode(inbuffer);
}
return toUTF8(outbuffer); // Convert to string instead of dstring
}