I'm working with libxml2's sax parser to read large xml files. Most callback handlers are provided a NULL terminated char pointer. Using String.fromCString
these can be converted to a regular string in Swift. However sax uses a buffer for reading the bytes, so one of the callbacks (characters
) might be called with part of a string, namely the size of the buffer. This partial string might even start/end halfway a Unicode code point. The callback will be called multi times, until the complete string is provided (in chunks).
I'm thinking of either concatenating all chunks until the complete string can be assembled, or somehow detecting codepoint boundaries in the partial strings, only processing complete up until the invalid codepoint.
What would be the best way to handle such circumstances? The processing should be as fast as possible, while still correct. Memory usage should be kept minimal, but not at the cost of performance.
If processing speed is your first goal then I would just collect
all characters until the XML element is processed completely and
endElement
is called. This can be done using NSMutableData
from the Foundation framework. So you need a property
var charData : NSMutableData?
which is initialized in startElement
:
charData = NSMutableData()
In the characters
callback you append all data:
charData!.appendBytes(ch, length: Int(len))
(The forced unwrapping is acceptable here. charData
can only be nil
if startElement
has not been called before, which means that you
made a programming error or libxml2 is not working correctly).
Finally in endElement
, create a Swift string
and release the data:
defer {
// Release data in any case before function returns
charData = nil
}
guard let string = String(data: charData!, encoding: NSUTF8StringEncoding) else {
// Handle invalid UTF-8 data situation
}
// string is the Swift string
The longest legal UTF-8 character is 4 bytes (RFC 3629 Section 3). So you don't need a very big buffer to keep yourself safe. The rules for how many bytes you'll need are pretty easy, too (just look at the first byte). So I would just maintain an buffer that holds from 0 to 3 bytes. When you have the right number, pass it along and try to construct a String. Something like this (only lightly tested, may have corner cases that don't work still):
final class UTF8Parser {
enum Error: ErrorType {
case BadEncoding
}
var workingBytes: [UInt8] = []
func updateWithBytes(bytes: [UInt8]) throws -> String {
workingBytes += bytes
var string = String()
var index = 0
while index < workingBytes.count {
let firstByte = workingBytes[index]
var numBytes = 0
if firstByte < 0x80 { numBytes = 1 }
else if firstByte < 0xE0 { numBytes = 2 }
else if firstByte < 0xF0 { numBytes = 3 }
else { numBytes = 4 }
if workingBytes.count - index < numBytes {
break
}
let charBytes = workingBytes[index..<index+numBytes]
guard let newString = String(bytes: charBytes, encoding: NSUTF8StringEncoding) else {
throw(Error.BadEncoding)
}
string += newString
index += numBytes
}
workingBytes.removeFirst(index)
return string
}
}
let parser = UTF8Parser()
var string = ""
string += try parser.updateWithBytes([UInt8(65)])
print(string)
let partial = try parser.updateWithBytes([UInt8(0xCC)])
print(partial)
let rest = try parser.updateWithBytes([UInt8(0x81)])
print(rest)
string += rest
print(string)
This is just one way that's kind of straightforward. Another approach that is probably faster would be to walk backwards through the bytes, looking for the last start of code point (a byte that doesn't start with "10"). Then you could process everything up to that point in one fell swoop, and special-case just the last few bytes.