How to deal with buffered strings from C in Swift?

2020-07-18 11:59发布

问题:

I'm working with libxml2's sax parser to read large xml files. Most callback handlers are provided a NULL terminated char pointer. Using String.fromCString these can be converted to a regular string in Swift. However sax uses a buffer for reading the bytes, so one of the callbacks (characters) might be called with part of a string, namely the size of the buffer. This partial string might even start/end halfway a Unicode code point. The callback will be called multi times, until the complete string is provided (in chunks).

I'm thinking of either concatenating all chunks until the complete string can be assembled, or somehow detecting codepoint boundaries in the partial strings, only processing complete up until the invalid codepoint.

What would be the best way to handle such circumstances? The processing should be as fast as possible, while still correct. Memory usage should be kept minimal, but not at the cost of performance.

回答1:

If processing speed is your first goal then I would just collect all characters until the XML element is processed completely and endElement is called. This can be done using NSMutableData from the Foundation framework. So you need a property

var charData : NSMutableData?

which is initialized in startElement:

charData = NSMutableData()

In the characters callback you append all data:

charData!.appendBytes(ch, length: Int(len))

(The forced unwrapping is acceptable here. charData can only be nil if startElement has not been called before, which means that you made a programming error or libxml2 is not working correctly).

Finally in endElement, create a Swift string and release the data:

defer {
    // Release data in any case before function returns
    charData = nil
}
guard let string =  String(data: charData!, encoding: NSUTF8StringEncoding) else {
    // Handle invalid UTF-8 data situation
} 
// string is the Swift string 


回答2:

The longest legal UTF-8 character is 4 bytes (RFC 3629 Section 3). So you don't need a very big buffer to keep yourself safe. The rules for how many bytes you'll need are pretty easy, too (just look at the first byte). So I would just maintain an buffer that holds from 0 to 3 bytes. When you have the right number, pass it along and try to construct a String. Something like this (only lightly tested, may have corner cases that don't work still):

final class UTF8Parser {
    enum Error: ErrorType {
        case BadEncoding
    }
    var workingBytes: [UInt8] = []

    func updateWithBytes(bytes: [UInt8]) throws -> String {

        workingBytes += bytes

        var string = String()
        var index = 0

        while index < workingBytes.count {
            let firstByte = workingBytes[index]
            var numBytes = 0

                 if firstByte < 0x80 { numBytes = 1 }
            else if firstByte < 0xE0 { numBytes = 2 }
            else if firstByte < 0xF0 { numBytes = 3 }
            else                     { numBytes = 4 }

            if workingBytes.count - index < numBytes {
                break
            }

            let charBytes = workingBytes[index..<index+numBytes]

            guard let newString = String(bytes: charBytes, encoding: NSUTF8StringEncoding) else {
                throw(Error.BadEncoding)
            }
            string += newString
            index += numBytes
        }

        workingBytes.removeFirst(index)
        return string
    }
}

let parser = UTF8Parser()
var string = ""
string += try parser.updateWithBytes([UInt8(65)])

print(string)
let partial = try parser.updateWithBytes([UInt8(0xCC)])
print(partial)

let rest = try parser.updateWithBytes([UInt8(0x81)])
print(rest)

string += rest
print(string)

This is just one way that's kind of straightforward. Another approach that is probably faster would be to walk backwards through the bytes, looking for the last start of code point (a byte that doesn't start with "10"). Then you could process everything up to that point in one fell swoop, and special-case just the last few bytes.