I have a requirement to parse data files in "txf" format. The files may contain more than 1000 entries. Since the format is well defined like JSON, I wanted to make a generic parser like JSON, which can serialise and deserialise txf files.
On contrary to JSON, the mark up doesn't have a way to identify an object or an array. If an entry with same tag occurs, we need to consider it as an array.
#
Marks the start of an object.$
Marks the members of an object/
Marks the end of an object
Following is a sample "txf" file
#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees
I was able to create a generic TXF Parser using NSScanner. But with more entries the performance needs more tweaking.
I wrote the foundation object obtained as plist
and compared its performance again the parser I wrote. My parser is around 10 times slower than plist
parser.
While plist
file size is 5 times more than txf
and has more markup characters, I feel that there is a lot of room for optimization.
Any help in that direction is highly appreciated.
EDIT : Including the parsing code
static NSString *const kArray = @"TXFArray";
static NSString *const kBodyText = @"TXFText";
@interface TXFParser ()
/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;
/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;
@end
@implementation TXFParser
#pragma mark - Getters
- (NSMutableArray *)stack{
if (!_stack) {
_stack = [NSMutableArray new];
}return _stack;
}
#pragma mark -
- (id)objectFromString:(NSString *)txfString{
[txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
if ([string hasPrefix:@"#"]) {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if([string hasPrefix:@"$"]){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if([string hasPrefix:@"/"]){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
}]; return self.dict;
}
#pragma mark -
- (void)didStartParsingTag:(NSString *)tag{
[self parserFoundObjectStartForKey:tag];
}
- (void)didFindKeyValuePair:(NSString *)tag{
NSArray *components = [tag componentsSeparatedByString:@"="];
NSString *key = [components firstObject];
NSString *value = [components lastObject];
if (key.length) {
self.dict[key] = value?:@"";
}
}
- (void)didFindBodyValue:(NSString *)bodyString{
if (!bodyString.length) return;
bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
if (!bodyString.length) return;
self.dict[kBodyText] = bodyString;
}
- (void)didEndParsingTag:(NSString *)tag{
[self parserFoundObjectEndForKey:tag];
}
#pragma mark -
- (void)parserFoundObjectStartForKey:(NSString *)key{
self.dict = [NSMutableDictionary new];
[self.stack addObject:self.dict];
}
- (void)parserFoundObjectEndForKey:(NSString *)key{
NSDictionary *dict = self.dict;
//Remove the last value of stack
[self.stack removeLastObject];
//Load the previous object as dict
self.dict = [self.stack lastObject];
//The stack has contents, then we need to append objects
if ([self.stack count]) {
[self addObject:dict forKey:key];
}else{
//This is root object,wrap with key and assign output
self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
}
}
#pragma mark - Add Objects after finding end tag
- (void)addObject:(id)dict forKey:(NSString *)key{
//If there is no value, bailout
if (!dict) return;
//Check if the dict already has a value for key array.
NSMutableArray *array = self.dict[kArray];
//If array key is not found look for another object with same key
if (array) {
//Array found add current object after wrapping with key
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
[array addObject:currentDict];
}else{
id prevObj = self.dict[key];
if (prevObj) {
/*
There is a prev value for the same key. That means we need to wrap that object in a collection.
1. Remove the object from dictionary,
2. Wrap it with its key
3. Add the prev and current value to array
4. Save the array back to dict
*/
[self.dict removeObjectForKey:key];
NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];
}else{
//Simply add object to dict
self.dict[key] = dict;
}
}
}
/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
if (!key ||!obj) {
return @{};
}
return @{key:obj};
}
EDIT 2:
A sample TXF file with more than 1000 entries.
Have you considered using pull-style reads & recursive processing? That would eliminate reading the whole file into memory and also eliminate managing some own stack to keep track how deep you're parsing.
Below an example in Swift. The example works with your sample "txf", but not with the dropbox version; some of your "members" span over multiple lines. If this is a requirement, it can easily be implemented into
switch/case "$"
section. However, I don't see your own code handling this either. Also, the example doesn't follow the correct Swift error handling yet (theparse
method would need an additionalNSError
parameter)And the
StreamReader
was borrowed from https://stackoverflow.com/a/24648951/95976Edit
Edit 2
I rewrote the above in C++11 and got it to run in less than 0.05 seconds (release mode) on a 2012 MBA I5 using the updated file on dropbox. I suspect
NSDictionary
andNSArray
must have some penalty. The code below can be compiled into an objective-c project (file needs have extension .mm):Edit 3
See link for full code C++: https://github.com/tofi9/TxfParser
I did some work on your github source - with following 2 changes I got overal improvement of 30% though the major improvement is from "Optimisation 1"
Optimisation 1 - based on your data came with with following work.
Optimisation 2:
Hope it helps you.