I am trying to read a file given in an NSURL
and load it into an array, with items separated by a newline character \n
.
Here is the way I've done it so far:
var possList: NSString? = NSString.stringWithContentsOfURL(filePath.URL) as? NSString
if var list = possList {
list = list.componentsSeparatedByString("\n") as NSString[]
return list
}
else {
//return empty list
}
I'm not very happy with this for a couple of reasons. One, I'm working with files that range from a few kilobytes to hundreds of MB in size. As you can imagine, working with strings this large is slow and unwieldy. Secondly, this freezes up the UI when it's executing--again, not good.
I've looked into running this code in a separate thread, but I've been having trouble with that, and besides, it still doesn't solve the problem of dealing with huge strings.
What I'd like to do is something along the lines of the following pseudocode:
var aStreamReader = new StreamReader(from_file_or_url)
while aStreamReader.hasNextLine == true {
currentline = aStreamReader.nextLine()
list.addItem(currentline)
}
How would I accomplish this in Swift?
A few notes about the files I'm reading from: All files consist of short (<255 chars) strings separated by either \n
or \r\n
. The length of the files range from ~100 lines to over 50 million lines. They may contain European characters, and/or characters with accents.
I've wrapped code from algal's answer into convenient class (Swift 4.0)
UPD: This code is platform independent (macOS, iOS, ubuntu)
Usage:
Repository on github
(The code is for Swift 2.2/Xcode 7.3 now. Older versions can be found in the edit history if somebody needs it. An updated version for Swift 3 is provided at the end.)
The following Swift code is heavily inspired by the various answers to How to read data from NSFileHandle line by line?. It reads from the file in chunks, and converts complete lines to strings.
The default line delimiter (
\n
), string encoding (UTF-8) and chunk size (4096) can be set with optional parameters.Usage:
You can even use the reader with a for-in loop
by implementing the
SequenceType
protocol (compare http://robots.thoughtbot.com/swift-sequences):Update for Swift 3/Xcode 8 beta 6: Also "modernized" to use
guard
and the newData
value type:Or you could simply use a
Generator
:Let's try it out
It's simple, lazy, and easy to chain with other swift things like enumerators and functors such as map, reduce, filter; using the
lazy()
wrapper.It generalises to all
FILE
as:called like
I wanted a version that did not continually modify the buffer or duplicate code, as both are inefficient, and would allow for any size buffer (including 1 byte) and any delimiter. It has one public method:
readline()
. Calling this method will return the String value of the next line or nil at EOF.It is called as follows:
I'm late to the game, but here's small class I wrote for that purpose. After some different attempts (try to subclass
NSInputStream
) I found this to be a reasonable and simple approach.Remember to
#import <stdio.h>
in your bridging header.(Note: I'm using Swift 3.0.1 on Xcode 8.2.1 with macOS Sierra 10.12.3)
All of the answers I've seen here missed that he could be looking for LF or CRLF. If everything goes well, s/he could just match on LF and check the returned string for an extra CR at the end. But the general query involves multiple search strings. In other words, the delimiter needs to be a
Set<String>
, where the set is neither empty nor contains the empty string, instead of a single string.On my first try at this last year, I tried to do the "right thing" and search for a general set of strings. It was too hard; you need a full blown parser and state machines and such. I gave up on it and the project it was part of.
Now I'm doing the project again, and facing the same challenge again. Now I'm going to hard-code searching on CR and LF. I don't think anyone would need to search on two semi-independent and semi-dependent characters like this outside of CR/LF parsing.
I'm using the search methods provided by
Data
, so I'm not doing string encodings and stuff here. Just raw binary processing. Just assume I got an ASCII superset, like ISO Latin-1 or UTF-8, here. You can handle string encoding at the next-higher layer, and you punt on whether a CR/LF with secondary code-points attached still counts as a CR or LF.The algorithm: just keep searching for the next CR and the next LF from your current byte offset.
Here's some code for that:
Of course, if you have a
Data
block of a length that's at least a significant fraction of a gigabyte, you'll take a hit whenever no more CR or LF exist from the current byte offset; always fruitlessly searching until the end during every iteration. Reading the data in chunks would help:You have to mix these ideas together yourself, since I haven't done it yet. Consider:
Good luck!