How to efficiently write large files to disk on ba

Update

I have resolved and removed the distracting error. Please read the entire post and feel free to leave comments if any questions remain.

Background

I am attempting to write relatively large files (video) to disk on iOS using Swift 2.0, GCD, and a completion handler. I would like to know if there is a more efficient way to perform this task. The task needs to be done without blocking the Main UI, while using completion logic, and also ensuring that the operation happens as quickly as possible. I have custom objects with an NSData property so I am currently experimenting using an extension on NSData. As an example an alternate solution might include using NSFilehandle or NSStreams coupled with some form of thread safe behavior that results in much faster throughput than the NSData writeToURL function on which I base the current solution.

What's wrong with NSData Anyway?

Please note the following discussion taken from the NSData Class Reference, (Saving Data). I do perform writes to my temp directory however the main reason that I am having an issue is that I can see a noticeable lag in the UI when dealing with large files. This lag is precisely because NSData is not asynchronous (and Apple Docs note that atomic writes can cause performance issues on "large" files ~ > 1mb). So when dealing with large files one is at the mercy of whatever internal mechanism is at work within the NSData methods.

I did some more digging and found this info from Apple..."This method is ideal for converting data:// URLs to NSData objects, and can also be used for reading short files synchronously. If you need to read potentially large files, use inputStreamWithURL: to open a stream, then read the file a piece at a time." (NSData Class Reference, Objective-C, +dataWithContentsOfURL). This info seems to imply that I could try using streams to write the file out on a background thread if moving the writeToURL to the background thread (as suggested by @jtbandes) is not sufficient.

The NSData class and its subclasses provide methods to quickly and easily save their contents to disk. To minimize the risk of data loss, these methods provide the option of saving the data atomically. Atomic writes guarantee that the data is either saved in its entirety, or it fails completely. The atomic write begins by writing the data to a temporary file. If this write succeeds, then the method moves the temporary file to its final location.

While atomic write operations minimize the risk of data loss due to corrupt or partially-written files, they may not be appropriate when writing to a temporary directory, the user’s home directory or other publicly accessible directories. Any time you work with a publicly accessible file, you should treat that file as an untrusted and potentially dangerous resource. An attacker may compromise or corrupt these files. The attacker can also replace the files with hard or symbolic links, causing your write operations to overwrite or corrupt other system resources.

Avoid using the writeToURL:atomically: method (and the related methods) when working inside a publicly accessible directory. Instead initialize an NSFileHandle object with an existing file descriptor and use the NSFileHandle methods to securely write the file.

Other Alternatives

One article on Concurrent Programming at objc.io provides interesting options on "Advanced: File I/O in the Background". Some of the options involve use of an InputStream as well. Apple also has some older references to reading and writing files asynchronously. I am posting this question in anticipation of Swift alternatives.

Example of an appropriate answer

Here is an example of an appropriate answer that might satisfy this type of question. (Taken for the Stream Programming Guide, Writing To Output Streams)

Using an NSOutputStream instance to write to an output stream requires several steps:

Create and initialize an instance of NSOutputStream with a repository for the written data. Also set a delegate.
Schedule the stream object on a run loop and open the stream.
Handle the events that the stream object reports to its delegate.
If the stream object has written data to memory, obtain the data by requesting the NSStreamDataWrittenToMemoryStreamKey property.
When there is no more data to write, dispose of the stream object.

I am looking for the most proficient algorithm that applies to writing extremely large files to iOS using Swift, APIs, or possibly even C/ObjC would suffice. I can transpose the algorithm into appropriate Swift compatible constructs.

Nota Bene

~~I understand the informational error below. It is included for completeness.~~ This question is asking whether or not there is a better algorithm to use for writing large files to disk with a guaranteed dependency sequence (e.g. NSOperation dependencies). If there is please provide enough information (description/sample for me to reconstruct pertinent Swift 2.0 compatible code). Please advise if I am missing any information that would help answer the question.

Note on the extension

I've added a completion handler to the base writeToURL to ensure that no unintended resource sharing occurs. My dependent tasks that use the file should never face a race condition.

extension NSData {

    func writeToURL(named:String, completion: (result: Bool, url:NSURL?) -> Void)  {

       let filePath = NSTemporaryDirectory() + named
       //var success:Bool = false
       let tmpURL = NSURL( fileURLWithPath:  filePath )
       weak var weakSelf = self


      dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), {
                //write to URL atomically
                if weakSelf!.writeToURL(tmpURL, atomically: true) {

                        if NSFileManager.defaultManager().fileExistsAtPath( filePath ) {
                            completion(result: true, url:tmpURL)                        
                        } else {
                            completion (result: false, url:tmpURL)
                        }
                    }
            })

        }
    }

This method is used to process the custom objects data from a controller using:

var items = [AnyObject]()
if let video = myCustomClass.data {

    //video is of type NSData        
    video.writeToURL("shared.mp4", completion: { (result, url) -> Void in
        if result {
            items.append(url!)
            if items.count > 0 {

                let sharedActivityView = UIActivityViewController(activityItems: items, applicationActivities: nil)

                self.presentViewController(sharedActivityView, animated: true) { () -> Void in
                //finished
    }
}
        }
     })
}

Conclusion

The Apple Docs on Core Data Performance provide some good advice on dealing with memory pressure and managing BLOBs. This is really one heck of an article with a lot of clues to behavior and how to moderate the issue of large files within your app. Now although it is specific to Core Data and not files, the warning on atomic writing does tell me that I ought to implement methods that write atomically with great care.

With large files, the only safe way to manage writing seems to be adding in a completion handler (to the write method) and showing an activity view on the main thread. Whether one does that with a stream or by modifying an existing API to add completion logic is up to the reader. I've done both in the past and am in the midst of testing for best performance.

Until then, I'm changing the solution to remove all binary data properties from Core Data and replacing them with strings to hold asset URLs on disk. I am also leveraging the built in functionality from Assets Library and PHAsset to grab and store all related asset URLs. When or if I need to copy any assets I will use standard API methods (export methods on PHAsset/Asset Library) with completion handlers to notify user of finished state on the main thread.

(Really useful snippets from the Core Data Performance article)

Reducing Memory Overhead

It is sometimes the case that you want to use managed objects on a temporary basis, for example to calculate an average value for a particular attribute. This causes your object graph, and memory consumption, to grow. You can reduce the memory overhead by re-faulting individual managed objects that you no longer need, or you can reset a managed object context to clear an entire object graph. You can also use patterns that apply to Cocoa programming in general.

You can re-fault an individual managed object using NSManagedObjectContext’s refreshObject:mergeChanges: method. This has the effect of clearing its in-memory property values thereby reducing its memory overhead. (Note that this is not the same as setting the property values to nil—the values will be retrieved on demand if the fault is fired—see Faulting and Uniquing.)

When you create a fetch request you can set includesPropertyValues to NO > to reduce memory overhead by avoiding creation of objects to represent the property values. You should typically only do so, however, if you are sure that either you will not need the actual property data or you already have the information in the row cache, otherwise you will incur multiple trips to the persistent store.

You can use the reset method of NSManagedObjectContext to remove all managed objects associated with a context and "start over" as if you'd just created it. Note that any managed object associated with that context will be invalidated, and so you will need to discard any references to and re-fetch any objects associated with that context in which you are still interested. If you iterate over a lot of objects, you may need to use local autorelease pool blocks to ensure temporary objects are deallocated as soon as possible.

If you do not intend to use Core Data’s undo functionality, you can reduce your application's resource requirements by setting the context’s undo manager to nil. This may be especially beneficial for background worker threads, as well as for large import or batch operations.

Finally, Core Data does not by default keep strong references to managed objects (unless they have unsaved changes). If you have lots of objects in memory, you should determine the owning references. Managed objects maintain strong references to each other through relationships, which can easily create strong reference cycles. You can break cycles by re-faulting objects (again by using the refreshObject:mergeChanges: method of NSManagedObjectContext).

Large Data Objects (BLOBs)

If your application uses large BLOBs ("Binary Large OBjects" such as image and sound data), you need to take care to minimize overheads. The exact definition of “small”, “modest”, and “large” is fluid and depends on an application’s usage. A loose rule of thumb is that objects in the order of kilobytes in size are of a “modest” sized and those in the order of megabytes in size are “large” sized. Some developers have achieved good performance with 10MB BLOBs in a database. On the other hand, if an application has millions of rows in a table, even 128 bytes might be a "modest" sized CLOB (Character Large OBject) that needs to be normalized into a separate table.

In general, if you need to store BLOBs in a persistent store, you should use an SQLite store. The XML and binary stores require that the whole object graph reside in memory, and store writes are atomic (see Persistent Store Features) which means that they do not efficiently deal with large data objects. SQLite can scale to handle extremely large databases. Properly used, SQLite provides good performance for databases up to 100GB, and a single row can hold up to 1GB (although of course reading 1GB of data into memory is an expensive operation no matter how efficient the repository).

A BLOB often represents an attribute of an entity—for example, a photograph might be an attribute of an Employee entity. For small to modest sized BLOBs (and CLOBs), you should create a separate entity for the data and create a to-one relationship in place of the attribute. For example, you might create Employee and Photograph entities with a one-to-one relationship between them, where the relationship from Employee to Photograph replaces the Employee's photograph attribute. This pattern maximizes the benefits of object faulting (see Faulting and Uniquing). Any given photograph is only retrieved if it is actually needed (if the relationship is traversed).

It is better, however, if you are able to store BLOBs as resources on the filesystem, and to maintain links (such as URLs or paths) to those resources. You can then load a BLOB as and when necessary.

Note:

I've moved the logic below into the completion handler (see the code above) and I no longer see any error. As mentioned before this question is about whether or not there is a more performant way to process large files in iOS using Swift.

~~When attempting to process the resulting items array to pass to a UIActvityViewController, using the following logic:~~

if items.count > 0 {
let sharedActivityView = UIActivityViewController(activityItems: items, applicationActivities: nil) self.presentViewController(sharedActivityView, animated: true) { () -> Void in //finished} }

I am seeing the following error: Communications error: { count = 1, contents = "XPCErrorDescription" => { length = 22, contents = "Connection interrupted" } }> (please note, I am looking for a better design, not an answer to this error message)

标签： ios swift multithreading large-files large-data

3条回答

Emotional °昔

2楼-- · 2019-01-30 01:07

Current Solution (2018)

Another useful possibility might include the use of a closure whenever the buffer is filled (or if you've used a timed length of recording) to append the data and also to announce the end of the stream of data. In combination with some of the Photo APIs this could lead to good outcomes. So some declarative code like below could be fired during processing:

var dataSpoolingFinished: ((URL?, Error?) -> Void)?
var dataSpooling: ((Data?, Error?) -> Void)?

Handling these closures in your management object may allow you to succinctly handle data of any size while keeping the memory under control.

Couple that idea with the use of a recursive method that aggregates pieces of work into a single dispatch_group and there could be some exciting possibilities.

Apple docs state:

DispatchGroup allows for aggregate synchronization of work. You can use them to submit multiple different work items and track when they all complete, even though they might run on different queues. This behavior can be helpful when progress can’t be made until all of the specified tasks are complete.

File System Programming Guide

Apple's Processing an Entire File Linearly Using Streams article in the FSPG also provided the notion that NSInputStream and NSOutputStream should be inherently thread safe.

Further Refinements

This object doesn't use stream delegation methods. Plenty of room for other refinements as well but this is the basic approach I will take. The main focus on the iPhone is enabling the large file management while constraining the memory via a buffer (TBD - Leverage the outputStream in-memory buffer). To be clear, Apple does mention that their convenience functions that writeToURL are only for smaller file sizes (but makes me wonder why they don't take care of the larger files - These are not edge cases, note - will file question as a bug).

Conclusion

I will have to test further for integrating on a background thread as I don't want to interfere with any NSStream internal queuing. I have some other objects that use similar ideas to manage extremely large data files over the wire. The best method is to keep file sizes as small as possible in iOS to conserve memory and prevent app crashes. The APIs are built with these constraints in mind (which is why attempting unlimited video is not a good idea), so I will have to adapt expectations overall.

(Gist Source, Check gist for latest changes)

import Foundation
import Darwin.Mach.mach_time

class MNGStreamReaderWriter:NSObject {

    var copyOutput:NSOutputStream?
    var fileInput:NSInputStream?
    var outputStream:NSOutputStream? = NSOutputStream(toMemory: ())
    var urlInput:NSURL?

    convenience init(srcURL:NSURL, targetURL:NSURL) {
        self.init()
        self.fileInput  = NSInputStream(URL: srcURL)
        self.copyOutput = NSOutputStream(URL: targetURL, append: false)
        self.urlInput   = srcURL

    }

    func copyFileURLToURL(destURL:NSURL, withProgressBlock block: (fileSize:Double,percent:Double,estimatedTimeRemaining:Double) -> ()){

        guard let copyOutput = self.copyOutput, let fileInput = self.fileInput, let urlInput = self.urlInput else { return }

        let fileSize            = sizeOfInputFile(urlInput)
        let bufferSize          = 4096
        let buffer              = UnsafeMutablePointer<UInt8>.alloc(bufferSize)
        var bytesToWrite        = 0
        var bytesWritten        = 0
        var counter             = 0
        var copySize            = 0

        fileInput.open()
        copyOutput.open()

        //start time
        let time0 = mach_absolute_time()

        while fileInput.hasBytesAvailable {

            repeat {

                bytesToWrite    = fileInput.read(buffer, maxLength: bufferSize)
                bytesWritten    = copyOutput.write(buffer, maxLength: bufferSize)

                //check for errors
                if bytesToWrite < 0 {
                    print(fileInput.streamStatus.rawValue)
                }
                if bytesWritten == -1 {
                    print(copyOutput.streamStatus.rawValue)
                }
                //move read pointer to next section
                bytesToWrite -= bytesWritten
                copySize += bytesWritten

            if bytesToWrite > 0 {
                //move block of memory
                memmove(buffer, buffer + bytesWritten, bytesToWrite)
                }

            } while bytesToWrite > 0

            if fileSize != nil && (++counter % 10 == 0) {
                //passback a progress tuple
                let percent     = Double(copySize/fileSize!)
                let time1       = mach_absolute_time()
                let elapsed     = Double (time1 - time0)/Double(NSEC_PER_SEC)
                let estTimeLeft = ((1 - percent) / percent) * elapsed

                block(fileSize: Double(copySize), percent: percent, estimatedTimeRemaining: estTimeLeft)
            }
        }

        //send final progress tuple
        block(fileSize: Double(copySize), percent: 1, estimatedTimeRemaining: 0)


        //close streams
        if fileInput.streamStatus == .AtEnd {
            fileInput.close()

        }
        if copyOutput.streamStatus != .Writing && copyOutput.streamStatus != .Error {
            copyOutput.close()
        }



    }

    func sizeOfInputFile(src:NSURL) -> Int? {

        do {
            let fileSize = try NSFileManager.defaultManager().attributesOfItemAtPath(src.path!)
            return fileSize["fileSize"]  as? Int

        } catch let inputFileError as NSError {
            print(inputFileError.localizedDescription,inputFileError.localizedRecoverySuggestion)
        }

        return nil
    }


}

Delegation

Here's a similar object that I rewrote from an article on Advanced File I/O in the background, Eidhof,C., ObjC.io). With just a few tweaks this could be made to emulate the behavior above. Simply redirect the data to an NSOutputStream in the processDataChunk method.

(Gist Source - Check gist for latest changes)

import Foundation

class MNGStreamReader: NSObject, NSStreamDelegate {

    var callback: ((lineNumber: UInt , stringValue: String) -> ())?
    var completion: ((Int) -> Void)?
    var fileURL:NSURL?
    var inputData:NSData?
    var inputStream: NSInputStream?
    var lineNumber:UInt = 0
    var queue:NSOperationQueue?
    var remainder:NSMutableData?
    var delimiter:NSData?
    //var reader:NSInputStreamReader?

    func enumerateLinesWithBlock(block: (UInt, String)->() , completionHandler completion:(numberOfLines:Int) -> Void ) {

        if self.queue == nil {
            self.queue = NSOperationQueue()
            self.queue!.maxConcurrentOperationCount = 1
        }

        assert(self.queue!.maxConcurrentOperationCount == 1, "Queue can't be concurrent.")
        assert(self.inputStream == nil, "Cannot process multiple input streams in parallel")

        self.callback = block
        self.completion = completion

        if self.fileURL != nil {
            self.inputStream = NSInputStream(URL: self.fileURL!)
        } else if self.inputData != nil {
            self.inputStream = NSInputStream(data: self.inputData!)
        }

        self.inputStream!.delegate = self
        self.inputStream!.scheduleInRunLoop(NSRunLoop.currentRunLoop(), forMode: NSDefaultRunLoopMode)
        self.inputStream!.open()
    }

    convenience init? (withData inbound:NSData) {
        self.init()
        self.inputData = inbound
        self.delimiter = "\n".dataUsingEncoding(NSUTF8StringEncoding)

    }

    convenience init? (withFileAtURL fileURL: NSURL) {
        guard !fileURL.fileURL else { return nil }

        self.init()
        self.fileURL = fileURL
        self.delimiter = "\n".dataUsingEncoding(NSUTF8StringEncoding)
    }

    @objc func stream(aStream: NSStream, handleEvent eventCode: NSStreamEvent){

        switch eventCode {
        case NSStreamEvent.OpenCompleted:
            fallthrough
        case NSStreamEvent.EndEncountered:
            self.emitLineWithData(self.remainder!)
            self.remainder = nil
            self.inputStream!.close()
            self.inputStream = nil

            self.queue!.addOperationWithBlock({ () -> Void in
                self.completion!(Int(self.lineNumber) + 1)
            })

            break
        case NSStreamEvent.ErrorOccurred:
            NSLog("error")
            break
        case NSStreamEvent.HasSpaceAvailable:
            NSLog("HasSpaceAvailable")
            break
        case NSStreamEvent.HasBytesAvailable:
            NSLog("HasBytesAvaible")

            if let buffer = NSMutableData(capacity: 4096) {
                let length = self.inputStream!.read(UnsafeMutablePointer<UInt8>(buffer.mutableBytes), maxLength: buffer.length)
                if 0 < length {
                    buffer.length = length
                    self.queue!.addOperationWithBlock({ [weak self]  () -> Void in
                        self!.processDataChunk(buffer)
                        })
                }
            }
            break
        default:
            break
        }
    }

    func processDataChunk(buffer: NSMutableData) {
        if self.remainder != nil {

            self.remainder!.appendData(buffer)

        } else {

            self.remainder = buffer
        }

        self.remainder!.mng_enumerateComponentsSeparatedBy(self.delimiter!, block: {( component: NSData, last: Bool) in

            if !last {
                self.emitLineWithData(component)
            }
            else {
                if 0 < component.length {
                    self.remainder = (component.mutableCopy() as! NSMutableData)
                }
                else {
                    self.remainder = nil
                }
            }
        })
    }

    func emitLineWithData(data: NSData) {
        let lineNumber = self.lineNumber
        self.lineNumber = lineNumber + 1
        if 0 < data.length {
            if let line = NSString(data: data, encoding: NSUTF8StringEncoding) {
                callback!(lineNumber: lineNumber, stringValue: line as String)
            }
        }
    }
}

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-01-30 01:15

You should consider using NSStream (NSOutputStream/NSInputStream). If you are going to choose this approach, keep in mind that background thread run loop will need to be started (run) explicitly.

NSOutputStream has a method called outputStreamToFileAtPath:append: which is what you might be looking for.