I'm trying to read a zip file, check that it has some required files, and then write all valid files out to another zip file. The basic introduction to java.util.zip has a lot of Java-isms and I'd love to make my code more Scala-native. Specifically, I'd like to avoid the use of vars
. Here's what I have:
val fos = new FileOutputStream("new.zip");
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos));
while (zipIn.available == 1) {
val entry = zipIn.getNextEntry
if (entryIsValid(entry)) {
zipOut.putNewEntry(new ZipEntry("subdir/" + entry.getName())
// read data into the data Array
var data = Array[Byte](1024)
var count = zipIn.read(data, 0, 1024)
while (count != -1) {
zipOut.write(data, 0, count)
count = zipIn.read(data, 0, 1024)
}
}
zipIn.close
}
zipOut.close
I should add that I'm using Scala 2.7.7.
dI don't think there's anything particularly wrong with using Java classes that are designed to work in imperative fashion in the fashion they were designed. Idiomatic Scala includes being able to use idiomatic Java as it was intended, even if the styles do clash a bit.
However, if you want--perhaps as an exercise, or perhaps because it does slightly clarify the logic--to do this in a more functional var-free way, you can do so. In 2.8, it's particularly nice, so even though you're using 2.7.7, I'll give a 2.8 answer.
First, we need to set up the problem, which you didn't entirely, but let's suppose we have something like this:
import java.io._
import java.util.zip._
import scala.collection.immutable.Stream
val fos = new FileOutputStream("new.zip")
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos))
val zipIn = new ZipInputStream(new FileInputStream("old.zip"))
def entryIsValid(ze: ZipEntry) = !ze.isDirectory
Now, given this we want to copy the zip file. The trick we can use is the continually
method in collection.immutable.Stream
. What it does is perform a lazily-evaluated loop for you. You can then take and filter the results to terminate and process what you want. It's a handy pattern to use when you have something that you want to be an iterator, but it isn't. (If the item updates itself you can use .iterate
in Iterable
or Iterator
--that's usually even better.) Here's the application to this case, used twice: once to get the entries, and once to read/write chunks of data:
val buffer = new Array[Byte](1024)
Stream.continually(zipIn.getNextEntry).
takeWhile(_ != null).filter(entryIsValid).
foreach(entry => {
zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
Stream.continually(zipIn.read(buffer)).takeWhile(_ != -1).
foreach(count => zipOut.write(buffer,0,count))
})
}
zipIn.close
zipOut.close
Pay close attention to the .
at the end of some lines! I would normally write this on one long line, but it's nicer to have it wrap so you can see it all here.
Just in case it isn't clear, let's unpack one of the uses of continually
.
Stream.continually(zipIn.read(buffer))
This asks to keep calling zipIn.read(buffer)
for as many times as necessary, storing the integer that results.
.takeWhile(_ != -1)
This specifies how many times are necessary, returning a stream of indefinite length but which will quit when it hits a -1
.
.foreach(count => zipOut.write(buffer,0,count))
This processes the stream, taking each item in turn (the count), and using it to write the buffer. This works in a slightly sneaky way, since you rely upon the fact that zipIn
has just been called to get the next element of the stream--if you tried to do this again, not on a single pass through the stream, it would fail because buffer
would be overwritten. But here it's okay.
So, there it is: a slightly more compact, possibly easier to understand, possibly less easy to understand method that is more functional (though there are still side-effects galore). In 2.7.7, in contrast, I would actually do it the Java way because Stream.continually
isn't available, and the overhead of building a custom Iterator
isn't worth it for this one case. (It would be worth it if I was going to do more zip file processing and could reuse the code, however.)
Edit: The looking-for-available-to-go-zero method is kind of flaky for detecting the end of the zip file. I think the "correct" way is to wait until you get a null
back from getNextEntry
. With that in mind, I've edited the previous code (there was a takeWhile(_ => zipIn.available==1)
that is now a takeWhile(_ != null)
) and provided a 2.7.7 iterator based version below (note how small the main loop is, once you get through the work of defining the iterators, which do admittedly use vars):
val buffer = new Array[Byte](1024)
class ZipIter(zis: ZipInputStream) extends Iterator[ZipEntry] {
private var entry:ZipEntry = zis.getNextEntry
private var cached = true
private def cache { if (entry != null && !cached) {
cached = true; entry = zis.getNextEntry
}}
def hasNext = { cache; entry != null }
def next = {
if (!cached) cache
cached = false
entry
}
}
class DataIter(is: InputStream, ab: Array[Byte]) extends Iterator[(Int,Array[Byte])] {
private var count = 0
private var waiting = false
def hasNext = {
if (!waiting && count != -1) { count = is.read(ab); waiting=true }
count != -1
}
def next = { waiting=false; (count,ab) }
}
(new ZipIter(zipIn)).filter(entryIsValid).foreach(entry => {
zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
(new DataIter(zipIn,buffer)).foreach(cb => zipOut.write(cb._2,0,cb._1))
})
zipIn.close
zipOut.close
Using scala2.8 and tail recursive call :
def copyZip(in: ZipInputStream, out: ZipOutputStream, bufferSize: Int = 1024) {
val data = new Array[Byte](bufferSize)
def copyEntry() {
in getNextEntry match {
case null =>
case entry => {
if (entryIsValid(entry)) {
out.putNextEntry(new ZipEntry("subdir/" + entry.getName()))
def copyData() {
in read data match {
case -1 =>
case count => {
out.write(data, 0, count)
copyData()
}
}
}
copyData()
}
copyEntry()
}
}
}
copyEntry()
}
I'd try something like this (yes, pretty much the same idea sblundy had):
Iterator.continually {
val data = new Array[Byte](100)
zipIn.read(data) match {
case -1 => Array.empty[Byte]
case 0 => new Array[Byte](101) // just to filter it out
case n => java.util.Arrays.copyOf(data, n)
}
} filter (_.size != 101) takeWhile (_.nonEmpty)
It could be simplified like below, but I'm not very fond of it. I'd prefer for read
not to be able to return 0...
Iterator.continually {
val data = new Array[Byte](100)
zipIn.read(data) match {
case -1 => new Array[Byte](101)
case n => java.util.Arrays.copyOf(data, n)
}
} takeWhile (_.size != 101)
Based on http://harrah.github.io/browse/samples/compiler/scala/tools/nsc/io/ZipArchive.scala.html:
private[io] class ZipEntryTraversableClass(in: InputStream) extends Traversable[ZipEntry] {
val zis = new ZipInputStream(in)
def foreach[U](f: ZipEntry => U) {
@tailrec
def loop(x: ZipEntry): Unit = if (x != null) {
f(x)
zis.closeEntry()
loop(zis.getNextEntry())
}
loop(zis.getNextEntry())
}
def writeCurrentEntryTo(os: OutputStream) {
IOUtils.copy(zis, os)
}
}
Without tail-recursion, I'd avoid recursion. You would run the risk to a stack overflow. You could wrap zipIn.read(data)
in an scala.BufferedIterator[Byte]
and go from there.