I'm trying to read an archive that's being tarred, streaming, to stdin, but I'm somehow reading far more data in the pipe than tar is sending.
I run my command like this:
tar -cf - somefolder | ./my-go-binary
The source code is like this:
package main
import (
"bufio"
"io"
"log"
"os"
)
// Read from standard input
func main() {
reader := bufio.NewReader(os.Stdin)
// Read all data from stdin, processing subsequent reads as chunks.
parts := 0
for {
parts++
data := make([]byte, 4<<20) // Read 4MB at a time
_, err := reader.Read(data)
if err == io.EOF {
break
} else if err != nil {
log.Fatalf("Problems reading from input: %s", err)
}
}
log.Printf("Total parts processed: %d\n", parts)
}
For a 100MB tarred folder, I'm getting 1468 chunks of 4MB (that's 6.15GB)! Further, it doesn't seem to matter how large the data []byte
array is: if I set the chunk size to 40MB, I still get ~1400 chunks of 40MB data, which makes no sense at all.
Is there something I need to do to read data from os.Stdin
properly with Go?
Your code is inefficient. It's allocating and initializing data
each time through the loop.
for {
data := make([]byte, 4<<20) // Read 4MB at a time
}
The code for your reader
as an io.Reader
is wrong. For example, you ignore the number of bytes read by _, err := reader.Read(data)
and you don't handle err
errors properly.
Package io
import "io"
type Reader
type Reader interface {
Read(p []byte) (n int, err error)
}
Reader is the interface that wraps the basic Read method.
Read reads up to len(p) bytes into p. It returns the number of bytes
read (0 <= n <= len(p)) and any error encountered. Even if Read
returns n < len(p), it may use all of p as scratch space during the
call. If some data is available but not len(p) bytes, Read
conventionally returns what is available instead of waiting for more.
When Read encounters an error or end-of-file condition after
successfully reading n > 0 bytes, it returns the number of bytes read.
It may return the (non-nil) error from the same call or return the
error (and n == 0) from a subsequent call. An instance of this general
case is that a Reader returning a non-zero number of bytes at the end
of the input stream may return either err == EOF or err == nil. The
next Read should return 0, EOF regardless.
Callers should always process the n > 0 bytes returned before
considering the error err. Doing so correctly handles I/O errors that
happen after reading some bytes and also both of the allowed EOF
behaviors.
Implementations of Read are discouraged from returning a zero byte
count with a nil error, except when len(p) == 0. Callers should treat
a return of 0 and nil as indicating that nothing happened; in
particular it does not indicate EOF.
Implementations must not retain p.
Here's a model file read program that conforms to the io.Reader
interface:
package main
import (
"bufio"
"io"
"log"
"os"
)
func main() {
nBytes, nChunks := int64(0), int64(0)
r := bufio.NewReader(os.Stdin)
buf := make([]byte, 0, 4*1024)
for {
n, err := r.Read(buf[:cap(buf)])
buf = buf[:n]
if n == 0 {
if err == nil {
continue
}
if err == io.EOF {
break
}
log.Fatal(err)
}
nChunks++
nBytes += int64(len(buf))
// process buf
if err != nil && err != io.EOF {
log.Fatal(err)
}
}
log.Println("Bytes:", nBytes, "Chunks:", nChunks)
}
Output:
2014/11/29 10:00:05 Bytes: 5589891 Chunks: 1365
Read the documentation for Read:
Read reads data into p. It returns the number of bytes read into p. It
calls Read at most once on the underlying Reader, hence n may be less
than len(p). At EOF, the count will be zero and err will be io.EOF.
You are not reading 4MB at a time. You are providing buffer space and discarding the integer that would have told you how much the Read actually read. The buffer space is the maximum, but most usually 128k seems to get read per call, at least on my system. Try it out yourself:
// Read from standard input
func main() {
reader := bufio.NewReader(os.Stdin)
// Read all data from stdin, passing the data as parts into the channel
// for processing.
parts := 0
for {
parts++
data := make([]byte, 4<<20) // Read 4MB at a time
amount , err := reader.Read(data)
// WILL NOT BE 4MB!
log.Printf("Read: %v\n", amount)
if err == io.EOF {
break
} else if err != nil {
log.Fatalf("Problems reading from input: %s", err)
}
}
log.Printf("Total parts processed: %d\n", parts)
}
You have to implement the logic for handling the varying read amounts.