How to read a file character by character in Go

2020-07-11 06:38发布

问题:

I have some large json files I want to parse, and I want to avoid loading all of the data into memory at once. I'd like a function/loop that can return me each character one at a time.

I found this example for iterating over words in a string, and the ScanRunes function in the bufio package looks like it could return a character at a time. I also had the ReadRune function from bufio mostly working, but that felt like a pretty heavy approach.

EDIT

I compared 3 approaches. All used a loop to pull content from either a bufio.Reader or a bufio.Scanner.

  1. Read runes in a loop using .ReadRune on a bufio.Reader. Checked for errors from the call to .ReadRune.
  2. Read bytes from a bufio.Scanner after calling .Split(bufio.ScanRunes) on the scanner. Called .Scan and .Bytes on each iteration, checking .Scan call for errors.
  3. Same as #2 but read text from a bufio.Scanner instead of bytes using .Text. Instead of joining a slice of runes with string([]runes), I joined an slice of strings with strings.Join([]strings, "") to form the final blobs of text.

The timing for 10 runs of each on a 23 MB json file was:

  1. 0.65 s
  2. 2.40 s
  3. 0.97 s

So it looks like ReadRune is not too bad after all. It also results in smaller less verbose call because each rune is fetched in 1 operation (.ReadRune) instead of 2 (.Scan and .Bytes).

回答1:

Just read each rune one by one in the loop... See example

EDIT: Adding code for posterity, in case link ever dies:

package main

import (
    "bufio"
    "fmt"
    "io"
    "log"
    "strings"
)

var text = `
The quick brown fox jumps over the lazy dog #1.
Быстрая коричневая лиса перепрыгнула через ленивую собаку.
`

func main() {
    r := bufio.NewReader(strings.NewReader(text))
    for {
        if c, sz, err := r.ReadRune(); err != nil {
            if err == io.EOF {
                break
            } else {
                log.Fatal(err)
            }
        } else {
            fmt.Printf("%q [%d]\n", string(c), sz)
        }
    }
}


回答2:

This code reads runes from the input. No cast is necessary, and it is iterator-like:

package main

import (
    "bufio"
    "fmt"
    "strings"
)

func main() {
    in := `{"sample":"json string"}`

    s := bufio.NewScanner(strings.NewReader(in))
    s.Split(bufio.ScanRunes)

    for s.Scan() {
        fmt.Println(s.Text())
    }
}


回答3:

if it's just about the memory size. In the upcoming release (really soon) there is going to be a token style enhancement of the json decoder : you can see it here

https://tip.golang.org/pkg/encoding/json/#Decoder.Token



标签: json parsing go io