How can I create an efficient iterator of chars fr

2019-05-10 02:44发布

问题:

Now that the Read::chars iterator has been officially deprecated, what is the the proper way to obtain an iterator over the chars coming from a Reader like stdin without reading the entire stream into memory?

回答1:

The corresponding issue for deprecation nicely sums up the problems with Read::chars and offers suggestions:

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

Your own iterator

The most efficient solution in terms of number of I/O calls is to read everything into a giant buffer String and iterate over that:

use std::io::{self, Read};

fn main() {
    let stdin = io::stdin();
    let mut s = String::new();
    stdin.lock().read_to_string(&mut s).expect("Couldn't read");
    for c in s.chars() {
        println!(">{}<", c);
    }
}

You can combine this with an answer from Is there an owned version of String::chars?:

use std::io::{self, Read};

fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
    let mut s = String::new();
    rdr.read_to_string(&mut s)?;
    Ok(s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}

fn main() -> io::Result<()> {
    let stdin = io::stdin();

    for c in reader_chars(stdin.lock())? {
        println!(">{}<", c);
    }

    Ok(())
}

We now have a function that returns an iterator of chars for any type that implements Read.

Once you have this pattern, it's just a matter of deciding where to make the tradeoff of memory allocation vs I/O requests. Here's a similar idea that uses line-sized buffers:

use std::io::{BufRead, BufReader, Read};

fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
    // We use 6 bytes here to force emoji to be segmented for demo purposes
    // Pick more appropriate size for your case
    let reader = BufReader::with_capacity(6, rdr);

    reader
        .lines()
        .flat_map(|l| l) // Ignoring any errors
        .flat_map(|s| s.into_chars())  // from https://stackoverflow.com/q/47193584/155423
}

fn main() {
    // emoji are 4 bytes each
    let data = "              
                            
标签: rust stdin chars