Creating a sliding window iterator of slices of ch

2020-04-06 06:47发布

问题:

I am looking for the best way to go from String to Windows<T> using the windows function provided for slices.

I understand how to use windows this way:

fn main() {
    let tst = ['a', 'b', 'c', 'd', 'e', 'f', 'g'];
    let mut windows = tst.windows(3);

    // prints ['a', 'b', 'c']
    println!("{:?}", windows.next().unwrap());
    // prints ['b', 'c', 'd']
    println!("{:?}", windows.next().unwrap());
    // etc...
}

But I am a bit lost when working this problem:

fn main() {
    let tst = String::from("abcdefg");
    let inter = ? //somehow create slice of character from tst
    let mut windows = inter.windows(3);

    // prints ['a', 'b', 'c']
    println!("{:?}", windows.next().unwrap());
    // prints ['b', 'c', 'd']
    println!("{:?}", windows.next().unwrap());
    // etc...
}

Essentially, I am looking for how to convert a string into a char slice that I can use the window method with.

回答1:

This solution will work for your purpose. (playground)

fn main() {
    let tst = String::from("abcdefg");
    let inter = tst.chars().collect::<Vec<char>>();
    let mut windows = inter.windows(3);

    // prints ['a', 'b', 'c']
    println!("{:?}", windows.next().unwrap());
    // prints ['b', 'c', 'd']
    println!("{:?}", windows.next().unwrap());
    // etc...
    println!("{:?}", windows.next().unwrap());
}

String can iterate over its chars, but it's not a slice, so you have to collect it into a vec, which then coerces into a slice.



回答2:

You can use itertools to walk over windows of any iterator, up to a width of 4:

extern crate itertools; // 0.7.8

use itertools::Itertools;

fn main() {
    let input = "日本語";

    for (a, b) in input.chars().tuple_windows() {
        println!("{}, {}", a, b);
    }
}

See also:

  • Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?


回答3:

The problem that you are facing is that String is really represented as something like a Vec<u8> under the hood, with some APIs to let you access chars. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.

The only slice you could get directly of an entire String, without copying everything, would be a &[u8], but you wouldn't know if the bytes corresponded to whole or just parts of code points.

The char type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char by copying from a String, the result could be up to 4 times larger.

To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String, making slices at exactly the char boundaries. Something like this:

fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator<Item = &'a str> {
    src.char_indices()
        .flat_map(move |(from, _)| {
            src[from ..].char_indices()
                .skip(win_size - 1)
                .next()
                .map(|(to, c)| {
                    &src[from .. from + to + c.len_utf8()]
                })
    })
}

This will give you an iterator where the items are &str, each with 3 chars:

let mut windows = char_windows(&tst, 3);

for win in windows {
    println!("{:?}", win);
}

The nice thing about this approach is that it hasn't done any copying at all - each &str produced by the iterator is still a slice into the original source String.


All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:

let tst = String::from("abcdefg");
let inter = tst.as_bytes();
let mut windows = inter.windows(3);

However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:

for win in windows {
    println!("{:?}", String::from_utf8_lossy(win));
}