How can I find a subsequence in a &[u8] slice?

2020-02-01 07:04发布

I have a &[u8] slice over a binary buffer. I need to parse it, but a lot of the methods that I would like to use (such as str::find) don't seem to be available on slices.

I've seen that I can covert both by buffer slice and my pattern to str by using from_utf8_unchecked() but that seems a little dangerous (and also really hacky).

How can I find a subsequence in this slice? I actually need the index of the pattern, not just a slice view of the parts, so I don't think split will work.

标签: rust
4条回答
唯我独甜
2楼-- · 2020-02-01 07:10

I found the memmem crate useful for this task:

use memmem::{Searcher, TwoWaySearcher};

let search = TwoWaySearcher::new("dog".as_bytes());
assert_eq!(
    search.search_in("The quick brown fox jumped over the lazy dog.".as_bytes()),
    Some(41)
);
查看更多
Evening l夕情丶
3楼-- · 2020-02-01 07:17

Here's a simple implementation based on the windows iterator.

fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
    haystack.windows(needle.len()).position(|window| window == needle)
}

fn main() {
    assert_eq!(find_subsequence(b"qwertyuiop", b"tyu"), Some(4));
    assert_eq!(find_subsequence(b"qwertyuiop", b"asd"), None);
}

The find_subsequence function can also be made generic:

fn find_subsequence<T>(haystack: &[T], needle: &[T]) -> Option<usize>
    where for<'a> &'a [T]: PartialEq
{
    haystack.windows(needle.len()).position(|window| window == needle)
}
查看更多
来,给爷笑一个
4楼-- · 2020-02-01 07:20

I don't think the standard library contains a function for this. Some libcs have memmem, but at the moment the libc crate does not wrap this. You can use the twoway crate however. rust-bio implements some pattern matching algorithms, too. All of those should be faster than using haystack.windows(..).position(..)

查看更多
地球回转人心会变
5楼-- · 2020-02-01 07:26

How about Regex on bytes? That looks very powerful. See this rust playground demo.

use regex::bytes::Regex;

// This shows how to find all null-terminated strings in a slice of bytes
let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
let text = b"foo\x00bar\x00baz\x00";

// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(text)
      .map(|c| c.name("cstr").unwrap().as_bytes())
      .collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
查看更多
登录 后发表回答