How can I case fold a string in Rust?

2019-06-20 18:08发布

I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase() and .to_uppercase() methods are not enough.

From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax does have case folding code, but it's not exposed in its API.

If there aren't any existing solutions then I might have to roll my own

标签: unicode rust
2条回答
疯言疯语
2楼-- · 2019-06-20 18:26

The unicase crate doesn't expose case folding directly, but it provides a generic wrapper type that implements Eq, Ord and Hash in a case insensitive manner. The master branch (unreleased) supports both ASCII case folding (as an optimization) and Unicode case folding (though only invariant case folding is supported).

查看更多
做个烂人
3楼-- · 2019-06-20 18:44

For my use case, I've found the caseless crate to be most useful.

As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.

Here's some example code that matches a string case-insensitively:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

To get the case folded string directly, you can use the default_case_fold_str function:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

Note that multiple rounds of normalization and case folding are needed for a correct result.

(Thanks to BurntSushi5 for pointing me to this library.)

查看更多
登录 后发表回答