In Rust it's possible to get UTF-8 from bytes by doing this:
if let Ok(s) = str::from_utf8(some_u8_slice) {
println!("example {}", s);
}
This either works or it doesn't, but Python has the ability to handle errors, e.g.:
s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');
In this example the argument surrogateescape
converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8
. see: Python docs for details.
Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?
You can either:
Construct it yourself by using the strict UTF-8 decoding which returns an error indicating the position where the decoding failed, which you can then escape. But that's inefficient since you will decode each failed attempt twice.
Try 3rd party crates which provide more customizable charset decoders.
Yes, via
String::from_utf8_lossy
:If you need more control over the process, you can use
std::str::from_utf8
, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.A quickly hacked-up example:
You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:
See also: