In Rust it's possible to get UTF-8 from bytes by doing this:
if let Ok(s) = str::from_utf8(some_u8_slice) {
println!("example {}", s);
}
This either works or it doesn't, but Python has the ability to handle errors, e.g.:
s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');
In this example the argument surrogateescape
converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8
. see: Python docs for details.
Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?
Yes, via String::from_utf8_lossy
:
fn main() {
let text = [104, 101, 0xFF, 108, 111];
let s = String::from_utf8_lossy(&text);
println!("{}", s); // he�lo
}
If you need more control over the process, you can use std::str::from_utf8
, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.
A quickly hacked-up example:
use std::str;
fn example(mut bytes: &[u8]) -> String {
let mut output = String::new();
loop {
match str::from_utf8(bytes) {
Ok(s) => {
// The entire rest of the string was valid UTF-8, we are done
output.push_str(s);
return output;
}
Err(e) => {
let (good, bad) = bytes.split_at(e.valid_up_to());
if !good.is_empty() {
let s = unsafe {
// This is safe because we have already validated this
// UTF-8 data via the call to `str::from_utf8`; there's
// no need to check it a second time
str::from_utf8_unchecked(good)
};
output.push_str(s);
}
if bad.is_empty() {
// No more data left
return output;
}
// Do whatever type of recovery you need to here
output.push_str("<badbyte>");
// Skip the bad byte and try again
bytes = &bad[1..];
}
}
}
}
fn main() {
let r = example(&[104, 101, 0xFF, 108, 111]);
println!("{}", r); // he<badbyte>lo
}
You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:
fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
// ...
handler(&mut output, bad);
// ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
use std::fmt::Write;
write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo
See also:
- How do I convert a Vector of bytes (u8) to a string
- How to print a u8 slice as text if I don't care about the particular encoding?.