I come from a Java/C#/JavaScript background and I am trying to implement a Dictionary
that would assign each passed string an id that never changes. The dictionary should be able to return a string by the specified id. This allows to store some data that has a lot of repetitive strings far more efficiently in the file system because only the ids of strings would be stored instead of entire strings.
I thought that a struct with a HashMap
and a Vec
would do but it turned out to be more complicated than that.
I started with the usage of &str
as a key for HashMap
and an item of Vec
like in the following sample. The value of HashMap
serves as an index into Vec
.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<&'a str>
}
impl<'a> Dictionary<'a> {
pub fn put_and_get_key(&mut self, value: &'a str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
self.keys_map.push(value);
self.values_map.insert(value, id);
id
},
Some(&mut id) => id
}
}
}
This works just fine until it turns out that the str
s need to be stored somewhere, preferably in this same struct
as well. I tried to store a Box<str>
in the Vec
and &'a str
in the HashMap
.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<Box<str>>
}
The borrow checker did not allow this of course because it would have allowed a dangling pointer in the HashMap
when an item is removed from the Vec
(or in fact sometimes when another item is added to the Vec
but this is an off-topic here).
I understood that I either need to write unsafe
code or use some form of shared ownership, the simplest kind of which seems to be an Rc
. The usage of Rc<Box<str>>
looks like introducing double indirection but there seems to be no simple way to construct an Rc<str>
at the moment.
pub struct Dictionary {
values_map: HashMap<Rc<Box<str>>, u32>,
keys_map: Vec<Rc<Box<str>>>
}
impl Dictionary {
pub fn put_and_get_key(&mut self, value: &str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
let value_to_store = Rc::new(value.to_owned().into_boxed_str());
self.keys_map.push(value_to_store);
self.values_map.insert(value_to_store, id);
id
},
Some(&mut id) => id
}
}
}
Everything seems fine with regard to ownership semantics, but the code above does not compile because the HashMap
now expects an Rc
, not an &str
:
error[E0277]: the trait bound `std::rc::Rc<Box<str>>: std::borrow::Borrow<str>` is not satisfied
--> src/file_structure/sample_dictionary.rs:14:31
|
14 | match self.values_map.get_mut(value) {
| ^^^^^^^ the trait `std::borrow::Borrow<str>` is not implemented for `std::rc::Rc<Box<str>>`
|
= help: the following implementations were found:
= help: <std::rc::Rc<T> as std::borrow::Borrow<T>>
Questions:
- Is there a way to construct an
Rc<str>
? - Which other structures, methods or approaches could help to resolve this problem. Essentially, I need a way to efficiently store two maps
string-by-id
andid-by-string
and be able to retrieve an id by&str
, i.e. without any excessive allocations.
Annoyingly, not that I know of.
Rc::new
requires aSized
argument, and I am not sure whether it is an actual limitation, or just something which was forgotten.If you look at the signature of
get
you'll notice:As a result, you could search by
&str
ifK
implementsBorrow<str>
.String
implementsBorrow<str>
, so the simplest solution is to simply useString
as a key. Sure it means you'll actually have twoString
instead of one... but it's simple. Certainly, aString
is simpler to use than aBox<str>
(although it uses 8 more bytes).If you want to shave off this cost, you can use a custom structure instead:
And then implement
Borrow<str>
for it. You'll then have 2 allocations per key (1 forRc
and 1 forString
). Depending on the size of yourString
, it might consume less or more memory.If you wish to got further (why not?), here are some ideas:
Dictionary
,