I am working with a TreeMap of Strings TreeMap<String, String>
, and using it to implement a Dictionay of words.
I then have a collection of files, and would like to create a representation of each file in the vector space (space of words) defined by the dictionary.
Each file should have a vector representing it with following properties:
- vector should have same size as dictionary
- for each word contained in the file the vector should have a 1 in the position corresponding to the word position in dictionary
- for each word not contained in the file the vector should have a -1 in the position corresponding to the word position in dictionary
So my idea is to use a Vector<Boolean>
to implement these vectors. (This way of representing documents in a collection is called Boolean Model - http://www.site.uottawa.ca/~diana/csi4107/L3.pdf)
The problem I am facing in the procedure to create this vector is that I need a way to find position of a word in the dictionary, something like this:
String key;
int i = get_position_of_key_in_Treemap(key); <--- purely invented method...
1) Is there any method like this I can use on a TreeMap?If not could you provide some code to help me implement it by myself?
2) Is there an iterator on TreeMap (it's alphabetically ordered on keys) of which I can get position?
3)Eventually should I use another class to implement dictionary?(If you think that with TreeMaps I can't do what I need) If yes, which?
Thanks in advance.
ADDED PART:
Solution proposed by dasblinkenlight looks fine but has the problem of complexity (linear with dimension of dictionary due to copying keys into an array), and the idea of doing it for each file is not acceptable.
Any other ideas for my questions?
There's no such implementation in the JDK itself. Although
TreeMap
iterates in natural key ordering, its internal data structures are all based on trees and not arrays (remember thatMaps
do not order keys, by definition, in spite of that the very common use case).That said, you have to make a choice as it is not possible to have O(1) computation time for your comparison criteria both for insertion into the
Map
and theindexOf(key)
calculation. This is due to the fact that lexicographical order is not stable in a mutable data structure (as opposed to insertion order, for instance). An example: once you insert the first key-value pair (entry) into the map, its position will always be one. However, depending on the second key inserted, that position might change as the new key may be "greater" or "lower" than the one in theMap
. You can surely implement this by maintaining and updating an indexed list of keys during the insertion operation, but then you'll have O(n log(n)) for your insert operations (as will need to re-order an array). That might be desirable or not, depending on your data access patterns.ListOrderedMap
andLinkedMap
in Apache Commons both come close to what you need but rely on insertion order. You can check out their implementation and develop your own solution to the problem with little to moderate effort, I believe (that should be just a matter of replacing theListOrderedMap
s internal backing array with a sorted list -TreeList
in Apache Commons, for instance).You can also calculate the index yourself, by subtracting the number of elements that are lower than then given key (which should be faster than iterating through the list searching for your element, in the most frequent case - as you're not comparing anything).
Once you have constructed your tree map, copy its sorted keys into an array, and use
Arrays.binarySearch
to look up the index in O(logN) time. If you need the value, do a lookup on the original map too.Edit: this is how you copy keys into an array
I had the same problem. So I took the source code of java.util.TreeMap and wrote IndexedTreeMap. It implements my own IndexedNavigableMap:
The implementation is based on updating node weights in the red-black tree when it is changed. Weight is the number of child nodes beneath a given node, plus one - self. For example when a tree is rotated to the left:
updateWeight simply updates weights up to the root:
And when we need to find the element by index here is the implementation that uses weights:
Also comes in very handy finding the index of a key:
I will implement IndexedTreeSet soon, in the meanwhile you can use the key set from IndexedTreeMap.
Update: IndexedTreeSet is implemented now.
You can find the result of this work at https://github.com/geniot/indexed-tree-map
An alternative solution would be to use
TreeMap
'sheadMap
method. If the word exists in theTreeMap
, then thesize()
of its head map is equal to the index of the word in the dictionary. It may be a bit wasteful compared to my other answer, through.Here is how you code it in Java:
Here is the output produced by the program:
Have you thought to make the values in your
TreeMap
contain the position in your dictionary? I am using aBitSet
here for my file details.This doesn't work nearly as well as my other idea below.
Here the building of the file details consists of a single lookup in the
TreeMap
for each word in the file.If you were planning to use the
value
in the dictionaryTreeMap
for something else you could always compose it with anInteger
.Added
Thinking about it further, if the
value
field of theMap
is earmarked for something you could always use special keys that calculate their own position in theMap
and act just likeString
s for comparison.NB: Assumes that once
getPosition()
has been called, the dictionary is not changed.I agree with Isolvieira. Perhaps the best approach would be to use a different structure than TreeMap.
However, if you still want to go with computing the index of the keys, a solution would be to count how many keys are lower than the key you are looking for.
Here is a code snippet: