I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.
EDIT: The problem is not the tree itself, but actually the way I was reading the file.
I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.
EDIT: The problem is not the tree itself, but actually the way I was reading the file.
If you're passing the entire text file as a single
string
you could easily run into an out of memory exception with your first loop!When reading the file to construct the trie, you'll need to split each line and probably normalize the characters:
For example (on the output from
dir /b c:\windows
):To appropriately handle larger files, a more compact trie structure would be desirable. I would just have unshared suffixes stored in a separate dictionary:
You would then move the per character logic to your
AddString
of theSuffixNode
:Now you have a much more compact version of the trie, which will greatly decrease the number of child
SuffixNode
s created for any given corpus. Returning to thedir /b c:\windows
example, we can see a practical reduction in nodes:At this point your trie has a more efficient representation. You're left with determining how to deal with terminal node representations in order to ensure lookups are accurate.