I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.
EDIT: The problem is not the tree itself, but actually the way I was reading the file.
I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.
EDIT: The problem is not the tree itself, but actually the way I was reading the file.
If you're passing the entire text file as a single string
you could easily run into an out of memory exception with your first loop!
// imagine if s.Length was 100k or so
for (int i = 0; i < s.Length; i++)
{
AddString(s.Substring(i, s.Length-i));
}
When reading the file to construct the trie, you'll need to split each line and probably normalize the characters:
string line;
while (null != (line = reader.ReadLine()))
{
string[] parts = line.Split(' ', ',', '.', '!', '\t', '?'); // naive
foreach (string part in parts)
{
if (part.Length > 0)
{
// make each string uppercase so as to avoid Hello and hello being
// two trie entries
trie.AddSuffix(part.ToUpperInvariant());
}
}
}
For example (on the output from dir /b c:\windows
):
A
D
D
I
N
S
E
D
P
P
C
O
M
P
A
T
P
A
T
C
H
...
To appropriately handle larger files, a more compact trie structure would be desirable. I would just have unshared suffixes stored in a separate dictionary:
// If you add a character, but there is no entry in m_children
// just park the tail end of it here
Dictionary<char, string> m_tails;
You would then move the per character logic to your AddString
of the SuffixNode
:
public void AddString(string s)
{
if (s.Length == 0) return;
char c = s[0];
if (m_children.ContainsKey(c))
{
if (s.Length > 1) m_children[c].AddString(s.Substring(1));
}
else if (m_tails.ContainsKey(c))
{
SuffixNode node = new SuffixNode();
node.AddString(m_tails[c]);
if (s.Length > 1) node.AddString(s.Substring(1));
m_children.Add(c, node);
m_tails.Remove(c);
}
else
{
m_tails.Add(c, s.Length > 1 ? s.Substring(1) : "");
}
}
Now you have a much more compact version of the trie, which will greatly decrease the number of child SuffixNode
s created for any given corpus. Returning to the dir /b c:\windows
example, we can see a practical reduction in nodes:
A
P
P
COMPAT
PATCH
I
T
I
O
N
S
...
At this point your trie has a more efficient representation. You're left with determining how to deal with terminal node representations in order to ensure lookups are accurate.