I have the following code, it reads in many files from a directory into a hash map, this is my feature vecteur. It's somewhat naive in the sense that it does no stemming but that's not my primary concern right now. I want to know how I can use this data structure as the input to the perceptron algorithm. I guess we call this a bag of words, isn't it?
public class BagOfWords
{
static Map<String, Integer> bag_of_words = new HashMap<>();
public static void main(String[] args) throws IOException
{
String path = "/home/flavius/atheism;
File file = new File( path );
new BagOfWords().iterateDirectory(file);
for (Map.Entry<String, Integer> entry : bag_of_words.entrySet())
{
System.out.println(entry.getKey()+" : "+entry.getValue());
}
}
private void iterateDirectory(File file) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader( f ));
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!bag_of_words.containsKey(word))
{
bag_of_words.put(word, 0);
}
bag_of_words.put(word, bag_of_words.get(word) + 1);
}
}
}
}
}
}
You can see that the path goes to a directory called 'atheism' there's also one called sports, I want to try to linearly seperate these two classes of documents, and then try to seperate the unseen test docs into either category.
How to do that? How to conceptualize that. I'd appreciate a solid reference, comprehensive explanation or some kind of pseudocode.
I've not found many informative and lucid references on the web.