-->

word count frequency in document

2019-05-06 10:24发布

问题:

I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far

     private Hashtable<String, Integer> getAllWordCount()
     private Hashtable<String, Integer> getAllWordCount()
    {
        Hashtable<String, Integer> result = new Hashtable<String, Integer>();
        HashSet<String> words = new HashSet<String>();
        try {   
            for (int j = 0; j < fileDirectory.length; j++){
                File theDirectory = new File(fileDirectory[j]);
                File[] children = theDirectory.listFiles();

                for (int i = 0; i < children.length; i++){
                    Scanner scanner = new Scanner(new FileReader(children[i]));

                    while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                        if (words.contains(text) == false){
                            if (result.get(text) == null)
                                result.put(text, 1);
                            else
                                result.put(text, result.get(text) + 1);
                            words.add(text);
                        }
                    }
                }
                words.clear();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println(result.size());
        return result;
    }

回答1:

You also need a HashSet<String> in which you store each unique word you've read from the current file.

Then after every word read, you should check if it's in the set, if it isn't, increment the corresponding value in the result map (or add a new entry if it was empty, like you already do) and add the word to the set.

Don't forget to reset the set when you start to read a new file though.



回答2:

how about this?

private Hashtable<String, Integer> getAllWordCount()
{
    Hashtable<String, Integer> result = new Hashtable<String, Integer>();
    HashSet<String> words = new HashSet<String>();
    try {   
        for (int j = 0; j < fileDirectory.length; j++){
            File theDirectory = new File(fileDirectory[j]);
            File[] children = theDirectory.listFiles();
            for (int i = 0; i < children.length; i++){
                Scanner scanner = new Scanner(new FileReader(children[i]));
                while (scanner.hasNext()){
                    String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                    words.add(text);
                }
                for (String word : words) {
                  Integer count = result.get(word)
                  if (result.get(word) == null) {
                    result.put(word, 1);
                  } else {
                    result.put(word, result.get(word) + 1);
                  }
                }
                words.clear();
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(result.size());
    return result;
}