Searching for matches in 3 million text files [clo

2019-07-14 17:03发布

I have a simple requirement where a user enters a bunch of words and the system scans over 3 million text files and finds files which has those keywords. What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

I thought of using Scanner class for this but have no idea about performance over such large files. Performance isn't very high priority but it should be in a acceptable standard.

标签: java file-io
5条回答
beautiful°
2楼-- · 2019-07-14 17:37

it should be in a acceptable standard

We don't know what an acceptable standard is. If we talk about interactive users there probably won't be a simple solution that scans 3 million files and returns something within lets say < 5 seconds.

A reasonable solution would be a search index, potentially based on Lucence.

The major problem with a scanner/grep/find etc. based solution is that they are slow, won't scale and that the expensive scanning work will have to be done over and over (unless you store intermediate results... but that would not be simple and basically a labor expensive re-implementation of an indexer). When working with an index only the creation and updates of the index are expensive, queries are cheap.

查看更多
一夜七次
3楼-- · 2019-07-14 17:46

When parsing each text file, I would use BufferedReader and check each line of text for a match.

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // Does this line containe the text?
   if(line.contains(text)) {
      System.out.println("Text found");
   }
}
br.close();

I'm not sure if this would be very fast for such huge number of files.

查看更多
老娘就宠你
4楼-- · 2019-07-14 17:47

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

A complex searching/indexing algorithm. There's no need to reinvent the wheel here. Since the user can enter any words, you can't make a simple preprocessing step, but rather have to index for all words in the text. This is what something like Lucene does for you.

There is no other fast way to search through text other than by preprocessing it and building an index. You can roll your own solution for this or you can just use Lucene.

Naïve text search with no preprocessing will be far far too slow to be of any use.

查看更多
冷血范
5楼-- · 2019-07-14 17:55

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm

If you don't use any kind of indexing algorithm then each time you submit a search, you will need to read every file. The overhead in doing so, doesn't lie in the 'matching' algorithm but in the I/O latency. So, I wouldn't care too much about what to use for matching; Scanner is straightforward choice.

If you want to increase performance, you will need to use some sort of pre-processing. You could load the files in memory, size permitting. You could create a set of words per file (index). There are too many algorithms out there for you to search for, especially as 'word count' examples in Map/Reduce contexts. You might also want to have a look into Java's Fork/Join framework if you want to achieve higher concurrency.

查看更多
别忘想泡老子
6楼-- · 2019-07-14 17:56

Why don't you wrap a system call to grep? You can achieve that through the Runtime class.

查看更多
登录 后发表回答