Searching for matches in 3 million text files [clo

I have a simple requirement where a user enters a bunch of words and the system scans over 3 million text files and finds files which has those keywords. What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

I thought of using Scanner class for this but have no idea about performance over such large files. Performance isn't very high priority but it should be in a acceptable standard.

标签： java file-io

5条回答

beautiful°

2楼-- · 2019-07-14 17:37

it should be in a acceptable standard

We don't know what an acceptable standard is. If we talk about interactive users there probably won't be a simple solution that scans 3 million files and returns something within lets say < 5 seconds.

A reasonable solution would be a search index, potentially based on Lucence.

The major problem with a scanner/grep/find etc. based solution is that they are slow, won't scale and that the expensive scanning work will have to be done over and over (unless you store intermediate results... but that would not be simple and basically a labor expensive re-implementation of an indexer). When working with an index only the creation and updates of the index are expensive, queries are cheap.

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-07-14 17:46

When parsing each text file, I would use BufferedReader and check each line of text for a match.

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // Does this line containe the text?
   if(line.contains(text)) {
      System.out.println("Text found");
   }
}
br.close();

I'm not sure if this would be very fast for such huge number of files.

0人赞添加讨论(0) 举报

老娘就宠你

4楼-- · 2019-07-14 17:47

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

A complex searching/indexing algorithm. There's no need to reinvent the wheel here. Since the user can enter any words, you can't make a simple preprocessing step, but rather have to index for all words in the text. This is what something like Lucene does for you.

There is no other fast way to search through text other than by preprocessing it and building an index. You can roll your own solution for this or you can just use Lucene.

Naïve text search with no preprocessing will be far far too slow to be of any use.

0人赞添加讨论(0) 举报

冷血范

5楼-- · 2019-07-14 17:55

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm

If you don't use any kind of indexing algorithm then each time you submit a search, you will need to read every file. The overhead in doing so, doesn't lie in the 'matching' algorithm but in the I/O latency. So, I wouldn't care too much about what to use for matching; Scanner is straightforward choice.

If you want to increase performance, you will need to use some sort of pre-processing. You could load the files in memory, size permitting. You could create a set of words per file (index). There are too many algorithms out there for you to search for, especially as 'word count' examples in Map/Reduce contexts. You might also want to have a look into Java's Fork/Join framework if you want to achieve higher concurrency.

0人赞添加讨论(0) 举报

别忘想泡老子

6楼-- · 2019-07-14 17:56

Why don't you wrap a system call to grep? You can achieve that through the Runtime class.

0人赞添加讨论(0) 举报

Searching for matches in 3 million text files [clo

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间