I currently have a program that reads file (very huge) in single threaded mode and creates search index but it takes too long to index in single threaded environment.
Now I am trying to make it work in multithreaded mode but not sure the best way to achieve that.
My main program creates a buffered reader and passes the instance to thread and the thread uses the buffered reader instance to read the files.
I don't think this works as expected rather each thread is reading the same line again and again.
Is there a way to make the threads read only the lines that are not read by other thread? Do I need to split the file? Is there a way to implement this without splitting the file?
Sample Main program:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
public class TestMTFile {
public static void main(String args[]) {
BufferedReader reader = null;
ArrayList<Thread> threads = new ArrayList<Thread>();
try {
reader = new BufferedReader(new FileReader(
"test.tsv"));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
for (int i = 0; i <= 10; i++) {
Runnable task = new ReadFileMT(reader);
Thread worker = new Thread(task);
// We can set the name of the thread
worker.setName(String.valueOf(i));
// Start the thread, never call method run() direct
worker.start();
// Remember the thread for later usage
threads.add(worker);
}
int running = 0;
int runner1 = 0;
int runner2 = 0;
do {
running = 0;
for (Thread thread : threads) {
if (thread.isAlive()) {
runner1 = running++;
}
}
if (runner2 != runner1) {
runner2 = runner1;
System.out.println("We have " + runner2 + " running threads. ");
}
} while (running > 0);
if (running == 0) {
System.out.println("Ended");
}
}
}
Thread:
import java.io.BufferedReader;
import java.io.IOException;
public class ReadFileMT implements Runnable {
BufferedReader bReader = null;
ReadFileMT(BufferedReader reader) {
this.bReader = reader;
}
public synchronized void run() {
String line;
try {
while ((line = bReader.readLine()) != null) {
try {
System.out.println(line);
} catch (Exception e) {
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Your bottleneck is most likely the indexing, not the file reading. assuming your indexing system supports multiple threads, you probably want a producer/consumer setup with one thread reading the file and pushing each line into a BlockingQueue (the producer), and multiple threads pulling lines from the BlockingQueue and pushing them into the index (the consumers).
See this thread - if your files are all on the same disk then you can't do better than reading them with a single thread, although it may be possible to process the files with multiple threads once you've read them into main memory.
First, I agree with @Zim-Zam that it is the file IO, not the indexing, that is likely the rate determining step. (So I disagree with @jtahlborn). Depends on how complex the indexing is.
Second, in your code, each thread has it's own, independent
BufferedReader
. Therefore they will all read the entire file. One possible fix is to use a singleBufferedReader
that they share. And then you need to synchronize theBufferedReader.readLine()
method (I think) since the javadocs are silent on whetherBufferedReader
is thread-safe. And, since I think the IO is the botleneck, this will become the bottleneck and I doubt if multithreading will gain you much. But give it a try, I have been wrong occasionally. :-)p.s. I agree with @jtahlmorn that a producer/consumer pattern is better than my share the BufferedReader idea, but that would be much more work for you.
If you can use Java 8, you may be able to do this quickly and easily using the Streams API. Read the file into a MappedByteBuffer, which can open a file up to 2GB very quicky, then read the lines out of the buffer (you need to make sure your JVM has enough extra memory to hold the file):