I am trying to read a very big file using Java. That big file will have data like this, meaning each line will have an user id.
149905320
1165665384
66969324
886633368
1145241312
286585320
1008665352
And in that big file there will be around 30Million user id's. Now I am trying to read all the user id's one by one from that big file only once. Meaning each user id should be selected only once from that big file. For example, if I have 30Million user id's then it should print 30 Million user id only once with the use of Multithreading code.
Below is the code I have which is a multithreaded code running with 10 threads but with the below program, I am not able to make sure that each user id is selected only once.
public class ReadingFile {
public static void main(String[] args) {
// create thread pool with given size
ExecutorService service = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
service.submit(new FileTask());
}
}
}
class FileTask implements Runnable {
@Override
public void run() {
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader("D:/abc.txt"));
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
//do things with line
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Can anybody help me with this? What wrong I am doing? And what is the fastest way to do this?
You really can't improve on having one thread reading the file sequentially, assuming that you haven't done anything like stripe the file across multiple disks. With one thread, you do one seek and then one long sequential read; with multiple threads you're going to have the threads causing multiple seeks as each gains control of the disk head.
Edit: This is a way to parallelize the line processing while still using serial I/O to read the lines. It uses a BlockingQueue to communicate between threads; the FileTask
adds lines to the queue, and the CPUTask
reads them and processes them. This is a thread-safe data structure, so no need to add any synchronization to it. You're using put(E e)
to add strings to the queue, so if the queue is full (it can hold up to 200 strings, as defined in the declaration in ReadingFile
) the FileTask
blocks until space frees up; likewise you're using take()
to remove items from the queue, so the CPUTask
will block until an item is available.
public class ReadingFile {
public static void main(String[] args) {
final int threadCount = 10;
// BlockingQueue with a capacity of 200
BlockingQueue<String> queue = new ArrayBlockingQueue<>(200);
// create thread pool with given size
ExecutorService service = Executors.newFixedThreadPool(threadCount);
for (int i = 0; i < (threadCount - 1); i++) {
service.submit(new CPUTask(queue));
}
// Wait til FileTask completes
service.submit(new FileTask(queue)).get();
service.shutdownNow(); // interrupt CPUTasks
// Wait til CPUTasks terminate
service.awaitTermination(365, TimeUnit.DAYS);
}
}
class FileTask implements Runnable {
private final BlockingQueue<String> queue;
public FileTask(BlockingQueue<String> queue) {
this.queue = queue;
}
@Override
public void run() {
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader("D:/abc.txt"));
String line;
while ((line = br.readLine()) != null) {
// block if the queue is full
queue.put(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
class CPUTask implements Runnable {
private final BlockingQueue<String> queue;
public CPUTask(BlockingQueue<String> queue) {
this.queue = queue;
}
@Override
public void run() {
String line;
while(true) {
try {
// block if the queue is empty
line = queue.take();
// do things with line
} catch (InterruptedException ex) {
break; // FileTask has completed
}
}
// poll() returns null if the queue is empty
while((line = queue.poll()) != null) {
// do things with line;
}
}
}
We are talking about an average of a 315 MB file with lines separated by new line. I presume this easily fits into memory. It is implied that there is no particular order in the user names that has to be conserved. So I would recommend the following algorithm:
- Get the file length
- Copy each a 10th of the file into a byte buffer (binary copy should be fast)
- Start a thread for processing each of these buffers
- Each thread processes all lines in his area except the first and last one.
- Each thread must return the first and last partitial line in its data when done,
- the “last” of each thread must be recombined with the “first” one of the one working on the next file block because you may have cut through a line. And these tokens must then be processed afterwards.
Fork Join API introduced in 1.7 is a great fit for this use case. Check out http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html. If you search, you are going to find lots of examples out there.