I have to modify a dropwizard application to improve its running time. Basically, this application receives around 3 million URLs daily and downloads and parses them to detect for malicious content. The problem is that the application is only able to process 1 million URLs. When I looked at the application I found that it is making a lot of sequential calls. I would like some suggestion on how can I improve the application by making it asynchronous or other techniques.
Required code is as follows:-
/* Scheduler */
private long triggerDetection(String startDate, String endDate) {
for (UrlRequest request : urlRequests) {
if (!validateRequests.isWhitelisted(request)) {
ContentDetectionClient.detectContent(request);
}
}
}
/* Client */
public void detectContent(UrlRequest urlRequest){
Client client = new Client();
URI uri = buildUrl(); /* It returns the URL of this dropwizard application's resource method provided below */
ClientResponse response = client.resource(uri)
.type(MediaType.APPLICATION_JSON_TYPE)
.post(ClientResponse.class, urlRequest);
Integer status = response.getStatus();
if (status >= 200 && status < 300) {
log.info("Completed request for url: {}", urlRequest.getUrl());
}else{
log.error("request failed for url: {}", urlRequest.getUrl());
}
}
private URI buildUrl() {
return UriBuilder
.fromPath(uriConfiguration.getUrl())
.build();
}
/* Resource Method */
@POST
@Path("/pageDetection")
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
/**
* Receives the url of the publisher, crawls the content of that url, applies a detector to check if the content is malicious.
* @returns returns the probability of the page being malicious
* @throws throws exception if the crawl call failed
**/
public DetectionScore detectContent(UrlRequest urlRequest) throws Exception {
return contentAnalysisOrchestrator.detectContentPage(urlRequest);
}
/* Orchestrator */
public DetectionScore detectContentPage(UrlRequest urlRequest) {
try {
Pair<Integer, HtmlPage> response = crawler.rawLoad(urlRequest.getUrl());
String content = response.getValue().text();
DetectionScore detectionScore = detector.getProbability(urlRequest.getUrl(), content);
contentDetectionResultDao.insert(urlRequest.getAffiliateId(), urlRequest.getUrl(),detectionScore.getProbability()*1000,
detectionScore.getRecommendation(), urlRequest.getRequestsPerUrl(), -1, urlRequest.getCreatedAt() );
return detectionScore;
} catch (IOException e) {
log.info("Error while analyzing the url : {}", e);
throw new WebApplicationException(e, Response.Status.INTERNAL_SERVER_ERROR);
}
}
I was thinking of following approaches:-
Instead of calling the dropwizard resource method via POST, I call
orchestrator.detectContent(urlRequest)
directly from scheduler.The orchestrator can return detectionScore and I'll store all the detectScores in a map/table and perform batch database insertion instead of individual insertion as in the present code.
I would like some comments on the above approaches and possibly other techniques by which I can improve the running time. Also, I just read about Java asynchronous programming but can't seem to understand how to use it in the above code, so would like some help with this also.
Thanks.
Edit: I can think of two bottlenecks:
- Downloading of webpage
- Insertion of result in to the database (Database is located in another system)
- It seems that processing is performed 1 URL at a time
The system has 8 GB of memory out of which 4 GB appears to be free
$ free -m
total used free shared buffers cached
Mem: 7843 4496 3346 0 193 2339
-/+ buffers/cache: 1964 5879
Swap: 1952 489 1463
CPU usage is also minimal:
top - 13:31:19 up 19 days, 15:39, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 215 total, 1 running, 214 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 0.0%sy, 0.0%ni, 99.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8031412k total, 4605196k used, 3426216k free, 198040k buffers
Swap: 1999868k total, 501020k used, 1498848k free, 2395344k cached
Inspired by Davide's (great) answer here is an example, easy way to parallelise this using simple-react (I library I wrote). Note it is slightly different, using the client to drive the concurrency on the server.
Example
Explanation
It looks like you can drive the concurrency from the client. Which means you can distribute the work across threads on the serverside with no additional work. In this example we are making 15 concurrent requests, but you could set it somewhere close to max the server can handle. Your application is IO Bound, so you can use a lot of threads to drive performance.
simple-react works as a Stream of Futures. So here we create an Async task for each call to the ContentDetection client. We have 15 threads available, so 15 calls can be made at once to the server.
Java 7
There is a backport of JDK 8 functionality for Java 7 called StreamSupport and you can also backport Lambda Expressions via RetroLambda.
To implement the same solution with CompletableFutures we can create a Future Task for each eligible URL. UPDATE I don't think we need to batch them, we can use the Executor to limit the number of active futures. We merely need to join on them all at the end.
First check where you loose majority of the time.
I suppose that most of the time is lost downloading the urls.
If downloading urls take more than 90% of the time probably you can't improve your applications because the bottleneck is not java but your network.
Consider the following only if downloading time is under network capabilities
If the downloading time is not so high you probably can try to improve your performances. A standard approach is to use a producers consumers chain. See here for details.
Basically you can split your work as follow:
Downloading is a producer, parsing is a consumer for downloading process and a producer for saving process and saving is a consumer.
Each step can be executed by a different number of threads. For example you can have 3 downloading threads, 5 parsing threads and 1 saving thread.
Edited after comments
As supposed the bottleneck is not cpu time so it is not important to make intervention on the java code.
If you know how many GigaBytes you download each day is possible to see if they are close to the maximum bandwith of your network.
If this happens there are different possibilities:
Content-Encoding: gzip
(so reducing used bandwith)