Java: Optimizing an application using asynchronous

I have to modify a dropwizard application to improve its running time. Basically, this application receives around 3 million URLs daily and downloads and parses them to detect for malicious content. The problem is that the application is only able to process 1 million URLs. When I looked at the application I found that it is making a lot of sequential calls. I would like some suggestion on how can I improve the application by making it asynchronous or other techniques.

Required code is as follows:-

/* Scheduler */
private long triggerDetection(String startDate, String endDate) {
for (UrlRequest request : urlRequests) {
                if (!validateRequests.isWhitelisted(request)) {
                    ContentDetectionClient.detectContent(request);
                }
            }
}

/* Client */
public void detectContent(UrlRequest urlRequest){
        Client client = new Client();
        URI uri = buildUrl(); /* It returns the URL of this dropwizard application's resource method provided below */

        ClientResponse response = client.resource(uri)
                .type(MediaType.APPLICATION_JSON_TYPE)
                .post(ClientResponse.class, urlRequest);

        Integer status = response.getStatus();
        if (status >= 200 && status < 300) {
            log.info("Completed request for url: {}", urlRequest.getUrl());

        }else{
            log.error("request failed for url: {}", urlRequest.getUrl());
        }
    }

    private URI buildUrl() {
        return UriBuilder
                .fromPath(uriConfiguration.getUrl())
                .build();
    }

/* Resource Method */
 @POST
    @Path("/pageDetection")
    @Consumes(MediaType.APPLICATION_JSON)
    @Produces(MediaType.APPLICATION_JSON)
    /**
     * Receives the url of the publisher, crawls the content of that url, applies a detector to check if the content is malicious.
     * @returns returns the probability of the page being malicious
     * @throws throws exception if the crawl call failed
     **/
    public DetectionScore detectContent(UrlRequest urlRequest) throws Exception {

        return contentAnalysisOrchestrator.detectContentPage(urlRequest);
    }

/* Orchestrator */
public DetectionScore detectContentPage(UrlRequest urlRequest) {
        try {

            Pair<Integer, HtmlPage> response =  crawler.rawLoad(urlRequest.getUrl());
            String content =   response.getValue().text();

            DetectionScore detectionScore = detector.getProbability(urlRequest.getUrl(), content);
            contentDetectionResultDao.insert(urlRequest.getAffiliateId(), urlRequest.getUrl(),detectionScore.getProbability()*1000,
                    detectionScore.getRecommendation(), urlRequest.getRequestsPerUrl(), -1, urlRequest.getCreatedAt() );

            return detectionScore;

        } catch (IOException e) {
            log.info("Error while analyzing the url : {}", e);
            throw new WebApplicationException(e, Response.Status.INTERNAL_SERVER_ERROR);
        }
    }

I was thinking of following approaches:-

Instead of calling the dropwizard resource method via POST, I call orchestrator.detectContent(urlRequest) directly from scheduler.
The orchestrator can return detectionScore and I'll store all the detectScores in a map/table and perform batch database insertion instead of individual insertion as in the present code.

I would like some comments on the above approaches and possibly other techniques by which I can improve the running time. Also, I just read about Java asynchronous programming but can't seem to understand how to use it in the above code, so would like some help with this also.

Thanks.

Edit: I can think of two bottlenecks:

Downloading of webpage
Insertion of result in to the database (Database is located in another system)
It seems that processing is performed 1 URL at a time

The system has 8 GB of memory out of which 4 GB appears to be free

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7843       4496       3346          0        193       2339
-/+ buffers/cache:       1964       5879 
Swap:         1952        489       1463

CPU usage is also minimal:

top - 13:31:19 up 19 days, 15:39,  3 users,  load average: 0.00, 0.00, 0.00
Tasks: 215 total,   1 running, 214 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.0%sy,  0.0%ni, 99.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8031412k total,  4605196k used,  3426216k free,   198040k buffers
Swap:  1999868k total,   501020k used,  1498848k free,  2395344k cached

标签： java asynchronous optimization

2条回答

smile是对你的礼貌

2楼-- · 2019-02-28 03:48

Inspired by Davide's (great) answer here is an example, easy way to parallelise this using simple-react (I library I wrote). Note it is slightly different, using the client to drive the concurrency on the server.

Example

LazyReact streamBuilder = new LazyReact(15,15);

streamBuilder.fromIterable(urlRequests)
      .filter(urlReq->!validateRequests.isWhitelisted(urlReq))
      .forEach(request -> {
           ContentDetectionClient.detectContent(request);
       });

Explanation

It looks like you can drive the concurrency from the client. Which means you can distribute the work across threads on the serverside with no additional work. In this example we are making 15 concurrent requests, but you could set it somewhere close to max the server can handle. Your application is IO Bound, so you can use a lot of threads to drive performance.

simple-react works as a Stream of Futures. So here we create an Async task for each call to the ContentDetection client. We have 15 threads available, so 15 calls can be made at once to the server.

Java 7

There is a backport of JDK 8 functionality for Java 7 called StreamSupport and you can also backport Lambda Expressions via RetroLambda.

To implement the same solution with CompletableFutures we can create a Future Task for each eligible URL. UPDATE I don't think we need to batch them, we can use the Executor to limit the number of active futures. We merely need to join on them all at the end.

   Executor exec = Executors.newFixedThreadPool(maxActive);//15 threads
   List<CompletableFuture<Void>> futures= new ArrayList<>();

   for (UrlRequest request : urlRequests) {
            if (!validateRequests.isWhitelisted(request)) {
                futures.add(CompletableFuture.runAsync(()->ContentDetectionClient.detectContent(request), exec));
            }
        }
 CompletableFuture.allOf(futures.toArray())
                      .join();

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-02-28 03:55

First check where you loose majority of the time.

I suppose that most of the time is lost downloading the urls.

If downloading urls take more than 90% of the time probably you can't improve your applications because the bottleneck is not java but your network.

Consider the following only if downloading time is under network capabilities

If the downloading time is not so high you probably can try to improve your performances. A standard approach is to use a producers consumers chain. See here for details.

Basically you can split your work as follow:

Downloading --> Parsing --> Saving

Downloading is a producer, parsing is a consumer for downloading process and a producer for saving process and saving is a consumer.

Each step can be executed by a different number of threads. For example you can have 3 downloading threads, 5 parsing threads and 1 saving thread.

Edited after comments

As supposed the bottleneck is not cpu time so it is not important to make intervention on the java code.

If you know how many GigaBytes you download each day is possible to see if they are close to the maximum bandwith of your network.

If this happens there are different possibilities:

Request for compressed content using Content-Encoding: gzip (so reducing used bandwith)
Split your application between different nodes working on different networks (so split the bandwith between different networks)
update your bandwith (so increase bandwith for your network)
Be sure to download only requested content (so no javascript, images, css and so on if not requested) (to minimize usage of bandwith)
A combination of the previous solutions

0人赞添加讨论(0) 举报

Java: Optimizing an application using asynchronous

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间