How to effectively parse text files with Java Stre

2020-07-25 23:11发布

问题:

I understand how to get specific data from a file with Java 8 Streams. For example if we need to get Loaded packages from a file like this

2015-01-06 11:33:03 b.s.d.task [INFO] Emitting: eVentToRequestsBolt __ack_ack 
2015-01-06 11:33:03 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package com.foo.bar
2015-01-06 11:33:04 b.s.d.executor [INFO] Processing received message source: eventToManageBolt:2, stream: __ack_ack, id: {}, [-6722594615019711369 -1335723027906100557]
2015-01-06 11:33:04 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package co.il.boo
2015-01-06 11:33:04 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package dot.org.biz

we can do

List<String> packageList = Files.lines(Paths.get(args[1])).filter(line -> line.contains("===---> Loaded package"))
        .map(line -> line.split(" "))
        .map(arr -> arr[arr.length - 1]).collect(Collectors.toList());

I took (and slightly modified) the code from Parsing File Example.

But what if we also need to get all the dates (and times) for Emitting: events from the same log file? How we can do this within working with the same Stream?

I can only imagine using collect(groupingBy(...)) which groups lines with Loaded packages and lines with Emitting: before parsing and then parse each group (a map entry) separately. But that would create a map with all the raw data from log file which is very memory consuming.

Is there a similar way to effectively extract multiple types of data from Java 8 Streams?

回答1:

You may solve it without defining new collectors and using third-party libraries in more imperative style. First you need to define a class which represents the parsing result. It should have two methods to accept an input line and combine with existing partial result:

class Data {
    List<String> packageDates = new ArrayList<>();
    List<String> emittingDates = new ArrayList<>();

    // Consume single input line
    void accept(String line) {
        if(line.contains("===---> Loaded package"))
            packageDates.add(line.substring(0, "XXXX-XX-XX".length()));
        if(line.contains("Emitting"))
            packageDates.add(line.substring(0, "XXXX-XX-XX XX:XX:XX".length()));
    }

    // Combine two partial results
    void combine(Data other) {
        packageDates.addAll(other.packageDates);
        emittingDates.addAll(other.emittingDates);
    }
}

Now you can collect in quite straightforward way:

Data result = Files.lines(Paths.get(args[1]))
    .collect(Data::new, Data::accept, Data::combine);


回答2:

You may use pairing collector which I wrote in this answer and which is available in my StreamEx library. For your concrete problem you will also need a filtering collector which is available in JDK-9 early access builds and also in my StreamEx library. If you don't like using third-party library, you may copy it from this answer.

Also you will need to store everything into some data structure. I declared the Data class for this purpose:

class Data {
    List<String> packageDates;
    List<String> emittingDates;

    public Data(List<String> packageDates, List<String> emittingDates) {
        this.packageDates = packageDates;
        this.emittingDates = emittingDates;
    }
}

Putting everything together you can define a parsingCollector:

Collector<String, ?, List<String>> packageDatesCollector = 
    filtering(line -> line.contains("===---> Loaded package"),
        mapping(line -> line.substring(0, "XXXX-XX-XX".length()), toList()));

Collector<String, ?, List<String>> emittingDatesCollector = 
    filtering(line -> line.contains("Emitting"),
        mapping(line -> line.substring(0, "XXXX-XX-XX XX:XX:XX".length()), toList()));

Collector<String, ?, Data> parsingCollector = pairing(
    packageDatesCollector, emittingDatesCollector, Data::new);

And use it like this:

Data data = Files.lines(Paths.get(args[1])).collect(parsingCollector);