I have some large-ish text files that I want to process by grouping its lines.
I tried to use the new streaming features, like
return FileUtils.readLines(...)
.parallelStream()
.map(...)
.collect(groupingBy(pair -> pair[0]));
The problem is that, AFAIK, this generates a Map.
Is there any way to have high level code like the one above that generates, for example, a Stream of Entries?
UPDATE: What I'm looking for is something like python's itertools.groupby. My files are already sorted (by pair[0]), I just want to load the groups one by one.
I already have an iterative solution. I'm just wondering if there's a more declarative way to do that. Btw, using guava or another 3rd party library wouldn't be a big problem.
cyclops-react, I library a contribute to, offers both sharding and grouping funcitonality that might do what you want.
The groupedStatefullyWhile operator allows elements to be grouped based on the current state of the batch. ReactiveSeq is a single threaded sequential Stream.
This will create a LazyFutureStream (that implements java.util.stream.Stream), that will process the data in the file asynchronously and in parallel. It's lazy and won't start processing until data is pulled through.
The only caveat is that you need to define the shards beforehand. I.e. the 'shards' parameter above which is a Map of async.Queue's keyed by the key to the shard (possibly whatever pair[0] is?).
e.g.
There is a sharding example with video here and test code here
It can be done by
collapse
with StreamExWe can add
peek
andlimit
to verify if it's lazy calculation:The task you want to achieve is quite different from what grouping does.
groupingBy
does not rely on the order of theStream
’s elements but on theMap
’s algorithm applied to the classifierFunction
’s result.What you want is to fold adjacent items having a common property value into one
List
item. It is not even necessary to have theStream
sorted by that property as long as you can guaranty that all items having the same property value are clustered.Maybe it is possible to formulate this task as a reduction but to me the resulting structure looks too complicated.
So, unless direct support for this feature gets added to the
Stream
s, an iterator based approach looks most pragmatic to me:The lazy nature of the resulting folded
Stream
can be best demonstrated by applying it to an infinite stream: