Consider the task of fetching a limited amount of data from a (w.r.t memory) infinite source, e.g. I need to fetch counts
from an huge table consisting of billions and more entries, labeled by timestamp
and foreign_key
within a finite but possibly huge time window. But for later consumption I only need maximally gridSize
values.
Note: The backing DB is a MariaDb and we use spring data jpa to connect to the database.
One stream based approach is:
int stepSize = (int) (numberOfCountsInWindow / gridSize) + 1;
StreamUtils
.zipWithIndex(countStream)
.filter(countIndexed -> countIndexed.getIndex() % stepSize == 0)
...
<intermediate steps>
...
.collect(Collectors.toList())
I've tried plenty of others, like:
- a custom
Collector
, that uses anAtromicLong
to decide if the next value is added to the final list or just ignored, - Flux.defere( () -> ), to use the Flux back-pressure management
- Java 11
- plenty of MySQL options like "useCursorFetch=true" in combination with a finite prefetch size as
@QueryHint()
- ...
However, all of what I've tried leads to GC overhead limit exceeded
(with the heap size limited to 2048, which is the default for our applications).
The reason is, the cpu is totally busy garbage collecting, while the memory consumption is just filling up until finally the application crashes.
What I actually expected (or at least hoped for), was the stream to "realize" the filtering and just continue with the actual computing, at best using 100% cpu, and the garbage collector to easily remove the unused objects, since they are filtered out anyway and don't even need to be parsed (was hoping for the laziness of the stream to help here). However, this doesn't seem to work as I was hoping for.
Comments or suggestions are very welcome, since I'd like to (at least) understand (at best solve) rather then just accept the situation.
EDIT: Alternatives to MariaDB (maybe Cassandra?) are also welcome.