One of our nodes in our 3 node cluster is down and on checking the log file, it shows the below messages
INFO [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:32,891 AbstractMetrics.java:114 - Cannot record QUEUE latency of 11 minutes because higher than 10 minutes.
INFO [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,233 AbstractMetrics.java:114 - Cannot record QUEUE latency of 10 minutes because higher than 10 minutes.
WARN [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,398 Worker.java:99 - Interrupt/timeout detected.
java.util.concurrent.BrokenBarrierException: null
at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:200) ~[na:1.7.0_79]
at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:355) ~[na:1.7.0_79]
at com.datastax.bdp.concurrent.FlushTask.bulkSync(FlushTask.java:76) ~[dse-core-4.8.3.jar:4.8.3]
at com.datastax.bdp.concurrent.Worker.run(Worker.java:94) ~[dse-core-4.8.3.jar:4.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
WARN [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,398 Worker.java:99 - Interrupt/timeout detected.
java.util.concurrent.BrokenBarrierException: null
at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:200) ~[na:1.7.0_79]
at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:355) ~[na:1.7.0_79]
at com.datastax.bdp.concurrent.FlushTask.bulkSync(FlushTask.java:76) ~[dse-core-4.8.3.jar:4.8.3]
at com.datastax.bdp.concurrent.Worker.run(Worker.java:94) ~[dse-core-4.8.3.jar:4.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
INFO [keyspace.core Index WorkPool work thread-4] 2016-09-14 14:05:33,720 AbstractMetrics.java:114 - Cannot record QUEUE latency of 13 minutes because higher than 10 minutes.
INFO [keyspace.core Index WorkPool work thread-4] 2016-09-14 14:05:33,721 AbstractMetrics.java:114 - Cannot record QUEUE latency of 13 minutes because higher than 10 minutes.
The nodes configuration are 8 CPU, 32 GB RAM, 500 GB Disk space. What could be the reasons for only one particular node going down?
So I'm going to answer with some general info here, your case might be more complex. 32GB RAM might not be large enough for a Solr node; using the G1 collector on Java 1.8 has proved better for Solr with heap sizes above 26GB.
I'm also not sure what heap sizes, JVM settings and how many solr cores you have here. However, I've seen similar errors to this when a node is busy indexing and its trying to keep up. Once of the most common problems seen on Solr nodes in my experience is where the
max_solr_concurrency_per_core
is left at default (commented out) in thedse.yaml
. This will typically allocate the number of indexing threads to the number of CPU cores, and to further compound the problem, you might see 8 cores but if you have HT then its actually likely 4 physical cores.Check your
dse.yaml
and make sure you are setting it tonum physcal cpu cores / num of solr cores
with 2 at a minimum. This might index slower but you should remove the pressure off of your node.I'd recommend this useful blog here as a good start to tuning DSE Solr:
http://www.datastax.com/dev/blog/tuning-dse-search
Also docs on the subject:
https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTune.html