Solr (JVM) peak every hour

2019-04-09 03:15发布

问题:

SOLVED

In our case the problem was that for the SuggestRequestHandler (requestHandler name="/suggest") now facelimit has been set: 10 Also there has been several requests for each single suggest request made by the application. Why this led to a (just) hourly peak is quite not clear yet...

Thank you all for tips and help - I appreciated!

Every full hour (12:00, 13:00, 14:00, ..., 20:00, 21:00, 22:00, 23:00) our Solr/Java process has a peak - which means the Java process where Solr is running increases 3x times CPU usage and the response time take - which usually takes msecs to respond, up to 9 seconds. Always for 2 - 3 minutes and only if there is traffic on our site (there is a php application which calls Java). Crond was completely disabled but still the problem on every full hour. And basically I think we tried almost every GC and memory combination (or maybe not?)

Someone any idea why this happens - here some details:

  • System: 32 GB RAM, 24 Core (mostly shared with php-fpm, but even isolated just solr as test same problem)
  • Solr version 3.6 (on Jetty - temporarily also Glassfish)
  • OS: RHEL 5.7
  • Multicore setup (4 indexes with each 2 cores)

Used Handlers (solrconfig.xml):

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<requestHandler name="dismax" class="solr.SearchHandler" >
<requestHandler name="/suggest" class="solr.SearchHandler">
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/analysis/document" class="solr.DocumentAnalysisRequestHandler" />
<requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="PingRequestHandler">
<requestHandler name="/debug/dump" class="solr.DumpRequestHandler" >
<requestHandler name="/replication" class="solr.ReplicationHandler" >

(also tested without replication and ping)

Used filters:

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseMinimalStemFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.PortugueseMinimalStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />

Index size: ~100 MB (actually even a bit less)

Current Java Options:

JAVA_OPTS="-Xmx4096m -Xms4096m -XX:+UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:MaxPermSize=128m -XX:+DisableExplicitGC -Dsun.rmi.dgc.server.gcInterval=300000 -Dsun.rmi.dgc.client.gcInterval=300000 -XX:NewRatio=1 -Xloggc:/shop/logs/live/solr/gc.log -verbose:gc -XX:+PrintGCDateStamps"

The same options but with 1024, 2048, 8192 and 12 GB didn't help at all.

Other try:

JAVA_OPTS="-server -Xmx2048m -XX:MaxPermSize=128m -XX:+UseParNewGC     -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=256 -XX:CMSInitiatingOccupancyFraction=60 -XX:+DisableExplicitGC"

Other try:

JAVA_OPTS="-Xmx2048m -Xms2048m -XX:+UseGCOverheadLimit -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:MaxPermSize=128m -XX:+DisableExplicitGC -Djava.util.logging.config.file=/opt/solr-jetty/etc/jetty-logging.properties"

Here a excerpt of the gc.log (of such a full hour problem):

2013-03-03T19:59:04.157-0300: 8087.754: [GC 3433559K->1788819K(3914560K), 0.0358190 secs]
2013-03-03T19:59:12.031-0300: 8095.628: [GC 3437075K->1792088K(3914560K), 0.0365830 secs]
2013-03-03T19:59:22.419-0300: 8106.016: [GC 3440344K->1803266K(3914560K), 0.0422040 secs]
2013-03-03T19:59:29.044-0300: 8112.641: [GC 3451522K->1815743K(3914560K), 0.0439870 secs]
2013-03-03T19:59:37.002-0300: 8120.599: [GC 3463999K->1821601K(3914560K), 0.0378990 secs]
2013-03-03T19:59:45.468-0300: 8129.065: [GC 3469857K->1822911K(3914560K), 0.0386720 secs]
2013-03-03T19:59:53.750-0300: 8137.347: [GC 3471167K->1829299K(3914560K), 0.0405040 secs]
2013-03-03T20:00:01.829-0300: 8145.426: [GC 3477555K->1832046K(3914560K), 0.0383070 secs]
2013-03-03T20:00:06.327-0300: 8149.924: [GC 3480302K->1831567K(3914560K), 0.0450550 secs]
2013-03-03T20:00:11.123-0300: 8154.719: [GC 3479823K->1843283K(3914560K), 0.0401710 secs]
2013-03-03T20:00:14.360-0300: 8157.957: [GC 3491539K->1854079K(3914560K), 0.0368560 secs]
2013-03-03T20:00:17.419-0300: 8161.015: [GC 3502335K->1855130K(3914560K), 0.0375530 secs]
2013-03-03T20:00:20.006-0300: 8163.603: [GC 3503386K->1861867K(3914560K), 0.0413470 secs]
2013-03-03T20:00:22.726-0300: 8166.323: [GC 3510123K->1870292K(3914560K), 0.0360600 secs]
2013-03-03T20:00:25.420-0300: 8169.017: [GC 3518548K->1872701K(3914560K), 0.0326970 secs]
2013-03-03T20:00:27.138-0300: 8170.735: [GC 3520957K->1873446K(3914560K), 0.0381430 secs]
2013-03-03T20:00:28.748-0300: 8172.345: [GC 3521702K->1889189K(3914560K), 0.0379160 secs]
2013-03-03T20:00:30.404-0300: 8174.001: [GC 3537445K->1887193K(3914560K), 0.0407670 secs]
2013-03-03T20:00:32.713-0300: 8176.309: [GC 3535449K->1892863K(3914560K), 0.0366880 secs]
2013-03-03T20:00:34.791-0300: 8178.388: [GC 3541119K->1899095K(3914560K), 0.0398270 secs]
2013-03-03T20:00:36.533-0300: 8180.129: [GC 3547351K->1910071K(3914560K), 0.0373960 secs]
2013-03-03T20:00:39.037-0300: 8182.634: [GC 3558327K->1904198K(3914560K), 0.0393020 secs]
2013-03-03T20:00:41.548-0300: 8185.144: [GC 3552454K->1912352K(3914560K), 0.0444060 secs]
2013-03-03T20:00:43.771-0300: 8187.368: [GC 3560608K->1919304K(3914560K), 0.0427220 secs]
2013-03-03T20:00:47.411-0300: 8191.008: [GC 3566354K->1918102K(3914560K), 0.0418150 secs]
2013-03-03T20:00:50.925-0300: 8194.522: [GC 3564290K->1930888K(3914560K), 0.0414700 secs]
2013-03-03T20:00:52.991-0300: 8196.588: [GC 3579144K->1933251K(3914560K), 0.0349600 secs]
2013-03-03T20:00:53.027-0300: 8196.624: [GC 1939697K(3914560K), 0.0256300 secs]
2013-03-03T20:00:54.208-0300: 8197.804: [GC 2780505K(3914560K), 0.1424860 secs]
2013-03-03T20:00:55.684-0300: 8199.281: [GC 3029503K->1389766K(3914560K), 0.0370380 secs]
2013-03-03T20:00:58.289-0300: 8201.886: [GC 2213458K->570843K(3914560K), 0.0413220 secs]
2013-03-03T20:01:00.672-0300: 8204.268: [GC 1962741K->319619K(3914560K), 0.0410840 secs]
2013-03-03T20:01:02.906-0300: 8206.503: [GC 1966833K->319605K(3914560K), 0.0453730 secs]
2013-03-03T20:01:06.861-0300: 8210.458: [GC 1967861K->330864K(3914560K), 0.0425570 secs]
2013-03-03T20:01:10.067-0300: 8213.664: [GC 1979120K->336541K(3914560K), 0.0479380 secs]
2013-03-03T20:01:12.587-0300: 8216.184: [GC 1984797K->343203K(3914560K), 0.0376810 secs]

Also there are just 2 entries at all (about 1 day) greater than 1 second: grep -oP ", [1-9]..*?secs]$" /shop/logs/live/solr/gc.log , 1.1727270 secs] , 1.0390840 secs]

Someone any idea or already had this phenomenon with solr/jvm?

回答1:

Don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options. Suspect them even then. There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.

You can try adding jHiccup as an agent to report on observed pause/glitch/stall/hiccups rather than rely in the honesty of the JVM logs. If it shows multi-sec glitches then you'll know your JVM is pausing. If it shows smooth JVM operation, then you know to look at your other configuration parts.



回答2:

if the index size is just 100MB, and the issue is related to GC I would start by:

  1. reducing -Xmx to less than 1024, start at about 256m and see if it's enough
  2. dont use any -XX at the begining
  3. use latest jdk