Analyzing Connection Closed Exception in Spring/JP

2019-01-23 06:00发布

问题:

PROBLEM

I have recently been put in charge of a Java web application with code already written and in place. The app receives moderately high traffic and has peak hours of traffic between 11am to 3pm everyday. The application uses Spring, JPA(Hibernate), MYSQL DB. Spring has been configured to use tomcat jdbc connection pool to make connections to the DB. (Details of configuration at the end of the post)

For the past few days, during the application's peak load hours, the application has been going down due to tomcat going unresponsive to requests. It has required tomcat to be restarted multiple times.

Going through the tomcat catalina.out logs, I noticed a whole lot of

Caused by: java.sql.SQLException: Connection has already been closed.
    at org.apache.tomcat.jdbc.pool.ProxyConnection.invoke(ProxyConnection.java:117)
    at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109)
    at org.apache.tomcat.jdbc.pool.DisposableConnectionFacade.invoke(DisposableConnectionFacade.java:80)
    at com.sun.proxy.$Proxy28.prepareStatement(Unknown Source)
    at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:505)
    at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:423)
    at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:139)
    at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1547)
    at org.hibernate.loader.Loader.doQuery(Loader.java:673)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
    at org.hibernate.loader.Loader.loadCollection(Loader.java:1994)
    ... 115 more

These appear frequently just before the crash.

Going further earlier before these exceptions, I noticed a whole lot of Connections being abandoned just before the Connection Closed exceptions.

WARNING: Connection has been abandoned PooledConnection[com.mysql.jdbc.Connection@543c2ab5]:java.lang.Exception
    at org.apache.tomcat.jdbc.pool.ConnectionPool.getThreadDump(ConnectionPool.java:1065)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:782)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:618)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:188)
    at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:47)
    at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:423)
    at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144)
    at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:139)

These seem to appear frequently just before the Connection Closed exceptions. And these seem to be the first symptoms of impending doom in the logs.

ANALYSIS

Going by the logs, I set out to see if there is any connection pool configuration/mysql configuration that might be causing the issue. Went through a couple of excellent articles that show tuning of the pool for Production environment. Links 1 & 2

Going by these articles, I noticed that:

  1. The below line in JHanik's article (link 1) mentions this

    Setting the value of abandonWhenPercentageFull to 100 would mean that connections are not > considered abandoned unless we've reached our maxActive limit.

    I figured this might be important in my case, because I see a lot of connections being abandoned.

  2. My max_connections setting does not match what is being recommended (in link 2)

    mysql max_connections should be equal to max_active+max_idle

WHAT I TRIED

So, as per the recommendations from the articles, I did the following two things:

  1. Changed abandonWhenPercentageFull to 100
  2. In my MYSQL server, max_connections was set as 500. Increased it to 600 In my connection pool settings, max_active was 200 and max_idle was 50. Changed it to max_active=350, max_idle = 250

THIS DID NOT HELP

The next day, the following observations were made during peak hours:

  1. Tomcat did not go down. The app stayed up throughout peak hours. However the performance went from bad to worse and then the app was barely usable even though it did not really go down.
  2. DB Connection pool, though increased in size, got completely utilized, and I could see 350 active connections to the DB at one point.

FINALLY, MY QUESTION:

It clearly looks like there are issues with the way DB connections are being made from the app server. So I have two directions to take this analysis forward.

My question is which of these should I be taking?

1. The issue is not with the connection pool settings. The code is what is causing the issue

There might be places in the code where DB connections are not being closed. Which is causing the high number of connections being open.

The code uses a GenericDao which is extended in every Dao class. The GenericDao uses Spring's JpaTemplate to fetch a EntityManager instance which in turn is used for all DB operations. My understanding is using the JpaTemplate handles the nitty gritty of closing DB connections internally.

So, where exactly should I looking for possible connection leaks?

2. The issue is with the connections pool/mysql config parameters. However, the optimizations I have put in need to be tuned further

If yes, which parameters should I be looking at? Should I be collecting some data to use to determine more appropriate values for my connection pool. (For eg, for max_active, max_idle, max_connections)


Addendum: Complete Connection Pool configuration

   <bean id="dataSource" class="org.apache.tomcat.jdbc.pool.DataSource" destroy-method="close">
        <property name="driverClassName" value="com.mysql.jdbc.Driver" />
        <property name="url" value="jdbc:mysql://xx.xx.xx.xx" />
        <property name="username" value="xxxx" />
        <property name="password" value="xxxx" />
        <property name="initialSize" value="10" />
        <property name="maxActive" value="350" />
        <property name="maxIdle" value="250" />
        <property name="minIdle" value="90" />
        <property name="timeBetweenEvictionRunsMillis" value="30000" />
        <property name="removeAbandoned" value="true" />
        <property name="removeAbandonedTimeout" value="60" />
        <property name="abandonWhenPercentageFull" value="100" />
        <property name="testOnBorrow" value="true" />
        <property name="validationQuery" value="SELECT 1" />
        <property name="validationInterval" value="30000" />
        <property name="logAbandoned" value="true" />
        <property name="jmxEnabled" value="true" />
    </bean>

回答1:

This is woefully late for the OP, but maybe it will help someone else in the future:

I ran into something similar to this in a production environment with long-running batch jobs. The problem is if your code needs a connection longer than the time specified by property:

name="removeAbandonedTimeout" value="60

and you have enabled:

<property name="removeAbandoned" value="true" />

then it will get disconnected during the processing after 60 seconds. One possible workaround (that didn't work for me) is to enable the interceptor:

jdbcInterceptors="ResetAbandonedTimer"

This will reset the abandoned timer for that connection for every read/write that occurs. Unfortunately in my case, the processing would sometimes still take longer than the timeout before anything was read/written to the database. So I was forced to either bump the timeout length, or disable the removeAbandonded (I chose the former solution).

Hope this helps someone else if they run into something similar!



回答2:

I was recently asked to investigate why production system sometimes goes down. I wanted to share my findings since it involves a correlation of events to take a JVM tomcat app with JDBC issues as outlined above to actually crash the app. This is using mysql as a backend so probably most useful for this scenario but if issue hit on another platform cause likely to be the same.

By simply getting connection closed does not imply the application is broken

This is under a grails application but will be relative to all JVM related apps:

tomcat/context.xml db configuration, notice very small db pool and removeAbandonedTimeout="10" ye right we want things to break

<Resource
 name="jdbc/TestDB"  auth="Container" type="javax.sql.DataSource"
              driverClassName="com.mysql.jdbc.Driver"
              url="jdbc:mysql://127.0.0.1:3306/test"
              username="XXXX"
              password="XXXX"
              testOnBorrow="true"
              testWhileIdle="true"
              testOnReturn="true"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
              removeAbandoned="true"
              logAbandoned="true"
              removeAbandonedTimeout="10"
              maxWait="5000"
              initialSize="1"
              maxActive="2"
              maxIdle="2"
              minIdle="2"
              validationQuery="Select 1" />

A quartz job that runs every minute, not that it matters the app I think dies on first attempt:

class Test2Job {
    static  triggers = {
               cron name: 'test2', cronExpression: "0 0/1 * * * ?"
        }
        def testerService
        def execute() {
        println "starting job2 ${new Date()}"
        testerService.basicTest3()

    }

}

Now our testService with comments so please follow comments:

def dataSource

  /**
   * When using this method in quartz all the jdbc settings appear to get ignored
   * the job actually completes notice huge sleep times compared to basicTest
   * strange and very different behaviour.
   * If I add Tester t = Tester.get(1L) and then execute below query I will get
   * connection pool closed error
   * @return
   */
  def basicTest2() {
      int i=1
      while (i<21) {
          def sql = new Sql(dataSource)
          def query="""select id as id  from tester t
                  where id=:id"""
          def instanceList = sql.rows(query,[id:i as Long],[timeout:90])
          sleep(11000)
          println "-- working on ${i}"
          def sql1 = new Sql(dataSource)
          sql1.executeUpdate(
                  "update tester t set t.name=? where t.id=?",
                  ['aa '+i.toString()+' aa', i as Long])

          i++
          sleep(11000)
      }
      println "run ${i} completed"
  }


  /**
   * This is described in above oddity
   * so if this method is called instead you will see connection closed issues
   */
  def basicTest3() {
      int i=1
      while (i<21) {
          def t = Tester.get(i)
          println "--->>>> test3 t ${t.id}"

          /**
           * APP CRASHER - This is vital and most important
           * Without this declared lots of closed connections and app is working
           * absolutely fine,
           * The test was originally based on execRun() which returns 6650 records or something
           * This test query is returned in time and does not appear to crash app
           *
           * The moment this method is called and please check what it is currently doing. It is simply
           * running a huge query which go beyond the time out values and as explained in previous emails MYSQL states
           *
           * The app is then non responsive and logs clearly show application is broke 
           */
          execRun2()


          def sql1 = new Sql(dataSource)
          sleep(10000)
          sql1.executeUpdate("update tester t set t.name=? where t.id=?",['aa '+i.toString()+' aa', t.id])
          sleep(10000)
          i++
      }

  }


  def execRun2() {
      def query="""select new map (t as tester) from Tester t left join t.children c
left join t.children c
                  left join c.childrena childrena
                  left join childrena.childrenb childrenb
                  left join childrenb.childrenc childrenc , Tester t2 left join t2.children c2 left join t2.children c2
                  left join c2.childrena children2a
                  left join children2a.childrenb children2b
                  left join children2b.childrenc children2c
             where ((c.name like (:name) or
                  childrena.name like (:name) or
                  childrenb.name like (:name) or (childrenc is null or childrenc.name like (:name))) or
                  (
                  c2.name like (:name) or
                  children2a.name like (:name) or
                  children2b.name like (:name) or (children2c is null or children2c.name like (:name))
      ))

          """
      //println "query $query"
      def results = Tester.executeQuery(query,[name:'aa'+'%'],[timeout:90])
      println "Records: ${results.size()}"

      return results
  }


  /**
   * This is no different to basicTest2 and yet
   * this throws a connection closed error and notice it is 20 not 20000
   * quite instantly a connection closed error is thrown when a .get is used vs
   * sql = new Sql(..) is a manuall connection
   *
   */
  def basicTest() {
      int i=1
      while (i<21) {
          def t = Tester.get(i)
          println "--- t ${t.id}"
          sleep(20)
          //println "publishing event ${event}"
          //new Thread({
          //    def event=new PurchaseOrderPaymentEvent(t,t.id)
          //    publishEvent(event)
          //} as Runnable ).start()

          i++
      }
  }

It is only when the query then takes longer than expected time but there has to be another element, the query itself then has to hit sit on MYSQL even though it is killed. MYSQL is eating it away processing it.

I think what is going on is

job 1 - hits app -> hits mysql ->    (9/10 left)
         {timeout} -> app killed  -> mysql running (9/10)
 job 2 - hits app -> hits mysql ->    (8/10 left)
         {timeout} -> app killed  -> mysql running (8/10) 
.....
 job 10 - hits app -> hits mysql ->    (10/10 left)
         {timeout} -> app killed  -> mysql running (10/10)
 job 11 - hits app -> 

If by this time job1 has not completed then we have nothing left in the pool well app is simply broke now.. jdbc errors thrown etc.. never mind if it completes after the crash..

You can monitor what is going on by checking mysql It appeared to run for a longer period which goes against what they have suggested this value should be doing, but then again maybe this isn’t really based on any of this and relates to a problem elsewhere.

Whilst testing noticed there were two states: Sending data / Sending to client:

|  92 | root | localhost:58462 | test | Query   |   80 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  95 | root | localhost:58468 | test | Query   |  207 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  96 | root | localhost:58470 | test | Query   |  147 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  97 | root | localhost:58472 | test | Query   |  267 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  98 | root | localhost:58474 | test | Sleep   |   18 |                   | NULL                                                                                                 |
|  99 | root | localhost:58476 | test | Query   |  384 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
| 100 | root | localhost:58478 | test | Query   |  327 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |

Sseconds later:

|  91 | root | localhost:58460 | test | Query   |   67 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  92 | root | localhost:58462 | test | Query   |  148 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  97 | root | localhost:58472 | test | Query   |  335 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test | |
| 100 | root | localhost:58478 | test | Query   |  395 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |

Seconds after that: (all dead)
|  58 | root | localhost       | NULL | Query   |    0 | starting | show processlist |
|  93 | root | localhost:58464 | test | Sleep   |  167 |          | NULL             |
|  94 | root | localhost:58466 | test | Sleep   |  238 |          | NULL             |
|  98 | root | localhost:58474 | test | Sleep   |   74 |          | NULL             |
| 101 | root | localhost:58498 | test | Sleep   |   52 |          | NULL             |

It may be that a script needs to be created to monitor the process list and maybe a deeper result set containing exact queries running to work out which of your queries events is killing your app



回答3:

The code uses a GenericDao which is extended in every Dao class. The GenericDao uses Spring's JpaTemplate to fetch a EntityManager instance which in turn is used for all DB operations. My understanding is using the JpaTemplate handles the nitty gritty of closing DB connections internally.

This is probably the root of your problem, you shouldn't use the JpaTemplate to get the EntityManager this will give you un unmanaged Entitymanager. In fact you shouldn't be using JpaTemplate at all.

It is recommended to write daos based on the plain EntityManager API and simply inject the EntityManager as you normally would do (with @PersistenceContext).

If you really want to use the JpaTemplate use the execute method and pass in a JpaCallback which will give you a managed EntityManager.

Also make sure that you have setup transactions correctly without proper tx setup connections will not be closed as spring doesn't know that it should close the connection.