Jenkins Windows slave connection getting terminate

2019-04-29 19:51发布

问题:

While connecting to windows machine as slave, i am getting following error i think its some network related issue, but need some help where to start looking or what is a possible solution for this.

INFO: Terminated
Aug 01, 2017 10:15:54 PM hudson.remoting.JarCacheSupport$1 run
WARNING: Failed to resolve a jar 06bcb4519543f5ec83cf9d6da9f6cfbe
java.io.IOException: Failed to write to C:\Users\Administrator\.jenkins\cache\jars\06\BCB4519543F5EC83CF9D6DA9F6CFBE.jar
        at hudson.remoting.FileSystemJarCache.retrieve(FileSystemJarCache.java:133)
        at hudson.remoting.JarCacheSupport$1.run(JarCacheSupport.java:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:483)
        at java.util.concurrent.FutureTask.run(FutureTask.java:274)
        at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
        at java.lang.Thread.run(Thread.java:809)
Caused by: java.io.IOException: Backing channel 'JNLP4-connect connection to dr2r4m1p21/172.20.238.41:9001' is disconnected.
        at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
        at com.sun.proxy.$Proxy4.writeJarTo(Unknown Source)
        at hudson.remoting.FileSystemJarCache.retrieve(FileSystemJarCache.java:98)
        ... 5 more
Caused by: java.nio.channels.ClosedChannelException
        at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208)
        at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:166)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
        at hudson.remoting.Engine$1$1.run(Engine.java:94)
        ... 1 more

Above mentioned stack trace is from salve (Windows) machine and my Jenkins/Master is running on RHEL, i am able to see following stacktrace there.

INFO: Accepted JNLP4-connect connection #113 from /172.20.238.31:60363
Aug 01, 2017 12:45:55 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: Computer.threadPoolForRemoting [#42] for Build_Agent terminated
java.nio.channels.ClosedChannelException
        at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208)
        at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
        at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
        at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311)
        at hudson.remoting.Channel.close(Channel.java:1295)
        at hudson.remoting.Channel.close(Channel.java:1263)
        at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:173)
        at org.jenkinsci.remoting.engine.JnlpConnectionState$4.invoke(JnlpConnectionState.java:421)
        at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:312)
        at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:418)
        at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler$1.run(JnlpProtocol4Handler.java:334)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

回答1:

  • I observed the same error after our jenkins master was updated. It is likely due to incompatibility between Java 7 (v80) and latest Java 8.
  • Check the java version being used by your master, and the java version of your slave.
  • In my case, I am running swarm-client-2.0-jar-with-dependencies.jar on a linux host, and it was using Java 7.

    java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

  • Our jenkins master was upgraded and is now running Java 8

    java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

  • When the java on the slave was updated to Java 8, the connection issues disappeared.


回答2:

I was experiencing a similar error as the OP where the connection to my slave was dropping. The root cause of the issue was not due to a mismatch in Java versions between Jenkins slave and master hosts.

Solution If you are running Jenkins in an EC2 instance on AWS behind an Elastic Load Balancer (ELB), increase the "idle timeout" value under the "attributes" section from the default 60 seconds. I set the new value to 600 and no longer experienced the error.

It appears that if a single command in your build process takes greater than 60 seconds with no log output, the ELB will terminate the session due to idle activity.

Source: https://issues.jenkins-ci.org/browse/JENKINS-44001?focusedCommentId=312412&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-312412



回答3:

I experienced the same issue. I found out that the windows slave switched to a "sleep" mode specially if your jobs are not running against a GUI.

  • For windows... no move of the mouse or keyboard means no activity.

Then to successfully solve it. On a Windows7 slave, here is what I did:

  • Control Panel\Hardware and Sound\Power Options
  • Show additionnal plans
  • select High performance

  • Control Panel\Hardware and Sound\Power Options\Edit Plan Settings

  • turn off display never
  • Change advanced power settings -->turn off hard disk after 10000 min

Should be ok after this procedure



回答4:

in addition to the error log in the post, I got also the error log under the jenkins directory in the slave (for me it was C:\jenkins\jenkins-slave.err.log):

JNLP file http://jenkins.domain.com/computer/my_slave_name/slave-agent.jnlp?encrypt=true has invalid arguments: [#####################################, my_slave_name, -workDir, c:\jenkins, -internalDir, remoting, -url, http://jenkins.domain.com/, -headless, -jar-cache, C:\Users\Administrator.jenkins\cache\jars] Most likely a configuration error in the master "-workDir" is not a valid option

my solution:

1)windows slave level: close the services console in the GUI for all users - this is must. from some reason Microsoft is locking installation/removal of windows services

2)windows slave level: kill all java and jenkins-slave processes (if exist)

3)windows slave level: delete the jenkins slave service (if exist) from cmd: sc delete jenkinsslave-c__jenkins /force (in my case)

4)windows slave level: verify that you have java 8 installed: i'm using jdk1.8.0_151 . uninstall all old java version

5)jenkins master ui level: Change the way the Jenkins is connect to the slave under slave configure --> Launch method: Let Jenkins control this Windows slave as a Windows service (instead of Launch agent via Java Web Start)

6) aws level: Increase the aws elb Idle timeout to 600 (from 60) - like @njtman suggested

7)jenkins master ui level: relaunch the agent in jenkins and wait several minutes.

my environment:

jenkins: 2.89.2 , os: windows 2012 R2, java: jdk1.8.0_151



回答5:

No time to breath for virtual slave...

ok, here how I've solved my special case:

I had some VM's with libvirt/quemu running as slaves. Because the libvirt-plugin was to unreliable for me I've started those VM's on my own. I asked my self: "Why this libvirt-plugin had a mandatory delay time... Impatience...

So if the libvirt-client (slave) is saying hello to jenkins you should probably wait some secs to let this poor guy breath a bit. After starting up.

The slave was a win7 the host a ubuntu 18.04



回答6:

On Windows, I recognized that I needed to add the "-noCertificateCheck" attribute to the arguments of the jenkins-slave.xml in the workdir. We use a cert from a internal PKI on the master and this was the easiest way to work around it (having everything in the internal network).

<arguments>-Xrs  -jar "%BASE%\slave.jar" -jnlpUrl https://jenkins.ourdomain.com/computer/Windows%20build%20server%20-%20Bare%20metal/slave-agent.jnlp -secret abc -noCertificateCheck</arguments>

I recognized this by manually running the agent from the command prompt:

java -jar agent.jar -jnlpUrl https://jenkins.ourdomain.com/computer/Windows%20build%20server%20-%20Bare%20metal/slave-agent.jnlp -secret abc -workDir "D:\agentroot" -noCertificateCheck


回答7:

Well... for me it worked the following solution:

mark the node "temporary offline" and put it back "online" again

reconnect



回答8:

I was facing the same issue , rectified using below steps

  1. Go to Jenkins Url -> Manage Jenkins -> Node -> Your Node ..you will get Custom WorkDir path lets say C:/Jenkins
  2. Open WorkDir Path and delete complete remoting directory
  3. Re-launch the slave