Kubernetes: (1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)

Created on 28 Jan 2020  ·  123Comments  ·  Source: kubernetes/kubernetes

We've just upgrade our production cluster to 1.17.2.

Since the update on saturday, we've had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.

Here is the timeline of last time it occured:

01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.

Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Configured

As expected, all watches are closed. Message is the same for them all:

...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736    2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...

So these messages begin:

`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`

Which I'm guessing shouldn't be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.

# journalctl --since '2020-01-28 01:31'   | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
   9757 01
  20663 02
  20622 03
  20651 04
  20664 05
  20666 06
  20664 07
  20661 08
  16655 09
      3 10

Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.

Is there any way to mitigate this kind of event?

Would this be a bug?

kinsupport siapi-machinery sinode

Most helpful comment

I've fixed it by running this bash script every 5 minutes:

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

All 123 comments

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

/assign @caesarxuchao

@rikatz can you elaborate how did you track down to the code you pasted?

My thought is that the reflector would have restarted the watch no matter how it handles the error (code), so it doesn't explain the failure to recover.

Exactly @caesarxuchao so this is our question.

I've tracked the error basically grepping it through the Code and crossing with what kubelet was doing that time (watching secrets) to get into that part.

Not an advanced way, through this seems to be the exact point of the error code.

The question is, because the connection is closed is there somewhere flagging that this is the watch EOF instead of understanding this is an error?

I do not have nothing else smarter to add other than we had another node fail the same way, increasing the ocurrences from last 4 days to 4.

Will try to map if bond disconects events are happening on other nodes and if kubelet is recovering - it might be bad luck on some recovers, and not a 100% event.

I think we are seeing this too, but we do not have bonds, we only see these networkd "carrier lost" messages for Calico cali* interfaces, and they are local veth devices.

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

Update: restarting Kubelet did fix the problem after enough time (1 hour?) was allowed to pass.

I am seeing this same behavior. Ubuntu 18.04.3 LTS clean installs. Cluster built with rancher 2.3.4. I have seen this happen periodically lately and just restarting kubelet tends to fix it for me. Last night all 3 of my worker nodes exhibited this same behavior. I corrected 2 to bring my cluster up. Third is still in this state while im digging around.

we are seeing the same issue on CentOS 7, cluster freshly built with rancher (1.17.2). We are using weave. All of 3 worker nodes are showing this issue. Restarting kubelet does not work for us we have to restart the entire node

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

We are also seeing the same issue. From the log, we found that after the problem occurred all subsequent requests were still send on the same connection. It seems that although the client will resend the request to apiserver, but the underlay http2 library still maintains the old connection so all subsequent requests are still send on this connection and received the same error use of closed connection.

So the question is why http2 still maintains an already closed connection? Maybe the connection it maintained is indeed alive but some intermediate connections are closed unexpectedly?

I have the same issue with a Raspberry Pi cluster with k8s 1.17.3 very often. Based on some older issues, i have set the kube API server http connection limit to 1000 "- --http2-max-streams-per-connection=1000", it was fine for more than 2 weeks after that it starts now again.

Is it possible to rebuild kube-apiserver https://github.com/kubernetes/apiserver/blob/b214a49983bcd70ced138bd2717f78c0cff351b2/pkg/server/secure_serving.go#L50
setting the s.DisableHTTP2 to true by default?
Is there a dockerfile for an official image (k8s.gcr.io/kube-apiserver:v1.17.3)?

same here.(ubuntu 18.04, kubernetes 1.17.3)

We also observed this in two of our clusters. Not entirely sure about the root cause, but at least we were able to see this happened in cluster with very high watch counts. I was not able to reproduce by forcing high number of watches per kubelet though (started pods with 300 secrets per pod, which also resulted in 300 watches per pod in Prometheus metrics). Also setting very low http2-max-streams-per-connection values did not trigger the issue, but at least I was able to observe some unexpected scheduler and controller-manager behavior (might have been just overload after endless re-watch loops or something like this, though).

As workarround all of my nodes restarting every night kublet via local cronjob. Now after 10 days ago, i can say it works for me, i have no more "use of closed network connection" on my nodes.

@sbiermann
Thank you for posting this. What time interval you use for cronjob?

24 hours

I can also confirm this issue, we are not yet on 1.17.3, currently running Ubuntu 19.10:

Linux <STRIPPED>-kube-node02 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

NAME                  STATUS   ROLES    AGE   VERSION       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
STRIPPED-kube-node02   Ready    <none>   43d   v1.16.6   10.6.0.12     <none>        Ubuntu 19.10   5.3.0-29-generic   docker://19.3.3

I can also confirm this on Kubernetes 1.17.4 deployed through Rancher 2.3.5 on RancherOS 1.5.5 nodes. Restarting the kubelet seems to work for me, I don't have to restart the whole node.

The underlying cause for me seems to be RAM getting close to running out and kswapd0 getting up to 100% CPU usage due to that, since I forgot to set the swappiness to 0 for my Kubernetes nodes. After setting the swappiness to 0 and adding some RAM to the machines, the issue hasn't reoccurred for me yet.

If the underlying issue was "http2 using dead connections", then restarting kubelet should fix the problem. https://github.com/kubernetes/kubernetes/pull/48670 suggested reducing TCP_USER_TIMEOUT can mitigate the problem. I have opened https://github.com/golang/net/pull/55 to add client-side connection health check to the http2 library, but it's going to take more time to land.

If restarting kubelet didn't solve the issue, then probably it's a different root cause.

I have the same issue with v1.17.2 when restart network, but only one of node have this issue(my cluster have five nodes), i can't reproduce it. Restart kubelet solved this problem.

How can i avoid to this issue? Upgrade the latest version or have other way to fix it?

I've fixed it by running this bash script every 5 minutes:

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

I've created a patch without restart kubelet and it seems the problem is resolved.
deadline patch

diff --git a/staging/src/k8s.io/client-go/transport/cache.go b/staging/src/k8s.io/client-go/transport/cache.go
index 7c40848c79f..bd61b39551a 100644
--- a/staging/src/k8s.io/client-go/transport/cache.go
+++ b/staging/src/k8s.io/client-go/transport/cache.go
@@ -38,6 +38,8 @@ const idleConnsPerHost = 25

 var tlsCache = &tlsTransportCache{transports: make(map[tlsCacheKey]*http.Transport)}

+type dialFunc func(network, addr string) (net.Conn, error)
+
 type tlsCacheKey struct {
        insecure   bool
        caData     string
@@ -92,7 +94,7 @@ func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) {
                TLSHandshakeTimeout: 10 * time.Second,
                TLSClientConfig:     tlsConfig,
                MaxIdleConnsPerHost: idleConnsPerHost,
-               Dial:                dial,
+               Dial:                setReadDeadlineAfterDial(dial, 30*time.Second),
        })
        return c.transports[key], nil
 }
@@ -111,3 +113,18 @@ func tlsConfigKey(c *Config) (tlsCacheKey, error) {
                serverName: c.TLS.ServerName,
        }, nil
 }
+
+func setReadDeadlineAfterDial(dialer dialFunc, timeout time.Duration) dialFunc {
+       return func(network, addr string) (net.Conn, error) {
+               c, err := dialer(network, addr)
+               if err != nil {
+                       return nil, err
+               }
+
+               if err := c.SetReadDeadline(time.Now().Add(timeout)); err != nil {
+                       return nil, err
+               }
+
+               return c, nil
+       }
+}

@mYmNeo Can you please explain how to rebuild the client-go?

@mYmNeo Can you please explain how to rebuild the client-go?

@ik9999 Apply this patch, then rebuild kubelet and replace binary

@mYmNeo How can i reproduce this issue and testing this ?

I've fixed it by running this bash script every 5 minutes

@ik9999 Thanks, it works.

cc @liggitt

does setting SetReadDeadline mean all watches will close every 30 seconds?

does setting SetReadDeadline mean all watches will close every 30 seconds?

Yes. It's a ugly way to resolve this problem(force close a connection).

Just another case:

We are seeing this in Kube 1.16.8 clusters as well. Rebooting the VM can be used to bring the node back to a good state (I suspect a kubelet restart would have worked as well).

Our setup kubelet talks to a local haproxy instance over localhost which acts as a tcp load balancer to the multiple backend master instances. We are going to investigate if adding

option clitcpka    # enables keep-alive only on client side
option srvtcpka    # enables keep-alive only on server side

To our load balancer instances help alleviate the need for the explicit reboot and can lead to a full recovery. Example of repeated logs

Apr  8 00:04:25 kube-bnkjtdvd03sqjar31uhg-cgliksp01-cgliksp-00001442 kubelet.service[6175]: E0408 00:04:25.472682    6175 reflector.go:123] object-"ibm-observe"/"sysdig-agent": Failed to list *v1.ConfigMap: Get https://172.20.0.1:2040/api/v1/namespaces/ibm-observe/configmaps?fieldSelector=metadata.name%3Dsysdig-agent&limit=500&resourceVersion=0: write tcp 172.20.0.1:22501->172.20.0.1:2040: use of closed network connection
Apr  8 00:04:25 kube-bnkjtdvd03sqjar31uhg-cgliksp01-cgliksp-00001442 kubelet.service[6175]: E0408 00:04:25.472886    6175 reflector.go:123] object-"default"/"default-token-gvbk5": Failed to list *v1.Secret: Get https://172.20.0.1:2040/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Ddefault-token-gvbk5&limit=500&resourceVersion=0: write tcp 172.20.0.1:22501->172.20.0.1:2040: use of closed network connection

Will post an update if that solves our specific problem in case that helps anyone here in the interim.

Curious if there's a config parameter to set an absolute upper bound on a watch time? I found --streaming-idle-connection-timeout but nothing specific for watches.

We are seeing this in kube 1.17.4 after API server unhealthy due to "etcd failed: reason withheld".

Hi, guys. I've recompiled the kubernetes binary with golang 1.14. It seems that the problem disppeared

@mYmNeo golang 1.14 + kubernetes v1.17 ?

@mYmNeo golang 1.14 + kubernetes v1.17 ?

@pytimer We are using 1.16.6 without changing any code just recompiling. I think the root cause may be golang.

Hey! Got the same issue here, k8s 1.17.4 do we think we could get a 1.17.5 recompiled with go 1.14 if that solves the issue?

Unfortunately, updating to go1.14 requires updates to several key components, so is unlikely to be picked back to Kube 1.17. You can track the issues and progress in https://github.com/kubernetes/kubernetes/pull/88638

Good to know, thx

@callicles has it been confirmed that recompiling with go 1.14 resolved the issue?

I'm seeing an identical issue on 1.16.8 - every so often (sometimes once every couple of days, sometimes every couple of weeks) the node becomes NotReady, with reason Kubelet stopped posting node status, and "use of closed network connection" filling the logs

go may have problem deal with h2 upgrade.
golang.org/x/net/http2/transport.go

    upgradeFn := func(authority string, c *tls.Conn) http.RoundTripper {
        addr := authorityAddr("https", authority)
        if used, err := connPool.addConnIfNeeded(addr, t2, c); err != nil {
            go c.Close()
            return erringRoundTripper{err}    <--- "use of closed network connection"  rised
        }

Hi, guys. I've recompiled the kubernetes binary with golang 1.14. It seems that the problem disppeared

@mYmNeo have you ever reproduce the problem after recompiling with go 1.14

Hi, guys. I've recompiled the kubernetes binary with golang 1.14. It seems that the problem disppeared

@mYmNeo have you ever reproduce the problem after recompiling with go 1.14

AFAIN, the problem is no longer existed.

Unfortunately, updating to go1.14 requires updates to several key components, so is unlikely to be picked back to Kube 1.17. You can track the issues and progress in #88638

Do you already know if go1.14 will be backported to 1.18?

Do you already know if go1.14 will be backported to 1.18?

I would not expect so. Changes to etcd and bbolt seem to be required to support go1.14, which is a bigger change than is typically made in release branches.

@liggitt Okay thx. Looks like we (at least for our clusters) need a mitigation strategy in the meantime :)

Does this problem only happen after a NIC failure? We're seeing the same error message in our v1.16.8 clusters, but there is no associated NIC failure.

We had at least one instance where the underlying VM had a SCSI error when connecting to a SAN. The SCSI problem resolved itself, but the kubelet never recovered.

The --goaway-chance option was added in 1.18(#88567). Will this option alleviate this problem?

No. That only has an effect if the kubelet is actually able to reach the API server and get a response back.

a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.

can you please tell what bond mode are you using? I'm not able to reproduce this on my cluster with active-backup bonds.

After upgrading to Kubernetes 1.16, we also started seeing the use of closed network connection error and kubelet not reconnecting to the apiserver, leaving nodes stuck in NotReady. We weren't able to reproduce the issue by taking down NICs (by setting the links down/up), but we did notice this behavior only happened on clusters that were more heavily loaded.

We did more digging and found that the server-side default in golang is 250 http2 streams per client, while the client-side default is 1000, so my guess is that once kubelet got an error from the apiserver for hitting the http2 stream limit, it never tried to reconnect. After setting --http2-max-streams-per-connection=1000 we didn't see the problem with nodes getting stuck in NotReady as much as originally found during testing. This didn't resolve the issue of kubelet not reconnecting, but it did help us mitigate the issue we were seeing.

After upgrading to Kubernetes 1.16, we also started seeing the use of closed network connection error and kubelet not reconnecting to the apiserver, leaving nodes stuck in NotReady. We weren't able to reproduce the issue by taking down NICs (by setting the links down/up), but we did notice this behavior only happened on clusters that were more heavily loaded.

We did more digging and found that the server-side default in golang is 250 http2 streams per client, while the client-side default is 1000, so my guess is that once kubelet got an error from the apiserver for hitting the http2 stream limit, it never tried to reconnect. After setting --http2-max-streams-per-connection=1000 we didn't see the problem with nodes getting stuck in NotReady as much as originally found during testing. This didn't resolve the issue of kubelet not reconnecting, but it did help us mitigate the issue we were seeing.

Hi, the default server-side https streams is 1000 in kube-apiserver, this is equal to the client's value.
https://github.com/kubernetes/kubernetes/blob/ae1103726f9aea1f9bbad1b215edfa47e0747dce/staging/src/k8s.io/apiserver/pkg/server/options/recommended.go#L62

@warmchang I think this applies to apiextensions apiservers and the sample apiserver:
https://github.com/kubernetes/kubernetes/blob/ae1103726f9aea1f9bbad1b215edfa47e0747dce/staging/src/k8s.io/apiserver/pkg/server/options/recommended.go#L62

A test with curl test without setting --http2-max-streams-per-connection has this in our apiserver logs (using v1.16):
I0603 10:18:08.038531 1 flags.go:33] FLAG: --http2-max-streams-per-connection="0"

And a curl request shows this in the response:
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!

When I use --http2-max-streams-per-connection=1000 the curl request then shows
* Connection state changed (MAX_CONCURRENT_STREAMS == 1000)!

@jmcmeek @treytabner, You're right. I misread the code. :+1:

Using kubernetes 1.17.6 and same here. It looks like kubelet is using a dead http2 connection.
I do noticed that the inconsistent default value of MAX_CONCURRENT_STREAMS between kube-apiserver and kubelet.

Just set the server-side value to 1000. Will report later.

Rancher/RKE

Add to cluster definition:

 kube-api:
      extra_args:
        http2-max-streams-per-connection: '1000'

Check on master-node:

docker exec -it kubelet bash
apt update && apt-get install -y nghttp2
nghttp -nsv https://127.0.0.1:6443
#Look for SETTINGS_MAX_CONCURRENT_STREAMS

Setting MAX_CONCURRENT_STREAMS to 1000 on APIserver have no effect on this issue.
I believed this is caused by a flaw in golang http2 Transport. See above

Had this issue again this night.
Seems like setting ‘MAX_CONCURRENT_STREAMS ‘ did not help☹️

Hi guys. I think finally I have tracked down this issue. We have the same issue happened last night. But successfully recovered with a modified kubelet.

It's not a Kubernetes bug, it's about golang's standard net/http package which client-go is using,too.
I believe there is a flaw in golang.org/x/net/http2/transport.go

Already have this reported to golang official. Waiting for some discussion.
https://github.com/golang/go/issues/39750

For now I modified the code to have the http2: perform connection health check introduced by https://github.com/golang/net/commit/0ba52f642ac2f9371a88bfdde41f4b4e195a37c0 enabled by default.
It proves to be some help on this problem. But a little slow-responded.

kubelet v1.17.6 logs (complied with self-modified golang.org/x/net package)

It do recovered from writing dead connections problem, but cost a little more time than expected.

Be noted that performing http2 healthCheck is a log message that I intended leave there to prove healthCheck func is called by readIdleTimer

Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912484   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912604   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912741   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:46 vm10.company.com kubelet[22255]: E0623 03:14:46.367046   22255 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:48 vm10.company.com kubelet[22255]: E0623 03:14:47.737579   22255 controller.go:135] failed to ensure node lease exists, will retry in 800ms, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.113920   22255 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get "https://vm10.company.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dvm10.company.com&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:48.744770   22255 reflector.go:153] object-"kube-system"/"flannel-token-zvfwn": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dflannel-token-zvfwn&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.599631   22255 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.599992   22255 controller.go:135] failed to ensure node lease exists, will retry in 1.6s, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600182   22255 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get "https://vm10.company.com:8443/api/v1/services?limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600323   22255 reflector.go:153] object-"kube-system"/"kube-flannel-cfg": Failed to list *v1.ConfigMap: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-flannel-cfg&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600463   22255 reflector.go:153] object-"core"/"registrypullsecret": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.369097   22255 reflector.go:153] object-"kube-system"/"registrypullsecret": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:39 vm10.company.com kubelet[22255]: E0623 03:25:39.543880   22255 desired_state_of_world_populator.go:320] Error processing volume "deployment-log-dir" for pod "fluentd-h76lr_core(e95c9200-3a0c-4fea-bd7f-99ac1cc6ae7a)": error processing PVC core/itom-vol-claim: failed to fetch PVC from API server: Get "https://vm10.company.com:8443/api/v1/namespaces/core/persistentvolumeclaims/itom-vol-claim": read tcp 16.155.199.4:41512->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:39 vm10.company.com kubelet[22255]: E0623 03:25:39.666303   22255 kubelet_node_status.go:402] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"Ready\"}]}}" for node "vm10.company.com": Patch "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com/status?timeout=10s": read tcp 16.155.199.4:41512->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:49 vm10.company.com kubelet[22255]: E0623 03:25:49.553078   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": read tcp 16.155.199.4:41718->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:49 vm10.company.com kubelet[22255]: E0623 03:25:49.560723   22255 desired_state_of_world_populator.go:320] Error processing volume "log-location" for pod "fluentd-h76lr_core(e95c9200-3a0c-4fea-bd7f-99ac1cc6ae7a)": error processing PVC core/itom-logging-vol: failed to fetch PVC from API server: Get "https://vm10.company.com:8443/api/v1/namespaces/core/persistentvolumeclaims/itom-logging-vol": read tcp 16.155.199.4:41718->16.155.199.4:8443: use of closed network connection
Jun 23 03:27:29 vm10.company.com kubelet[22255]: I0623 03:27:29.961600   22255 log.go:181] performing http2 healthCheck
Jun 23 03:31:32 vm10.company.com kubelet[22255]: I0623 03:31:31.829860   22255 log.go:181] performing http2 healthCheck
Jun 23 03:31:44 vm10.company.com kubelet[22255]: I0623 03:31:44.570224   22255 log.go:181] performing http2 healthCheck
Jun 23 03:32:13 vm10.company.com kubelet[22255]: I0623 03:32:12.961728   22255 log.go:181] performing http2 healthCheck
Jun 23 03:33:16 vm10.company.com kubelet[22255]: I0623 03:33:15.441808   22255 log.go:181] performing http2 healthCheck
Jun 23 03:33:28 vm10.company.com kubelet[22255]: I0623 03:33:28.233121   22255 log.go:181] performing http2 healthCheck

no more use of closed network connection reported and kubelet returns to Ready state

We got some new potential insights in the issue in our stacks. With some confidence we assume rare connection drop on networking/infrastructure level due to high load with respect to connection numbers in specific situations, so in our case it wasn't network interfaces flipping. We especially had bad issues with Prometheus federation because of this since they switched to http2 on client-side. Enabling the http2 health monitor by setting http2.Transport.ReadIdleTimeout as implemented with golang/net#55 entirely resolved the federation issues for us.

The values are currently not exposed as apimachinery/pkg/util/net/http.go instantiates http.Transport and upgrades this to http2 internally, which is not exposing the option until golang/net#74 is merged.

Are there any other workarounds besides the kubelet restart cron job? We have had a cron job in place for a week and it has not stopped the problem from happening.

I have the same issue in v1.17.3.

What I found is that the k8s version using a specific golang.org/x/net version is in trouble, and this package seems to be fixed.
https://go-review.googlesource.com/c/net/+/198040

Version with this problem (v1.16.5 ~ latest release)
golang.org/x/net v0.0.0-20191004110552-13f9640d40b9

Fix Version (master branch)
golang.org/x/net v0.0.0-20200707034311-ab3426394381

Will updating golang.org/x/net package fix this issue?

Is there a release planned for maintained k8s version(v1,16, 1.17, v1,18..) to fix this?

What I found is that the k8s version using a specific golang.org/x/net version is in trouble, and this package seems to be fixed.
https://go-review.googlesource.com/c/net/+/198040

The mentioned change only _offers_ the possibility to enable the HTTP2 health monitor, but it needs to be enabled by developers (default is off). Furthermore, it cannot be really set but there is a pull request to give developers access to the health monitor.

I'm currently integrating a reflection-based hotfix that enables the health monitor for our own Kubernetes distribution, in the hope that this helps at resolving the issue.

--
Jens Erat \ Imprint

@JensErat thank you for answer.
If so, can this issue also occur in older versions of k8s(1.13, 1.15, ..)?

I changed nodes distro from RancherOS(kernel 4.14.138) to Ubuntu 18.04 (kernel 5.3.0) more than a month ago, the issue has not appeared since then.
One of my cluster left on RancherOS, it has had this issue reproduced for 3 times already.

Not 100% shure, but probably kernel version matters.

Hard to say. We definitely observe(d) the issue with 1.16 to 1.18, but hat rare weird "kubelet stuck occurences" before. We already digged into such issues since at least a year, but could never correlate anything (single incidents all few weeks, and we have some four-digit number of kubelets running). Got much worse though since we installed 1.16, but currently we're more into assuming the underlying (also very rare and hard to trace down...) networking issues happen more often. We're running Ubuntu 19.10 with Kernel 5.3.0-46-generic but are affected (so well possible you actually got a newer patchlevel). Can you give a hint which exact kernel version/patch level you're running?

--
Jens Erat \ Imprint

It is 5.3.0-59-generic. But we only have ~ 40 kubletes, so it still might be a coincidence.

As I said above. This issue happens more frequent on heavy loaded clusters. We observed the same issue almost every night before enabling h2 transport healthCheck.
According to the issue reported to golang official. The problem occurs in socket read loop, which should return an error on reading closed socket but it never does. Also, I suggest adding error handle logic on writing socket to actively detect connection problem. But several days later, seems that they do not care about such rare issues.

A little far to the topic, I mean, since the problem is caused by network socket which is very closing to kernel. Updating kernel may help, or not. (PS: we are using centos 7 with kernel 3.10, it happens almost everyday before enabling healthCheck)
I spent about 3 days reading source code of net/http, as far as I saw, enabling h2 transport healthCheck shoud help recover from such issue, and we really escaped from this odd situation by doing so.
@JensErat Do you have some concrete proof for enabling healthCheck helps resolve this issue ?

@JensErat Do you have some concrete proof for enabling healthCheck helps resolve this issue ?

We're running Prometheus federation for each of our Kubernetes clusters. Prometheus 2.19.0 introduced http2 (they forget to mention this in the changelog though and had it well-hidden in a commit message body, I had to git bisect, deploy and wait some hours for each run...) and we observed about a dozen incidents with federation stuck per day. I first patched out http2 support again (and issue was gone), and then set the read timeout directly in golang/net/x/http2. Since then, we did not have a single federation down incident any more.

I'm currently preparing to roll out a patched Kubernetes release on some clusters, so we should have data in a few days. We'll definitely share our outcomes as soon as we have proper data.

--
Jens Erat \ Imprint

I'm currently preparing to roll out a patched Kubernetes release on some clusters, so we should have data in a few days. We'll definitely share our outcomes as soon as we have proper data.

Thanks for your feedback. That's a very delightful message.
Although the root cause is not very clear, at least we find out a way to recover from disaster. :D

We have the same issue with k8s v1.14.3, and restart kubelet can fix the problem.

I know this is silly, but must work as a temporary workaround:

Expand yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kubelet-face-slapper
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: kubelet-face-slapper
  template:
    metadata:
      labels:
        app: kubelet-face-slapper
    spec:
      # this toleration is to have the daemonset runnable on master nodes
      # remove it if your masters can't run pods    
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/controlplane
        operator: Equal
        value: "true"
      - effect: NoExecute
        key: node-role.kubernetes.io/etcd
        operator: Equal
        value: "true"
      containers:
      - command:
        - /bin/sh
        - -c
        - while true; do sleep 40; docker logs kubelet --since 1m 2>&1 | grep -q "use
          of closed network connection" && (docker restart kubelet ; echo "kubelet
          has been restarted due to connection error") || echo "kubelet connection
          is ok" ;done
        image: docker:stable
        name: kubelet-face-slapper
        volumeMounts:
        - mountPath: /var/run/docker.sock
          name: docker-sock
      volumes:
      - hostPath:
          path: /var/run/docker.sock
          type: File
        name: docker-sock


(This is rancher specific, but can be easily adapted to other distributions by using privileged container and journalctl/systemctl)

Amount of time for sleep and --since must be less than cluster's pod-eviction-timeout (5m by default)

BTW - docker pause nginx-proxy on rancher worker node makes kubelet produce the same error message.

Temp workaround for those who are running K8S on VMWare vSphere - disable DRS for the K8S VMs, that'll prevent vSphere from moving VMs among hypervisors thus eliminating any network disconnections which are causing troubles to the Kubelets

We have some very good news regarding mitigation of the issue using the new golang http2 health check feature: no issues any more. By now, we implemented the "fix" (hardcoded setting of the value in vendored x/net code) in Prometheus, entire Kubernetes and several internal components, observing:

  • no Prometheus federation issues any more
  • kubelet sometimes still reports single "use of closed connection" events but recovers within seconds (we set a ~30 seconds http2 health check window)
  • sometimes we had issues with kubectl watches -- also gone if using patched kubectl
  • we're running an extended E2E testsuite to verify our integration regularly, and observed sporadic test timeouts and flakiness. Guess what? Gone now.

Furthermore, we have been able to get new insights on how to trigger the issues. I can confirm @vi7's observation regarding live migration with some confidence (we could trace it down though), and at least with the NSX version we're running also load balancer changes can trigger such issues (we have a ticket with VMware to make sure they're sending reset packets in future). And very likely lots of other reasons dropping connections in-between, like connection table overflows.

This is a very annoying and somewhat massive issue for some users of Kubernetes (depending on some kind of "brokeness" of the IaaS layer/networking, I guess). Although there are golang discussions about exposing an interface to properly set the values -- do you think there is any chance of getting a PR merged upstream setting those values through reflection (still better than forking x/net I guess like we do right now)? We're fine with providing the code (and validating the fix, we cannot actually reproduce, but observe it often enough to be able to confirm whether the fix works).

cc @liggitt

long-term-issue (note to self)

@JensErat thank you for answer.
If so, can this issue also occur in older versions of k8s(1.13, 1.15, ..)?

I can confirm to see the issue with Kubernetes v1.16.13
We did not see the issue with Kubernetes v1.15.9

when i restroe a kubenetes cluster v1.16.14 from etcd snapshot backup . this error appear in kubelet log.
thanks to @ik9999 . I restart the kubelet then errors go away

[root@dev-k8s-master ~]# journalctl -u kubelet -n 1 | grep "use of closed network connection"
Aug 22 11:31:10 dev-k8s-master kubelet[95075]: E0822 11:31:10.565237   95075 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Get https://apiserver.cluster.local:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: write tcp 192.168.160.243:58374->192.168.160.243:6443: use of closed network connection
[root@dev-k8s-master ~]# systemctl restart kubelet
[root@dev-k8s-master ssh]# journalctl -u kubelet -n 1 | grep "use of closed network connection"

We met the same issue on 1.17.3, restarting kubelet shall solve. Any stable workaround for it or when this shall be fixed?

v1.18.6 same

@rxwang662001
This is caused by upstream golang issue. One thing to be certain is that this WON'T be fixed in go 1.15.
Meanwhile, the Kubernetes community are still struggling to migrate to go 1.14 LOL.

Typically, go releases every 6 months. If everything works well, maybe we could see this issue resolved by upstream in the next year, and maybe another year till kubernetes adpats the fix 🥇 !
(Just for kidding. If you really want this fixed in you stack right now. Hack the h2Transport to enable healthCheck proved to be working.

Meanwhile, the Kubernetes community are still struggling to migrate to go 1.14 LOL.

Actually, due to great work by sig-scalability and sig-release to qualify on go1.15 prereleases, Kubernetes 1.19 just released on go1.15. It looks like there is work in progress to expose the http/2 options in go1.16, and I expect we will make use of that as soon as it is available.

Actually, due to great work by sig-scalability and sig-release to qualify on go1.15 prereleases, Kubernetes 1.19 just released on go1.15.

Opps. Sorry for the awkward joke. Didn't pay much attention to v1.19 release.
Seems we skipped go1.14 entirely on K8S? Wow. that's a big leap 👍

@povsister

Thanks for sharing your solution. Could you add more details on how you made it work?

For now I modified the code to have the http2: perform connection health check introduced by golang/net@0ba52f6 enabled by default.
It proves to be some help on this problem. But a little slow-responded.

What code changes did you put in place? And where, in which file?

@KarthikRangaraju
Refer this PR to enable healthCheck when initializing h2Transport,
or you can do some reflection/unsafe offset hack to access unexported field at runtime.

And, don't forget to update golang/x/net before doing such stuff.

We have not been able to reproduce this issue although we do face it from time to time.

Since we are not able to identify the root cause of the symptom, we are fixing the symptom regardless.

Our solution:

  • The following script runs every 1 hour. It talks to kube-api server via kubectl via the kube config file
    kubelet uses(This way there is no privilege escalation).
  • Asks if the master node thinks it's own node is NotReady. If yes, triggers a kubelet restart by running touch command on a file
    that is watched by a kubelet-watcher.service for file system changes and restarts kubelet accordingly.
#!/bin/bash

while true; do
  node_status=$(KUBECONFIG=/etc/kubernetes/kubelet.conf kubectl get nodes | grep $HOSTNAME | awk '{print $2}')
  date=$(date)
  echo "${date} Node status for ${HOSTNAME}: ${node_status}"
  if [ ${node_status} == "NotReady" ]; then
    echo "${date} Triggering kubelet restart ..."
    # Running touch command on /var/lib/kubelet/config.yaml. This will trigger a kubelet restart.
    # /usr/lib/systemd/system/kubelet-watcher.path & /usr/lib/systemd/system/kubelet-watcher.service
    # are responsible for watching changes in this file
    # and will restart the kubelet process managed by systemd accordingly.
    touch /var/lib/kubelet/config.yaml
  fi

  # Runs ever 1 hour
  sleep 3600
done
# cat  /usr/lib/systemd/system/kubelet-watcher.path
[Path]
PathModified=/var/lib/kubelet/config.yaml

[Install]
WantedBy=multi-user.target

# cat /usr/lib/systemd/system/kubelet-watcher.service
[Unit]
Description=kubelet restarter

[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl restart kubelet.service

[Install]
WantedBy=multi-user.target[root@den-iac-opstest-kube-node02 karthik]#

With Kubernetes 1.19.0 the issue still exists, but the message are slightly different.
Sep 11 18:19:39 k8s-node3 kubelet[17382]: E0911 18:19:38.745482 17382 event.go:273] Unable to write event: 'Patch "https://192.168.1.150:6443/api/v1/namespaces/fhem/events/fhem-7c99f5f947-z48zk.1633c689ec861314": read tcp 192.168.1.153:34758->192.168.1.150:6443: use of closed network connection' (may retry after sleeping)
It contains now a "(may retry after sleeping)" in the error message.

Is it possible to mitigate this entirely in kubernetes without waiting for an upgrade golang? For example, can client-go be made to swap out the transport if it hits a "use of closed network connection" or something?

Alternatively, would this issue still occur if using HTTP 1.1, or is it purely HTTP 2 related? If HTTP 1.1 would be immune and doesn't have huge draw backs, it'd be a real simple workaround to just set GODEBUG=http2client=0 on kubelet, kube-proxy, and various control-plane processes, or even set GODEBUG=http2server=0 on the apiserver process to make the change universal.

Do we think these would actually mitigate this issue and not cause other major pitfalls other than some performance issues due to the increase in connection count when not multiplexing over HTTP2?

can client-go be made to swap out the transport if it hits a "use of closed network connection" or something?

not very surgically... transports are currently shared in order to avoid ephemeral port exhaustion in the face of callers that repeatedly construct new clients

would this issue still occur if using HTTP 1.1, or is it purely HTTP 2 related?

as far as I know, HTTP 1.1 can encounter the same issue since idle connections go back into a keep-alive pool (and has fewer options to detect/mitigate it since the ping health check mechanism is not available to it)

Is there a good workaround for projects that use the client? How can we identify when the client is dead and what' the minimum we need to do to fix it (sounds like restarting our process might be the only option)?

How can we identify when the client is dead

When you got repeatedly write tcp xxx use of closed network connection error for an identical URL. That indicates the client is dead. The connection pool inside Transport has cached a dead tcp connection for requested host:port .

Is there a good workaround for projects that use the client?

As far as I know, re-construct a http.Client can fix this problem without restarting the whole application.

what' the minimum we need to do to fix it

It requires source-code-level access to project. Maybe you can use the mechanism mentioned above to detect dead client and re-construct a new client when needed. If no one is using the old client, it will be garbage collected.

I have source code access to my project, but we use the kubernetes client. When we do watches, it seems that it never detects if the TCP connection is severed like this (and since the watch is handling the HTTP transactions, no errors bubble up to our code to handle).

Yeah. you are right, the http.Client is not exposed by kubernetes client.
Currently it's hopeless for top-level application to do such workaround with little cost.
If kubernetes client does not use http.DefaultClient, it could be fixed by re-construct the whole kubernetes Client.

For watch request, it is getting worse. kubernetes client seems keep re-trying request and no error is popped to upper application. I have no good idea on such situation now.

The workaround proposed here has been working for us for several weeks. We turned it onto a python script that runs as a daemonset in all of our clusters. We usually see it take action once or twice a week (automatically restart kubelet) and we have not had any negative impact on our cluster operation as a result. We made it variable so that it only restarts kubelet if it sees 2+ messages in a 5 minute period. We saw that occasionally we could see one message, and it was not a problem. When the problem happens, you'll see the use of closed network connection error constantly in the kubelet logs.

Please create a pullrequest and we will investigate this subject matter.

On a single baremetal cluster I’m seeing this happen about 2-4 times every 24 hours. 1.17.12

it happens when the api-server pod restarts, even on singlie node cluster . Connections to the apiserver became lost, so error number minimization method is solving issue why apiserver restarting.

I'm using haproxy in front of the master node, do you think there is anyway to prevent this with some LB configuration ?

@shubb30 do you mind share with me your solution?

I can confirm that my apiservers are NOT restarting when I experience the issue. I'm using the daemonset and shell trick to monitor for the log entry and then restart kubelet, this has worked fairly well but I see it as only a temporary workaround.

Here is a modified version of what has been working well for us as a workaround.

Hey y’all!

Do we potentially think this backport could help?
https://github.com/golang/go/issues/40423

Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2

Done.
PR opened for review.

Another good news: Kubernetes do not use the bundled http2 in standard net/http package, so we do not need to wait for the next Go release. We can directly use https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2 to fix this issue.

Proposed a fix here. https://github.com/kubernetes/kubernetes/pull/95898
It updates dependency to required version and enables http2 Transport health check by default.
It should help applications who use client-go to communicate with apiserver(eg: kubelet) get rid of "app hang at write tcp xxx: use of closed connection" problem.

Feel free to leave any comments.

It seems the mentioned #95898 got closed for reasons we do not need to discuss.

Is there any other update in regards of this issue?

https://github.com/kubernetes/kubernetes/pull/95981 (linked above) is in progress to pull in the http/2 fix

Is this issue specific to 1.17.X versions of kubernetes ?

@krmayankk Not entirely sure when it exactly started. But at least 1.17-1.19 have this problem. #95980 should have fixed it though (will be part of the next 1.20 release, hasn't made it into the beta yesterday)

@krmayankk We saw this issue with v1.18.9 as well but it was triggered by a buggy version of Rancher that caused very high network usage. After rolling back to another version no issues observed.

I had this issue but I've now "fixed" it in my small hobby cluster using the workaround in a comment above

I wrote a small ansible-playbook to deploy the workaround on nodes as a systemd unit and timer, might save others with a similar setup some time

Is there a plan to cherry-pick/backport https://github.com/kubernetes/kubernetes/pull/95981 and https://github.com/kubernetes/kubernetes/issues/87615 to 1.18 release branch?

Is there a plan to cherry-pick #95981 to 1.17 release branch?

This comment discusses backports to the older releases: https://github.com/kubernetes/kubernetes/pull/95981#issuecomment-730561539

I think the answer is "it's tough and could break things so probably not". Kinda the same answer I'd expect from folks running v1.17 when asked, so why not upgrade to v1.20 to get the fix? :laughing:

Backporting this to at least 1.19 would be great since that will make the fix available relatively soon. I suspect some people will hold off on 1.20 due to the deprecation of Docker.

Backporting this to at least 1.19 would be great since that will make the fix available relatively soon.

That has already been done.

I suspect some people will hold off on 1.20 due to the deprecation of Docker.

Nothing has changed in 1.20 with respect to docker other than a deprecation warning. At the end of the deprecation period, dockershim support will be removed.

getting these errors on 1.20 on a raspbian 10. where does one even start with getting a fix for any of this? seems like the cost of running a cloud managed cluster is way more cost effective than trying to run it on your own cluster

For my own clarity, this looks like it should be resolved by #95981, and that made it into 1.20 and was back ported to 1.19?

95981 was merged to 1.20, and was cherry-picked to 1.19 in #96770.

/close

@caesarxuchao: Closing this issue.

In response to this:

95981 was merged to 1.20, and was cherry-picked to 1.19 in #96770.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Will there be any backport/cherry pick for v1.16, v1.17 or or v1.18?

@chilicat see https://github.com/kubernetes/kubernetes/pull/95981#issuecomment-730561539. I don't plan to cherry-pick it to 1.18 or older versions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chowyu08 picture chowyu08  ·  3Comments

ttripp picture ttripp  ·  3Comments

broady picture broady  ·  3Comments

theothermike picture theothermike  ·  3Comments

mml picture mml  ·  3Comments