Kubernetes: kube-proxy ipvs mode hangs after a few hours, needs manual restart

Created on 15 Nov 2018  ·  142Comments  ·  Source: kubernetes/kubernetes

What happened: upgraded from v1.11.0 to v1.12.2, now kube-proxy hangs, usually in less than a day

What you expected to happen: it should continue working

How to reproduce it (as minimally and precisely as possible): just have a normal kube-proxy running in ipvs mode. It gets stuck usually in less than a day.

Anything else we need to know?:

I've sent SIGABRT to the kube-proxy process to get a stack trace, and the only thing interesting is that when the bug happens goroutine 1 is reading from netlink, whereas when I send SIGABRT and the process is not stuck it is not reading from netlink.

Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: goroutine 1 [syscall]:
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.Syscall6(0x2d, 0x3, 0xc42081c000, 0x1000, 0x0, 0xc42083deb0, 0xc42083dea4, 0x40fb86, 0x7f2ab9788aa8, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.recvfrom(0x3, 0xc42081c000, 0x1000, 0x1000, 0x0, 0xc42083deb0, 0xc42083dea4, 0x101ffffffffffff, 0x0, 0x1000)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1665 +0xa6
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.Recvfrom(0x3, 0xc42081c000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x16ad4c0, 0x1f4d508)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/syscall_unix.go:252 +0xaf
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc4204d3440, 0x0, 0x0, 0x0, 0x16ad4c0, 0x1f4d508)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:613 +0x9b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.execute(0xc4204d3440, 0xc42083e150, 0x0, 0x1, 0x2, 0xc420423740, 0x1, 0x2)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:219 +0xd3
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).doCmdwithResponse(0xc420573990, 0xc420990750, 0x0, 0xc42003d501, 0xa, 0x10, 0xc420990701, 0xc42083e230, 0x118dd6b)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:140 +0x20b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).doCmd(0xc420573990, 0xc420990750, 0x0, 0x1, 0x60000020fc6e0, 0xc420990750)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:149 +0x48
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).NewService(0xc420573990, 0xc420990750, 0x0, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/ipvs.go:117 +0x43
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/ipvs.(*runner).AddVirtualServer(0xc42052a340, 0xc420b0e550, 0x1e, 0xc42083e3b8)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/ipvs/ipvs_linux.go:61 +0x60
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).syncService(0xc4201b3e00, 0xc420d53620, 0x21, 0xc420b0e550, 0xc420357901, 0x5, 0xc420d53620)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:1454 +0x635
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).syncProxyRules(0xc4201b3e00)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:803 +0x1081
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).(k8s.io/kubernetes/pkg/proxy/ipvs.syncProxyRules)-fm()
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:395 +0x2a
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc42042e3f0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:217 +0xb6
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc42042e3f0, 0xc4200b0120)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:179 +0x211
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).SyncLoop(0xc4201b3e00)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:589 +0x4e
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run(0xc4207f2000, 0xc4207f2000, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:568 +0x3dd
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.(*Options).Run(0xc4201082c0, 0xc42063e230, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:238 +0x69
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.NewProxyCommand.func1(0xc4200f2500, 0xc42063e230, 0x0, 0x7)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:360 +0x156
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc4200f2500, 0xc4200c4010, 0x7, 0x7, 0xc4200f2500, 0xc4200c4010)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:760 +0x2c1
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4200f2500, 0xc4205dc090, 0x16076a0, 0xc4207a7ee8)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:846 +0x30a
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4200f2500, 0x1608b88, 0x20dcf40)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:794 +0x2b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: main.main()

The last lines that it logged were:

Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392942    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392952    5389 graceful_termination.go:160] Trying to delete rs: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: E1115 07:48:52.392978    5389 graceful_termination.go:89] Try delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432" err: Failed to delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432", can't 
find the real server
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392985    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: E1115 07:48:52.392992    5389 graceful_termination.go:183] Try flush graceful termination list err
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393147    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393260    5389 graceful_termination.go:89] Try delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125" err: Failed to delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393272    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393281    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393310    5389 graceful_termination.go:89] Try delete rs "10.128.33.142:6379/TCP/10.125.6.13:6379" err: Failed to delete rs "10.128.33.142:6379/TCP/10.125.6.13:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393317    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393327    5389 graceful_termination.go:160] Trying to delete rs: 10.128.34.237:6379/TCP/10.125.6.11:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393355    5389 graceful_termination.go:89] Try delete rs "10.128.34.237:6379/TCP/10.125.6.11:6379" err: Failed to delete rs "10.128.34.237:6379/TCP/10.125.6.11:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393362    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.34.237:6379/TCP/10.125.6.11:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393371    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.200:5432/TCP/10.125.6.27:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393397    5389 graceful_termination.go:89] Try delete rs "10.128.33.200:5432/TCP/10.125.6.27:6432" err: Failed to delete rs "10.128.33.200:5432/TCP/10.125.6.27:6432", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393404    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.200:5432/TCP/10.125.6.27:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393412    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.228:6379/TCP/10.125.2.63:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393440    5389 graceful_termination.go:89] Try delete rs "10.128.33.228:6379/TCP/10.125.2.63:6379" err: Failed to delete rs "10.128.33.228:6379/TCP/10.125.2.63:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393447    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.228:6379/TCP/10.125.2.63:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393455    5389 graceful_termination.go:160] Trying to delete rs: 10.128.34.247:5672/TCP/10.125.2.6:5672
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393482    5389 graceful_termination.go:89] Try delete rs "10.128.34.247:5672/TCP/10.125.2.6:5672" err: Failed to delete rs "10.128.34.247:5672/TCP/10.125.2.6:5672", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393489    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.34.247:5672/TCP/10.125.2.6:5672
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393497    5389 graceful_termination.go:160] Trying to delete rs: 10.128.35.47:4007/TCP/10.125.2.15:4007
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393523    5389 graceful_termination.go:89] Try delete rs "10.128.35.47:4007/TCP/10.125.2.15:4007" err: Failed to delete rs "10.128.35.47:4007/TCP/10.125.2.15:4007", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393530    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.35.47:4007/TCP/10.125.2.15:4007
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393538    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393563    5389 graceful_termination.go:89] Try delete rs "10.128.33.241:4059/TCP/10.125.2.26:4059" err: Failed to delete rs "10.128.33.241:4059/TCP/10.125.2.26:4059", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393571    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393578    5389 graceful_termination.go:160] Trying to delete rs: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393605    5389 graceful_termination.go:89] Try delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432" err: Failed to delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393613    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393621    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.252:6379/TCP/10.125.6.26:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393648    5389 graceful_termination.go:89] Try delete rs "10.128.33.252:6379/TCP/10.125.6.26:6379" err: Failed to delete rs "10.128.33.252:6379/TCP/10.125.6.26:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393656    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.252:6379/TCP/10.125.6.26:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393664    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393690    5389 graceful_termination.go:89] Try delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379" err: Failed to delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393697    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393704    5389 graceful_termination.go:183] Try flush graceful termination list err
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: I1115 07:50:52.393838    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: E1115 07:50:52.393940    5389 graceful_termination.go:89] Try delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125" err: device or resource busy
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: I1115 07:50:52.393957    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420734    5389 proxier.go:1496] Failed to list IPVS destinations, error: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420763    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.33.156:8125/TCP, err: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420889    5389 proxier.go:1485] Failed to get IPVS service, error: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420903    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.35.150:6379/TCP, err: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420950    5389 proxier.go:1455] Failed to add IPVS service "monitoring/extradata-inserter:": file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420961    5389 proxier.go:812] Failed to sync service: 10.128.34.54:80/TCP, err: file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.421886    5389 proxier.go:1544] Failed to add destination: 10.125.27.66:80, error: file exists

If I restart kube-proxy (manually, ugh), it all recovers and works fine. But it will get stuck again, eventually.

Environment:


/kind bug

areipvs kinbug sinetwork triagunresolved

Most helpful comment

The PR has been merged in all release branches, so it will be present in:

  • 1.11.7
  • 1.12.5
  • 1.13.2

All 142 comments

/sig network

the same error. upgrade from 1.11.1. Everything become good except kube-proxy becomse not stable, some time pod can't access svc though their endpoints are ok .. And the same as the name resolution. so i think it may exist an important bug here.

/area ipvs

@gjcarneiro do you resolve this problem?

Well, I switched the cluster back to the default (iptables) mode, appears to be pretty stable since then...

@kubernetes/sig-network-bugs

@berlinsaint: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs

In response to this:

@kubernetes/sig-network-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

same problem ,both 1.12.2 and 1.13.0-beta1. @m1093782566 @Lion-Wei

@gjcarneiro do you mean that in kernel 4.15 , it has the same problem?

@annProg

same problem ,both 1.12.2 and 1.13.0-beta1. @m1093782566 @Lion-Wei

你邮箱多少啊啊,,认识下。。少年

@annProg

same problem ,both 1.12.2 and 1.13.0-beta1. @m1093782566 @Lion-Wei

你邮箱多少啊啊,,认识下。。少年

点我头像就看到了

I am also seeing this repeatedly in my centralized logging system:

  | Time | sys_name | log_lvl | tag | log_msg
  | November 27th 2018, 17:56:23.501 | k8lab2bs | 6 | kube.kube-system.kube-proxy | lw: remote out of the list: 10.12.12.1:443/TCP/10.12.100.133:6443
  | November 27th 2018, 17:56:23.500 | k8lab2bs | 6 | kube.kube-system.kube-proxy | Deleting rs: 10.12.12.1:443/TCP/10.12.100.133:6443
  | November 27th 2018, 17:56:23.497 | k8lab2bs | 6 | kube.kube-system.kube-proxy | Trying to delete rs: 10.12.12.1:443/TCP/10.12.100.133:6443

I'm running v1.12.3 (fresh install, not an upgrade).

Hi, all. Thanks for the report, I'll make a research and try to figure out what happened, any progress I'll report here.

Besides what happened, which is good to find out, I don't feel confident about IPVS until there is code in place that detects and resets a broken netlink socket. Because from what I can guess from the logs and traces it is that the netlink socket gets into a broken state and the task that reads/writes from/to it will hang forever. I think kube-proxy needs to have code that detects this problem, and if it happens create a new netlink socket from scratch, and maybe restart the task that handles it. But, no, I'm not volunteering, I don't even know Go, sorry.

Because from what I can guess from the logs and traces it is that the netlink socket gets into a broken state and the task that reads/writes from/to it will hang forever.

We upgraded the go netlink version some days ago for supporting graceful termination. The netlink library was running pretty stable before. I am not sure if it's the cause.

1.12.3 with the same error.
After filter the INFO log,
execute kubectl -n kube-system logs kube-proxy-29zh8|egrep ^E; kubectl -n kube-system logs kube-proxy-2pqdn|egrep ^E; kubectl -n kube-system logs kube-proxy-4xw8q|egrep ^E; kubectl -n kube-system logs kube-proxy-6j4bc|egrep ^E; kubectl -n kube-system logs kube-proxy-brbjb|egrep ^E; kubectl -n kube-system logs kube-proxy-r6cg2|egrep ^E; kubectl -n kube-system logs kube-proxy-rpl6t|egrep ^E;
and all the proxy node log is

bash E1129 01:01:10.862180 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 21:09:11.422553 1 proxier.go:1485] Failed to get IPVS service, error: Expected only one service obtained=0 E1129 21:09:11.422699 1 proxier.go:1116] Failed to sync endpoint for service: 172.17.0.1:31800/TCP, err: Expected only one service obtained=0 E1129 01:01:11.618125 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 01:01:11.905626 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:01:41.771596 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:02:11.923872 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:02:42.060658 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:03:12.326762 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:03:42.718417 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:04:12.859933 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:04:42.963527 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:05:13.078486 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:05:43.185141 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:06:13.303090 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:06:43.408582 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:07:13.550358 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:07:43.686235 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:08:13.786326 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:08:43.981764 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:09:14.188053 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:09:44.316316 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:10:14.440466 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:10:44.550092 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:11:14.658190 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:11:44.765396 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:12:14.975724 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:12:45.085632 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:13:15.317679 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:13:45.432469 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:14:15.689681 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:14:45.813565 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:15:33.434931 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:16:03.578017 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:16:33.709744 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:17:03.823325 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:17:33.935191 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:18:04.053183 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:18:34.248994 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:19:04.375946 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:19:34.552779 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:19:36.596182 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:19:47.433846 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:20:17.534206 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:20:47.682350 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:21:17.825175 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:21:47.942548 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:22:18.055820 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:22:48.149494 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:23:18.564985 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:23:48.678462 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:24:18.836217 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:24:48.991344 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:25:19.091032 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:25:49.187178 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:26:19.301770 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:26:49.425812 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:27:19.606711 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:27:49.721224 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:28:20.109117 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:28:50.211574 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:29:20.405007 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:29:50.507766 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:30:20.605471 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:30:50.702798 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:31:20.807400 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:31:50.903049 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:32:21.005618 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:32:51.106378 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:33:21.224447 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:33:51.354916 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:34:21.467898 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:34:51.583216 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:35:21.691099 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:35:51.797151 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:36:21.921057 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:36:52.083821 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:37:02.445737 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:37:07.427811 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:37:37.546362 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:38:07.651769 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:38:37.797858 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:39:08.064112 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:39:38.181089 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:40:08.282892 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:40:38.390167 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:41:08.508549 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:41:38.617921 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:42:08.728384 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:42:38.844074 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:43:08.973517 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:43:39.092095 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:44:09.202329 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:44:39.363721 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:45:09.476240 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:45:39.764944 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:46:09.894587 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:46:40.013388 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:47:10.127894 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:47:40.236119 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:48:10.397111 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:48:40.600084 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:49:10.877034 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:49:41.023593 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:50:11.149894 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:50:41.262320 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:51:11.410407 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:51:41.547057 1 proxier.go:1622] Failed to unbind service addr fe80::1c5d:70ff:fef2:e97b from dummy interface kube-ipvs0: error unbind address: fe80::1c5d:70ff:fef2:e97b from interface: kube-ipvs0, err: cannot assign requested address E1129 01:52:11.609974 1 proxier.go:1544] Failed to add destination: 172.31.200.9:443, error: file exists E1129 01:52:11.610088 1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/172.31.10.38:6443" err: invalid argument E1129 01:52:11.616690 1 proxier.go:1485] Failed to get IPVS service, error: Expected only one service obtained=0 E1129 01:52:11.616767 1 proxier.go:1116] Failed to sync endpoint for service: 172.18.0.1:31645/TCP, err: Expected only one service obtained=0 E1129 01:01:12.181112 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 01:01:22.278474 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 01:01:21.459378 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 01:01:17.442316 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 20:10:17.955872 1 proxier.go:1455] Failed to add IPVS service "dev-fffrf/mysql-testportal:mysql": file exists E1129 20:10:17.955863 1 graceful_termination.go:89] Try delete rs "10.100.217.68:15004/TCP/192.100.143.225:15004" err: Failed to delete rs "10.100.217.68:15004/TCP/192.100.143.225:15004", can't find the real server E1129 20:10:17.955892 1 proxier.go:1119] Failed to sync service: 172.17.0.1:31645/TCP, err: file exists E1129 20:10:17.956412 1 proxier.go:1485] Failed to get IPVS service, error: invalid argument E1129 20:10:17.956417 1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/172.31.10.38:6443" err: Failed to delete rs "10.96.0.1:443/TCP/172.31.10.38:6443", can't find the real server E1129 20:10:17.956426 1 proxier.go:1116] Failed to sync endpoint for service: 172.31.200.67:31645/TCP, err: invalid argument E1129 20:10:17.956448 1 graceful_termination.go:89] Try delete rs "10.101.1.98:5672/TCP/192.100.29.180:5672" err: invalid argument E1129 20:10:17.956547 1 graceful_termination.go:183] Try flush graceful termination list err E1129 01:01:17.484679 1 proxier.go:430] Failed to execute iptables-restore for nat: exit status 1 (iptables-restore: line 7 failed E1129 02:52:17.498135 1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/172.31.10.38:6443" err: invalid argument E1129 02:52:17.498396 1 graceful_termination.go:183] Try flush graceful termination list err E1129 06:28:17.607394 1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/172.31.10.38:6443" err: device or resource busy E1129 06:28:17.607513 1 graceful_termination.go:183] Try flush graceful termination list err E1129 10:02:17.718458 1 proxier.go:1544] Failed to add destination: 192.100.183.139:3306, error: file exists E1129 10:02:17.718606 1 proxier.go:1544] Failed to add destination: 192.100.183.152:8080, error: invalid argument

BTW. from above my logs , kube-proxy seems like still have many problems with iptables-restore . all my node are UBUNTU 16.04 which have iptable's version 1.6.0.
@m1093782566 @Lion-Wei

We see this happening when upgrading from 1.12.1. it must thus be in the delta of 1.12.1 and 1.12.2.

The issue is graceful termination related and I think we have fixed the bug in v1.13.1.

@m1093782566 will the fix be backported to the 1.12 line? right now this is blocking the 1.12.3 security fix for us.

Got the same with upgrade from 1.11.2 to 1.11.5.

@m1093782566 can you provide more information on the fix? I would be interested to look over the commit.

This is pretty insane, we got bitten really hard earlier today.

We see this happening when upgrading from 1.12.1. it must thus be in the delta of 1.12.1 and 1.12.2.

Got the same with upgrade from 1.11.2 to 1.11.5.

Was the bug chery-picked?

This looks like the PR that fixes the issue: https://github.com/kubernetes/kubernetes/pull/71515

True @Quentin-M

Yes this probably comes from the fix in the delete function (not really related to the title of the PR).

@m1093782566 I can create the cherry-pick for 1.12. Do we want to backport the full PR or only the fix in the delete function? (to avoid changing behavior with UDP flows in 1.12)

Was the bug chery-picked?

@Quentin-M it looks like the bug was introduced by https://github.com/kubernetes/kubernetes/pull/66012 which was cherry-picked to 1.11.5 and 1.12.2 releases:
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#changelog-since-v1114
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#other-notable-changes-1

I think all of #71515 should be cherry-picked. Graceful termination overall was only introduced and cherry-picked with #66012 so the behavior change has already happened.

k8s.gcr.io/kube-proxy:v1.13.1-beta.0 still going stale on our clusters. 60% of them got stuck in less than 24 hours. Trying 1.12.1 now.

@Quentin-M : Can you check the logs on kube-proxy and confirm that these types of messages are gone:

Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393664    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393690    5389 graceful_termination.go:89] Try delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379" err: Failed to delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393697    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420950    5389 proxier.go:1455] Failed to add IPVS service "monitoring/extradata-inserter:": file exists

Both are related to the delete function fixed in the PR. If these are gone and you still see the issue, this fix is not enough

Also, if everything works with 1.12.1, there is a strong probability that the issue comes from graceful termination (#66012)

I think I have an idea: I wonder what happens if we try to add a real-server that is still in graceful termination. In this case we should just increase the weight to 1 and remove it from the gracefultermination list.

I'll try to test this scenario as soon as I can (but it will tricky because I'm at Kubecon next week)

@Lion-Wei what do you think?

Ok scratch that, I just tested and it works fine. Code responsible for it is here: https://github.com/kubernetes/kubernetes/blob/456c351e31517543e0686b2cadf21615d30a738f/pkg/proxy/ipvs/proxier.go#L1538-L1542

In that case, the RS is removed from the gracefulDelete list and from IPVS RS. I wonder if setting the weight back to 1 instead might not be better

I'll try other things based on this part of the logs, which seems to point to trying to sync endpoints for a deleted service

Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420734    5389 proxier.go:1496] Failed to list IPVS destinations, error: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420763    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.33.156:8125/TCP, err: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420889    5389 proxier.go:1485] Failed to get IPVS service, error: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420903    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.35.150:6379/TCP, err: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420950    5389 proxier.go:1455] Failed to add IPVS service "monitoring/extradata-inserter:": file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420961    5389 proxier.go:812] Failed to sync service: 10.128.34.54:80/TCP, err: file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.421886    5389 proxier.go:1544] Failed to add destination: 10.125.27.66:80, error: file exists

Ok after a quick check, the commit that fixes the delete issue is not in v1.13.1-beta.0 nor in v1.14.0-alpha.0 which was cut on November 7th
I think it will be in the next 1.14 alpha release which will be based on master

I'm going to create the cherry pick for 1.13 too

Got an issue with the upgrade from 1.12.0 to 1.13.0, seems like to be related to this one, but not the same.

We cannot fix it by reverting to iptables. Please, help me.

I've posted it on stackoverflow

Got the same on 1.11.5. Temporary reverted from ipvs to iptables.

Same on 1.11.5. Can't roll back to 1.11.4 due to the vulnerability present in it. Had to switch to iptables. Is the fix for this going to be released for 1.11.x?

You can rollback kube-proxy very specifically, without getting affected by the vulnerability.

@Quentin-M Yeah, that thought has struck my mind, but then on the other thought if I did it I'd not have graceful termination, which has been there all along with iptables. Not sure that our customers would like the change.

@Quentin-M does it mean there will be no fix for 1.11.x to address this bug?

@emptywee I created a cherry-pick for 1.11: https://github.com/kubernetes/kubernetes/pull/71848
Hopefully it will get merged very soon and will be in 1.11.6

It has already been merged in 1.12 (will be in 1.12.4) and 1.13 (will be in 1.13.1)

The cherry-pick just got merged in 1.11 too
We just need to wait for the next patch releases in the 3 branches (but if you want to test it before I can easily create a custom kube-proxy build (or, even easier, an hyperkube docker image with the patch)

Thank you very much @lbernail ! A publicly available docker image would be nice to have! I'd love to test it!

The fix is already cherry-picked to v1.11, v1.12 and v1.13.

Would you please test them so that we can close this issue.

@m1093782566 i'm testing this . btw, when will the community publish the small release for this?

Thanks @berlinsaint, please let me the test result.

when will the community publish the small release for this?

The new small release with the fix will be published in less than 2 weeks.

@emptywee : here you go : lbernail/hyperkube:v1.12.4-beta.1
Let us know how it works for you!

@lbernail is it okay to use it with everything else at v1.11.5 ? Or should I bump my kubernetes cluster to v1.12.4 first? Anyway, thanks alot!

@emptywee yes it should work without any problem (I have been running kube-proxy 1.11 on 1.10 clusters for months)

I can also do a 1.11 build later today

@lbernail yeah, would be nice to have an image from the same release version.

@emptywee here you go: lbernail/hyperkube:v1.11.6-beta.1

@m1093782566 @lbernail After one night obversation, the ipvs sync Error log doesn't seen anymore, while
it occurs at least once one night before. But still need more time to verify it, thanks a lot, it made me annoyed long time.

I'll start testing tomorrow on 1.11 and I'll give it a few days too.

@lbernail I tried you lbernail/hyperkube:v1.11.6-beta.1 (Thanks for that!)
We immediately thought things were fixed, as we stopped seeing the can't find the real server log messages. However, after a day and a half, we're seeing issues with stale ipvs entries on 3/6 cluster nodes.
This is the ipvs entries for a service on a broken cluster node.
Before restarting kube-proxy:

-A -t 10.230.65.211:9494 -s rr
-a -t 10.230.65.211:9494 -r 10.230.158.10:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.158.57:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.213.245:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.216.170:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.216.180:9494 -m -w 0
-a -t 10.230.65.211:9494 -r 10.230.222.151:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.222.158:9494 -m -w 0
-a -t 10.230.65.211:9494 -r 10.230.222.160:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.226.82:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.226.85:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.226.117:9494 -m -w 0
-a -t 10.230.65.211:9494 -r 10.230.242.229:9494 -m -w 1

After restarting kube-proxy:

-A -t 10.230.65.211:9494 -s rr
-a -t 10.230.65.211:9494 -r 10.230.158.36:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.213.236:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.216.129:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.216.163:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.226.84:9494 -m -w 1
-a -t 10.230.65.211:9494 -r 10.230.226.124:9494 -m -w 1

@lbernail @bjornryden That was the reason I got into this thread when I switched to IPVS on 1.11.5. Next day I found services in-accessible, so I thought that kube-proxy had hung and wouldn't update entries for services.
That's just to let you know that I had a similar problem (wrong IPs in ipvs table) with v1.11.5 of kube-proxy.

@bjornryden Good news for the deletion errors (this was what the patch was addressing)

The fact that backends are not properly updated is weird because graceful termination should not impact new endpoints (if some stale endpoints were still there for instance I would be less surprised)

Could you share the logs of kube-proxy related to service 10.230.65.211:9494?

Also, any special behavior with this service/deployment? (for instance are pods updated very frequently, or behind an HPA?)

So it's been running for 2 days in my lab cluster with the 1.11.6-beta.1 image by @lbernail in the ipvs mode.
So far so good, it hasn't hung and keeps updating ipvs entries.

One strange observation though:

lvdkbm501 ~ # ipvsadm -L -n -t 10.158.9.61:443
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
  -> 10.158.130.24:8443           Masq    1      0          0         
lvdkbm501 ~ # ipvsadm -L -n -t 10.158.9.61:443
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
lvdkbm501 ~ # ipvsadm -L -n -t 10.158.9.61:443
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
lvdkbm501 ~ # ipvsadm -L -n -t 10.158.9.61:443
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
  -> 10.158.128.28:8443           Masq    1      0          0         
lvdkbm501 ~ # ipvsadm -L -n -t 10.158.9.61:443
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
  -> 10.158.129.28:8443           Masq    1      0          0         

First call to ipvsadm shows initial state before I deleted one of the pods for the service. Few next show that one entry is gone, as the pod had been deleted. However, I don't know if it's how it's supposed to be, but it created a second entry with the same IP address as the pod that'd left. And then it got it updated to the correct IP of the new pod that came up to replace the deleted one.

Other than that it's been okay. But the lab cluster isn't as busy as the other ones. The comment from @bjornryden got me a bit worried to try the image in busier clusters, where it can impact our customers.

@emptywee I assume the deleted pod was 10.158.130.24. I don't understand this:

TCP  10.158.9.61:443 rr
  -> 10.158.128.28:8443           Masq    1      0          0         
  -> 10.158.128.28:8443           Masq    1      0          0 

Having the same real server twice doesn't make sense. Can you reproduce it?

The normal behavior on pod deletion should be:

  • if there are no connections to this pod, remove the RS
  • if there are connections, set weight to 0 (established connections continue, no new connections)
  • every minute check all RS with weight 0 (this is what the graceful termination list is for) and remove the ones that don't have connections anymore

@lbernail correct. The deleted pod had that IP address. And yes, it also looked strange to me. I'll try to reproduce when I get a chance. Would you like me to increase log verbosity if I'm able to reproduce the same behavior? If yes, to what level?

@emptywee : it would be great thanks, --v=5 should give us plenty of information
The main thing is I don't even think a RS can appear twice in a VS in IPVS (I never saw that). Which kernel are you running?

@lbernail will set it to --v=5 and will try to reproduce, probably some time over the weekend.

I am on CoreOS stable:

image

4.14 is very recent, so I would be surprised if there was an IPVS issue
I'm going to try and stress a cluster with the patched version to see if I can reproduce the issue from @bjornryden

@lbernail Unfortunately, I could not reproduce that ipvs entry duplication behavior that I've seen only once. Tried multiple times, with different services. However, I haven't seen any entry with weight 0, but this might be explained by the fact that the pods I delete exit very quickly. Also, kube-proxy has been responsive after running for over 5 days. I might give it a try again in the way more busier cluster over here.

@emptywee This is "good" news (even if I'd rather we were able to reproduce it and understand it, maybe it was a display glitch in ipvsadm?)

Weight is only set to 0 if they are some ActiveConn or InActiveConn. To see it simply open a long lived connection to a service before deleting the associated pod (for instance using telnet)

5 days of stability is good news! Let us know how it goes when/if you try on a busier cluster

I have a few small improvements on the way:

Also, 1.11.6, 1.12.4 and 1.13.1 have been released and contain the fix in the delete function

I wrote a short article about the issue over the weekend/. I'll make sure to update it with the fixed version!

@lbernail Sure, will do! Is your image of hyperkube same as the one that's been released as 1.11.6? I have installed your image for kube-proxy only in the more busier cluster and switched to ipvs again for testing. Would you want me to use the official one?

@emptywee My image and the 1.11.6 are basically the same: I built it from the head of the release-1.11 branch a few days before the official release so the difference is a few commits at most, none involving IPVS. So I don't think it's worth the effort of updating your cluster again

Bad news :( just got my kube-proxy failed to update ipvs backends. Switching back to iptables and collecting logs. Had kube-proxy running with --v=5 flag.

@lbernail So, not sure how much of the logs is needed, but we basically encountered a problem with a few services, which did not have their ipvs backends IPs updated. The logs, however, tell us it was updating endpoints properly:

$ cat kube-proxy-issue.log | egrep -e 'c-qa4/ycsdtleuk|10.148.191.69|10.148.192.14|10.148.183.70|10.148.184.67'
I1217 19:12:52.849863       1 graceful_termination.go:160] Trying to delete rs: 10.148.6.202:80/TCP/10.148.192.14:8080
I1217 19:12:52.849948       1 graceful_termination.go:173] Deleting rs: 10.148.6.202:80/TCP/10.148.192.14:8080
I1217 19:12:52.850000       1 graceful_termination.go:160] Trying to delete rs: 10.148.6.202:80/TCP/10.148.191.69:8080
I1217 19:12:52.850069       1 graceful_termination.go:173] Deleting rs: 10.148.6.202:80/TCP/10.148.191.69:8080
I1217 19:12:53.425438       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:13:00.501203       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.191.69,tcp:8080,10.148.191.69 from ip set: KUBE-LOOP-BACK
I1217 19:13:00.543691       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.192.14,tcp:8080,10.148.192.14 from ip set: KUBE-LOOP-BACK
I1217 19:13:06.916777       1 service.go:309] Adding new service port "c-qa4/ycsdtleuk:http" at 10.148.6.202:80/TCP
I1217 19:13:08.345142       1 proxier.go:1465] Adding new service "c-qa4/ycsdtleuk:http" 10.148.6.202:80/TCP
I1217 19:13:13.939763       1 ipset.go:148] Successfully add entry: 10.148.191.69,tcp:8080,10.148.191.69 to ip set: KUBE-LOOP-BACK
I1217 19:13:14.002663       1 ipset.go:148] Successfully add entry: 10.148.192.14,tcp:8080,10.148.192.14 to ip set: KUBE-LOOP-BACK
I1217 19:27:53.485483       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:27:53.485526       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:42:53.388638       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:42:53.388673       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:57:53.643829       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 19:57:53.644032       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:12:53.541080       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:12:53.541129       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:27:53.368129       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:27:53.368306       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:42:53.539517       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:42:53.539542       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:57:53.624320       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 20:57:53.624371       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:12:53.630561       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:12:53.630585       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:20.897397       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:20.897463       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:46.630013       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:46.630084       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:57.995544       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.191.69:8080 10.148.192.14:8080]
I1217 21:26:57.995657       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.191.69:8080 10.148.192.14:8080]
I1217 21:27:24.296606       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080 10.148.191.69:8080 10.148.192.14:8080]
I1217 21:27:53.678172       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080 10.148.191.69:8080 10.148.192.14:8080]
I1217 21:28:21.517538       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080 10.148.191.69:8080]
I1217 21:28:21.595632       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080]
I1217 21:42:53.491595       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080]
I1217 21:57:53.492214       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080]
I1217 22:05:45.760620       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080]
I1217 22:06:24.890429       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080]
I1217 22:07:15.463144       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.183.68:8080]
I1217 22:07:43.201415       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080]
I1217 22:08:18.268005       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080]
I1217 22:08:58.774066       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.192.70:8080]
I1217 22:12:53.528515       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.192.70:8080]
I1217 22:27:53.522197       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.192.70:8080]
I1217 22:29:20.396794       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.192.70:8080]
I1217 22:29:43.994895       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.192.70:8080]
I1217 22:29:54.382253       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.183.70:8080 10.148.192.70:8080]
I1217 22:30:17.895084       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.183.70:8080 10.148.184.67:8080 10.148.192.70:8080]
I1217 22:31:45.791871       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.134.231:8080 10.148.183.70:8080 10.148.184.67:8080]
I1217 22:31:45.855011       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.70:8080 10.148.184.67:8080]
I1217 22:42:53.652587       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.70:8080 10.148.184.67:8080]
I1217 22:57:53.509372       1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.70:8080 10.148.184.67:8080]

But when I checked it with ipvsadm, it was still showing the first two: 10.148.191.69:8080 10.148.192.14:8080

Whereas kubectl -n c-qa4 describe svc ycsdtleuk was saying that the endpoints are 10.148.183.70:8080 10.148.184.67:8080, and on other boxes ipvs entries for the same very service were getting updated properly.

$ kubectl -n c-qa4 describe svc ycsdtleuk
Name:              ycsdtleuk
Namespace:         c-qa4
Labels:            country=uk
                   env=qa
                   run=ycsdtleuk
                   stack=c
                   track=4
Annotations:       run: ycsdtleuk
Selector:          load-balancer-ycsdtleuk=true
Type:              ClusterIP
IP:                10.148.6.202
Port:              http  80/TCP
TargetPort:        8080/TCP
Endpoints:         10.148.183.70:8080,10.148.184.67:8080
Session Affinity:  None
Events:            <none>

I was basically looking at the following on two boxes at the same time:

(incorrect ipvs config)
rnqkbm401 ~ # ipvsadm -L -t 10.148.6.202:80
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  rnqkbm401:http rr
  -> 10.148.191.69:http-alt       Masq    1      0          0
  -> 10.148.192.14:http-alt       Masq    1      0          0
(correct ipvs config)
rnqkbw401 ~ # ipvsadm -L -t 10.148.6.202:80
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  rnqkbw401:http rr
  -> 10.148.183.70:http-alt       Masq    1      0          0
  -> 10.148.184.67:http-alt       Masq    1      0          0

There are no obvious error message or anything screaming that would indicate any signs of a problem anywhere. I can send the full log file (18M uncompressed) to you directly, if you'd like.

Long story short: it feels like while the part that listens to services changes is working fine, but the part that is interacting with ipvs module isn't. I am not even seeing it was trying to delete the initial two IP addresses once it added them. Something is getting stuck somewhere.

@emptywee Thank you for the detailed report. Nothing weird in these logs.

What is surprising is that syncProxyRules is run at least every 30s (default) and should reconcile IPVS real servers with endpoints. It calls syncEndpoints which triggers AddRealServer and ipvsHandle.NewDestination

All this happens before the deletion of old endpoints so shouldn't be impacted by graceful termination

Could you send me the full logs so I can see exactly what happens between 21:26 and 21:29?

Two additional questions:

  • a single kube-proxy was impacted, right?
  • was a single service impacted (did the other continue to update normally?)

Next time, try sending SIGABRT to the process, so that it prints tasks states.

When I had the problem, it was always doing something inside libnetwork/ipvs/netlink.go, which leads me to suspect that the netlink system call gets stuck for some reason.

Sorry about my absence, but weekend and then had to focus on reverting to iptables for our setup.

Looking back at the requested logs for the service, @lbernail , we have a lot of these:

I1214   11:48:44.889081       1   graceful_termination.go:160] Trying to delete rs:   172.17.0.1:31663/TCP/10.230.213.204:9300
I1214 11:48:44.889099       1 graceful_termination.go:173]   Deleting rs: 172.17.0.1:31663/TCP/10.230.213.204:9300
I1214 11:48:44.889113       1 graceful_termination.go:160] Trying   to delete rs: 172.17.0.1:31663/TCP/10.230.226.83:9300
I1214 11:48:44.889130       1 graceful_termination.go:173]   Deleting rs: 172.17.0.1:31663/TCP/10.230.226.83:9300
...

We also saw these:

I1214 12:15:44.920364       1 graceful_termination.go:160] Trying to delete rs: 10.230.65.211:9494/TCP/10.230.226.124:9494
I1214 12:15:44.920489       1 graceful_termination.go:173] Deleting rs: 10.230.65.211:9494/TCP/10.230.226.124:9494
I1214 12:15:44.920512       1 graceful_termination.go:93] lw: remote out of the list: 10.230.65.211:9494/TCP/10.230.226.124:9494

Just as others have reported, this only happened on some nodes. By the time I found the problem, we had 3 cluster nodes that was broken for this specific service.

@lbernail Absolutely. Here's the requested piece of the logs, I have included the startup piece as well: https://gist.github.com/emptywee/6f7e9c9e43f10288950d1d8420c038f0

Let me know if you need the whole file. If yes, where you'd like me to send it to, or upload to.

To answer your questions:

  1. The cluster has over 60+ nodes, I checked on a few other nodes, the ipvs entry was correct. Not sure about all entries though. But on the master node, thru which we route all traffic to services, at least two services were not getting updated properly. There could be more, but once we started to get complaints from customers, we immediately heard about two services being not accessible. I collected logs and restarted kube-proxy to minimize impact. I compared the state of the ipvs entries on that master node and one of the worker nodes as part of troubleshooting and realized that it was different. There's a possibility that other nodes could have different entries not updated properly, but since traffic doesn't go via them, we did not notice any impact.

  2. No, at least two services were impacted. Possibly more, but I did not check for that.

Two interesting things from the logs:

  • I1217 21:26:57.995657 1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.191.69:8080 10.148.192.14:8080] is never processed by IPVS but no error because you keep the two old RS (we can't be sure, adding a RS does not generate log)
  • I1217 21:28:21.517538 1 endpoints.go:234] Setting endpoints for "c-qa4/ycsdtleuk:http" to [10.148.183.68:8080 10.148.187.41:8080 10.148.191.69:8080] does not trigger the deletion call for 10.148.192.14:8080 (we should have logs there)

@emptywee was IPVS working for you before graceful termination or were you using iptables?

@m1093782566 / @Lion-Wei any idea what could be going here?

@lbernail We waited until we upgrade to 1.11 to switch to ipvs (I believe it's when the GA release happened for this feature) and we upgraded straight to 1.11.5 and it was our first attempt to start using ipvs.

@emptywee : I'm going to add some more Info level logs to try and pinpoint the issue. Would you be ok to run a more verbose version for a few hours? (I completely understand if you'd rather not, but I haven't been able to reproduce on my side yet)

We have been using IPVS since May and had no issues in 1.12.1 and 1.11.1.

We also saw this happen on several, but by far not all nodes, after upgrading to 1.12.3. The affected nodes stopped updating IPVS a few hours after the upgrade, but not in a coordinated way that I could discern.

One thing to note is that kube-proxy also stopped logging completely at that time, so it appears like it got completely stuck.

Log file showing it get stuck, and a goroutine dump after SIGQUIT (lightly redacted)

@matthiasr's log file again shows a goroutine stuck reading from netlink socket. Same as me.

You could argue that there is some logic error causing too many interactions with netlink socket. But if the communication with netlink is not solid, ipvs will get stuck eventually. If the logic error gets fixed, it might take a lot longer to reach this state, but it doesn't mean it cannot happen given enough time.

That's why I would advise to have some sort of timeout in netlink interactions; if the operation times out, close the netlink socket and open a new one automatically, as a recovery mechanism.

@matthiasr OK that's good news (we've also been using IPVS for more than 6 months without any issue but haven't updated to versions using graceful termination on our busiest clusters yet)

Looks like we may have a deadlock somewhere

@lbernail while it's a bit cumbersome to switch the whole cluster to ipvs, if this is going to help nailing this issue down, I am down. Tell me the image name to use and I'll do whatever in my power to help you with this.

We've reverted to IPVS on 1.11.6 to see if we can recreate our problem. Logging set to 5 and running some activity in the cluster over the next days. I won't be working tomorrow, but I will pop online to see if there's an image with increased logging that I should change to.

@matthiasr : 1.12.3 definitely has a problem (see the rest of this thread). Part of it is solved in 1.12.4. You will get a lot less errors in logs. All these should disappear: Failed to add destination: 10.zz.yy.xx, error: file exists

We still have the issue where kube-proxy seems to hang an no longer updates real-servers.
I looked into it a bit more:

  • from @emptywee it seems that after the issue has started we don't see any logs from either proxier.go, ipset.go or graceful_termination.go, which seems to indicate that we are stuck somewhere. @emptywee can you check if you have logs for these files after 21:30?
  • in the go routine dump, there are two syscall receiving on a netlink socket. So we could have a problem there. In addition, to support graceful termination libnetwork/ipvs was upgraded so the behavior may have changed. However, libnetwork code seems to set send/rcv timeouts on the netlink socket (and if it was stuck we should see the number of minutes it has been): https://github.com/docker/libnetwork/blob/a9cd636e37898226332c439363e2ed0ea185ae92/ipvs/ipvs.go#L77-L105

Interestingly only the new version of libnetwork has netlink timeouts:

libnetwork was upgraded in all release branches (1.11, 1.12 and 1.13) to allow for graceful termination. we went from: ba46b928444931e6865d8618dc03622cac79aa6f to a9cd636e37898226332c439363e2ed0ea185ae92

In the older version, the code was: https://github.com/docker/libnetwork/blob/ba46b928444931e6865d8618dc03622cac79aa6f/ipvs/ipvs.go#L65-L87

The only differences in the IPVS code in libnetwork between these 2 commits are:

  • introduction of the timeout
  • addition of ActiveConnections and InactiveConnections to Destination (which was required for graceful termination)

I wonder if this might be related.

After putting some more thoughts on the issue, I think it comes from the fact that we now have two goroutines using the same netlink socket (because the gracefulTerminationManager runs in its own goroutine and shares the utilipvs.Interface with the proxier).

We could probably add a mutex in all calls in pkg/util/ipvs/ipvs_linux.go to make sure a single goroutine is using the netlink socket at any given time.

@m1093782566 what do you think?

@lbernail No, there's no entries from those files anywhere after 21:30.
Here's the last entries for various files:

Last entries from graceful_termination.go

$ cat kube-proxy-issue.log | grep 'graceful_termination.go' | tail
I1217 21:18:20.757262       1 graceful_termination.go:66] Adding rs 10.148.11.105:8080/TCP/10.148.174.35:8081 to graceful delete rsList
I1217 21:18:52.267340       1 graceful_termination.go:160] Trying to delete rs: 10.148.0.145:80/TCP/10.148.188.28:8080
I1217 21:18:52.267456       1 graceful_termination.go:173] Deleting rs: 10.148.0.145:80/TCP/10.148.188.28:8080
I1217 21:18:52.267638       1 graceful_termination.go:160] Trying to delete rs: 10.148.0.145:80/TCP/10.148.145.194:8080
I1217 21:18:52.267755       1 graceful_termination.go:173] Deleting rs: 10.148.0.145:80/TCP/10.148.145.194:8080
I1217 21:18:52.405770       1 graceful_termination.go:160] Trying to delete rs: 10.148.0.145:80/TCP/10.148.188.28:8081
I1217 21:18:52.405906       1 graceful_termination.go:173] Deleting rs: 10.148.0.145:80/TCP/10.148.188.28:8081
I1217 21:18:52.406052       1 graceful_termination.go:160] Trying to delete rs: 10.148.0.145:80/TCP/10.148.145.194:8081
I1217 21:18:52.406156       1 graceful_termination.go:173] Deleting rs: 10.148.0.145:80/TCP/10.148.145.194:8081
I1217 21:18:53.240664       1 graceful_termination.go:160] Trying to delete rs: 10.148.11.105:8080/TCP/10.148.174.35:8081

Last entries from proxier.go:

$ cat kube-proxy-issue.log | grep 'proxier.go' | tail
I1217 21:18:52.570282       1 proxier.go:1023] Port "nodePort for kube-system/kube-dns:dns-tcp" (:30054/tcp) was open before and is still needed
I1217 21:18:52.663609       1 proxier.go:1023] Port "nodePort for ingress-nginx/ingress-nginx:http" (:31513/tcp) was open before and is still needed
I1217 21:18:53.241144       1 proxier.go:1465] Adding new service "c-qa4/yardchange-ws:http" 10.148.1.136:80/TCP
E1217 21:18:53.241277       1 proxier.go:1467] Failed to add IPVS service "c-qa4/yardchange-ws:http": file exists
E1217 21:18:53.241350       1 proxier.go:821] Failed to sync service: 10.148.1.136:80/TCP, err: file exists
I1217 21:18:53.241624       1 proxier.go:1465] Adding new service "c-qa4/receiving-ws-v2-uk:http-admin" 10.148.13.141:8080/TCP
E1217 21:18:53.241723       1 proxier.go:1467] Failed to add IPVS service "c-qa4/receiving-ws-v2-uk:http-admin": file exists
E1217 21:18:53.241761       1 proxier.go:821] Failed to sync service: 10.148.13.141:8080/TCP, err: file exists
E1217 21:18:53.242506       1 proxier.go:1496] Failed to get IPVS service, error: Expected only one service obtained=0
E1217 21:18:53.242558       1 proxier.go:818] Failed to sync endpoint for service: 10.148.11.84:8080/TCP, err: Expected only one service obtained=0

Last entries from ipset.go:

$ cat kube-proxy-issue.log | grep 'ipset.go' | tail
I1217 21:17:14.768694       1 ipset.go:148] Successfully add entry: 10.148.134.229,tcp:8080,10.148.134.229 to ip set: KUBE-LOOP-BACK
I1217 21:17:14.774040       1 ipset.go:148] Successfully add entry: 10.148.134.229,tcp:8081,10.148.134.229 to ip set: KUBE-LOOP-BACK
I1217 21:17:26.643709       1 ipset.go:148] Successfully add entry: 10.148.172.59,tcp:8080,10.148.172.59 to ip set: KUBE-LOOP-BACK
I1217 21:17:26.648504       1 ipset.go:148] Successfully add entry: 10.148.172.59,tcp:8081,10.148.172.59 to ip set: KUBE-LOOP-BACK
I1217 21:17:31.049667       1 ipset.go:148] Successfully add entry: 10.148.136.14,tcp:8080,10.148.136.14 to ip set: KUBE-LOOP-BACK
I1217 21:17:31.053901       1 ipset.go:148] Successfully add entry: 10.148.136.14,tcp:8081,10.148.136.14 to ip set: KUBE-LOOP-BACK
I1217 21:18:18.920550       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.181.11,tcp:8080,10.148.181.11 from ip set: KUBE-LOOP-BACK
I1217 21:18:18.924752       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.181.11,tcp:8081,10.148.181.11 from ip set: KUBE-LOOP-BACK
I1217 21:18:21.232725       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.174.35,tcp:8080,10.148.174.35 from ip set: KUBE-LOOP-BACK
I1217 21:18:21.237619       1 ipset.go:140] Successfully delete legacy ip set entry: 10.148.174.35,tcp:8081,10.148.174.35 from ip set: KUBE-LOOP-BACK

They all ended logging at around 21:18... Does it give any clues?

@emptywee Yes it confirms the hypothesis that we have a deadlock between the two goroutines sharing the netlink socket. They are both stuck waiting to receive messages and keep retrying without success (due to the 3s receive timeout).

I'll try to validate the hypothesis and will start working on a fix as soon as I can

I put 1.12.4 on a test cluster and began redeploying an app, that reproduced the issue in 1.5 hours, so I think I will be able to test a potential fix fairly reliably.

There are again 2 goroutines (1 and 107) hanging on a netlink Receive.

I have a potential fix for the bug, the code is available here: https://github.com/DataDog/kubernetes/commit/9121f0122d805f55f85f42d26c294898fe29a92f
The idea is to protect all netlink calls with a Mutex to avoid the deadlock. I'm not sure this is the best solution but it could solve the issue (and if it works confirm that we've identified the bug).

If you want/can test it, I created two images with the fix:

If you test, let us know if it works and if it doesn't, please share interesting logs and look at the goroutines to see if some seem stuck in netlink calls.

(I did some testing on my side but limited to a small cluster)

I might give it a try on Monday, Dec 24th!

Giving it a go today. Will report results, if any.

This is a bug. I have switched to iptables mode.

I've also upgraded to the beta version and starting our load tests again shortly. We usually break quickly (less than a day) and quite consistently... fingers crossed.

@li-sen definitely a bug. We are working on it

@emptywee and @bjornryden thanks a lot for testing!

@lbernail my pleasure! So it's been about two days since I launched kube-proxy with the 1.11.7-beta.1 image. It hasn't hung on us yet. I can see proxier.go and other goroutines are still alive and printing to the logs. I'll give it some more time since the last couple of days were really slow and quiet, not a lot of deployments going on. Either way, I have a feeling the mutexes that @lbernail added are helping :)

I'll update this issue/thread by the end of the week.

@lbernail

Would you please raise a PR for fixing this issue?

/reopen
(to keep an eye on feedback from tests and before creating cherry-picks if everything works ok)

@lbernail: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
(to keep an eye on feedback from tests and before creating cherry-picks if everything works ok)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@m1093782566 can you reopen the issue until we are completely sure it is fixed?

I have the same problem of Kube-proxy ipvs, version 1.12.3.
The real servers are wrong and Kube-proxy can't refresh the ipvs rules.

Here are last logs and it stuck here, didn't print new logs anymore.

27 23:31:24.113927       1 graceful_termination.go:93] lw: remote out of the list: 10.96.0.1:443/TCP/10.6.40.32:6443
E1227 23:31:24.113935       1 graceful_termination.go:183] Try flush graceful termination list err
I1227 23:32:24.114182       1 graceful_termination.go:160] Trying to delete rs: 10.96.0.1:443/TCP/10.6.40.32:6443
E1227 23:32:24.114316       1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/10.6.40.32:6443" err: Failed to delete rs "10.96.0.1:443/TCP/10.6.40.32:6443", can't find the real server
I1227 23:32:24.114332       1 graceful_termination.go:93] lw: remote out of the list: 10.96.0.1:443/TCP/10.6.40.32:6443
I1227 23:32:24.114348       1 graceful_termination.go:160] Trying to delete rs: 10.96.0.1:443/TCP/10.6.40.33:6443
I1227 23:32:24.114397       1 graceful_termination.go:160] Trying to delete rs: 10.6.40.25:32444/TCP/10.244.20.41:32444
E1227 23:32:24.114446       1 graceful_termination.go:183] Try flush graceful termination list err
I1227 23:33:24.114853       1 graceful_termination.go:160] Trying to delete rs: 10.96.0.1:443/TCP/10.6.40.32:6443
E1227 23:33:24.115043       1 graceful_termination.go:89] Try delete rs "10.96.0.1:443/TCP/10.6.40.32:6443" err: Failed to delete rs "10.96.0.1:443/TCP/10.6.40.32:6443", can't find the real server
I1227 23:33:24.115106       1 graceful_termination.go:93] lw: remote out of the list: 10.96.0.1:443/TCP/10.6.40.32:6443
I1227 23:33:24.115128       1 graceful_termination.go:160] Trying to delete rs: 10.96.0.1:443/TCP/10.6.40.33:6443

@VincentFF : there are two issues:

  • failed deletions: fixed in 1.12.4
  • kube-proxy getting stuck due to a deadlock (which is the issue you are describing): hopefully fixed with the PR from yesterday. We are waiting for feedback from @emptywee and @bjornryden who are currently testing the patch. As soon as we have confirmation that the issue is not present anymore I'll create the cherrypicks for 1.13, 1.12 and 1.11 (if the patch works, it will probably be merged in all branches at the beginning of next week and will be part of the next patch releases)

I'm also re-running my experiment https://github.com/kubernetes/kubernetes/issues/71071#issuecomment-449402484 again, with the patch back-ported onto 1.12.4. Will report back!

We haven't had any issues with accessing services while kube-proxy was(and is) running in the IPVS mode and we've had quite a few deployments and changes to the services. So far I'd say the patch is working well for us.

@lbernail Great job on fixing it! Much appreciated!

Ok let's wait until monday so we also get feedback from @matthiasr and @bjornryden (I want to avoid rushing the cherry-picks and make sure that we have a clean fix this time)

@lbernail Good news, thanks to your reply.

No issues after 12+ hours on 1.12.4 with https://github.com/DataDog/kubernetes/commit/9121f0122d805f55f85f42d26c294898fe29a92f – without that I got a lockup in 1.5 hours, so this looks very good!

@lbernail I have another question, please. If the next patch is version 1.12.5?
If I just upgrade kube-proxy to 1.12.5, other units(apiserver, kubelet, controller-manager, scheduler ...) keep 1.12.3, is it ok? Or I have to upgrade all units?

@VincentFF Hopefully it will make it to 1.12.5 yes. You can run kube-proxy 1.12.5 with other units running 1.12.3, no problem

Update: after 24 hours, still no issues, I'm stopping my experiment now & consider this a full success. Looking forward to 1.12.5!

Ok this is very good news. I'm going to create the cherry-pick for all branches in order to get the fix in official releases as soon as possible.
Many thanks for your involvement in helping solve the issue and testing!

I can join the choir of happy testers. 4 days of no trouble happening.

Thanks a lot to all involved, especially @lbernail. Happy New Year!

I build it and deploy to test env, over 24 hours have no issues. Happy new year.

Hello, My env is
Ubuntu 16.04.4 LTS
Kubernetes version (use kubectl version): v1.13.0
Cloud provider or hardware configuration: bare metal, 64 bit linux
Kernel 4.4.0-75-generic

same bug:

I0101 22:20:09.742898 267576 graceful_termination.go:160] Trying to delete rs: 10.200.198.70:9090/TCP/172.200.101.32:9090
E0101 22:20:09.742975 267576 proxier.go:1519] Failed to get IPVS service, error: Expected only one service obtained=0
E0101 22:20:09.743036 267576 proxier.go:843] Failed to sync endpoint for service: 10.200.36.192:80/TCP, err: Expected only one service obtained=0

I need restart kube-proxy fix it

@daigong v1.13.1 have fixed this. In my case, no issues after more than one week.

@daigong : the full fix will actually be in 1.13.2 (it depends on how fast #72426 gets merged)

@lbernail @FrostyLeaf
thanks , I use #72426 to test this bug

Anything is ok now

The PR has been merged in all release branches, so it will be present in:

  • 1.11.7
  • 1.12.5
  • 1.13.2

@annProg

same problem ,both 1.12.2 and 1.13.0-beta1. @m1093782566 @Lion-Wei

你邮箱多少啊啊,,认识下。。少年

一样一样

any chance for a new build in v1.11.x ?

I've been refreshing the releases pages pretty much every hour for the last couple of days in the hope 1.12.5 will appear so i can patch this bug on multiple clusters. :-)

@verwilst haha, same, can't wait for v1.11.7 to get released to upgrade current prod clusters. And v1.12.5 too to start upgrading in lower envs.

I've been refreshing the releases pages pretty much every hour for the last couple of days in the hope 1.12.5 will appear so i can patch this bug on multiple clusters. :-)

Its here :-P

I've been refreshing the releases pages pretty much every hour for the last couple of days in the hope 1.12.5 will appear so i can patch this bug on multiple clusters. :-)

Its here :-P

Why is it not there for 1.11? :(

Was this bug fixed in 1.11.7? I didn't see any release notes.

Yes it is https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#v1117:

Fix race condition introduced by graceful termination which can lead to a deadlock in kube-proxy (#72361, @lbernail)

Yes I forgot to add "IPVS" in the release-note in the PR, sorry about that
Let us know if it works ok now! (we'll leave the issue open for a weeks just in case)

We've had no issues since upgrading to 1.12.5 🎉

Great news!

@m1093782566 , @thockin I think we can close this issue (no new reports in about 2 months)

For reference, bug impacts initial releases with graceful termination:

  • 1.11.5 - 1.11.6
  • 1.12.2 - 1.12.4
  • 1.13.0 - 1.13.1

Fix is present after:

  • 1.11.7
  • 1.12.5
  • 1.13.2

/close

@m1093782566: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Can we reopen? We're seeing this issue on 1.12.7.

yes, we are seeing the issues in 1.12.9 too

Are you sure it is the same issue?
Can you post the logs of a kube-proxy instance?

sorry, it may have been a false alert... We see this is the logs and its not the same error

I0709 11:59:48.640888       1 graceful_termination.go:160] Trying to delete rs: 192.168.240.51:80/TCP/172.30.71.181:5050
I0709 11:59:48.640304       1 graceful_termination.go:93] lw: remote out of the list: 192.168.240.51:80/TCP/172.30.106.125:5050

for some reason, when the cluster is under load it couldn't route to some of the cluster service IPs. We are still not sure if it is an IPVS issue.

I can confirm the same log in 1.14.4. When does this log come out?

I0712 01:34:49.925012       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.22.42:443
I0712 01:34:49.925057       1 graceful_termination.go:175] Deleting rs: 11.3.0.1:443/TCP/10.0.22.42:443
I0712 01:35:11.907742       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.10:53/TCP/11.2.5.13:53
I0712 01:35:11.907874       1 graceful_termination.go:175] Deleting rs: 11.3.0.10:53/TCP/11.2.5.13:53
I0712 01:35:11.908006       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.10:53/UDP/11.2.5.13:53
I0712 01:35:11.908059       1 graceful_termination.go:175] Deleting rs: 11.3.0.10:53/UDP/11.2.5.13:53
I0712 01:35:21.992487       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.10:53/TCP/11.2.4.11:53
I0712 01:35:21.992594       1 graceful_termination.go:175] Deleting rs: 11.3.0.10:53/TCP/11.2.4.11:53
I0712 01:35:21.992784       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.10:53/UDP/11.2.4.11:53
I0712 01:35:21.992858       1 graceful_termination.go:175] Deleting rs: 11.3.0.10:53/UDP/11.2.4.11:53
I0712 01:36:29.411592       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.199:44134/TCP/11.2.0.5:44134
I0712 01:36:29.411660       1 graceful_termination.go:175] Deleting rs: 11.3.0.199:44134/TCP/11.2.0.5:44134
I0712 01:36:29.534158       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.17:8085/TCP/11.2.0.3:8085
I0712 01:36:29.534394       1 graceful_termination.go:175] Deleting rs: 11.3.0.17:8085/TCP/11.2.0.3:8085
I0712 01:36:50.745469       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:36:50.745508       1 graceful_termination.go:172] Not deleting, RS 11.3.0.1:443/TCP/10.0.22.162:443: 0 ActiveConn, 12 InactiveConn
I0712 01:37:37.475724       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:37:37.475841       1 graceful_termination.go:172] Not deleting, RS 11.3.0.1:443/TCP/10.0.22.162:443: 0 ActiveConn, 4 InactiveConn
I0712 01:38:37.475956       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:38:37.476201       1 graceful_termination.go:172] Not deleting, RS 11.3.0.1:443/TCP/10.0.22.162:443: 0 ActiveConn, 4 InactiveConn
I0712 01:39:37.476324       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:39:37.476585       1 graceful_termination.go:175] Deleting rs: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:39:37.476628       1 graceful_termination.go:94] lw: remote out of the list: 11.3.0.1:443/TCP/10.0.22.162:443
I0712 01:40:36.883345       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.245:5473/TCP/10.0.22.25:5473
I0712 01:40:36.883566       1 graceful_termination.go:175] Deleting rs: 11.3.0.245:5473/TCP/10.0.22.25:5473
I0712 01:40:36.886079       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.17:8085/TCP/11.2.1.3:8085
I0712 01:40:36.886310       1 graceful_termination.go:175] Deleting rs: 11.3.0.17:8085/TCP/11.2.1.3:8085
I0712 01:40:46.927450       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.245:5473/TCP/10.0.22.191:5473
I0712 01:40:46.927525       1 graceful_termination.go:175] Deleting rs: 11.3.0.245:5473/TCP/10.0.22.191:5473
I0712 01:40:56.968884       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.245:5473/TCP/10.0.22.75:5473
I0712 01:40:56.968936       1 graceful_termination.go:175] Deleting rs: 11.3.0.245:5473/TCP/10.0.22.75:5473
I0712 01:41:16.747362       1 graceful_termination.go:161] Trying to delete rs: 11.3.0.1:443/TCP/10.0.23.45:443

These are just information logs:

  • Trying to delete rs : a backend pod from removed, trying to delete it. If it has existing connections, make it enter graceful termination, otherwise delete id
  • Deleting rs : the backend has no connection, removing it from IPVS

@andrewsykim Maybe we could now decrease the level of these logs so they only show above v=5 for instance ?

In https://github.com/kubernetes/kubernetes/pull/78395 we bumped the log level from v=0 to v=2, I think v=5 makes sense. https://github.com/kubernetes/kubernetes/pull/80100 for update

Was this page helpful?
0 / 5 - 0 ratings