Kubernetes: use iptables for proxying instead of userspace

Created on 23 Jan 2015  ·  187Comments  ·  Source: kubernetes/kubernetes

I was playing with iptables yesterday, and I protoyped (well, copied from Google hits and mutated) a set of iptables rules that essentially do all the proxying for us without help from userspace. It's not urgent, but I want to file my notes before I lose them.

This has the additional nice side-effect (as far as I can tell) of preserving the source IP and being a large net simplification. Now kube-proxy would just need to sync Services -> iptables. This has the downside of not being compatible with older iptables and kernels. We had a problem with this before - at some point we need to decide just how far back in time we care about.

This can probably be optimized further, but in basic testing, I see sticky sessions working and if I comment that part out I see ~equal probability of hitting each backend. I was not able to get deterministic round-robin working properly (with --nth instead of --probability) but we could come back to that if we want.

This sets up a service portal with the backends listed below

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.500 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p tcp --to-destination 10.244.4.6:9376
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p tcp --to-destination 10.244.1.15:9376
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p tcp --to-destination 10.244.4.7:9376

iptables -t nat -F KUBE-PORTALS-HOST
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC
iptables -t nat -F KUBE-PORTALS-CONTAINER
iptables -t nat -A KUBE-PORTALS-CONTAINER -d 10.0.0.93/32 -m state --state NEW -p tcp -m tcp --dport 80 -j TESTSVC
prioritawaiting-more-evidence release-note sinetwork siscalability

All 187 comments

Cool! I think we should definitely get this merged in. On a separate note, I was seeing the proxy eat ~30% of a core under heavy load, I have to believe that iptables will give us better performance than that.

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns [email protected]
wrote:

Cool! I think we should definitely get this merged in. On a separate note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71517501
.

Maybe implementing it as a parallel option and slowly migrating makes sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin [email protected]
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy and
all the tests thereof. It also has back-compat problems (will not work on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns [email protected]
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71517501

.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71527216
.

I'm trying to coax someone else who doesn't know this code well to learn it
and take it on. I really _want_ to tackle it, but it would be better if
someone else learned this space (not you! :)

That said, you also sent (good) email about the massive P1 list - and I
don;t think this is on that list yet.

On Mon, Jan 26, 2015 at 1:06 PM, Brendan Burns [email protected]
wrote:

Maybe implementing it as a parallel option and slowly migrating makes
sense?

On Mon, Jan 26, 2015 at 12:01 PM, Tim Hockin [email protected]
wrote:

We have to prioritize this - it's almost a total rewrite of kube-proxy
and
all the tests thereof. It also has back-compat problems (will not work
on
older kernels or older iptables binaries).

On Mon, Jan 26, 2015 at 11:06 AM, Brendan Burns <
[email protected]>
wrote:

Cool! I think we should definitely get this merged in. On a separate
note,
I was seeing the proxy eat ~30% of a core under heavy load, I have to
believe that iptables will give us better performance than that.

Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71517501

.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71527216>

.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-71538256
.

Is this a P2? Might it be worth making it a P3 for now?

I'm hoping to make it work, but we may yet demote it

On Wed, Feb 11, 2015 at 2:49 PM, Satnam Singh [email protected]
wrote:

Is this a P2? Might it be worth making it a P3 for now?

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-73982161
.

Doesn't "hope" equate to a P3 that we'll get to if we can?

From discussion with @thockin: This is a requirement in order to support service port ranges, which aren't required for 1.0, but we would like to support eventually.

@thockin "This has the downside of not being compatible with older iptables and kernels." How 'new' would the kernel have to be?

Not TOO new, but we have some users who REALLY want iptables from 2012 to
work.

On Mon, Feb 23, 2015 at 2:44 PM, Sidharta Seethana <[email protected]

wrote:

@thockin https://github.com/thockin "This has the downside of not being
compatible with older iptables and kernels." How 'new' would the kernel
have to be?

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-75654187
.

@thockin thanks. We are using/testing with RHEL/CentOS 6, for example - so it would nice if we don't have a hard dependency on recent 3.x kernels.

@pweil- we were discussing this the other day
On Mon, Feb 23, 2015 at 11:40 PM Sidharta Seethana [email protected]
wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-75698480
.

Well, you do need Docker to run, and at some point we have to cut it off.
The back-rev iptables support will not stop me from (eventually) making
this change, and it's going to sting for some people.

On Mon, Feb 23, 2015 at 8:40 PM, Sidharta Seethana <[email protected]

wrote:

@thockin https://github.com/thockin thanks. We are using/testing with
RHEL/CentOS 6, for example - so it would nice if we don't have a hard
dependency on recent 3.x kernels.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-75698480
.

With @thockin 's help, we tried the same with udp.

We created a GCE Kubernetes cluster with 3 sky-dns replication controllers.
On the kubernetes-master, we set the following in iptables:
The dns service ip was 10.0.0.10, and the pod endpoints running dns were 10.244.0.5:53, 10.244.3.6:53, 10.244.0.6:53

iptables -t nat -N TESTSVC
iptables -t nat -F TESTSVC
iptables -t nat -N TESTSVC_A
iptables -t nat -F TESTSVC_A
iptables -t nat -N TESTSVC_B
iptables -t nat -F TESTSVC_B
iptables -t nat -N TESTSVC_C
iptables -t nat -F TESTSVC_C
iptables -t nat -N KUBE-PORTALS-HOST
iptables -t nat -F KUBE-PORTALS-HOST

iptables -t nat -A TESTSVC -m recent --name hostA --rcheck --seconds 1 --reap -j TESTSVC_A
iptables -t nat -A TESTSVC -m recent --name hostB --rcheck --seconds 1 --reap -j TESTSVC_B
iptables -t nat -A TESTSVC -m recent --name hostC --rcheck --seconds 1 --reap -j TESTSVC_C

iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.333 -j TESTSVC_A
iptables -t nat -A TESTSVC -m statistic --mode random --probability 0.5 -j TESTSVC_B
iptables -t nat -A TESTSVC -m statistic --mode random --probability 1.000 -j TESTSVC_C

iptables -t nat -A TESTSVC_A -m recent --name hostA --set -j DNAT -p udp --to-destination 10.244.0.5:53
iptables -t nat -A TESTSVC_B -m recent --name hostB --set -j DNAT -p udp --to-destination 10.244.3.6:53
iptables -t nat -A TESTSVC_C -m recent --name hostC --set -j DNAT -p udp --to-destination 10.244.0.6:53
iptables -t nat -A KUBE-PORTALS-HOST -d 10.0.0.10/32 -p udp -m udp --dport 53 -j TESTSVC
iptables -t nat -A OUTPUT -j KUBE-PORTALS-HOST


kubernetes-master>nslookup kubernetes.default.kuberenetes.local 10.0.0.10

We get a response back!

Great stuff! Just FYI (confirming from our face-to-face conversation), it's not safe to run multiple concurrent iptables commands in general (different chains sounds like it might be OK). iptables is a wrapper around libiptc, and see the comment on iptc_commit: http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?): http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save / iptables-restore (at least per chain); I've seen a lot of code which just therefore calls iptables-save & iptables-restore rather than doing things through adds and deletes. I may even have some code to do that I could dig up if that is helpful.

It boggles my mind that there's no way to do CAS or LL/SC sorts of ops.

We should add support for --wait, though it is recent enough that GCE's
debian-backports doesn't have it.

Maybe we should do our own locking inside our code to at least prevent us
from stepping on ourselves.

On Thu, Feb 26, 2015 at 1:56 PM, Justin Santa Barbara <
[email protected]> wrote:

Great stuff! Just FYI (confirming from our face-to-face conversation),
it's not safe to run multiple concurrent iptables commands in general
(different chains sounds like it might be OK). iptables is a wrapper around
libiptc, and see the comment on iptc_commit:
http://www.tldp.org/HOWTO/Querying-libiptc-HOWTO/mfunction.html

This was apparently fixed in 2013, but maybe only if you pass --wait (?):
http://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

The root cause of this is that iptables effectively calls iptables-save /
iptables-restore (at least per chain); I've seen a lot of code which just
therefore calls iptables-save & iptables-restore rather than doing things
through adds and deletes. I may even have some code to do that I could dig
up if that is helpful.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-76282629
.

What happens in the case of failures in the middle of creating a bunch of rules?

Fair question - we should probably think really hard about what it means to
encounter an error in the middle of this

On Thu, Feb 26, 2015 at 8:47 PM, Brian Grant [email protected]
wrote:

What happens in the case of failures in the middle of creating a bunch of
rules?

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-76331174
.

@thockin From irc today:

The net.ipv4.conf.all.route_localnet permits 127.0.0.1 to be the target of DNAT rules. From the docs:

route_localnet - BOOLEAN

Do not consider loopback addresses as martian source or destination
while routing. This enables the use of 127/8 for local routing purposes.
default FALSE

Would we integrate this into kubelet, or keep it in a separate daemon? Kubelet already watches services in order to populate env vars.

I would like to keep it as a separate binary. There are reasons why you
might want to run this on other machines (e.g. pet VMs) outside of a k8s
cluster in order to gain access to k8s services.

--brendan

On Fri, Mar 13, 2015 at 11:37 AM, Brian Grant [email protected]
wrote:

Would we integrate this into kubelet, or keep it in a separate daemon?
Kubelet already watches services in order to populate env vars.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79230747
.

Catching up. Regarding what happens in the case of failures (of which there will be many, trust me), I'm a great fan of the anti-entropy approach - store desired state somewhere, and periodically reconcile desired and actual state (by mutating actual state). In this case, perhaps as simple as:

while (true) {
actualState = iptablesSave()
if actualState != desiredState { iptablesRestore(desiredState))
sleep_a_while()
}

Agree 100% that's the right way to deal with failure in writing iptables
rules.

On Fri, Mar 13, 2015 at 1:16 PM, Quinton Hoole [email protected]
wrote:

Catching up. Regarding what happens in the case of failures (of which
there will be many, trust me), I'm a great fan of the anti-entropy approach

  • store desired state somewhere, and periodically reconcile desired and
    actual state (by mutating actual state). In this case, perhaps as simple as:

while (true) {
actualState = iptablesSave()
if actualState != desiredState { iptablesRestore(desiredState))
sleep_a_while()
}


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79336296
.

That's more or less what happens now, isn't it? For each expected rule,
check if it exists and if not, make it.

On Fri, Mar 13, 2015 at 2:02 PM, Brendan Burns [email protected]
wrote:

Agree 100% that's the right way to deal with failure in writing iptables
rules.

On Fri, Mar 13, 2015 at 1:16 PM, Quinton Hoole [email protected]
wrote:

Catching up. Regarding what happens in the case of failures (of which
there will be many, trust me), I'm a great fan of the anti-entropy
approach

  • store desired state somewhere, and periodically reconcile desired and
    actual state (by mutating actual state). In this case, perhaps as simple
    as:

while (true) {
actualState = iptablesSave()
if actualState != desiredState { iptablesRestore(desiredState))
sleep_a_while()
}


Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79336296

.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79392626
.

I agree there's utility in a separate binary, but maybe we link it into
kubelet (the same way cAdvisor is going) and make it a standalone at the
same time.

On Fri, Mar 13, 2015 at 12:03 PM, Brendan Burns [email protected]
wrote:

I would like to keep it as a separate binary. There are reasons why you
might want to run this on other machines (e.g. pet VMs) outside of a k8s
cluster in order to gain access to k8s services.

--brendan

On Fri, Mar 13, 2015 at 11:37 AM, Brian Grant [email protected]
wrote:

Would we integrate this into kubelet, or keep it in a separate daemon?
Kubelet already watches services in order to populate env vars.


Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79230747

.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-79257059
.

I'd be curious to hear what the Kubernetes-Mesos folks have to say about whether node components should be more integrated or modular. @jdef?

[[EDITED]] I really like the modularity of the k8s components, for example running a proxy process separate from a kubelet process. If the proxy fails for whatever reason it doesn't take down the kubelet. That's pretty great since Mesos executors don't have a very graceful failover model right now -- and the kubernetes-mesos framework's executor is a kubelet/executor hybrid. This model also lets me run a proxy service on a mesos master and use it as a round-robin balancer for external clients (as in the getting started guide that we submitted).

In terms of packing/shipping binaries, I think it's pretty useful to have the functionality packed together, as in hyperkube. I've also thought about how to package the kubernetes-mesos framework components into minimal Docker containers. Iptables has external library dependencies and that complicates things. So a nice compromise might be to ship the k8sm framework as a Docker that contains the single hyperkube image - but when that framework executes and starts distributing kubelet-executors across the cluster, it basically ships a hyperkube image that can morph into either a kubelet-executor or proxy - and each process can run directly on the host. This basically does an end-run around the iptables-{binaries,libraries}-in-Docker dependency problem.

+1 for modular functionality, +1 for single binary image

@thockin re: anti-entropy: Ah yes. I see proxier.SyncLoop() does that. In which case isn't the answer to @bgrant0607 's question of Feb 26 that errors can be ignored and will be repaired on the next iteration of SyncLoop() (currently 1 minute)? Or perhaps I'm missing something?

@thockin

  1. Are we worried about black hole-ing network traffic or is it something that that the service/pod author needs to take care of ?

With userspace proxy
Assume Virtual IP 10.0.0.11 has 3 endpoints say, 10.240.1.1, 10.240.1.2, 10.240.1.3
With a user-space proxy, if one endpoint say, 10.240.1.1 didnt work, the proxy would realize that the tcp connection was not established with 10.240.1.1, and it could fallback to one of the other 2 endpoints.

With iptables
When we use iptables, there is no fallback mechanism since kubernetes doesn't realize whether the endpoint worked or not.
We could mitigate this if we had some sort of a healthcheck for the endpoints, that would remove non-responsive endpoints.

Or maybe, worrying about non-responsive endpoints is not the responsibility of kubernetes system and is the responsibility of the pod author?

The user can set up readiness probes, which are performed by Kubelet. Probe failure will cause an endpoint to be removed from the endpoints list by the endpoints controller. The service proxy should then remove the endpoint from the target set.

I've been looking into this for GSoC, and I'm wondering:

So ideally we'd detect wether iptables is sufficiently new enough to use and otherwise continue using kube-proxy?

From https://github.com/GoogleCloudPlatform/kubernetes/issues/5419 it sounds like this would be the ideal approach; with kubelet determining wether to use ip-tables or start kube-proxy.

I'm also a bit late to GSoC (was on spring break....), so I'm also wondering if I can still submit a GSoC proposal for it tomorrow/later today (other than of course the 27th deadline, is this still open)?

@BenTheElder Yes, you have until Friday to submit a proposal. There's one other person potentially interested in this topic, but no specific proposal for it yet.

I'm not worried about kernels older than 2012 as much as I am about OSes without iptables entirely, though those are already broken to some degree.

@bgrant0607 thanks!
I think I may select this issue then. It looks interesting.

Readiness probe will work well for application startup but I'm not sure readiness is suitable for mitigating application failure. Upon pod failure, signal must pass from application -> kubelet -> apiserver -> endpoints_controller -> apiserver -> kube-proxy. I'd be interested in understanding the latency between an application failure and an endpoint removal from the kube-proxy rotation. During this period, requests will be proxied to a non-responsive endpoint.

Retry upon connection failure is a common strategy, and reasonably useful feature of many popular load balancers (e.g. haproxy, AWS ELB) and is handled by the current implementation of kube-proxy. Should this responsibility be migrated to the external LB? What about intra-cluster traffic?

Another thought, with iptables we will likely encounter issues gracefully connection draining upon reconfiguration vs an actual LB.

Mike raises good points.

On Mon, Mar 23, 2015 at 11:00 PM, Mike Danese [email protected]
wrote:

Another thought, we will likely encounter issues gracefully connection
draining upon reconfiguration vs an actual LB.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-85354865
.

I've been reading through the source for kube-proxy and the proxy pkg; is iptables not already being used in the current revision?

What exactly needs doing on this? From the current master source it looks as if iptables is already being used pretty extensively in proxy.

@mikedanese @thockin Readiness is most useful for planned outages. Unplanned outages will always cause some observable errors. The polling interval should generally long relative to update latency, but we could also put into place re-forwarding rules on the destination node via direct communication between Kubelet and kube-proxy, if the latency via apiserver and etcd is too long and/or that path is not sufficiently reliable.

@BenTheElder The existing rules route traffic through the proxy. The idea here is to use the rules to bypass the proxy.

@bgrant0607 Thanks, that makes complete sense now. Another read through the source and the design docs and I'm almost done writing a draft proposal.

Draft GSoC Proposal: https://gist.github.com/BenTheElder/ac61900595a7ea9ea9b5

I'd especially appreciated feedback on the schedule section. I'm not quite sure on that.
Should I finish early I'd love to work on some other (smaller?) un-taken GSoC issues like:
https://github.com/GoogleCloudPlatform/kubernetes/issues/1651.

Thanks again, Kubernetes takes the cake for friendliest group.

I just want to say that I'm very happy to say that my proposal has been accepted and that I will be working on this over the summer. :smiley:

I'm very excited. Unfortunately I'm in the midst of my finals right now, but starting sometime this weekend I should be on a lot more and working on it, most likely starting with getting https://github.com/GoogleCloudPlatform/kubernetes/pull/7032 finished.

@thockin @brendanburns Can anyone weigh in on wether we want to do this in parallel to the userspace proxy or how migrating to the reimplementation would work?

It looks like we already prefer iptables >= 1.4.11(released 2011-May-26).

// Executes the rule check without using the "-C" flag, instead parsing iptables-save.
// Present for compatibility with <1.4.11 versions of iptables.  This is full
// of hack and half-measures.  We should nix this ASAP.
func (runner *runner) checkRuleWithoutCheck(table Table, chain Chain, args ...string) (bool, error) {

Source: https://github.com/GoogleCloudPlatform/kubernetes/blob/aec41967416cf3463b188d72c97e71465e00719d/pkg/util/iptables/iptables.go#L206

Do we actually see hosts older than that?

One approach would obviously be to detect at run time what version of iptables we're running, and do the "best thing" we can given the version e.g. something like:

if (oldversion) {
load user space proxying module
}
else {
load iptables proxying module
}

I'd caution against having too many branches in the above if statement (ideally just 2), and avoid as far as possible having this sort of if statement in more than one place in the code.

I haven't sifted through the code in detail to figure out how feasible the above approach is.

Also, do all nodes need to implement the same strategy (user space vs iptables proxying), or can each one decide independently?

If each node decides independently, we potentially increase the test surface area proportional to the square of the number of branches in the above if statement (i.e. source_mode x dest_mode), but if we can keep the number of modes to 2, I think that's fine.

Q

Oh, and the answer to your question of whether we see old nodes is "yes", unfortunately.

There is much discussion of the above in a separate issue., I'll try to dig it out for you.

@quinton-hoole Thanks!

Also I'm fairly sure we can run user space on one node and iptables on another.

The number of modes should just be 2 except for that hack for when we don't have -C but the nodes that have -C should be able to run the pure iptables version (I think).

Ah yes, #7528 discusses kernel versions and such.

Thanks.
I didn't see that when I looked for requirements for kubernetes. The only requirements I found were in the networking docs discussing how we assume unique ip's.

We should probably get some documentation written for requirements once we have a better idea what they are.

I've started hacking on this here: https://github.com/BenTheElder/kubernetes/tree/iptables_proxy

Specifically I've moved the user-space implementation behind an interface here:
https://github.com/BenTheElder/kubernetes/commit/4e5d24bb74aca43b0dd37cf5cfee8a34f8eff2bf

I'm not sure now though if the implementation selection should be in cmd/kube-proxy or in pkg/proxy so I may remove this and have the implementation be selected by kube-proxy instead.

Edit: I think in retrospect that it probably makes more sense to select the implementation from kube-proxy.

@BenTheElder I tested Tim's rules with Calico and they work fine. We do all our work in the filter table, so the DNAT rules here have set the appropriate src IP by that point.

More generally, it would be good to have a discussion to define how network plugins can safely alter iptables if Kubernetes is also going to be inserting rules there. Don't want to trample (or be trampled by) Kubernetes rules if they change down the road.

@Symmetric Yes. I'm not sure about the plugins part yet at all, but that does seem pretty important.

I'm probably going to be a bit busy this weekend but Monday I should be starting to work on this fulltime for GSoC implementing the first iteration.
I'd like to be keeping this in mind while doing so but after looking over the network plugins api I'm not really sure what the cleanest way to handle this is.

Do you have any idea of what you'd want this to look like / how it should work?

FWIW, I did something that works for us because the kube-proxy was becoming nearly unresponsive. It's here: https://github.com/MikaelCluseau/kubernetes-iptables-proxy/blob/master/iptables-routing.rb.

I didn't see this thread before, but I end up with something close, except that I did a mistake with my random statistics match weights :-)

I would also like to share that we've hit the nf_conntrack_max quickly. It should probably be increased.

# cat /etc/sysctl.d/nf_conntrack.conf 
net.netfilter.nf_conntrack_max = 1000000
net.nf_conntrack_max           = 1000000

I may have not understand all the need for iptables, but why not using IPVS instead?
It seems to be more relevant for proxing than iptables...
Here is a simple go implementation: https://github.com/noxiouz/go-ipvs
And just to complete #561, there is also the ktcpvs project.

IPVS appears to also be an abstraction upon netfilter (like iptables). We are able to share some functionality with the existing code by using iptables; and iptables seems like the more flexible/general solution to managing netfilter.

As for #561 and ktcpvs: ktcpvs doesn't appear to have had any development since 2004 and doesn't appear to have features users would want like URL rewriting. Regardless #561 is looking for a generic solution to be usable with pluggable balanacers.

Side note: that go project doesn't appear to have a license.

iptables will be deprecated "one day" in favor of nftables (nft cli).
Also using iptables CLI to create rules doesn't seems to be quite robust...

A quick search find me this other MIT project: https://github.com/vieux/go-libipvs
But it seems to be really quite easy to create a simple working one as all complexity is already bulletproofed inside kernel code.

I doubt iptables will be removed from any of the major distros anytime soon, and the iptables CLI is specifically to create and manage rules for netfilter ... ?

An incomplete cgo wrapper like the one linked seems a lot less safe than shelling out to iptables and iptables-restore and we already need iptables for other rules (eg nodeports) and with iptables-restore we can do bulk updates with some atomicity.

IPVS further seems to be designed to be used on a load balancing host machine separately from the "real" servers.
This suggests that to be the only supported usage:

2.2. Gotchas: you need an outside client (the director and realservers can't access the virtual service)

To set up and test/run LVS, you need a minimum of 3 machines: client, director, realserver(s).

From the outside, the LVS functions as one machine. The client cannot be one of the machines in the LVS (the director, or realserver). You need an outside client. If you try to access an LVS controlled service (eg http, smtp, telnet) from any of the machines in the LVS; access from the director will hang, access from a realserver will connect to the service locally, bypassing the LVS.

It also looks like IPVS/LVS adds some extra requirements like a heartbeating daemon and extra monitoring processes. We already handle endpoint information and pod health monitoring etc from within kubernetes.

+1 for the iptables approach. We use iptables extensively in Calico and they have proved to be robust and they perform and scale well (assuming you design your rules well). @BenTheElder, should you need any help with anything of the iptables work then please let us know, because we would be happy to chip in.

+1 for iptables and iptables-restore, it's much less heavy-handed approach
than IPVS/LVS and dictates fewer system requirements (heartbeating daemon,
etc.)

On Sat, Jun 13, 2015 at 11:27 AM, Alex Pollitt [email protected]
wrote:

+1 for the iptables approach. We use iptables extensively in Calico and
they have proved to be robust and they perform and scale well (assuming you
design your rules well). @BenTheElder https://github.com/BenTheElder,
should you need any help with anything of the iptables work then please let
us know, because we would be happy to chip in.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-111719474
.

Thanks Alex, I will let you know if I do.

I could use some feedback/input on the current implementation (https://github.com/GoogleCloudPlatform/kubernetes/pull/9210) if anyone has time.

It's mostly complete and currently full up to date with the upstream master, I need to finish writing the code that compares the generated rules to iptables-save and restores the counters etc, but the rules are generated and (mostly) work, pretty much following the rules outlined in the OP here with the biggest change just being the chain names, which was necessary for automatic generation of names that iptables will accept.

There is an edge case reported here: https://github.com/BenTheElder/kubernetes/issues/3 that may require a change for handling pods connecting to themselves.

I've had some excellent feedback and discussion with @MikaelCluseau and @Symmetric in particular about doing some redesign to handle this and other things (thanks again!); but we could use some more input on the rule design in particular. If anyone else has time to take a look that would be greatly appreciated as I'm not sure what the best route to take is and I want to avoid making any major changes without more input.

The PR itself is pretty big but the relevant rule generation is all in pkg/proxy/proxieriptables.go syncProxyRules() at: https://github.com/BenTheElder/kubernetes/blob/iptables_proxy/pkg/proxy/proxieriptables.go#L286

The exisiting discussion can be seen (here of course) as well as in the PR comments and at https://github.com/BenTheElder/kubernetes/issues/3 as well as a bit more at https://github.com/BenTheElder/kubernetes/issues/4.

One other issue that needs input:

In the current code the kube-proxy is still included, to handle ONLY the nodePort case. I think we can do away with kube-proxy in this case too, and have proposed some simple iptables rules to do so on Ben's PR.

But these rules still obscure the source IP of any external LB, so they are not ideal. The problem is that if we just DNAT traffic from the LB when it hits a node, then the response packet could come from a different node, and so I don't think the LB will be able to correlate the response with the originating TCP session. Is this concern valid? The implementation would be simpler if we don't have to worry about this.

I suspect there's some magic we can do to keep HTTP proxies happy, but I don't see a way to make this general at L4.

I'm giving a try to the black magic from your PR, but it doesn't generate anything for me, it seems the rules began to be genereted with calls to iptables then a iptables-restore file is produced.

There is a miss in the header part in the file produced, typically the ones that have been populate with iptables call, There is the relevant part of the log :

I0807 11:41:24.560063 8369 iptables.go:327] running iptables -N [KUBE-PORTALS-CONTAINER -t nat]
I0807 11:41:24.562361 8369 iptables.go:327] running iptables -C [PREROUTING -t nat -m comment --comment handle ClusterIPs; NOTE: this must be before the NodePort rules -j KUBE-PORTALS-CONTAINER]
I0807 11:41:24.563469 8369 iptables.go:327] running iptables -N [KUBE-PORTALS-HOST -t nat]
I0807 11:41:24.565452 8369 iptables.go:327] running iptables -C [OUTPUT -t nat -m comment --comment handle ClusterIPs; NOTE: this must be before the NodePort rules -j KUBE-PORTALS-HOST]
I0807 11:41:24.566552 8369 iptables.go:327] running iptables -N [KUBE-NODEPORT-CONTAINER -t nat]
I0807 11:41:24.568363 8369 iptables.go:327] running iptables -C [PREROUTING -t nat -m addrtype --dst-type LOCAL -m comment --comment handle service NodePorts; NOTE: this must be the last rule in the chain -j KUBE-NODEPORT-CONTAINER]
I0807 11:41:24.569564 8369 iptables.go:327] running iptables -N [KUBE-NODEPORT-HOST -t nat]
I0807 11:41:24.571458 8369 iptables.go:327] running iptables -C [OUTPUT -t nat -m addrtype --dst-type LOCAL -m comment --comment handle service NodePorts; NOTE: this must be the last rule in the chain -j KUBE-NODEPORT-HOST]
I0807 11:41:24.573392 8369 iptables.go:327] running iptables -C [POSTROUTING -t nat -m comment --comment handle pod connecting to self -s 10.240.240.78/32 -d 10.240.240.78/32 -j MASQUERADE]
I0807 11:41:24.574447 8369 proxier.go:349] Syncing iptables rules.
I0807 11:41:24.575592 8369 proxier.go:399] Chain: PREROUTING, Rule: :PREROUTING ACCEPT [0:0]
I0807 11:41:24.575615 8369 proxier.go:401] Rule: -A PREROUTING -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-CONTAINER
I0807 11:41:24.575625 8369 proxier.go:401] Rule: -A PREROUTING -m addrtype --dst-type LOCAL -m comment --comment "handle service NodePorts; NOTE: this must be the last rule in the chain" -j KUBE-NODEPORT-CONTAINER
I0807 11:41:24.575633 8369 proxier.go:399] Chain: INPUT, Rule: :INPUT ACCEPT [0:0]
I0807 11:41:24.575646 8369 proxier.go:399] Chain: OUTPUT, Rule: :OUTPUT ACCEPT [0:0]
I0807 11:41:24.575658 8369 proxier.go:401] Rule: -A OUTPUT -m comment --comment "handle ClusterIPs; NOTE: this must be before the NodePort rules" -j KUBE-PORTALS-HOST
I0807 11:41:24.575670 8369 proxier.go:401] Rule: -A OUTPUT -m addrtype --dst-type LOCAL -m comment --comment "handle service NodePorts; NOTE: this must be the last rule in the chain" -j KUBE-NODEPORT-HOST
I0807 11:41:24.575683 8369 proxier.go:399] Chain: POSTROUTING, Rule: :POSTROUTING ACCEPT [0:0]
I0807 11:41:24.575691 8369 proxier.go:401] Rule: -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
I0807 11:41:24.575699 8369 proxier.go:401] Rule: -A POSTROUTING -s 10.240.240.78/32 -d 10.240.240.78/32 -m comment --comment "handle pod connecting to self" -j MASQUERADE
I0807 11:41:24.575709 8369 proxier.go:399] Chain: KUBE-NODEPORT-CONTAINER, Rule: :KUBE-NODEPORT-CONTAINER - [0:0]
I0807 11:41:24.575720 8369 proxier.go:399] Chain: KUBE-NODEPORT-HOST, Rule: :KUBE-NODEPORT-HOST - [0:0]
I0807 11:41:24.575729 8369 proxier.go:399] Chain: KUBE-PORTALS-CONTAINER, Rule: :KUBE-PORTALS-CONTAINER - [0:0]
I0807 11:41:24.575740 8369 proxier.go:399] Chain: KUBE-PORTALS-HOST, Rule: :KUBE-PORTALS-HOST - [0:0]
I0807 11:41:24.581897 8369 proxier.go:603] Syncing rule: :KUBE-PORTALS-HOST - [0:0]
:KUBE-PORTALS-CONTAINER - [0:0]
:KUBE-NODEPORT-HOST - [0:0]
:KUBE-NODEPORT-CONTAINER - [0:0]
:KUBE-SVC-VO8JL93ZeRSf8cnsLpl - [0:0]
:KUBE-SVC-L26cB3JYuxdW5TF84ct - [0:0]
:KUBE-SVC-j2SF8q3nUajS8vOx2qL - [0:0]
:KUBE-SVC-shln2urO8W1aBiB2bWJ - [0:0]
:KUBE-SVC-8jQ3IvijvhJ4ppFj3Ui - [0:0]
[... SNIP ...]

Merging an iptable-save with the result produced in verbose mode can be imported and doing good things.

@bnprss thanks for the report, there was a recently a number of untested changes including to using a temp file and using the "-T table" flag for iptables-restore during some rewrites for the review process. I'll put in a fix once I know what caused the regression(s).

@bnprss Like you said the table header is missing ("*nat" should be the first line), it was mistakenly removed and after putting that back in everything seems to be working fine again with seemingly no other bugs (excluding: https://github.com/BenTheElder/kubernetes/issues/3). Thanks again, sorry about that. I've pushed the fix.

Nice job, rules are loading and seems to work from the inside but no luck with externalloadbalancer no communication from the outside give answer.

Huh. Could you move over to the PR and provide some more details? So far
it's been working well but I don't deploy myself beyond local testing and I
don't think any of the other testers were using an external load balancer.
On Aug 7, 2015 1:29 PM, "bnprss" [email protected] wrote:

Nice job, rules are loading and seems to work from the inside but no luck
with externalloadbalancer no communication from the outside give answer.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-128772763
.

Yep, for the details I'll conduct further inquiries and will produce some logs or leads on PR, but not before tomorrow, I'd like to have some good backup job to run before potentially breaking something.

Can you compute the token of the rules without separators "-_" ?

@bnprss, great.
The generated rule chains for services are a hash of the service port / endpoint and then base64 url encoded and truncated. KUBE-SVC-. The code is here: https://github.com/GoogleCloudPlatform/kubernetes/pull/9210/files#diff-d51765b83fe795b469e8a86276b12dc9R321
We chose this as a way to generate valid chain names that will meet the character limit in iptables while still being deterministic.
So it should be possible to replicate externally.
If you mean, can we stop using separators, we probably could but the "_" come from some the encoded hash and the "-" are all just following the patterns in rule names from the existing userspace proxy.
We could probably use something else without too much trouble if it was necessary.

I'm ok with that, and this is really cosmetic! :)
But this differ from the things I saw before :
gce lb rules : a07f76b3b2ec311e59e2642010af0479
gce fw rules : k8s-fw-a7ecad94f3ba511e59e2642010af0479
gce routing rules : default-route-6973e029b504a0e8
gce routing to node : obfuscated_cluster_node-43506797-2eb2-11e5-9e26-42010af04793

this one is nice:
KUBE-SVC-6ADi2TVfn7mFPvBjC56
those ones are funny :
KUBE-SVC-zU6ParcQ-UfW_LdRDUc
KUBE-SVC-y--z1xTUpHPT6sgAUCC

Yeah I'm not exactly a fan of them either, we could perhaps change the hash
encoding.

On Fri, Aug 7, 2015 at 2:16 PM, bnprss [email protected] wrote:

I'm ok with that, and this is really cosmetic! :)
But this differ from the things I saw before :
gce lb rules : a07f76b3b2ec311e59e2642010af0479
gce fw rules : k8s-fw-a7ecad94f3ba511e59e2642010af0479
gce routing rules : default-route-6973e029b504a0e8
gce routing to node :
obfuscated_cluster_node-43506797-2eb2-11e5-9e26-42010af04793

this one is nice:
KUBE-SVC-6ADi2TVfn7mFPvBjC56
those ones are funny :
KUBE-SVC-zU6ParcQ-UfW_LdRDUc
KUBE-SVC-y--z1xTUpHPT6sgAUCC


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-128785914
.

Yes, it can be possible only use the truncated part of SHA, git is ok with that, docker too, and seems to be the way the other reference to kube entities are made. In case of collision in the generated hash base64 won't help. ;)
I guess @thockin could advise on that point.

I was more-so concerned with valid characters in iptables, which I had difficulty finding a good reference for. I'll look into this further shortly.

On Fri, Aug 7, 2015 at 2:29 PM, bnprss [email protected] wrote:

Yes, it can be possible only use the truncated part of SHA, git is ok with
that, docker too, and seems to be the way the other reference to kube
entities are made. In case of collision in the generated hash base64 won't
help. ;)
I guess @thockin https://github.com/thockin could advise on that point.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-128788454
.

@bnprss fyi the PR is pretty unstable and being redone, per thockin, we're trimming out NodePort etc for now and focusing on getting a simpler, cleaner version with support for portals, then working back up to full parity.

I would _not_ try to run this right now, but it'll be back soon hopefully. Breaking the PR up into some smaller related ones then pushing a clean one with the iptables-proxy stuff right now.

For those of you playing at home, I am confident we can get it up to full
parity, but it will be vastly easier to review in stages :)

On Fri, Aug 7, 2015 at 9:35 PM, Benjamin Elder [email protected]
wrote:

My reply to above comment, also soon to be squashed away:

Discussed in IRC:

  • will still need to handle counters, but want to keep parsing of
    state in util/iptables package.
  • still need hashing or similar to handle chain length limits

Otherwise seems like a very clean simplification, will implement after
some more discussion.


Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3760#issuecomment-128912169
.

Status: the "main" logic is checked in and flag-gated.

I'm working on node ports now. There are a lot of weirdo cases that need special handling. My notes so far:

# Basic node ports:
iptables -t nat -N KUBE-NODEPORTS
iptables -t nat -A PREROUTING -j KUBE-NODEPORTS
iptables -t nat -A OUTPUT -j KUBE-NODEPORTS
iptables -t nat -A KUBE-NODEPORTS -p tcp -m comment --comment "TEST: default/nodeport:p" -m tcp --dport 30241 -j KUBE-SVC-EQKU6GMUKRXBR6NWW53

# To get traffic from node to localhost:nodeport to the service:
echo 1 > /proc/sys/net/ipv4/conf/all/route_localnet
# Mark packets that are destined for services from localhost, then masquerade those
iptables -t nat -I KUBE-SVC-EQKU6GMUKRXBR6NWW53 -s 127.0.0.0/16 -j MARK --set-mark 0x4b000001;
iptables -t nat -A POSTROUTING -m mark --mark 0x4b000001 -j MASQUERADE

# To get traffic from a pod to itself via a service:
for intf in $(ip link list | grep veth | cut -f2 -d:); do brctl hairpin cbr0 $intf on; done
# Mark packets that are destined for each endpoint from the same endpoint, then masquerade those.
# This is hacky, but I don't really know which pods are "local" and I don't really want to right now. (but I will eventually)
iptables -t nat -I KUBE-SEP-HHNEQBOLY57T5MQCFIY -s 10.244.1.6 -j MARK --set-mark 0x4b000001

Been working on a contrib tool for the testing.
So far I'm thinking I'll fire up a server on a node, time latency of
requests to it, see about getting the kube-proxy resource load and dump the
data to CSV for graphing etc.
Hopefully done before Friday, getting more familiar with kubectl right now.

On Wed, Aug 12, 2015 at 8:48 PM, Tim Hockin [email protected]
wrote:

Status: the "main" logic is checked in and flag-gated.

I'm working on node ports now. There are a lot of weirdo cases that need
special handling. My notes so far:

Basic node ports:

iptables -t nat -N KUBE-NODEPORTS
iptables -t nat -A PREROUTING -j KUBE-NODEPORTS
iptables -t nat -A OUTPUT -j KUBE-NODEPORTS
iptables -t nat -A KUBE-NODEPORTS -p tcp -m comment --comment "TEST: default/nodeport:p" -m tcp --dport 30241 -j KUBE-SVC-EQKU6GMUKRXBR6NWW53

To get traffic from node to localhost:nodeport to the service:

echo 1 > /proc/sys/net/ipv4/conf/all/route_localnet

Mark packets that are destined for services from localhost, then masquerade those

iptables -t nat -I KUBE-SVC-EQKU6GMUKRXBR6NWW53 -s 127.0.0.0/16 -j MARK --set-mark 0x4b000001;
iptables -t nat -A POSTROUTING -m mark --mark 0x4b000001 -j MASQUERADE

To get traffic from a pod to itself via a service:

for intf in $(ip link list | grep veth | cut -f2 -d:); do brctl hairpin cbr0 $intf on; done

Mark packets that are destined for each endpoint from the same endpoint, then masquerade those.

This is hacky, but I don't really know which pods are "local" and I don't really want to right now. (but I will eventually)

iptables -t nat -I KUBE-SEP-HHNEQBOLY57T5MQCFIY -s 10.244.1.6 -j MARK --set-mark 0x4b000001


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130492394
.

@BenTheElder I just did some reasonably detailed network perf measurements on GCE -- I recommend taking a look at netperf (qperf also gives latency measurements).

netperf is a client/server perf tool , I've packaged both the client and the server in the docker container paultiplady/netserver:ubuntu.2. There are lots of options on netperf, but something like spinning up two netserver pods and running

kubectl exec  -t $netserver-pod-1 -- netperf –l 30 -i 10 -I 99,1 -c -j -H $netserver-pod-2-ip -t OMNI --  -T tcp -D -O THROUGHPUT,THROUGHPUT_UNITS,MEAN_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY,LOCAL_CPU_UTIL

should give you a decent spread of stats including latency and throughput. You can run the netserver container using docker run --net=host to do node->pod tests too.

The dockerfile for this container is pretty simple, I can fire it over if you want to extend it into something leaner (e.g. an alpinelinux-based container for quicker pulling).

Thanks I'll look into that.

From this comment though I think we want to do some kind of service request latency. Right now i've trying using the standard nginx container as node X and working on having a test pod time hitting it repeatedly so we can build a graph on node Y.

I'll look at netperf/qperf though, and we can always have multiple tests.
I'd like to get that graph done first though per previous discussion with @thockin

On Thu, Aug 13, 2015 at 12:02 AM, Paul Tiplady [email protected]
wrote:

@BenTheElder https://github.com/BenTheElder I just did some reasonably
detailed network perf measurements on GCE -- I recommend taking a look at
netperf (qperf also gives latency measurements).

netperf is a client/server perf tool , I've packaged both the client and
the server in the docker container paultiplady/netserver:ubuntu.2. There
are lots of options on netperf, but something like spinning up two
netserver pods and running

kubectl exec -t $netserver-pod-1 -- netperf –l 30 -i 10 -I 99,1 -c -j -H $netserver-pod-2-ip -t OMNI -- -T tcp -D -O THROUGHPUT,THROUGHPUT_UNITS,MEAN_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY,LOCAL_CPU_UTIL

should give you a decent spread of stats including latency and throughput.
You can run the netserver container using docker run --net=host to do
node->pod tests too.

The dockerfile for this container is pretty simple, I can fire it over if
you want to extend it into something leaner (e.g. an alpinelinux-based
container for quicker pulling).


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130524576
.

regarding node ports: In #9210 @Symmetric brought this case:

If the traffic flows:
LB -> node1:nodePort
And the service pod is on node2, then the full flow will be:
LB -> node1:nodePort -> node2 -> pod:svcPort
The srcIP will still be the LB, so the response will go
pod -> node2 -> LB
Since node2 can route directly to the LB.

Now we lose the opportunity to un-DNAT to restore the correct source IP for the return packet (that can only happen on node1).

I have reproduced the problem. ACK that it is a real problem. tcpdump shows the packets being DNAT'ed to the (off-machine) pod IP:port, with src intact, but tcpdump on the destination machine does not show anything. I'm not sure what I would expect to happen even if the packets did get there.

I think the only solution is to SNAT. The least-impactful solution would be to _only_ SNAT packets from the LB that are destined off-node, but a) I don't have that info in kube-proxy (could get it at the cost of code) and b) since any policy will have to consider the SNAT case anyway, I can simplify by always SNATing external LB packets. How bad is that for policy engines?

Eventually LBs will be smart enough to only target hosts with pods and traffic will stay local, and then this will be moot.

It gets more complicated though. We have the deprecatedPublicIPs field which we will probably un-deprecate with some tweaks to behavior. I guess we need to do the same for those. But it gets even more complicated - I don't actually KNOW all the public IPs (e.g. the VM has a 1-to-1 NAT external IP). Easy answer - always SNAT node-port packets. What think?

I'll test more tomorrow.

@BenTheElder You could make the netserver pod a service, so that traffic from perf <->
server is going via the service VIP. That way you don't have to do the
sampling/latency calculations yourself...

On Wed, Aug 12, 2015 at 9:20 PM, Benjamin Elder [email protected]
wrote:

Thanks I'll look into that.

From this comment
though I think we want to do some kind of service request latency. Right
now i've trying using the standard nginx container as node X and working on
having a test pod time hitting it repeatedly so we can build a graph on
node Y.

I'll look at netperf/qperf though, and we can always have multiple tests.
I'd like to get that graph done first though per previous discussion with
@thockin

On Thu, Aug 13, 2015 at 12:02 AM, Paul Tiplady [email protected]
wrote:

@BenTheElder https://github.com/BenTheElder I just did some reasonably
detailed network perf measurements on GCE -- I recommend taking a look at
netperf (qperf also gives latency measurements).

netperf is a client/server perf tool , I've packaged both the client and
the server in the docker container paultiplady/netserver:ubuntu.2. There
are lots of options on netperf, but something like spinning up two
netserver pods and running

kubectl exec -t $netserver-pod-1 -- netperf –l 30 -i 10 -I 99,1 -c -j -H
$netserver-pod-2-ip -t OMNI -- -T tcp -D -O
THROUGHPUT,THROUGHPUT_UNITS,MEAN_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY,LOCAL_CPU_UTIL

should give you a decent spread of stats including latency and
throughput.
You can run the netserver container using docker run --net=host to do
node->pod tests too.

The dockerfile for this container is pretty simple, I can fire it over if
you want to extend it into something leaner (e.g. an alpinelinux-based
container for quicker pulling).


Reply to this email directly or view it on GitHub
<
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130524576

.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130527558
.

True. I think @thockin mentioned eventually wanting an e2e latency test as
well. Time permitting there will be a number of different tests and we'll
probably have to account for gce vs AWS etc.
On Aug 13, 2015 1:47 PM, "Paul Tiplady" [email protected] wrote:

You could make the netserver pod a service, so that traffic from perf <->
server is going via the service VIP. That way you don't have to do the
sampling/latency calculations yourself...

On Wed, Aug 12, 2015 at 9:20 PM, Benjamin Elder [email protected]
wrote:

Thanks I'll look into that.

From [this comment](

https://github.com/kubernetes/kubernetes/pull/9210#issuecomment-130154261)
though I think we want to do some kind of service request latency. Right
now i've trying using the standard nginx container as node X and working
on
having a test pod time hitting it repeatedly so we can build a graph on
node Y.

I'll look at netperf/qperf though, and we can always have multiple tests.
I'd like to get that graph done first though per previous discussion with
@thockin

On Thu, Aug 13, 2015 at 12:02 AM, Paul Tiplady <[email protected]

wrote:

@BenTheElder https://github.com/BenTheElder I just did some
reasonably
detailed network perf measurements on GCE -- I recommend taking a look
at
netperf (qperf also gives latency measurements).

netperf is a client/server perf tool , I've packaged both the client
and
the server in the docker container paultiplady/netserver:ubuntu.2.
There
are lots of options on netperf, but something like spinning up two
netserver pods and running

kubectl exec -t $netserver-pod-1 -- netperf –l 30 -i 10 -I 99,1 -c -j
-H
$netserver-pod-2-ip -t OMNI -- -T tcp -D -O

THROUGHPUT,THROUGHPUT_UNITS,MEAN_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY,LOCAL_CPU_UTIL

should give you a decent spread of stats including latency and
throughput.
You can run the netserver container using docker run --net=host to do
node->pod tests too.

The dockerfile for this container is pretty simple, I can fire it over
if
you want to extend it into something leaner (e.g. an alpinelinux-based
container for quicker pulling).


Reply to this email directly or view it on GitHub
<

https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130524576

.


Reply to this email directly or view it on GitHub
<
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130527558

.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130776866
.

@Symmetric the netperf tests are working nicely. Thanks for the suggestion :-)

I'd like to revist test for a "real" load like a web-service later possibly, but after getting the args right, it's giving very nice data so far. I'll post results later when I'm finished cleaning things up.

Glad to hear that's working for you -- there are a bewildering number of
options on that tool, but it has proven very useful for my profiling work.
Definitely better than iperf...

On Thu, Aug 13, 2015 at 2:32 PM, Benjamin Elder [email protected]
wrote:

@Symmetric https://github.com/Symmetric the netperf tests are working
nicely. Thanks for the suggestion :-)

I'd like to revist test for a "real" load like a web-service later
possibly, but after getting the args right, it's giving very nice data so
far. I'll post results later when I'm finished cleaning things up.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-130850398
.

@thockin I think we can live with SNAT for LB traffic. My current thinking is that you'll need to specify a pod's access policy as one of:

  • default is 'allow from [my namespace]', in which case LB packets are dropped
  • 'allow from [list of namespaces]', or 'allow from [all namespaces in the cluster]', again LB packets are always dropped
  • 'allow from all', in which case we don't care if it's from a LB, other node, or wherever

So losing the source IP for LBs only doesn't actually cost us much.

If we can guarantee that the LB is hitting the right node for the service pod, that would be great -- in that case we don't need SNAT, and we can run a tighter ship by whitelisting LB IPs when they are provisioned on a service, and dropping the traffic otherwise.

Regarding publicIPs, I think they will have the same considerations as nodePort, and so we'll need to SNAT them until LBs can hit the right hosts. Which per the above is fine, unless I'm missing some way that they are more evil than nodePort...

As a safety mesure, it could be very useful to actually include a flag to proxy to MASQUERADE everything (acting very close to a userspace proxy). I think it's not very hard to do and a very good way to diagnose or even fallback in case of problem (I'm thinking about vxlan cases).

-------- Message d'origine --------
De : Paul Tiplady [email protected]
Date : 14/08/2015 12:50 (GMT+11:00)
À : kubernetes/kubernetes [email protected]
Cc : Mikaël Cluseau [email protected]
Objet : Re: [kubernetes] use iptables for proxying instead of userspace
(#3760)

@thockin I think we can live with SNAT for LB traffic. My current thinking is that you'll need to specify a pod's access policy as one of:

default is 'allow from [my namespace]', in which case LB packets are dropped
'allow from [list of namespaces]', or 'allow from [all namespaces in the cluster]', again LB packets are always dropped
'allow from all', in which case we don't care if it's from a LB, other node, or wherever

So losing the source IP for LBs only doesn't actually cost us much.

If we can guarantee that the LB is hitting the right node for the service pod, that would be great -- in that case we don't need SNAT, and we can run a tighter ship by whitelisting LB IPs when they are provisioned on a service, and dropping the traffic otherwise.

Regarding publicIPs, I think they will have the same considerations as nodePort, and so we'll need to SNAT them until LBs can hit the right hosts. Which per the above is fine, unless I'm missing some way that they are more evil than nodePort...


Reply to this email directly or view it on GitHub.

@MikaelCluseau that's not a bad idea - can you please open a new issue on that specifically, so I don't lose track of it?

Still TODO: fix hairpin, e2e, enable by default

Hi Tim, sorry for not coming back to you, but we had some mess to handle here... I'll pick the fix hairpin next saturday I think.

This was a note for myself - were you planning to tackle some of this? :)

Yes, sure, as I said when we were speaking about e2e testing. I'm going to help, Kubernetes is of great help for me, so I'd better master it as much as possible, and what's best than taking bugs? :-) Feel free to suggest anything of higher priority, but I think hairpin is quite good for a start. It should take place in the kubelet and have a flag to enable (disabled by default at first). I'll try to work 0.5 to 1 day a week.

AFAIK the only part of this left to do is make it the default which can happen (assuming no blow-ups) some time after v1.1 and this has some miles on it.

Whoo!

On Thu, Sep 24, 2015 at 11:21 AM, Tim Hockin [email protected]
wrote:

AFAIK the only part of this left to do is make it the default which can
happen (assuming no blow-ups) some time after v1.1 and this has some miles
on it.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-142960614
.

some time after v1.1 and this has some miles on it.

Ouch. We were really counting on it for 1.1....
https://github.com/kubernetes/kubernetes/blob/master/docs/roadmap.md

@bgrieder you still can enable it through parameter.

It's IN but not on by default. You can opt-in with a single annotation per
node (and a kube-proxy restart)

On Thu, Sep 24, 2015 at 8:27 AM, Bruno G. [email protected] wrote:

some time after v1.1 and this has some miles on it.

Ouch. We were really counting on it for 1.1....
https://github.com/kubernetes/kubernetes/blob/master/docs/roadmap.md


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-142962932
.

@thockin @bnprss ok but we expect version 1.1 to run on the Google Container Engine after release. I wonder what sort of flexibility we will have to 'opt-in with a single annotation per node'. Could you please give us some details on what the process will be or point us to some documentation ?

Once upgraded to 1.1:

$ for node in $(kubectl get nodes -o name); do kubectl annotate $node net.beta.kubernetes.io/proxy-mode=iptables; done

Then SSH to each node and restart kube-proxy (or reboot each node).

If you want to be more cautious, do one or two nodes and then try it out :)

I've marked this issue as "release-note" so that we don't forget to include that magic loop in our 1.1 documentation.

@RichieEscarez

(Just wanted to pop by and say we've been using the iptables proxying for a week now and it seems all ok!)

@thockin Should this be closed, or removed from the 1.1 milestone?

I'll move it to 1.2 just for the default enablement.

Apologies for a potentially dumb question, but regarding the preservation of client IPs:

@thockin I saw in another issue on Sep. 2 that "only intra-cluster traffic retains the client IP" -- is this still true for the 1.2 alpha?

We launched a fresh 1.2 cluster, applied the node annotation, restarted, and still see 10.244.0.1 as the source address for all requests made to a pod running HAProxy.

At this point I'm just trying to figure out whether or not we've missed a setting or I'm trying to achieve something that is not yet possible -- that is seeing the public IP address of the actual client making the request from outside the cluster.

The default still uses userspace mode. You have to set an annotation on
the node (net.beta.kubernetes. io/proxy-mode=iptables) and restart the
proxy. But that will not expose external client IPs, just intra cluster
IPs.
On Oct 23, 2015 5:09 PM, "Ben Hundley" [email protected] wrote:

Apologies for a potentially dumb question, but regarding the preservation
of client IPs:

@thockin https://github.com/thockin I saw in another issue on Sep. 2
that "only intra-cluster traffic retains the client IP" -- is this still
true for the 1.2 alpha?

We launched a fresh 1.2 cluster, applied the node annotation, restarted,
and still see 10.244.0.1 as the source address for all requests made to a
pod running HAProxy.

At this point I'm just trying to figure out whether or not we've missed a
setting or I'm trying to achieve something that is not yet possible -- that
is seeing the public IP address of the actual client making the request
from outside the cluster.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150725513
.

I'm able to keep the external client IP by DNAT'ing external traffic + routing through a kube-proxy. For instance, if your service network is 10.42.0.0/16 and you have a highly available kube-proxy on the IP 10.10.1.1, you can have the following iptable rule:

-A PREROUTING -i public -p tcp -m tcp --dport 25 -j DNAT --to-destination 10.42.12.34

and the following route:

10.42.0.0/16 via 10.10.1.1 dev edge 

The pod behind then sees the real IP:

Oct 24 02:41:39 email-0yr7n mail.info postfix/smtpd[469]: connect from zed.yyy.ru.[94.102.51.96]

You have to have the right the packet's return path of course.

Yeah, if this DNATs to an off-machine backend, you form a triangle without
SNAT. This is the fundamental problem.

On Fri, Oct 23, 2015 at 8:12 PM, Mikaël Cluseau [email protected]
wrote:

I'm able to keep the external client IP by DNAT'ing external traffic +
routing through a kube-proxy. For instance, if your service network is
10.42.0.0/16 and you have a highly available kube-proxy on the IP
10.10.1.1, you can have the following iptable rule:

-A PREROUTING -i public -p tcp -m tcp --dport 25 -j DNAT --to-destination 10.42.12.34

and the following route:

10.42.0.0/16 via 10.10.1.1 dev edge

The pod behind then sees the real IP:

Oct 24 02:41:39 email-0yr7n mail.info postfix/smtpd[469]: connect from zed.yyy.ru.[94.102.51.96]

You have to have the right the packet's return path of course.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150747217
.

Sound interesting, any link? :-) I'm trying to find a way to be sure the packet will go through the right conntrack rule. I was thinking about replicating conntrack state through the cluster.

I'm a bit lost with what you are trying to achieve.

The way the current iptables proxy is supposed to work is that a packet
arrives at a node, we detect that it is not locally generated, flag it for
SNAT, choose a backend, forward to the backend with SNAT, backend responds
to us, we un-SNAT, un-DNAT, and respond to external user.

On Fri, Oct 23, 2015 at 9:32 PM, Mikaël Cluseau [email protected]
wrote:

Sound interesting, any link? :-) I'm trying to find a way to be sure the
packet will go through the right conntrack rule. I was thinking about
replicating conntrack state through the cluster.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150753147
.

Oh, sorry if I'm not clear. I was talking about the SNAT-less case. If every kube-proxy has the same conntrack list, any of them should be able to un-DNAT correctly when the container replies to the customer.

I couldn't see a was without replication that doesn't involve HA to keep a line-shaped structure like this:

[client] ----- [proxy in HA] ------ [node1]
                           `------- [node2]

But if a triangle-shaped thing can work, it should open more possibilities.

[client] ----- [proxy1] ------ [node1]
       `------ [proxy2] ------ [node2]

That would be cute but seems crazy complicated

On Sun, Oct 25, 2015 at 11:20 PM, Mikaël Cluseau [email protected]
wrote:

Oh, sorry if I'm not clear. I was talking about the SNAT-less case. If
every kube-proxy has the same conntrack list, any of them should be able to
un-DNAT correctly when the container replies to the customer.

I couldn't see a was without replication that doesn't involve HA to keep a
line-shaped structure like this:

[client] ----- [proxy in HA] ------ [node1]
`------- [node2]

But if a triangle-shaped thing can work, it should open more possibilities.

[client] ----- [proxy1] ------ [node1]
`------ [proxy2] ------ [node2]


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-151037663
.

The way I have to explore (I haven't and don't like to ask before proper research but since the subjet is now open...) is the "Asymmetric multi-path routing" you can see here: http://conntrack-tools.netfilter.org/manual.html#sync-aa . And yes, that would be really really nice :-)

The simplest thing that could possibly work...

  1. proxy1 receives a new connection through an iptables hook (I think I've seen that somewhere), and its LB assigns it to the proxy2's node.
  2. proxy1 sends a request like "setup a conntrack entry for {src-ip}:{src-port} -> {pod-ip}:{pod-port}"
  3. proxy2 receives the request, setup the conntrack entry, and ACK it to proxy1
  4. proxy1 let the packet go through the DNAT rule (which puts a conntrack entry in proxy1 too).
  5. when the pod replies, the proxy2's host unDNATs accordingly.
  6. when the client sends another packet on this flow through proxy1, the conntrack entry does the right DNAT too.

This way, the overhead is 2 packets per new connection, and rapidly paid back by avoid un-SNAT + extra routing (since otherwise the packet has to go back through proxy1).

I'm not a network guy so I may assume too much, but that seems reasonable.

In my case, I was aiming to establish firewall rules, per NodePort service.

It looks like I can add simple ALLOW IP / DROP everything else rules in the INPUT chain, like so:

iptables -A INPUT -s $WHITELISTED_IP -p tcp --dport $CONTAINER_PORT -j ACCEPT
iptables -A INPUT -p tcp --dport $CONTAINER_PORT -j DROP

To apply these rules, what I was envisioning was using annotations on the NodePort services. The annotations would hold whitelisted IPs.

Since I can wait a bit for these rules to apply, I imagined a minutely cron task on each minion coming through and updating the minion's INPUT chain from all the Service annotations.

Is there anything that could cause an issue here? Am I insane?

@thockin have a better view than me, but I wouldn't use an annotation for this. I think security is orthogonal and should be put in a system aside, or maybe a network/proxy plugin. If you have Kubernetes, you have etcd, so you could simply store a rule set in a key and update with etcdctl watch/exec:

# while true; do etcdctl watch "/iptables/$(hostname)" && etcdctl get /iptables/$(hostname) |iptables-restore --noflush; done &
# iptables -F my-filter
# iptables -nvL my-filter
Chain my-filter (0 references)
 pkts bytes target     prot opt in     out     source               destination      
# ~nwrk/go/bin/etcdctl set /iptables/$(hostname) >/dev/null <<EOF
*filter
:my-filter -
-A my-filter -j ACCEPT -s 1.2.3.4 -p tcp --dport 80
-A my-filter -j DROP -p tcp --dport 80
COMMIT
EOF
# iptables -nvL my-filter
Chain my-filter (0 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     tcp  --  *      *       1.2.3.4              0.0.0.0/0            tcp dpt:80
    0     0 DROP       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:80

I think you want #14505

On Mon, Oct 26, 2015 at 8:53 AM, Ben Hundley [email protected]
wrote:

In my case, I was aiming to establish firewall rules, per NodePort service.

It looks like I can add simple ALLOW IP / DROP everything else rules in
the INPUT chain, like so:

iptables -A INPUT -s $WHITELISTED_IP -p tcp --dport $CONTAINER_PORT -j ACCEPT
iptables -A INPUT -p tcp --dport $CONTAINER_PORT -j DROP

To apply these rules, what I was envisioning was using annotations on the
NodePort services. The annotations would hold whitelisted IPs.

Since I can wait a bit for these rules to apply, I imagined a minutely
cron task on each minion coming through and updating the minion's INPUT
chain from all the Service annotations.

Is there anything that could cause an issue here? Am I insane?


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-151181267
.

That was the initial approach, with security groups attached to load balancers. I just hit the limit for listeners per network interface pretty fast on AWS, and ran into some hairy logic trying to spread firewall rules across multiple SGs and multiple ELBs for a single kube cluster.

Fortunately we're on to a better solution that doesn't involve screwing around with iptables.

If you just joined, allow me to sum it up. All issues regarding inability to get the client ip were merged into this issue, however the solution proposed (and implemented) does not solve it. You currently have no way to access client's ip. Ha.

@shaylevi2 there's just no good way currently to get the client IP while bouncing through a cloud LB and onto a nodePort. Once cloud LB's catch up, I will jump right on it. But this DOES preserve client IP within the cluster

But this DOES preserve client IP within the cluster

That depends on exactly how cluster networking is set up; eg, it doesn't work right in OpenShift at the moment, because iptables rules don't get run on OVS-internal traffic. So the packets get DNAT'ed going into the service endpoint, but since the source IP is cluster-internal, the response will stay within OVS, so it doesn't hit iptables again, so the DNAT'ing doesn't get reversed, so the client pod doesn't recognize the packets. At the moment, the simplest workaround for this is to fully masquerade the packets going into the endpoint, forcing them to get bounced out of OVS again on the way out. (I'm working on figuring out some way around this.)

Does OVS have a VIP notion internally? You could just get rid of
kube-proxy (c.f. opencontrail)

On Fri, Nov 20, 2015 at 7:09 AM, Dan Winship [email protected]
wrote:

But this DOES preserve client IP within the cluster

That depends on exactly how cluster networking is set up; eg, it doesn't
work right in OpenShift at the moment, because iptables rules don't get run
on OVS-internal traffic. So the packets get DNAT'ed going into the service
endpoint, but since the source IP is cluster-internal, the response will
stay within OVS, so it doesn't hit iptables again, so the DNAT'ing doesn't
get reversed, so the client pod doesn't recognize the packets. At the
moment, the simplest workaround for this is to fully masquerade the packets
going into the endpoint, forcing them to get bounced out of OVS again on
the way out. (I'm working on figuring out some way around this.)


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-158426296
.

We had talked about doing essentially the equivalent of the pure-iptables-proxying entirely inside OVS, but that requires OVS conntrack support, which requires a very recent kernel, which we don't want to depend on yet. That's probably the long-term plan though.

(For now it looks like we can make it work by adding a gratuitous extra hop out of OVS for packets with a source IP+port matching a known service endpoint that are coming from a container interface; the node will then possibly un-DNAT it, and then bounce it back into OVS where it can get delivered back to the client pod correctly.)

I'm hoping to write a doc about the Service VIP abstraction and make it
clear that it is an abstraction that can be replaced (and should be in some
cases).

On Mon, Nov 23, 2015 at 6:54 AM, Dan Winship [email protected]
wrote:

We had talked about doing essentially the equivalent of the
pure-iptables-proxying entirely inside OVS, but that requires OVS conntrack
support, which requires a very recent kernel, which we don't want to depend
on yet. That's probably the long-term plan though.

(For now it looks like we can make it work by adding a gratuitous extra
hop out of OVS for packets with a source IP+port matching a known service
endpoint that are coming from a container interface; the node will then
possibly un-DNAT it, and then bounce it back into OVS where it can get
delivered back to the client pod correctly.)


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-158959014
.

Although iptables/nftables would solve both the TCP and UDP load balancing use cases, I personally think IPVS https://github.com/kubernetes/kubernetes/issues/17470 would be a much better fit because it is purpose-built for load balancing (read: less ongoing change / maintainance for k8s team), offers a richer set of load balancing algorithms, has proven stabliity at near-line-rate speeds, amd also has golang libraries ready to manipulate rules.

@thockin, others, as per https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150743158, I did the annotation, but as rightly mentioned, external client IP is still not seen by app sitting in a container.

How to achieve this i.e. get the external client IP? In my setup, there is no external LB, the service is exposed as nodeport and client is making plain TCP (not http/ Websocket) connection to my containerized application.

@ashishvyas what version of kube-proxy are you running?

I am running v1.1.3

Follow the directions in https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-143280584 and https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150743158 but instead of using the annotation named net.beta.kubernetes.io/proxy-mode, use the annotation named net.experimental.kubernetes.io/proxy-mode.

for node in $(kubectl get nodes -o name); do
  kubectl annotate $node net.experimental.kubernetes.io/proxy-mode=iptables;
done

You should see log statements in the beginning of the kube-proxy startup like 'Found experimental annotation' and 'Annotation allows iptables proxy'

The first release that https://github.com/kubernetes/kubernetes/commit/da9a9a94d804c5bfdf3cc86ee76a2bc1a2742d16 made it into was 1.1.4 so net.beta.kubernetes.io/proxy-mode is not working for many people. You are not the first person to run into this.

Because of the way the proxy works, we lose client IP when it comes through
a node port. I know this is not great. It's very much on my mind how to
fix this properly, but it mostly comes down to the capabilities of the
load-balancers (or the other ways by which traffic arrives at a node, such
as DNS-RR)

On Wed, Jan 13, 2016 at 10:25 AM, Mike Danese [email protected]
wrote:

Follow the directions in #3760 (comment)
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-143280584
and #3760 (comment)
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-150743158
but instead of using the annotation named
net.beta.kubernetes.io/proxy-mode, use the annotation named
net.experimental.kubernetes.io/proxy-mode.

for node in $(kubectl get nodes -o name); do
kubectl annotate $node net.experimental.kubernetes.io/proxy-mode=iptables;
done

You should see log statements in the beginning of the kube-proxy startup
like 'Found experimental annotation' and 'Annotation allows iptables proxy'

The first release that da9a9a9
https://github.com/kubernetes/kubernetes/commit/da9a9a94d804c5bfdf3cc86ee76a2bc1a2742d16
made it into was 1.1.4. You are not the first person to run into this.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-171387997
.

@thockin , any workaround to temporarily address this is possible right now? If yes, I would recommend providing detailed steps for me and others on this thread would help.

No, there is no real workaround, currently. The problem comes down to the
fact that each kube-proxy might choose a backend on a different node.
Forwarding traffic with the original client IP would have the other node
respond directly which obviously will not work.

The "fix" is to _only_ send traffic for Service S to nodes that have at
least 1 backend for S _and_ to send traffic proportional to how many
backends each node has. Kube-proxy could then choose local backends
exclusively.

Consider 2 nodes and 3 backends. One node necessarily ends up with 2
backends. Whatever routes traffic has to send 2x as much to one node as it
does to another node. We just haven't tackled that problem yet - none of
the cloud load-balancers support this, so it's sort of speculative and
therefore very risky to start working on.

On Wed, Jan 13, 2016 at 12:17 PM, Ashish Vyas [email protected]
wrote:

@thockin https://github.com/thockin , any workaround to temporarily
address this is possible right now? If yes, I would recommend providing
detailed steps for me and others on this thread would help.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-171420567
.

@mikedanese, I don't seem to be able to find 1.1.4 on gcr.io:

$ sudo docker pull gcr.io/google_containers/hyperkube:v1.1.4
Pulling repository gcr.io/google_containers/hyperkube
Tag v1.1.4 not found in repository gcr.io/google_containers/hyperkube
$ sudo docker pull gcr.io/google_containers/hyperkube:v1.1.3
v1.1.3: Pulling from google_containers/hyperkube
Digest: sha256:004dde049951a4004d99e12846e1fc7274fdc5855752d50288e3be4748778ca2
Status: Image is up to date for gcr.io/google_containers/hyperkube:v1.1.3

@thockin Apologies for the long response, I wanted to cover both methods we tried to solve this with so others could understand the challenges we faced with both.

As a bit of background, our main application is a very high performance smart DNS platform (i.e. it needs UDP and needs to do at least 100k+ requests/secs per pod), and its supporting application in an SNI proxy which needs see the clients real IP address (this is a show stopper for us). We didn't want to use different networking approaches for different applications, so we decided to standardise on a single network method for all, and we chose to use IPVS for the reasons I mentioned above (performance/stability/flexibility/purpose build SLB), but you could probably hack something together using just iptables along these same lines too. We use vxlan (fast, easy, works between sites) but both of these methods should also work with GRE/VXLAN with OVS or with standard Layer 2 host networking too (assuming your hosts are all on the same L2 network).

We distribute incoming end-user traffic using a mixture of anycast and DNS, depending on failover speed requirements or whatever works best for the particular type of service, so we have a fairly even distribution of end-user traffic coming into our nodes, but the problem as you pointed out, is then getting an even distribution of traffic across pods, regardless of pod location. The other issue is making sure services talking to other services is load balanced effectively.

We attempted two models to address this:

The first method we tried was 2 layers of VIPs. External VIPs (1 per service), which distribute traffic across nodes (based on the pod count for that service on the node), and then Internal VIPs (which run on the node with the pods), which distribute load within the nodes (typically evenly across pods). The limitation of this model was that nodes running external VIPs needed to run two different network namespaces, or run their own physical nodes. The nice thing with IPVS in DSR mode (direct server return) mode is that it does not need to see the return traffic, the traffic goes:

Consumer >> (over L3) >> External VIP node >> (1) >> Internal VIP node >> (2) >> Container >> (any which way you want) >> Consumer

(1) IPVS (in DSR mode) on the host with an external VIP picks a _node_ to send traffic to (a "real server" in IPVS terms), and only changes DST MAC address of the packet (ie IP packet arrives unchanged at k8s node). It load balances across nodes based on the number of pods running that service on the node.
(2) IPVS (also in DSR mode) on the k8s node load balances traffic across pods (via the veths to the node). Replies from containers (TCP and UDP) go directly back to consumer of the service.

The upside of this model, is it was really simple to get going and the ruleset was very easy to manage. The downside of this model is that it concentrates all our service requests (but not the replies) through a number of nodes running the External VIPs. We like "shared-nothing", so, enter version 2:

The second model are now entertaining is a single layer of VIPs with smarter IPVS and iptables configuration.

Consumer >> Any node/local node >> (1) >> Container >> (any which way you want) >> Consumer
or, it might go to another node:
Consumer >> Any node/local node >> (1) >> Remote Node >> (2) >> Container >> (any which way you want) >> Consumer

(1) Traffic hits Primary VIP, traffic is load balanced across all pods in the cluster.
(2) Traffic hits Secondary VIP, traffic is load balanced only across all local pods. This secondary VIP is only used for traffic coming in from other hosts on the network (its a FWMARK VIP). We mark traffic coming in any external interface with FWMARK=1234, and that forces the traffic to go to a different ruleset, which prevent loops between nodes.

The primary VIP has a list of local pods and remote hosts with pods (with the weight being 100 for each local pods, and 100 * number of pods for remote nodes). So for example if 3 pods are running locally on nodeA, and there are two pods running on nodeB, the ruleset on nodeA would look like this:

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP service.ip.address:0 rr persistent 360
-> pod1.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> interfaceip.of.nodeB:80 Route 200 0 0
FWM 1234 rr
-> pod1.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> pod3.on.nodeA.ip:80 Route 100 0 0

However on nodeB, the IPVS config would look a bit different because it only has two local pods, and three remote pods on nodeA:

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP service.ip.address:0 rr persistent 360
-> pod1.on.nodeB.ip:80 Route 100 0 0
-> pod2.on.nodeB.ip:80 Route 100 0 0
-> interfaceip.of.nodeA:80 Route 300 0 0
FWM 1234 rr
-> pod1.on.nodeB.ip:80 Route 100 0 0
-> pod2.on.nodeB.ip:80 Route 100 0 0

Another way would be to switch the FWMARKs around, and use iptables to FWMARK anything in veth+ interfaces (wildcard match) and have the FWMARK match only used for local load balancing.

Because there is no NAT involved here, you need to add the SVC_XXX_YYY IPs in the environment to a loopback or dummy interface when you start each pod, but you could probably also change the IPVS VIPs to do DNAT as well, I don't see why that wouldn't work.

The end result is the most direct networking, with no need to centralise request processing/routing so it scales much better. The downside is some extra smarts when creating the IPVS rules. We use a little (golang) daemon to do all of this, but I'd consider writing a k8s module for this if I had the time and there was enough interest.

I'm dipping into this issue late, and probably haven't read the full trail in enough detail, but just in case it helps: If I've understood @qoke's post above then they want to use VIPs, not node ports. One of the concerns raised earlier in the thread was the source IP is not maintained when using the iptables kubeproxy. However, it is maintained if you are using the service VIP and not the node port feature I believe. (Anyway, as I say, I haven't really read the full trail, so if these comments are obvious or not helpful, then please ignore! I'll try to make time to read it in depth next week when I'm back from vacation.)

@lxpollitt Correct, the source IP is maintained when using IPVS, however keep in mind that with this method we had to pretty much do all the networking ourselves, because configuration of IPVS is not supported by kube-proxy. You can also keep the source IP with iptables as well, however you need two layers of IPtables DNAT so that you can "un-nat" the traffic on the way back

On my side, with flannel (in vxlan mode) managing my container network, I use kube-proxy (in iptables mode and not masquerading) + flannel in a namespace on my routing nodes. The external requests are DNATed to service IPs and then forwarded through the namespace with the kube-proxy. I haven't done active/active router cluster testing, but this setup allows me to keep the external IP. I mention it FWIW, but I understand it's not the "most direct networking".

I understand that if kube-proxy could manage it it would be nice, but given your specific needs and especially the fact that the load-balancing is already outside kube-proxy, wouldn't it make sense to code a customer iptables-rules manager watching the kubernetes cluster state and setting up the rules to DNAT VIPs only to the pods of the running host? This could also be a mode for kube-proxy, though, like... well.. I'm not good at names... --proxy-mode=iptables-to-node-pods-only.

Thanks for the detailed writeup. Your solution is interesting, and I spent
a good amount of time today thinking about it. Unfortunately you've
crossed into very specific territory that doesn't work in a generic sense.
Clouds like GCE can't use IPVS gatewaying mode because of the routed
network. Even if gatewaying worked, it doesn't support port remapping,
which Kubernetes does, so it only applies if the service port == target
port.

The challenge with the core of kubernetes is finding ways to handle your
sort of situation generically or to get out of your way and empower you to
set it up yourself. Maybe we could do something with the ipvs encap mode,
but I don't know the perf implications of it.

On Thu, Jan 14, 2016 at 3:37 AM, qoke [email protected] wrote:

@thockin https://github.com/thockin Apologies for the long response, I
wanted to cover both methods we tried to solve this with so others could
understand the challenges we faced with both.

As a bit of background, our main application is a very high performance
smart DNS platform (i.e. it needs UDP and needs to do at least 100k+
requests/secs per pod), and its supporting application in an SNI proxy
which needs see the clients real IP address (this is a show stopper for
us). We didn't want to use different networking approaches for different
applications, so we decided to standardise on a single network method for
all, and we chose to use IPVS for the reasons I mentioned above
(performance/stability/flexibility/purpose build SLB), but you could
probably hack something together using just iptables along these same lines
too. We use vxlan (fast, easy, works between sites) but both of these
methods should also work with GRE/VXLAN with OVS or with standard Layer 2
host networking too (assuming your hosts are all on the same L2 network).

We distribute incoming end-user traffic using a mixture of anycast and
DNS, depending on failover speed requirements or whatever works best for
the particular type of service, so we have a fairly even distribution of
end-user traffic coming into our nodes, but the problem as you pointed out,
is then getting an even distribution of traffic across pods, regardless of
pod location. The other issue is making sure services talking to other
services is load balanced effectively.

We attempted two models to address this:

The first method we tried was 2 layers of VIPs. External VIPs (1 per
service), which distribute traffic across nodes (based on the pod count for
that service on the node), and then Internal VIPs (which run on the node
with the pods), which distribute load within the nodes (typically evenly
across pods). The limitation of this model was that nodes running external
VIPs needed to run two different network namespaces, or run their own
physical nodes. The nice thing with IPVS in DSR mode (direct server return)
mode is that it does not need to see the return traffic, the traffic goes:

Consumer >> (over L3) >> External VIP node >> (1) >> Internal VIP node >>
(2) >> Container >> (any which way you want) >> Consumer

(1) IPVS (in DSR mode) on the host with an external VIP picks a _node_ to
send traffic to (a "real server" in IPVS terms), and only changes DST MAC
address of the packet (ie IP packet arrives unchanged at k8s node). It load
balances across nodes based on the number of pods running that service on
the node.
(2) IPVS (also in DSR mode) on the k8s node load balances traffic across
pods (via the veths to the node). Replies from containers (TCP and UDP) go
directly back to consumer of the service.

The upside of this model, is it was really simple to get going and the
ruleset was very easy to manage. The downside of this model is that it
concentrates all our service requests (but not the replies) through a
number of nodes running the External VIPs. We like "shared-nothing", so,
enter version 2:

The second model are now entertaining is a single layer of VIPs with
smarter IPVS and iptables configuration.

Consumer >> Any node/local node >> (1) >> Container >> (any which way you
want) >> Consumer
or, it might go to another node:
Consumer >> Any node/local node >> (1) >> Remote Node >> (2) >> Container

(any which way you want) >> Consumer

(1) Traffic hits Primary VIP, traffic is load balanced across all pods in
the cluster.
(2) Traffic hits Secondary VIP, traffic is load balanced only across all
local pods. This secondary VIP is only used for traffic coming in from
other hosts on the network (its a FWMARK VIP). We mark traffic coming in
any external interface with FWMARK=1234, and that forces the traffic to go
to a different ruleset, which prevent loops between nodes.

The primary VIP has a list of local pods and remote hosts with pods (with
the weight being 100 for each local pods, and 100 * number of pods for
remote nodes). So for example if 3 pods are running locally on nodeA, and
there are two pods running on nodeB, the ruleset on nodeA would look like
this:

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP service.ip.address:0 rr persistent 360
-> pod1.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> interfaceip.of.nodeB:80 Route 200 0 0
FWM 1234 rr
-> pod1.on.nodeA.ip:80 Route 100 0 0
-> pod2.on.nodeA.ip:80 Route 100 0 0
-> pod3.on.nodeA.ip:80 Route 100 0 0

However on nodeB, the IPVS config would look a bit different because it
only has two local pods, and three remote pods on nodeA:

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP service.ip.address:0 rr persistent 360
-> pod1.on.nodeB.ip:80 Route 100 0 0
-> pod2.on.nodeB.ip:80 Route 100 0 0
-> interfaceip.of.nodeA:80 Route 300 0 0
FWM 1234 rr
-> pod1.on.nodeB.ip:80 Route 100 0 0
-> pod2.on.nodeB.ip:80 Route 100 0 0

Another way would be to switch the FWMARKs around, and use iptables to
FWMARK anything in veth+ interfaces (wildcard match) and have the FWMARK
match only used for local load balancing.

The end result is the most direct networking, with no need to centralise
request processing/routing so scales much better. The downside is some
extra smarts when creating the IPVS rules.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-171619663
.

I think a iptables-to-node-pods-only mode would be interesting to try,
but it has a lot of ripple. The potential for imbalance is very real and
at least the service controller would need to know how to program the
external load-balancers.

On Thu, Jan 14, 2016 at 3:59 PM, Mikaël Cluseau [email protected]
wrote:

On my side, with flannel (in vxlan mode) managing my container network, I
use kube-proxy (in iptables mode and not masquerading) + flannel in a
namespace on my routing nodes. The external requests are DNATed to service
IPs and then forwarded through the namespace with the kube-proxy. I haven't
done active/active router cluster testing, but this setup allows me to keep
the external IP. I mention it FWIW, but I understand it's not the "most
direct networking".

I understand that if kube-proxy could manage it it would be nice, but
given your specific needs and especially the fact that the load-balancing
is already outside kube-proxy, wouldn't it make sense to code a customer
iptables-rules manager watching the kubernetes cluster state and setting up
the rules to DNAT VIPs only to the pods of the running host? This could
also be a mode for kube-proxy, though, like... well.. I'm not good at
names... --proxy-mode=iptables-to-node-pods-only.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-171821603
.

@thockin You mean if 2 replicas are on the same node for instance? I think we could put the case "program the external load-balancers" outside the scope of the kube-proxy, as it has multiple instances and a external LB programmer should probably be in a "single master" mode. Thus, allowing kube-proxy mode "iptables-to-node-pods-only" is only the first of a 2 steps process.

I think I could try to implement something like that tomorrow: "iptables-to-node-pods-only" mode in kube-proxy, plus a contrib/ip-route-elb that would maintain a Linux routing table with one route per service, with the right weight for each node based on how many endpoints the node has for a given service.

@thockin You mean of 2 replicas are on the same node for instance? I think we could put the case "program the
external load-balancers" outside the scope of the kube-proxy, as it has multiple instances and a external LB
programmer should probably be in a "single master" mode. Thus, allowing kube-proxy mode "iptables-to-node-
pods-only" is only the first of a 2 steps process.

"Only proxy to local pods" has to be step 2 of the process. Step 1
has to be changing the service controller to send load-balancers only
to Nodes that have 1 or more backends for a given Service. That step
alone is PROBABLY reasonable, but will require a lot of testing to
make sure we get it right. I think we want to do this eventually,
anyway.

Once that is done, we can talk about making node-ports prefer local
backends if possible, but this step requires a lot more careful
thought.. Does that mean _always_ (i.e.. never choose a remote
backend if a local one is available) or probabilistic? Should we do
that through the same node port (different nodes will get very
different behaviors) or do we allocate a different port which is used
if and only if this node has 1 or more backends? How do we handle the
imbalance problem?

I think I could try to implement something like that tommorow: "iptables-to-node-pods-only" mode in kube-proxy,
plus a contrib/ip-route-elb that would maintain a Linux routing table with one route per service, with the right weight
for each node based on how many endpoints the node has for a given service.

If ELB supports weights then it will work better in some regards than
GCE, which doesn't. That's fine, I just didn't think it supported
weights. I don't think it can be contrib, though - it's a pretty
fundamental part of the system.

On 01/16/2016 05:19 AM, Tim Hockin wrote:

"Only proxy to local pods" has to be step 2 of the process. Step 1
has to be changing the service controller to send load-balancers only
to Nodes that have 1 or more backends for a given Service. That step
alone is PROBABLY reasonable, but will require a lot of testing to
make sure we get it right. I think we want to do this eventually,
anyway.

That makes sense.

Once that is done, [...]

So let's see once that is done :-)

If ELB supports weights then it will work better in some regards than
GCE, which doesn't. That's fine, I just didn't think it supported
weights.

Since its from the manual and you have probably more that 10x my
experience in this kind of networking, I must completely ignore a catch.
The man ip-route says this:

           nexthop NEXTHOP
                  the nexthop of a multipath route.  NEXTHOP is a 

complex value with its own syntax similar to the top level argument lists:

                          via [ FAMILY ] ADDRESS - is the nexthop 

router.

                          dev NAME - is the output device.

                          weight NUMBER - is a weight for this 

element of a multipath route reflecting its relative bandwidth or quality.

I don't think it can be contrib, though - it's a pretty
fundamental part of the system.

Since the "E" stands for "external", I felt like it could start there,
at least to get some code to support ideas.

On Fri, Jan 15, 2016 at 2:55 PM, Mikaël Cluseau
[email protected] wrote:

On 01/16/2016 05:19 AM, Tim Hockin wrote:

"Only proxy to local pods" has to be step 2 of the process. Step 1
has to be changing the service controller to send load-balancers only
to Nodes that have 1 or more backends for a given Service. That step
alone is PROBABLY reasonable, but will require a lot of testing to
make sure we get it right. I think we want to do this eventually,
anyway.

That makes sense.

I thought more about this today, I don't think it will be very hard.
Just medium hard.

Once that is done, [...]

So let's see once that is done :-)

fair enough, I just like to know where a set of changes is going :)

If ELB supports weights then it will work better in some regards than
GCE, which doesn't. That's fine, I just didn't think it supported
weights.

Since its from the manual and you have probably more that 10x my
experience in this kind of networking, I must completely ignore a catch.
The man ip-route says this:

nexthop NEXTHOP
the nexthop of a multipath route. NEXTHOP is a
complex value with its own syntax similar to the top level argument lists:

via [ FAMILY ] ADDRESS - is the nexthop
router.

dev NAME - is the output device.

weight NUMBER - is a weight for this
element of a multipath route reflecting its relative bandwidth or quality.

We don't really use linux's notion of IP routing though. None of the
LB implementations I know of use it, anyway. GCE uses Google's cloud
balancer, which doesn't have weights. I don't know is Amazon ELB
does.

I don't think it can be contrib, though - it's a pretty
fundamental part of the system.

Since the "E" stands for "external", I felt like it could start there,
at least to get some code to support ideas.

Sure, we can START in contrib :)

Also, if you want to pursue this you should open 2 bugs, something like:

1) load-balancers for a Service should only target nodes that actually
have a backend for that Service

2) to preserve client IP across load-balancers, kube-proxy should
always prefer local backend if present (xref #1)

and then explain the intention and direction

On Fri, Jan 15, 2016 at 5:11 PM, Tim Hockin [email protected] wrote:

On Fri, Jan 15, 2016 at 2:55 PM, Mikaël Cluseau
[email protected] wrote:

On 01/16/2016 05:19 AM, Tim Hockin wrote:

"Only proxy to local pods" has to be step 2 of the process. Step 1
has to be changing the service controller to send load-balancers only
to Nodes that have 1 or more backends for a given Service. That step
alone is PROBABLY reasonable, but will require a lot of testing to
make sure we get it right. I think we want to do this eventually,
anyway.

That makes sense.

I thought more about this today, I don't think it will be very hard.
Just medium hard.

Once that is done, [...]

So let's see once that is done :-)

fair enough, I just like to know where a set of changes is going :)

If ELB supports weights then it will work better in some regards than
GCE, which doesn't. That's fine, I just didn't think it supported
weights.

Since its from the manual and you have probably more that 10x my
experience in this kind of networking, I must completely ignore a catch.
The man ip-route says this:

nexthop NEXTHOP
the nexthop of a multipath route. NEXTHOP is a
complex value with its own syntax similar to the top level argument lists:

via [ FAMILY ] ADDRESS - is the nexthop
router.

dev NAME - is the output device.

weight NUMBER - is a weight for this
element of a multipath route reflecting its relative bandwidth or quality.

We don't really use linux's notion of IP routing though. None of the
LB implementations I know of use it, anyway. GCE uses Google's cloud
balancer, which doesn't have weights. I don't know is Amazon ELB
does.

I don't think it can be contrib, though - it's a pretty
fundamental part of the system.

Since the "E" stands for "external", I felt like it could start there,
at least to get some code to support ideas.

Sure, we can START in contrib :)

The risk of a PR without docs is that it is in the wrong direction. It is
much easier to review something as a proposal. I will take a look at your
PR when I get a chance, soon I hope.
On Jan 15, 2016 7:02 PM, "Mikaël Cluseau" [email protected] wrote:

Is it ok to open a pull-request for (1) directly with this name and some
explanations?


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172149777
.

Unfortunately you've crossed into very specific territory that doesn't work in a generic sense.
Clouds like GCE can't use IPVS gatewaying mode because of the routed network. Even if gatewaying worked, it doesn't support port remapping, which Kubernetes does, so it only applies if the service port == target port.

On the gatewaying side, this works fine over a Layer 3 network (we use an overlay), and even though we could have got away without an overlay network, we built it this way because we wanted the approach we use to be portable and work in 3rd party clouds (like GCE).

Correct the limitation with DSR mode is that the service port == target port, but this isn't really an issue unless you have two applications _in the same container_ that need to run on the same port (we spent a lot of time thinking about this, and assuming the "1 application per container" guideline, we couldn't find a single use case for this). We have many containers all running on the same nodes with services inside them all on the same ports, and all load balanced fine. If you _really_ need to remap ports (I like to understand the real reasons behind why though), you can use IPVS NAT mode instead of the "ROUTE" mode.

The challenge with the core of kubernetes is finding ways to handle your sort of situation generically or to get out of your way and empower you to set it up yourself. Maybe we could do something with the ipvs encap mode, but I don't know the perf implications of it.

What we have done is as generic as we could possibly make it (works here, works in Amazon, and I'm certain it will work in GCE when we have the need to expand), with the only limitation with DSR being that the application running in the pod/container needs to run on the same port as the service which after a lot of internal discussion, nobody has been able to find a scenario where this would be limiting from an E2E application stack perspective.

Now that said, if you remove DSR (IPVS Route mode) from the equation, and use IPVS's "NAT mode" instead, then the ports can be remapped and you still get the benefit of getting IPVS features/performance/etc. The only downside is that NAT adds some performance tax but (a) it supports UDP, and (b) its still lightning fast compared to a user space solution.

@brendandburns @thockin Earlier in the thread you asked for some performance numbers. I wouldn't call this the most comprehensive set of tests, but I expect HTTP is one of the more common workloads in containers, so here are some apache-bench numbers as a starting point:

https://docs.google.com/presentation/d/1vv5Zszt4HDGbuyVlvOe76unHskxPuZQseQnarNbhQVc

DNAT was enabled on IPVS for a fair comparison to the other two solution (this also means service port can be different to target service port). Our workloads might seem a bit unusual to some, but our performance targets probably aren't to dissimilar to others (ie squeeze the hardware for all you can get).

Thanks!

I'm not sure where the idea that kube-proxy doesn't do UDP comes from - it
absolutely does, though maybe not perfectly (without a connection it comes
down to timeouts).

It's also worth clarifying the iptables (new) kube-proxy vs the userspace
(legacy) mode.

On Sat, Jan 16, 2016 at 9:45 PM, qoke [email protected] wrote:

@thockin https://github.com/thockin Earlier in the thread you asked for
some performance numbers. I wouldn't call this the most comprehensive set
of tests, but I expect HTTP is one of the more common workloads in
containers, so here are some apache-bench numbers as a starting point:

https://docs.google.com/presentation/d/1vv5Zszt4HDGbuyVlvOe76unHskxPuZQseQnarNbhQVc

DNAT was enabled on IPVS for a fair comparison to the other two solution
(this also means service port can be different to target service port). Our
workloads might seem a bit unusual to some, but our performance targets
probably aren't to dissimilar to others (ie squeeze the hardware for all
you can get).


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172293881
.

Good call on the new vs legacy mode - noted and updated.

Also on the UDP side of things, thank you for the clarification; I wasn't aware UDP was fully supported in kube-proxy until now. When we first tried kube-proxy with UDP, we got lots and lots of hangs. Not sure why but we did increase the timeouts and still had issues. We had to find a solution fast so we ended up working around it with IPVS rather than debugging it. At the time it only worked for quite low packet-per-second workloads (under 1k pps) but we haven't re-tested recently.

The major issue with iptables and any high-rate UDP service is netfilter conntrack tables filling up. Even if you increase the conntrack size to 1 million, then some malware-infected end-user DDoS's you or tries to use you for a DNS amplification attack and then your conntrack table just fills up again. Generally speaking, best practice for DNS servers (or any high-rate UDP services) is to disable conntrack (using -j NOTRACK in the raw table) , and if you disable conntrack, iptables NAT and stateful stuff (-m state) breaks.

Apart from the GIT repo, where would be the best place to look before attempting to create a "k8s.io/kubernetes/pkg/proxy/ipvs" module/package? or is this best left to someone who knows the code base better?

Also, if you would like any specific benchmarks run let me know and I'll see what I can do.

On 01/17/2016 08:50 PM, qoke wrote:

Apart from the GIT repo, where would be the best place to look before
attempting to create a "k8s.io/kubernetes/pkg/proxy/ipvs" module/package?

I believe you can start in contrib too.

FWIW... I used list watches and undelta stores to make a independent
binay reacting to the cluster's state in
https://github.com/kubernetes/kubernetes/pull/19755. If the information
in
https://github.com/kubernetes/kubernetes/pull/19755/files#diff-0becc97ac222c3f2838fbfe8446d5375R26
is enough, you should only have to change the call a few lines below
(https://github.com/kubernetes/kubernetes/pull/19755/files#diff-0becc97ac222c3f2838fbfe8446d5375R44).

Please note that I only support clusterIP services in this PoC.

Unfortunately you've crossed into very specific territory that doesn't work in a generic sense.
Clouds like GCE can't use IPVS gatewaying mode because of the routed network. Even if gatewaying worked, it doesn't
support port remapping, which Kubernetes does, so it only applies if the service port == target port.

On the gatewaying side, this works fine over a Layer 3 network (we use an overlay), and even though we could have got away
without an overlay network, we built it this way because we wanted the approach we use to be portable and work in 3rd party
clouds (like GCE).

I'm not sure how it can work. Machine-level static routes don't work
in GCE. Maybe I am missing some technique that you have applied. I
freely admit that I am NOT an expert in this :)

Correct the limitation is that the service port == target port, but this isn't really an issue unless you have two applications in the
same container that need to run on the same port (we spent a lot of time thinking about this, and assuming the "1 application
per container" guideline, we couldn't find a single use case for this). We have many containers all running on the same nodes
with services inside them all on the same ports, and all load balanced fine. In short, the approach we use does not stop you
from running multiple services on the same ports on the same node.

It also applies if you have backends that are changing versions (e.g.
transitioning between etcd1 and etcd2) or any other situation where
the backend port just needs to be different. The problem is that
Kubernetes allows it to be expressed, so we need to make sure people
can actually use it (or else deprecate it and EOL the feature which
seems unlikely).

The challenge with the core of kubernetes is finding ways to handle your sort of situation generically or to get out of your way
and empower you to set it up yourself. Maybe we could do something with the ipvs encap mode, but I don't know the perf
implications of it.

What we have done is as generic as we could possibly make it (works here, works in Amazon, and I'm certain it will work in
GCE when we have the need to expand), with the only limitation being that the application running in the pod/container needs > to run on the same port as the service which after a lot of internal discussion, nobody has been able to find a scenario where
this would be limiting from an E2E application stack perspective.

I really want to understand how. Do you have something more step-by-step?

Also on the UDP side of things, thank you for the clarification; I wasn't aware UDP was fully supported in kube-proxy until now.
When we first tried kube-proxy with UDP, we got lots and lots of hangs. Not sure why but we did increase the timeouts and
still had issues. We had to find a solution fast so we ended up working around it with IPVS rather than debugging it. At the
time it only worked for quite low packet-per-second workloads (under 1k pps) but we haven't re-tested recently.

The major issue with iptables and any high-rate UDP service is netfilter conntrack tables filling up. Even if you increase the
conntrack size to 1 million, then some malware-infected end-user DDoS's you or tries to use you for a DNS amplification
attack and then your conntrack table just fills up again. Generally speaking, best practice for DNS servers (or any high-rate
UDP services) is to disable conntrack (using -j NOTRACK in the raw table) , and if you disable conntrack, iptables NAT and
stateful stuff (-m state) breaks.

Yeah, NAT for UDP is really unfortunate. Finding a non-conntrack
solution would be great, but it has to either apply to all
environments or be parameterized by platform (which is hard in its own
way).

Apart from the GIT repo, where would be the best place to look before attempting to create a
"k8s.io/kubernetes/pkg/proxy/ipvs" module/package? or is this best left to someone who knows the code base better?

I opened a github issue about IPVS, but I was using masquerade (NAT)
mode because I could not get it working on GCE without (and because of
the port remapping feature). If port-remapping triggered a less ideal
balancing mode, I could probably live with that and just document it
as such.

We should move this convo there - it's going to get lost here.

On 01/18/2016 12:34 PM, Tim Hockin wrote:

Yeah, NAT for UDP is really unfortunate. Finding a non-conntrack
solution would be great, but it has to either apply to all
environments or be parameterized by platform (which is hard in its own
way).

could stateless NAT work for the case where a (pod, port) couple has at
most one service (many pods to one service)?

This would look like this:

{from: clientIP:clientPort, to: externalIP:externalPort} ---[proxy selects a random pod]---> {from: clientIP:clientPort, to: podIP:targetPort} ---> [routed through the right host...]

On the way back, the firewall will have a rule saying a packet {from:
podIP:targetPort, to: any} should be SNATed to {from:
externalIP:externalPort, to: unchanged}.

To state it in iptables dialect:

iptables -t nat -N stateless-svc-in

iptables -t nat -N stateless-svc-out

iptables -t nat -A stateless-svc-in  -j DNAT -s 1.2.3.4  -p udp --dport 53 --to-destination 10.1.0.1 -m statistic --mode random --probability 0.3333

iptables -t nat -A stateless-svc-in  -j DNAT -s 1.2.3.4  -p udp --dport 53 --to-destination 10.2.0.1 -m statistic --mode random --probability 0.5

iptables -t nat -A stateless-svc-in  -j DNAT -s 1.2.3.4  -p udp --dport 53 --to-destination 10.2.0.2 -m statistic --mode random --probability 1

iptables -t nat -A stateless-svc-out -j SNAT -s 10.1.0.1 -p udp --sport 53 --to-source 1.2.3.4

iptables -t nat -A stateless-svc-out -j SNAT -s 10.2.0.1 -p udp --sport 53 --to-source 1.2.3.4

iptables -t nat -A stateless-svc-out -j SNAT -s 10.2.0.2 -p udp --sport 53 --to-source 1.2.3.4

I don't see where this doesn't work when the packet comes from outside
the cluster.

The way that Services are expressed in Kubernetes allows a single pod to be
fronted by any number of Services, so this breaks down - we don't know what
to SNAT to.

On Sun, Jan 17, 2016 at 6:13 PM, Mikaël Cluseau [email protected]
wrote:

On 01/18/2016 12:34 PM, Tim Hockin wrote:

Yeah, NAT for UDP is really unfortunate. Finding a non-conntrack
solution would be great, but it has to either apply to all
environments or be parameterized by platform (which is hard in its own
way).

could stateless NAT work for the case where a (pod, port) couple has at
most one service (many pods to one service)?

This would look like this:

{from: clientIP:clientPort, to: externalIP:externalPort} ---[proxy selects
a random pod]---> {from: clientIP:clientPort, to: podIP:targetPort} --->
[routed through the right host...]

On the way back, the firewall will have a rule saying a packet {from:
podIP:targetPort, to: any} should be SNATed to {from:
externalIP:externalPort, to: unchanged}.

To state it in iptables dialect:

iptables -t nat -N stateless-svc-in

iptables -t nat -N stateless-svc-out

iptables -t nat -A stateless-svc-in -j DNAT -s 1.2.3.4 -p udp --dport 53
--to-destination 10.1.0.1 -m statistic --mode random --probability 0.3333

iptables -t nat -A stateless-svc-in -j DNAT -s 1.2.3.4 -p udp --dport 53
--to-destination 10.2.0.1 -m statistic --mode random --probability 0.5

iptables -t nat -A stateless-svc-in -j DNAT -s 1.2.3.4 -p udp --dport 53
--to-destination 10.2.0.2 -m statistic --mode random --probability 1

iptables -t nat -A stateless-svc-out -j SNAT -s 10.1.0.1 -p udp --sport 53
--to-source 1.2.3.4

iptables -t nat -A stateless-svc-out -j SNAT -s 10.2.0.1 -p udp --sport 53
--to-source 1.2.3.4

iptables -t nat -A stateless-svc-out -j SNAT -s 10.2.0.2 -p udp --sport 53
--to-source 1.2.3.4

I don't see where this doesn't work when the packet comes from outside
the cluster.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172408290
.

On 01/18/2016 03:31 PM, Tim Hockin wrote:

The way that Services are expressed in Kubernetes allows a single pod
to be
fronted by any number of Services, so this breaks down - we don't know
what
to SNAT to.

That's why I limited the case to a many-to-one (my first sentence :-)).
I'm just trying to draw a line around what can be done or not.

Sure, I just have the unfortunate job of pointing out why it's not a
general enough solution :(

On Sun, Jan 17, 2016 at 8:34 PM, Mikaël Cluseau [email protected]
wrote:

On 01/18/2016 03:31 PM, Tim Hockin wrote:

The way that Services are expressed in Kubernetes allows a single pod
to be
fronted by any number of Services, so this breaks down - we don't know
what
to SNAT to.

That's why I limited the case to a many-to-one (my first sentence :-)).
I'm just trying to draw a line around what can be done or not.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172421828
.

On 01/18/2016 03:46 PM, Tim Hockin wrote:

Sure, I just have the unfortunate job of pointing out why it's not a
general enough solution :(

Yeah... but its a FAQ. We could also put, somewhere, "if len(services)
== 1 { implement stateless } else { implement stateful }". But this may
look like a mess to the newbies. I could also be a contrib/elbs/something...

It's not even something we currently track (the number of Services that
front a given pod). We could, I suppose. I am not against it (even if it
seems niche). It sounds like a substantial change to make for having so
many caveats. I would like to think about better answers, still.

On Sun, Jan 17, 2016 at 8:51 PM, Mikaël Cluseau [email protected]
wrote:

On 01/18/2016 03:46 PM, Tim Hockin wrote:

Sure, I just have the unfortunate job of pointing out why it's not a
general enough solution :(

Yeah... but its a FAQ. We could also put, somewhere, "if len(services)
== 1 { implement stateless } else { implement stateful }". But this may
look like a mess to the newbies. I could also be a
contrib/elbs/something...


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172425404
.

On 01/18/2016 04:07 PM, Tim Hockin wrote:

It's not even something we currently track (the number of Services that
front a given pod). We could, I suppose. I am not against it (even if it
seems niche). It sounds like a substantial change to make for having so
many caveats. I would like to think about better answers, still.

I agree, but don't have any better idea for now :-(

Even a purpose built SDN would need to track something I suppose. Maybe
label-based solutions like MPLS..?

On 01/18/2016 04:18 PM, Mikaël Cluseau wrote:

Even a purpose built SDN would need to track something I suppose.
Maybe label-based solutions like MPLS..?

In the idea of labelling things... if we assign one IP per service + one
IP per endpoint (service+pod couple), and add these endpoint IPs to the
pods, it should work fully stateless:

``````

  • External to host:{from: clientIP:clientPort, to: externalIP:servicePort} -----[ELB
    selects one endpoint]--------> {from: clientIP:clientPort, to:
    endpointServiceIP:podPort} --> route to the host
  • Host to pod:{from: clientIP:clientPort, to: endpointServiceIP:podPort} --[standard
    routing to containers]--> {from: clientIP:clientPort, to:
    endpointServiceIP:podPort}

  • Pod to host:{from: endpointServiceIP:podPort, to: clientIP:clientPort}
    --------[standard routing to routers]-----> {from:
    endpointServiceIP:podPort, to: clientIP:clientPort} - Host to external:{from: endpointServiceIP:podPort, to: clientIP:clientPort} --------[ELB
    SNATs back]------------------> {from: clientIP:clientPort, to:
    externalIP:servicePort} ```

It think we can make this work for clusterIPs too.
``````

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing something.
We'll have to add/remove IPs in containers in response to services coming
and going, but I could not make "extra" IPs in a container work (it could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.

On Sun, Jan 17, 2016 at 10:22 PM, Mikaël Cluseau [email protected]
wrote:

On 01/18/2016 04:18 PM, Mikaël Cluseau wrote:

Even a purpose built SDN would need to track something I suppose.
Maybe label-based solutions like MPLS..?

In the idea of labelling things... if we assign one IP per service + one
IP per endpoint (service+pod couple), and add these endpoint IPs to the
pods, it should work fully stateless:

``````

  • External to host:{from: clientIP:clientPort, to: externalIP:servicePort}
    -----[ELB
    selects one endpoint]--------> {from: clientIP:clientPort, to:
    endpointServiceIP:podPort} --> route to the host
  • Host to pod:{from: clientIP:clientPort, to: endpointServiceIP:podPort}
    --[standard
    routing to containers]--> {from: clientIP:clientPort, to:
    endpointServiceIP:podPort}

  • Pod to host:{from: endpointServiceIP:podPort, to: clientIP:clientPort}
    --------[standard routing to routers]-----> {from:
    endpointServiceIP:podPort, to: clientIP:clientPort} - Host to
    external:{from: endpointServiceIP:podPort, to: clientIP:clientPort}
    --------[ELB
    SNATs back]------------------> {from: clientIP:clientPort, to:
    externalIP:servicePort} ```

It think we can make this work for clusterIPs too.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172438133
.

``````

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing something.
We'll have to add/remove IPs in containers in response to services coming
and going, but I could not make "extra" IPs in a container work (it could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.

I got a bit farther, but something I should have predicted happened.

I set up a pod with 10.244.2.8/25 as its main interface and 10.244.2.250/25
as its "in-a-service" interface. I was hoping that I could send UDP to
.250 and detect responses, to SNAT them. But of course, if the client is
not in the same /25 (which it can not be) the default route kicks in, which
comes from the .8 address. tcpdump confirms that responses come from .8
when using UDP.

I am again at a place where I am not sure how to make it work. will think
more on it.

On Mon, Jan 18, 2016 at 2:59 AM, Mikaël Cluseau [email protected]
wrote:

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a
single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing
something.
We'll have to add/remove IPs in containers in response to services coming
and going, but I could not make "extra" IPs in a container work (it could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172497456
.

It dawns on me (via Abhishek) that even if this works, we STILL have to
track flows somewhere, so it's not stateless in the end anyway.

On Mon, Jan 18, 2016 at 9:50 PM, Tim Hockin [email protected] wrote:

I got a bit farther, but something I should have predicted happened.

I set up a pod with 10.244.2.8/25 as its main interface and
10.244.2.250/25 as its "in-a-service" interface. I was hoping that I
could send UDP to .250 and detect responses, to SNAT them. But of course,
if the client is not in the same /25 (which it can not be) the default
route kicks in, which comes from the .8 address. tcpdump confirms that
responses come from .8 when using UDP.

I am again at a place where I am not sure how to make it work. will think
more on it.

On Mon, Jan 18, 2016 at 2:59 AM, Mikaël Cluseau [email protected]
wrote:

I'm trying on my side to get something (with pure netns on my local host
for now).

I'm trying an approach where I affect service IP ranges to hosts to
reduce the number of routing entries:

cli -- elb -- h1 -- c1

| `--- c2

`--- h2 -- c2

h1_ep_ip_ranges=( 10.1.1.0/24 10.1.2.0/24 )
h2_ep_ip_ranges=( 10.1.3.0/24 )

No ping ATM (packets not going through the PREROUTING chain...), and
need to sleep. More on this tomorrow ;)

On 01/18/2016 06:28 PM, Tim Hockin wrote:

I can't make this work on GCE, and I am not sure about AWS - there is a
limited number of static routes available.

I wonder if I could do it by piggybacking 2 IP ranges together in a
single
route. It's a lot of IPs to spend, but I guess it only matters for UDP.
I'll have to try it out.

Edit: I tried it out and couldn't make it work, but I'm missing
something.
We'll have to add/remove IPs in containers in response to services
coming
and going, but I could not make "extra" IPs in a container work (it
could
ping but not TCP or UDP, not sure why).

I'll have to try again sometime.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-172497456
.

That's unfortunate :-( not sure why by the way. I'll try something with MPLS then, I want to learn it anyway.

If you have 2 backends for Service and you want to send more than a single
packet, you need to track flows in some way, don't you? Or are you
assuming it is safe to spray packets at different backends?

On Wed, Jan 20, 2016 at 12:24 PM, Mikaël Cluseau [email protected]
wrote:

That's unfortunate :-( not sure why by the way. I'll try something with
MPLS then, I want to learn it anyway.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-173348973
.

I kind of assumed that for UDP workloads, yes. It also can be optional to go stateless even for UDP. @qoke any comment on this?

Also, we could use things like client IP hashing to make the flow more stable while still balanced (I don't know if we can call that "some kind of tracking" :-)).

@MikaelCluseau we use the default IPVS behaviour, which does some very light-weight UDP "stickyness"...

For scheduling UDP datagrams, IPVS load balancer records UDP datagram scheduling with configurable timeout, and the default UDP timeout is 300 seconds. Before UDP connection timeouts, all UDP datagrams from the same socket (protocol, ip address and port) will be directed to the same server.

-- Quoted from http://kb.linuxvirtualserver.org/wiki/IPVS

Of course, this only works if you have many clients talking to a single service, or a single client with varying source ports. If you have a single high-volume client, all sending traffic from the same source port, and you want to load balance this over multiple backends, then you may prefer to use a stateless/spray-and-pray approach.

We load balance a lot of DNS and RADIUS traffic - DNS typically falls into the first category (lotso of clients, or clients with lots of source ports), and RADIUS typically falls into the later category (few clients, lots of packets all from the same IP/port). Rather than using a stateless hash for RADIUS we instead decided to randomize source ports to get an even spread.

After reading the whole thread I still can't figure out whether activating the iptables mode for kube-proxy should fix the problem of external IPs being hidden (#10921) or not. We did enable the iptables mode with v1.1 as suggested here but we're still seeing the IPs from the cluster, no the real ones from the users.

Our cluster is in GCE and we just need a load balancer with HTTPS support before we go live. As GCE doesn't support v.1.2 alpha we cannot use the new Ingress (which AFAIK supports HTTPS Load Balancers), so the Network Load Balancer is our only option. But obviously we cannot go live without the ability of logging real ips from our users.

Some clarification for new users on this would be appreciated. Supporting HTTPS is mandatory for many of us. Thanks!

I have been using the iptables proxy on and off for quite some time and can confirm that the external IPs of clients are still hidden/show cluster IPs.

We've gotten around this so far by running our frontend HTTP/HTTPS proxy running in host network mode so that it sees the source IP address.

@maclof thanks for the feedback. Could you share more info about how your workaround? What do you mean by your HTTP/HTTPS running in host network?

@javiercr we use a pod spec something like this: http://pastie.org/private/zpdelblsob654zif7xus5g

Using host network means that the pod runs in the host machines network, instead of being assigned a cluster IP.

That means when our nginx server binds to port 80/443 it will listen on a host IP and will see the source IP addresses.

I'm using kubernetes 1.1, /opt/bin/kube-proxy ... --proxy-mode=iptables --masquerade-all=false and routing cluser IP network through an host having a kube-proxy. In this setup, my services are seeing the external IP. I use a highly available network namespace who has a external IP and a route to the hosts:

I0221 01:20:32.695440       1 main.go:224] <A6GSXEKN> Connection from 202.22.xxx.yyy:51954 closed.

I've learned a lot reading this thread!

As an FYI this doc states that AWS ELB uses round-robin for TCP connections and least connections for http/https: http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/how-elb-works.html#request-routing

I agree that focussing on getting requests only to nodes that run pods and to try to serve local pods is the best way to go about it. The nice side benefit of this is there'll be less node-to-node traffic within the cluster and I suppose a latency improvement by always service local requests from service to pod (which I guess is of even more benefit if you have nodes in multiple availability zones in the same cluster).

In terms of working with a load balancer that doesn't support weighting then you could solve this with your replication controller by trying to always keep the same number of pods on a node (if there's more than 1 per node) and then distributing evenly between them, even if this means having to move pods off of a node in certain situations and allowing only certain replica counts. e.g. for a 4 node cluster with a service connected to a load balancer the only number of acceptable pod replicas would be 1,2,3,4,6,8,9,12,16,20, etc

We're also looking to solve for traffic to route to local pods only. I'd be fine with the nodeport going away on a node at times when no pods are present locally for a service. This way a simple load balancer TCP health check would prevent requests from going to those nodes. I think if we can at least solve for the iptables\kube-proxy portion of this, then we'll find out what the load balancing implications are when the pods are not balanced across the cluster. I think there are ways to solve that on load balancers without having to set a weight for each node w/ an API call.

Load balancers already deal w/ this using other dynamic methods. Also, depending on what the service you're running is actually doing inside that container for each api call it may not be able to support 2x the traffic when there are 2 pods on a node vs one anyways. If Kubernetes Limits are set and if maximum levels of usage are being approached on a podnode could play into this as well which adds yet another layer of complexity to trying to find the right weight setting on the external load balancer.

I'd say, stay away from that level of complexity and not try to set load balancer weight from kubernetes.

@yoshiwaan Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that _if_ a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

@justinsb +1, also we're running into a problem now where we need to see client IPs and it's basically impossible with the current setup.

This could be way too naive, yet I was wondering what's the difference between userspace mode and iptables? I cannot really tell from the user doc.

Userland mode means kube-proxy handles the connections itself by receiving the connection request from the client and opening a socket to the server, which (1) consume much more CPU and memory and (2) is limited to the number of ports a single can open (<65k). The iptables mode works at a lower level, in the kernel, and uses connection tracking instead, so it's much lighter and handles a lot more connections*.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

@MikaelCluseau
meaning kube-proxy is only responsible for setting up and maintaining iptables rules and we no longer get a random local port for each service in iptables mode, right?

On 04/19/2016 10:51 PM, Emma He wrote:

meaning kube-proxy is only responsible for setting up and maintaining
iptables and we no longer get a random local port for each service in
iptables mode, right?

Yes.

Sorry but I absolutely missed this earlier.

(edit) (*) As long as you don't SNAT packets going through, which in turn requires a setup where you are sure packets will cross the connection tracking rules associated with them. For instance, using a routed access design allows you to avoid SNAT, which means the endpoint of the service will see the real client's IP.

@MikaelCluseau I was thinking iptables adopts SNAT and DNAT, which is not the case according to you. Could you please clarify this for me?

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

  • a client 1.0.1.1,
  • a service 1.0.2.1,
  • a pod implementing the service 1.0.3.1.

Then,

  1. Your router/firewall/loadbalancer/host/whatever receives a packet
    for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";
  2. It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
    1.0.3.1" in the cluster network;
  3. The pod replies with a packet "1.0.3.1 -> 1.0.1.1";
  4. The packet goes through a router/firewall/loadbalancer/host/whatever
    having the conntrack rule, the conntrack system rewrite the packet
    to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.

@MikaelCluseau - drop me an email at my github name at google.com - I have
something for you

On Tue, Apr 19, 2016 at 8:20 PM, Mikaël Cluseau [email protected]
wrote:

On 04/20/2016 01:59 PM, Emma He wrote:

@MikaelCluseau https://github.com/MikaelCluseau I was thinking
iptables adopts SNAT and DNAT, which is not the case according to you.
Could you please clarify this for me?

It's the tricky part.

(1) Using service/external IPs requires DNAT.
(2) If you are sure reply packets will go through the same conntrack
rule (ie, the same network stack or a replicated conntrack table), you
can skip the SNAT part (ie, MASQUERADE rules).

The condition of (2) is usually ok in routed access network designs
(which is the simplest design I can think of).

For instance, given

  • a client 1.0.1.1,
  • a service 1.0.2.1,
  • a pod implementing the service 1.0.3.1.

Then,

  1. Your router/firewall/loadbalancer/host/whatever receives a packet
    for the service so it sees a packet "1.0.1.1 -> 1.0.2.1";
  2. It DNATs it to the endpoint (pod) so the packet will be "1.0.1.1 ->
    1.0.3.1" in the cluster network;
  3. The pod replies with a packet "1.0.3.1 -> 1.0.1.1";
  4. The packet goes through a router/firewall/loadbalancer/host/whatever
    having the conntrack rule, the conntrack system rewrite the packet
    to "1.0.2.1 -> 1.0.1.1" before sending it back to the client.

If the condition of (2) cannot be met, you have to use SNAT/MASQUERADING
to be sure that the packet will go back through the
router/firewall/loadbalancer/host/whatever's conntrack.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3760#issuecomment-212230959

@justinsb @yoshiwaan did anyone ever create an issue for this? My search fu is failing me, and I have a similar need.

Can I suggest opening a new issue for the inter-node traffic suggestion, as this issue is now closed. Personally I think a good first step would be to ensure that if a pod is running on the local node, that we route to the local pod. I suspect that this will be sufficient, because you can then scale your RC so that there are pods on every node.

I didn't raise it myself

Ahhhhh, I think I found it, this appears to be the feature/fix: https://github.com/kubernetes/features/issues/27

Seems to be beta as of 1.5.x.

Was this page helpful?
0 / 5 - 0 ratings