Note that we should probably also rename service to lbservice or somesuch to distinguish them from other types of services.

bgrant0607 on 28 Jun 2014

As part of this, I'd remove service objects from the core apiserver and facilitate the use of other load balancers, such as HAProxy and nginx.

bgrant0607 on 10 Jul 2014

It would be nice if the logical definition of a service (the query and/or global name) was able to be used/specialized in multiple ways - as a simple load balancer installed via the infrastructure, as a more feature complete load balancer like nginx or haproxy also offered by the infrastructure, as a queryable endpoint an integrator could poll/wait on (GET /services/foo -> { endpoints: [{host, port}, ...] }), or as information available to hosts to expose local load balancers. Obviously these could be multiple different use cases and as such split into their own resources, but having some flexibility to specify intent (unify under a lb) distinct from mechanism makes it easier to satisfy a wide range of reqts.

smarterclayton on 10 Jul 2014

@smarterclayton I agree with separating policy and mechanism.

Primitives we need:

The ability to poll/watch a set identified by a label selector. Not sure if there is an issue filed yet.
The ability to query pod IP addresses (#385).

This would be enough to compose with other naming/discovery mechanisms and/or load balancers. We could then build a higher-level layer on top of the core that bundles common patterns with a simple API.

bgrant0607 on 10 Jul 2014

The two primitives described by @bgrant0607 is it worth keeping this issue open? Or are there more specific issues we can file?

brendandburns on 14 Jul 2014

I don't think zookeeper is solved - since you need the unique identifier in each container. I _think_ you could do this with 3 separate replication controllers (one per instance) or a mode on the replication controller.

smarterclayton on 14 Jul 2014

Service design I think deserves some discussion as Brian notes. Currently it couples an infrastructure abstraction (local proxy) with a mechanism for exposure (environment variables in all containers) with a label query. There is an equally valid use case for an edge proxy that takes L7 hosts/paths and balances them to a label query, as well as supporting protocols like http(s) and web sockets. In addition, services have a hard scale limit today of 60k backends, shared across the entire cluster (the amount of IPs allocated). It should be possible to run a local proxy on a minion that proxies only the services the containers on that host need, and also to avoid containers having to know about the external port. We can move this discussion to #494 if necessary.

smarterclayton on 22 Jul 2014

Tackling the problem of singleton services and non-auto-scaled services with fixed replication, such as master-slave replicated databases, key-value stores with fixed-size peer groups (e.g., etcd, zookeeper), etc.

The fixed-replication cases require predictable array-like behavior. Peers need to be able to discover and individually address each other. These services generally have their own client libraries and/or protocols, so we don't need to solve the problem of determining which instance a client should connect to, other than to make the instances individually addressable.

Proposal: We should create a new flavor of service, called Cardinal services, which map N IP addresses instead of just one. Cardinal services would perform a stable assignment of these IP addresses to N instances targeted by their label selector (i.e., a specified N, not just however many targets happen to exist). Once we have DNS ( #1261, #146 ), it would assign predictable DNS names based on a provided prefix, with suffixes 0 to N-1. The assignments could be recorded in annotations or labels of the targeted pods.

This would preserve the decoupling of role assignment from the identities of pods and replication controllers, while providing stable names and IP addresses, which could be used in standard application configuration mechanisms.

Some of the discussion around different types of load balancing happened in the services v2 design: #1107.

I'll file a separate issue for master election.

/cc @smarterclayton @thockin

bgrant0607 on 2 Oct 2014

The assignments would have to carry through into the pods via some environment parameterization mechanism (almost certainly).

For the etcd example, I would create:

replication controller cardinality 1: 1 pod, pointing to stable storage volume A
replication controller cardinality 2: 1 pod, pointing to stable storage volume B
replication controller cardinality 3: 1 pod, pointing to stable storage volume C
cardinal service 'etcd' pointing to the pods

If pod 2 dies, replication controller 2 creates a new copy of it and reattaches it to volume B. Cardinal service 'etcd' knows that that pod is new, but how does it know that it should be cardinality 2 (which comes from data stored on volume B)?

smarterclayton on 2 Oct 2014

Rather than 3 replication controllers, why not a sharding controller, which
looks at a label like "kubernetes.io/ShardIndex" when making decisions. If
you want 3-way sharding, it makes 3 pods with indices 0, 1, 2. I feel like
this was shot down before, but I can't reconstruct the trouble it caused in
my head.

It just seems wrong to place that burden on users if this is a relatively
common scenario.

Do you think it matters if the nominal IP for a given pod changes due to
unrelated changes in the set? For example:

at time 0, pods (A, B, C) make up a cardinal service, with IP's
10.0.0.{1-3} respectively

at time 1, the node which hosts pod B dies

at time 2, the replication controller driving B creates a new pod D

at time 3, the cardinal service changes to (A, C, D) with IP's 10.0.0.{1-3}
respectively

NB: pod C's "stable IP" changed from 10.0.0.3 to 10.0.0.2 when the set
membership changed. I expect this will do bad things to running
connections.

To circumvent this, we would need to have the ordinal values specified
outside of the service, or something else clever. Maybe that is OK, but it
seems fragile and easy to get wrong if people have to deal with it.

On Thu, Oct 2, 2014 at 10:17 AM, Clayton Coleman [email protected]
wrote:

The assignments would have to carry through into the pods via some
environment parameterization mechanism (almost certainly).

For the etcd example, I would create:

replication controller cardinality 1: 1 pod, pointing to stable
storage volume A

replication controller cardinality 2: 1 pod, pointing to stable
storage volume B

replication controller cardinality 3: 1 pod, pointing to stable
storage volume C

cardinal service 'etcd' pointing to the pods

If pod 2 dies, replication controller 2 creates a new copy of it and
reattaches it to volume B. Cardinal service 'etcd' knows that that pod is
new, but how does it know that it should be cardinality 2 (which comes from
data stored on volume B)?

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57663926
.

thockin on 2 Oct 2014

I think a sharding controller makes sense and is probably more useful in context of a cardinal service.

I do think that IP changes based on membership are scary and I can think of a bunch of degenerate edge cases. However, if the cardinality is stored with the pods, the decision is less difficult.

smarterclayton on 2 Oct 2014

First of all, I didn't intend this to be about sharding -- that's #1064. Let's move sharding discussions to there. We've seen many cases of trying to use an analogous mechanism for sharding, and we concluded that it's not the best way to implement sharding.

bgrant0607 on 2 Oct 2014

Second, my intention is that it shouldn't be necessary to run N replication controllers. It should be possible to use only one, though the number required depends on deployment details (canaries, multiple release tracks, rolling updates, etc.).

bgrant0607 on 2 Oct 2014

Third, I agree we need to consider how this would interact with the durable data proposal (#1515) -- @erictune .

bgrant0607 on 2 Oct 2014

Four, I agree we probably need to reflect the identity into the pod. As per #386, ideally a standard mechanism would be used to make the IP and DNS name assignments visible to the pod. How would IP and host aliases normally be surfaced in Linux?

bgrant0607 on 2 Oct 2014

Fifth, I suggested that we ensure assignment stability by recording assignments in the pods via labels or annotations.

bgrant0607 on 2 Oct 2014

👍1

Sixth, the problem with a "sharding controller" replacing the replication controller is that I want to decouple role assignment from image/environment management. I see replication controller as providing the latter (see https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57678564).

bgrant0607 on 2 Oct 2014

In the case of durable data, as per #1515:

For durable pods, this proposal would just work. Assignment would be stable for the durable pod.
For separate data objects, the cardinal service would need to also management role assignment for the data objects, and would defer role assignment to the pods until after they were matched to data. I think this would be straightforward.

/cc @erictune

bgrant0607 on 2 Oct 2014

I think you were trying to make the conversation easier to have, but I am not sure it worked.

Re: sharding - isn't "a set of replicas with distinct identity" basically sharding?

Re: 1 replication controller - we don't have replication controller assigning indices today. I don't think we want that, do we?

Re: telling the pod its own identity - service is a clean layer over pod. It would be messy to tell a service about an impending pod so that it could assign an IP before it exists. I think the ID needs to be part of the pod, e.g. a ShardIndex label or something. We can reflect that into the pod (somehow) and service can use that for assigning IP. If we let the Cardinal service do that itself, then the pod will have already been started by the time it is assigned. We could have per-pod metadata like we do with VMs in GCE, but that's a bigger proposal.

thockin on 2 Oct 2014

No, providing stable, distinct identities is neither necessary nor sufficient for sharding. See #1064 for details, which I just added there.

No, we don't want replication controller to assign indices. I intentionally proposed that Cardinal services do that instead.

bgrant0607 on 2 Oct 2014

Yes, I expect pods to exist (and potentially to have started) before they are assigned roles (indices). That was deliberate. It also needs to be possible to revoke an assignment.

bgrant0607 on 2 Oct 2014

Possible identity approach: Create a non-persistent IP alias in the container and provide reverse DNS lookup in the DNS implementation.

However, I don't think plumbing the identity into the pod containers is essential for many use cases. If the application is using self-registration, it probably doesn't even need this mechanism.

bgrant0607 on 3 Oct 2014

If the service manager is willing to keep some state aside from what it
currently does, we can just remember the mappings that were previously made
and try to respect them.

On Thu, Oct 2, 2014 at 8:59 PM, bgrant0607 [email protected] wrote:

Possible identity approach: Create a non-persistent IP alias in the
container and provide reverse DNS lookup in the DNS implementation.

However, I don't think plumbing the identity into the pod containers is
essential for many use cases. If the application is using
self-registration, it probably doesn't even need this mechanism.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57749083
.

thockin on 3 Oct 2014

Re. remembering the mappings, I agree -- see https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57679787.

I don't know how that's relevant to the comment you replied to, though.

bgrant0607 on 3 Oct 2014

GitHub and email don't mix. Sorry

On Thu, Oct 2, 2014 at 9:39 PM, bgrant0607 [email protected] wrote:

Re. remembering the mappings, I agree -- see #260 (comment)
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57679787
.

I don't know how that's relevant to the comment you replied to, though.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-57750769
.

thockin on 3 Oct 2014

Tim suggested "nominal", which I agree is a better fit.

http://www.mathsisfun.com/definitions/nominal-number.html
http://www.mathsisfun.com/definitions/cardinal-number.html
http://www.mathsisfun.com/definitions/ordinal-number.html

bgrant0607 on 17 Oct 2014

I like nominal as well. Will hate explaining it to users, but it's very often going to come up in use for config created things (i.e. a nominal mongodb replica set is going to be a replication controller, a nominal service, volumes, so it's already somewhat complex).

----- Original Message -----

Tim suggested "nominal", which I agree is a better fit.

http://www.mathsisfun.com/definitions/nominal-number.html
http://www.mathsisfun.com/definitions/cardinal-number.html
http://www.mathsisfun.com/definitions/ordinal-number.html

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-59438307

smarterclayton on 30 Oct 2014

Two problems I see:

Role assignment involves setting at least environment variables or volume settings per pod (ie the first pod needs to get volume a, and if deleted its replacement needs that same volume)
If the pods for a nominal service come from a single repl controller, the template has to be modified after creation but before the pod is bound.

This seems to imply role assignment is a controller that sits in between the apiserver and the scheduler as a workflow step. It also implies some form of transformation or override of the template, such that the template without the role assignment doesn't necessarily make sense.

The shard controller seems like a variant of this problem, just for more complicated role assignment.

smarterclayton on 16 Nov 2014

Example of running zookeeper with one service per instance: http://iocanel.blogspot.com/2014/10/zookeeper-on-kubernetes.html

bgrant0607 on 17 Dec 2014

So I know this is a long-old thread, but it hits a topic near and dear to me ;-)

Provided the system can push forward+reverse records for non-nat'd "nominal-services" to skydns and use names as the ENV injection into the pods that use that service, are there other limitations?

It may look a little weird in the case of ZK where every element in the quorum would use the other elements, eg:
zk1 uses: zk2, zk3
zk2 uses: zk1, zk3
zk3 uses: zk1, zk2

but in theory it should work right? Provided we can add the reverse records for the nominal services.

@bgrant0607 am I missing anything?

It becomes slightly weirder when other applications wish to use the overall service it provides. (https://github.com/GoogleCloudPlatform/kubernetes/issues/1542)

timothysc on 4 Mar 2015

@timothysc What you propose would work if zk1, zk2, and zk3 were services, at least once multi-port services are supported.

bgrant0607 on 8 Mar 2015

Yeah, there's an ugly here, though.

A running pod doesn't know what services "front" it. E.g. given a service
of one backend instance, the VIP used to access the service is not known by
the backend (unless it seeks that information by knowing the service
name). In a nominal service(N), you would have N backends and N VIPs (and
presumably N DNS names or an N-address name), but the backends would still
not know their own VIPs. They might observe the pool of DNS names, but
don't know (without probing) which one is self. This would make your ZK
case hard to use the VIPs (also VIPs are observably proxied right now).

Alternatives:

1) set up 1 service(N) and have each instance probe all VIPs for self

2) set up N service(1) instances and use labels and cmdline flags to
manually assign indices, make each ZK backend know (a priori) the DNS name
of each other ZK backend

3) Do DNS for label selectors, assign a DNS name to the ZK replica set,
hope clients use DNS correctly

4) Run kube2zk bridges alongside each ZK that sync kubernetes endpoints ->
ZK config

5) Design an alternative way to assign VIPs or indices that is more
replcation-controller-centric than service centric. Brendan and I
brainstormed some ideas here a while back, but have had no time to follow
up on it.

As for reverse DNS - I am not sure I see the role of it? I'm not against
supporting it (i.e. ask @bketelsen to support it :) but I don't think it
applies here. Traffic never comes "from" a service IP.

Tim

On Sat, Mar 7, 2015 at 8:56 PM, Brian Grant [email protected]
wrote:

@timothysc https://github.com/timothysc What you propose would work if
zk1, zk2, and zk3 were services, at least once multi-port services are
supported.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77733331
.

thockin on 8 Mar 2015

(5) sounds like this proposal.

bgrant0607 on 8 Mar 2015

I'm a huge fan of 5 - the pods knowing their role is step one, the pods having a unique having DNS identify or using endpoints to get others ip and reacting to it is two. Being able to look up the pod ip in endpoints by a stable role id would be third.

On Mar 8, 2015, at 11:06 AM, Tim Hockin [email protected] wrote:

Yeah, there's an ugly here, though.

A running pod doesn't know what services "front" it. E.g. given a service
of one backend instance, the VIP used to access the service is not known by
the backend (unless it seeks that information by knowing the service
name). In a nominal service(N), you would have N backends and N VIPs (and
presumably N DNS names or an N-address name), but the backends would still
not know their own VIPs. They might observe the pool of DNS names, but
don't know (without probing) which one is self. This would make your ZK
case hard to use the VIPs (also VIPs are observably proxied right now).

Alternatives:

1) set up 1 service(N) and have each instance probe all VIPs for self

2) set up N service(1) instances and use labels and cmdline flags to
manually assign indices, make each ZK backend know (a priori) the DNS name
of each other ZK backend

3) Do DNS for label selectors, assign a DNS name to the ZK replica set,
hope clients use DNS correctly

4) Run kube2zk bridges alongside each ZK that sync kubernetes endpoints ->
ZK config

5) Design an alternative way to assign VIPs or indices that is more
replcation-controller-centric than service centric. Brendan and I
brainstormed some ideas here a while back, but have had no time to follow
up on it.

As for reverse DNS - I am not sure I see the role of it? I'm not against
supporting it (i.e. ask @bketelsen to support it :) but I don't think it
applies here. Traffic never comes "from" a service IP.

Tim

On Sat, Mar 7, 2015 at 8:56 PM, Brian Grant [email protected]
wrote:

@timothysc https://github.com/timothysc What you propose would work if
zk1, zk2, and zk3 were services, at least once multi-port services are
supported.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77733331
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 8 Mar 2015

I'll have to make some time to write up this idea a bit more, then.

On Sun, Mar 8, 2015 at 10:25 AM, Clayton Coleman [email protected]
wrote:

I'm a huge fan of 5 - the pods knowing their role is step one, the pods
having a unique having DNS identify or using endpoints to get others ip and
reacting to it is two. Being able to look up the pod ip in endpoints by a
stable role id would be third.

On Mar 8, 2015, at 11:06 AM, Tim Hockin [email protected]
wrote:

Yeah, there's an ugly here, though.

A running pod doesn't know what services "front" it. E.g. given a
service
of one backend instance, the VIP used to access the service is not known
by
the backend (unless it seeks that information by knowing the service
name). In a nominal service(N), you would have N backends and N VIPs
(and
presumably N DNS names or an N-address name), but the backends would
still
not know their own VIPs. They might observe the pool of DNS names, but
don't know (without probing) which one is self. This would make your ZK
case hard to use the VIPs (also VIPs are observably proxied right now).

Alternatives:

1) set up 1 service(N) and have each instance probe all VIPs for self

2) set up N service(1) instances and use labels and cmdline flags to
manually assign indices, make each ZK backend know (a priori) the DNS
name
of each other ZK backend

3) Do DNS for label selectors, assign a DNS name to the ZK replica set,
hope clients use DNS correctly

4) Run kube2zk bridges alongside each ZK that sync kubernetes endpoints
->
ZK config

5) Design an alternative way to assign VIPs or indices that is more
replcation-controller-centric than service centric. Brendan and I
brainstormed some ideas here a while back, but have had no time to
follow
up on it.

As for reverse DNS - I am not sure I see the role of it? I'm not against
supporting it (i.e. ask @bketelsen to support it :) but I don't think it
applies here. Traffic never comes "from" a service IP.

Tim

On Sat, Mar 7, 2015 at 8:56 PM, Brian Grant [email protected]
wrote:

@timothysc https://github.com/timothysc What you propose would work
if
zk1, zk2, and zk3 were services, at least once multi-port services are
supported.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77733331>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77762731
.

thockin on 8 Mar 2015

@thockin re: reverse DNS, let's just wave our hands about and consider it a requirement.

ZK will break without it, as will many other distributed systems.

timothysc on 9 Mar 2015

RC that applied a unique label (members=#) to each pod it creates, and tries to create a sequence up to N replicas, and then a headless service that created an A and CNAME name for each value of "member" label (#.service.namespace.local), where the root served all of those service.namespace.local -> round robin -> 1.service.namespace.local, 2.service.namespace.local seems local.

We _could_ use strict pod IPs for those individual DNS labels. Creating fake IPs for each one gets expensive and the container won't be able to send its own IP to someone.

----- Original Message -----

@thockin re: reverse DNS, let's just wave our hands about and consider it a
requirement.

ZK will break without it, as will many other distributed systems.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369

smarterclayton on 9 Mar 2015

@timothysc re: reverse DNS - what is it reverse DNSing? The source-IP of
connections? None of that works in kube. Connections through services are
proxied, so source-ip doesn't work in the first place (could be fixed).

I don't know ZK - can you explain what they are trying to get by reverse
DNS? It seems like a very fragile assumption.

On Mon, Mar 9, 2015 at 9:05 AM, Timothy St. Clair [email protected]
wrote:

@thockin https://github.com/thockin re: reverse DNS, let's just wave
our hands about and consider it a requirement.

ZK will break without it, as will many other distributed systems.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369
.

thockin on 10 Mar 2015

I think my experience (and maybe Tim's) is that the majority of clustered software today expects the following:

each node has a stable identity
that identity is a DNS name
the IP of the underlying identity does not have to be stable
the node expects to identify itself to other nodes either by its stable identity (DNS) or its public IP
some of the clustered software expects the IP of the node a client reaches it on to match an IP it can announce itself to others in the cluster and for that IP to be reachable.

----- Original Message -----

@timothysc re: reverse DNS - what is it reverse DNSing? The source-IP of
connections? None of that works in kube. Connections through services are
proxied, so source-ip doesn't work in the first place (could be fixed).

I don't know ZK - can you explain what they are trying to get by reverse
DNS? It seems like a very fragile assumption.

On Mon, Mar 9, 2015 at 9:05 AM, Timothy St. Clair [email protected]
wrote:

@thockin https://github.com/thockin re: reverse DNS, let's just wave
our hands about and consider it a requirement.

ZK will break without it, as will many other distributed systems.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369
.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78063150

smarterclayton on 10 Mar 2015

On Mon, Mar 9, 2015 at 12:49 PM, Clayton Coleman
[email protected] wrote:

RC that applied a unique label (members=#) to each pod it creates, and tries to create a sequence up to
N replicas, and then a headless service that created an A and CNAME name for each value of "member"
label (#.service.namespace.local), where the root served all of those service.namespace.local -> round
robin -> 1.service.namespace.local, 2.service.namespace.local seems local.

The idea that Brendan and I were bouncing around was basically a pool
of tokens, perhaps part of RC, perhaps not, that each element of an RC
would get assigned. Given that token, other things could be assigned

VIPs, PDs, generic indices, etc. What's nice is that when a pod
dies, the token is returned to the pool, and when that pod is replaced
that token is re-used.

The problem comes with working out how to turn "arbitrary token" into
something meaningful.

We _could_ use strict pod IPs for those individual DNS labels. Creating fake IPs for each one gets expensive and the container won't be able to send its own IP to someone.

I haven't tried, but I bet we could make nominal services do full SNAT
and DNAT since they are 1:1. This would be a way to get stable
per-pod IPs without migratable IPs.

----- Original Message -----

@thockin re: reverse DNS, let's just wave our hands about and consider it a
requirement.

ZK will break without it, as will many other distributed systems.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369

—
Reply to this email directly or view it on GitHub.

thockin on 10 Mar 2015

Far be it from me to swim upstream - if enough apps really expect DNS to
work this way, we can make it work this way.

Sketch:

Creating a replication controller R creates a DNS name "$R.rc" which is the
pool of pod IPs that are "under" that RC. Each pod P gets a DNS name
"$P.$R.rc". BUt what is P? It could be simple indices, but that has
bitten us hard internally. It could be random strings like GenerateName,
but they have to survive pod death/restart (and maybe pod hostnames should
match?).

On Tue, Mar 10, 2015 at 7:59 AM, Clayton Coleman [email protected]
wrote:

I think my experience (and maybe Tim's) is that the majority of clustered
software today expects the following:

each node has a stable identity

that identity is a DNS name

the IP of the underlying identity does not have to be stable

the node expects to identify itself to other nodes either by its stable
identity (DNS) or its public IP

some of the clustered software expects the IP of the node a client
reaches it on to match an IP it can announce itself to others in the
cluster and for that IP to be reachable.

----- Original Message -----

@timothysc re: reverse DNS - what is it reverse DNSing? The source-IP of
connections? None of that works in kube. Connections through services are
proxied, so source-ip doesn't work in the first place (could be fixed).

I don't know ZK - can you explain what they are trying to get by reverse
DNS? It seems like a very fragile assumption.

On Mon, Mar 9, 2015 at 9:05 AM, Timothy St. Clair <
[email protected]>
wrote:

@thockin https://github.com/thockin re: reverse DNS, let's just wave
our hands about and consider it a requirement.

ZK will break without it, as will many other distributed systems.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369

.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78063150

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78071406
.

thockin on 10 Mar 2015

I kind of like DNS being attached to nominal services rather than RC.

What was the pain point with simple indices internally? P > 5000?

----- Original Message -----

Far be it from me to swim upstream - if enough apps really expect DNS to
work this way, we can make it work this way.

Sketch:

Creating a replication controller R creates a DNS name "$R.rc" which is the
pool of pod IPs that are "under" that RC. Each pod P gets a DNS name
"$P.$R.rc". BUt what is P? It could be simple indices, but that has
bitten us hard internally. It could be random strings like GenerateName,
but they have to survive pod death/restart (and maybe pod hostnames should
match?).

On Tue, Mar 10, 2015 at 7:59 AM, Clayton Coleman [email protected]
wrote:

I think my experience (and maybe Tim's) is that the majority of clustered
software today expects the following:

each node has a stable identity

that identity is a DNS name

the IP of the underlying identity does not have to be stable

the node expects to identify itself to other nodes either by its stable
identity (DNS) or its public IP

some of the clustered software expects the IP of the node a client
reaches it on to match an IP it can announce itself to others in the
cluster and for that IP to be reachable.

----- Original Message -----

@timothysc re: reverse DNS - what is it reverse DNSing? The source-IP of
connections? None of that works in kube. Connections through services are
proxied, so source-ip doesn't work in the first place (could be fixed).

I don't know ZK - can you explain what they are trying to get by reverse
DNS? It seems like a very fragile assumption.

On Mon, Mar 9, 2015 at 9:05 AM, Timothy St. Clair <
[email protected]>
wrote:

@thockin https://github.com/thockin re: reverse DNS, let's just wave
our hands about and consider it a requirement.

ZK will break without it, as will many other distributed systems.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-77883369

.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78063150

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78071406
.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78080138

smarterclayton on 10 Mar 2015

+1 to special-casing 1:1 VIPs. I think that's going to be a common case.

I'm still worried about changing DNS mappings dynamically. I'd prefer an approach that didn't require that.

Something to keep in mind when evaluating alternatives is that I'm 100% certain that we're eventually going to need 2 pods with the same role to coexist at the same time, the old and the new. Therefore, the role can't be tied to object name nor to something else that must both be unique and must be set in stone at pod creation time.

If we tie the role-assignment mechanism to replication controller, that pretty much rules out rolling updates, canaries, etc. I'd like to avoid that if possible.

bgrant0607 on 10 Mar 2015

@smarterclayton Simple indices aren't problematic due to scale. It's due to the model. See my comments from a few minutes ago. Simple indices are the way to go if they can be assigned dynamically independent of pod and RC identity.

bgrant0607 on 10 Mar 2015

Given that one of the problems is system-oriented assumptions, is there something Linux could do for us here? For instance, could we create an IP alias or somesuch in the container for the service VIP?

bgrant0607 on 10 Mar 2015

Hey guys,

I am out all this week. This is a really fun conversation, but it requires
more time than I have at hand. I'd be happy to sit and discuss in real
time next week.

On Tue, Mar 10, 2015 at 3:56 PM, Brian Grant [email protected]
wrote:

Given that one of the problems is system-oriented assumptions, is there
something Linux could do for us here? For instance, could we create an IP
alias or somesuch in the container for the service VIP?

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78166544
.

thockin on 11 Mar 2015

I'm out of pocket next two days - next week it is.

On Mar 10, 2015, at 11:33 PM, Tim Hockin [email protected] wrote:

Hey guys,

I am out all this week. This is a really fun conversation, but it requires
more time than I have at hand. I'd be happy to sit and discuss in real
time next week.

On Tue, Mar 10, 2015 at 3:56 PM, Brian Grant [email protected]
wrote:

Given that one of the problems is system-oriented assumptions, is there
something Linux could do for us here? For instance, could we create an IP
alias or somesuch in the container for the service VIP?

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78166544
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 11 Mar 2015

+1 for next week, perhaps a hangout.

timothysc on 11 Mar 2015

Poking this thread again due to related topics popping up (https://github.com/GoogleCloudPlatform/kubernetes/issues/4825#issuecomment-76193417, https://github.com/GoogleCloudPlatform/kubernetes/issues/175#issuecomment-84423902, https://github.com/GoogleCloudPlatform/kubernetes/issues/1607#issuecomment-88177147, #6393).

How do we assign network identities (DNS names and/or IP addresses) to individual pods, which may be singletons or part of a nominal set?
How do we convey the identity to containers within a pod?

Should we schedule a hangout, or use a weekly community hangout?

bgrant0607 on 8 Apr 2015

Also https://github.com/openshift/mongodb/pull/14 which we are starting to prototype what a generic pattern would be for membership set initialization (something we can either put in a container, or library, or...)

@danmcp @mfojtik

----- Original Message -----

Poking this thread again due to related topics popping up
(https://github.com/GoogleCloudPlatform/kubernetes/issues/4825#issuecomment-76193417,
https://github.com/GoogleCloudPlatform/kubernetes/issues/175#issuecomment-84423902,
https://github.com/GoogleCloudPlatform/kubernetes/issues/1607#issuecomment-88177147,

6393).

How do we assign network identities (DNS names and/or IP addresses) to
individual pods, which may be singletons or part of a nominal set?

How do we convey the identity to containers within a pod?

Should we schedule a hangout, or use a weekly community hangout?

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-91041442

smarterclayton on 8 Apr 2015

IIUC, one piece of that puzzle is master election #1542.

bgrant0607 on 9 Apr 2015

Yeah, we discussed it. For most clusters that can do their own election membership is the most important (detect membership from Kube endpoints, apply to cluster), but there are always different ways of approaching it. For instance, Cassandra basically uses Kube as the external source of truth for membership, so it's SEP and that makes leadership election easier. Although I believe depending on how members lag you could end up with a partition if membership flaps.

For mongo, you want to contact each of the members and either have them join an existing cluster or form a new cluster. If multiple clusters exist, you don't want to make them worse.

smarterclayton on 9 Apr 2015

Note that the problem of communicating a service IP to a container is similar to communicating an external IP to a VM: https://cloud.google.com/compute/docs/networking. GCE translates the external addresses to/from internal addresses, and external addresses are communicated via the metadata server: http://169.254.169.254/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip. AWS is similar: http://169.254.169.254/latest/meta-data/public-ipv4.

bgrant0607 on 9 Apr 2015

@smarterclayton I updated the mongo prototype PR: https://github.com/openshift/mongodb/pull/17

it is now using one shot Pod that initialize the replicaSet

mfojtik on 9 Apr 2015

Re. tying this mechanism to replication/shard controller:

Trying DNS names to replication controllers and/or to pod object names forces an in-place update model. A completely different kind of "in-place" rolling update would need to be implemented, while somehow supporting multiple templates simultaneously with a single controller. It would also mean that in order to perform some kinds of updates (including moving to new hosts), pods would need to be deleted and then re-created with the same names, which would increase downtime compared to the approach where we add a new pod before taking an old one away.

The token pool idea sounds great. I just think it needs to be decoupled from RCs.

bgrant0607 on 10 Apr 2015

This could be coupled to a higher-level controller, if we were to add one. Will comment further on #1743 and/or #503.

bgrant0607 on 10 Apr 2015

This shouldn't really be coupled to the DeploymentController proposed in #1743. It wouldn't work for the scenario of multiple independent release tracks, nor in the case where someone wanted to control their rollouts with a different mechanism, such as the proposed name-swapping rolling update. For similar reasons, I would not tie the DNS names to pod/RC/DC object names.

So we're back to some flavor of service, or an entirely separate controller. Perhaps the endpoints controller could assign indices to service endpoints? Is Endpoints itself a reliable enough place to record this data? If it were lost, all pods would be re-indexed.

To facilitate communication of identity down to the containers in the pod (such as via env. var. substitution discussed in #2316), it would be useful to set a field on the pod when the role were assigned/unassigned, via a subresource. That could also address the durability issue.

We should reserve space in the DNS schema for nominal instances -- #6666.

I could buy that we could try using pod IPs and just remapping DNS for nominal instances, per https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-78071406.

bgrant0607 on 11 Apr 2015

Also, I removed this from the 1.0 milestone in Feb. (though not roadmap.md), but the discussion in #6393 suggests that it may be added back.

bgrant0607 on 11 Apr 2015

Suggested by @thockin: Reverse lookup using DNS (PTR records) looks like a reasonable way to go from pod IPs to nominal DNS names:
http://en.wikipedia.org/wiki/Reverse_DNS_lookup

bgrant0607 on 11 Apr 2015

Yeah, would make "who am i" really easy via standard name tools.

On Apr 10, 2015, at 8:11 PM, Brian Grant [email protected] wrote:

Suggested by @thockin: Reverse lookup using DNS (PTR records) looks like a reasonable way to go from pod IPs to nominal DNS names:
http://en.wikipedia.org/wiki/Reverse_DNS_lookup

—
Reply to this email directly or view it on GitHub.

smarterclayton on 11 Apr 2015

/subscribe

davidopp on 21 Apr 2015

I am not sure I understand the requirements here, anymore.

There is an inherent N-to-M relationship between services and pods, by the
design of the system. Things like reverse DNS generally expect to get a
single response: given a pod's IP, get a canonical name. Every doc written
about DNS says "don't return multiple reverse records". Since a pod can be
in N services, this single response can not really be service-related. We
could do a reverse lookup to a canonical name and then a TXT lookup on that
name for "what services are you under", but that's hard to maintain and is
essentially a custom protocol anyway.

The reason I thought to attach nominal anything to RCs is that that is a
more concrete relationship. But it's not even. A pod can be created by
one RC, orphaned, adopted by another RC, and destroyed by that. Or
destroyed manually

So I am not sure what we want to do with this, short of limiting the number
of services a pod can be "under", which sounds awful.

What are the behavioral requirements? What are we trying to achieve. We
can provide simple set semantics with DNS and headless services. Is that
sufficient? If not, why?

We could go to extremes and set up iptables rules to SNAT/DNAT pods in a
service so that they all see each other on VIPs. E.g. given a Set of pods
(P1, P2, P3) they would get VIPs (V1, V2, V3). Traffic from P1 to V2 would
appear to come from V1, etc. Clients would access V{1,2,3}. But what
problem does that solve, really? It gives stable IPs but isn't this a
general problem for headless services all over?

On Tue, Apr 21, 2015 at 2:17 PM, David Oppenheimer <[email protected]

wrote:

/subscribe

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-94945744
.

thockin on 22 Apr 2015

The goal was very concrete. Real clustered systems have members that have an "identity" which is expected to be continuous in the presence of node failures. That identity might be attached to a persistent disk or a "name", but the name the identify to other systems cant change.

Example, zookeeper has a per node identity. That identity must remain stable in cluster membership - there _could_ be two pods that think they are host1, but if host1 goes away host435 cannot replace it. There may be a reused persistent disk for that pod that moves from one place to another, but when it does the pod has to be able to define that identity. But we need a way to assign that #.

I suspect that the way I think about nominal services is always about enabling existing software with small (3/5/7) membership, not the totally flexible multi hundred use case. Perhaps we should split that use case from this discussion.

On Apr 22, 2015, at 1:59 AM, Tim Hockin [email protected] wrote:

I am not sure I understand the requirements here, anymore.

There is an inherent N-to-M relationship between services and pods, by the
design of the system. Things like reverse DNS generally expect to get a
single response: given a pod's IP, get a canonical name. Every doc written
about DNS says "don't return multiple reverse records". Since a pod can be
in N services, this single response can not really be service-related. We
could do a reverse lookup to a canonical name and then a TXT lookup on that
name for "what services are you under", but that's hard to maintain and is
essentially a custom protocol anyway.

The reason I thought to attach nominal anything to RCs is that that is a
more concrete relationship. But it's not even. A pod can be created by
one RC, orphaned, adopted by another RC, and destroyed by that. Or
destroyed manually

So I am not sure what we want to do with this, short of limiting the number
of services a pod can be "under", which sounds awful.

What are the behavioral requirements? What are we trying to achieve. We
can provide simple set semantics with DNS and headless services. Is that
sufficient? If not, why?

We could go to extremes and set up iptables rules to SNAT/DNAT pods in a
service so that they all see each other on VIPs. E.g. given a Set of pods
(P1, P2, P3) they would get VIPs (V1, V2, V3). Traffic from P1 to V2 would
appear to come from V1, etc. Clients would access V{1,2,3}. But what
problem does that solve, really? It gives stable IPs but isn't this a
general problem for headless services all over?

On Tue, Apr 21, 2015 at 2:17 PM, David Oppenheimer <[email protected]

wrote:

/subscribe

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-94945744
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 22 Apr 2015

Some kind of controller could dynamically set/unset some field(s) on pods. It just can't be conflated with pod nor RC identity. Tying the network identity to pod or RC identities is totally broken.

I wouldn't try to make reverse DNS lookup work for ordinary or headless services, and I don't think it's unreasonable to impose a max of one nominal service per pod -- which doesn't have to limit other types of services at all.

There are a number of DNS limitations we'll want to push on in the long run (cache invalidation, support for long polling, etc.). Multiple PTR records is perhaps just another item added to that list.

The other alternative is that we can give a service IP to each pod of the nominal service and then solve the problems that creates.

bgrant0607 on 22 Apr 2015

----- Original Message -----

Some kind of controller could dynamically set/unset some field(s) on pods. It
just can't be conflated with pod nor RC identity. Tying the network identity
to pod or RC identities is totally broken.

I wouldn't try to make reverse DNS lookup work for ordinary or headless
services, and I don't think it's unreasonable to impose a max of one nominal
service per pod -- which doesn't have to limit other types of services at
all.

There are a number of DNS limitations we'll want to push on in the long run
(cache invalidation, support for long polling, etc.). Multiple PTR records
is perhaps just another item added to that list.

The other alternative is that we can give a service IP to each pod of the
nominal service and then solve the problems that creates.

A lot of these services expect that the IP they have (their name) resolves to the IP they connect on - so if they connect to X on 172.1.1.1, then X thinks its 172.1.1.1. That's not all software, but some of it. Usually it's a DNS name though, which means IP can change underneath.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-95266421

smarterclayton on 22 Apr 2015

I'm afraid this is all still a bit abstract for me.

Real clustered systems have members that have an "identity" which is
expected to be continuous in the presence of node failures. That identity
might be attached to a persistent disk or a "name", but the name the
identify to other systems cant change.

This is not a requirement. It's so vague as to be unimplementable.

Example, zookeeper has a per node identity. That identity must remain
stable in cluster membership

What is an identity? An IP address? A string flag passed to the software?
and env var? How does a zk process learn what its own identity is?

Some kind of controller could dynamically set/unset some field(s) on pods

Which will trigger those pods to restart, and still doesn't answer the
question of what fields, what values, and what logic to tell the pod?

I could see something like expanding annotations into commandline flags
(like we're discussing with env vars) or just into env vars. E.g.
controller writes annotations["zookeeper.com/index"] = 9, we convert
$.metatdata["zookeeper.com/index"] into ZK_INDEX env. But I am making this
up, I have seen no concrete demands that say what exactly zookeeper (or
whatever) need.

I don't think it's unreasonable to impose a max of one nominal service
per pod

I think it will be hard to implement such restrictions. The system is so
loosely coupled and async that imposing these limits might be worse than
just solving the problems.

we can give a service IP to each pod of the nominal service

This is where we started, but ...

A lot of these services expect that the IP they have (their name)
resolves to the IP they connect on - so if they connect to X on 172.1.1.1,
then X thinks its 172.1.1.

Service IPs do not satisfy this, unless we do something more in-depth like
I mentioned with SNAT/DNAT. And even that has real shortcomings.

I'm not trying to be a pain in the neck, it's just that we have very little
time in the runup to 1.0, and there's nothing here that is clear enough to
even proof-of-concept, much less implement properly.

I am trying to pin down exactly what we want to experience so I can see
what we can implement. Given the repeated references to DNS, I am holding
off the DNS rework until I know what's going on here.

On Wed, Apr 22, 2015 at 10:12 AM, Clayton Coleman [email protected]
wrote:

----- Original Message -----

Some kind of controller could dynamically set/unset some field(s) on
pods. It
just can't be conflated with pod nor RC identity. Tying the network
identity
to pod or RC identities is totally broken.

I wouldn't try to make reverse DNS lookup work for ordinary or headless
services, and I don't think it's unreasonable to impose a max of one
nominal
service per pod -- which doesn't have to limit other types of services at
all.

There are a number of DNS limitations we'll want to push on in the long
run
(cache invalidation, support for long polling, etc.). Multiple PTR
records
is perhaps just another item added to that list.

The other alternative is that we can give a service IP to each pod of the
nominal service and then solve the problems that creates.

A lot of these services expect that the IP they have (their name) resolves
to the IP they connect on - so if they connect to X on 172.1.1.1, then X
thinks its 172.1.1.1. That's not all software, but some of it. Usually it's
a DNS name though, which means IP can change underneath.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-95266421

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-95267902
.

thockin on 23 Apr 2015

Most of the multi-port examples are clustered systems. For example:
ZK: resolvable hostname or IP address in a config file, plus an "id" (i.e., an index: 1, 2, 3, ...) in a file

We should also look through some DBs.

Whatever we do won't work for everything. We just need to find the sweet spot.

bgrant0607 on 23 Apr 2015

----- Original Message -----

I'm afraid this is all still a bit abstract for me.

Real clustered systems have members that have an "identity" which is
expected to be continuous in the presence of node failures. That identity
might be attached to a persistent disk or a "name", but the name the
identify to other systems cant change.

This is not a requirement. It's so vague as to be unimplementable.

Example, zookeeper has a per node identity. That identity must remain
stable in cluster membership

What is an identity? An IP address? A string flag passed to the software?
and env var? How does a zk process learn what its own identity is?

Some kind of controller could dynamically set/unset some field(s) on pods

Which will trigger those pods to restart, and still doesn't answer the
question of what fields, what values, and what logic to tell the pod?

I could see something like expanding annotations into commandline flags
(like we're discussing with env vars) or just into env vars. E.g.
controller writes annotations["zookeeper.com/index"] = 9, we convert
$.metatdata["zookeeper.com/index"] into ZK_INDEX env. But I am making this
up, I have seen no concrete demands that say what exactly zookeeper (or
whatever) need.

I don't think it's unreasonable to impose a max of one nominal service
per pod

I think it will be hard to implement such restrictions. The system is so
loosely coupled and async that imposing these limits might be worse than
just solving the problems.

we can give a service IP to each pod of the nominal service

This is where we started, but ...

A lot of these services expect that the IP they have (their name)
resolves to the IP they connect on - so if they connect to X on 172.1.1.1,
then X thinks its 172.1.1.

Service IPs do not satisfy this, unless we do something more in-depth like
I mentioned with SNAT/DNAT. And even that has real shortcomings.

I'm not trying to be a pain in the neck, it's just that we have very little
time in the runup to 1.0, and there's nothing here that is clear enough to
even proof-of-concept, much less implement properly.

I am trying to pin down exactly what we want to experience so I can see
what we can implement. Given the repeated references to DNS, I am holding
off the DNS rework until I know what's going on here.

I think that's wise, we should specifically devote time to solving a set of known examples. We have 3 going on our side that we can leverage for concrete examples (MongoDB replica set, Zookeeper, Mysql Master/Slave) as well as the existing examples in kube/examples. Perhaps a working session to hash out the items, set bounds on unsolvable problems, identify what's left.

smarterclayton on 23 Apr 2015

Suggest changing the name of this feature since it could also be used for batch jobs to assign a shard ID number.

davidopp on 23 Apr 2015

Batch jobs have somewhat different requirements, so I'd like to keep that separate.

bgrant0607 on 23 Apr 2015

Example of zookeeper: https://github.com/openshift/origin/pull/1965

smarterclayton on 29 Apr 2015

Example of nominal service ish needs https://github.com/GoogleCloudPlatform/kubernetes/blob/master/examples/rethinkdb/image/run.sh#L26

smarterclayton on 8 May 2015

I'm kind of confused why the direction is still unclear here - particularly from the Googlers. Google has been running stateful services under Borg for many years, and it is quite clear what is required for this type of workload:

A stable identifier for which one of the set of "shards" (non-identical replicas) the current pod represents (in borg, this is the "task number" - a simple integer). This identifier should stay constant across pod reschedules/restarts.
A way to enumerate and contact all the "peer" shards, probably by using their stable identifiers in some way (in borg, this is a chubby prefix along with task number)

.. and we're done.

Note: If resource exclusion is critical for each shard, then the applications will need to do their own distributed locking since there's always the possibility that pod lifetimes can overlap during failures/restarts (perhaps using an etcd lock based on the shard identifier). A potential exotic feature extension is to allow more than one identical replica within each shard, for redundancy/load.

This can be faked right now by creating a unique service/port for each shard and running a replication controller with replicas:1 but it is a bit clumsy to manage large numbers of shards "by hand" like this.

A natural way to implement this in kubernetes might be:

Pods get additional environment variables giving their own shard index (integer) and the total number of shards (or perhaps communicate it via the downward API?).
ReplicationControllers get a "shards" count (default: 1), and "replicas" is reinterpreted to mean "replicas within each shard" (default: 1). When shrinking the set of replicas, they should kill from the end (to keep the shard indices contiguous). Note changing "shards" will require a rolling restart of the controlled pods to update their "total shards" env var (good, you don't want it happening instantaneously).
Services get a similar "shards" count that maps a contiguous range of ports to the regular selector plus an implicit "shard index" selector.
Pods can find other shards by using SERVICE_PORT (or some new env var?) + shard index offset.

Note the above gracefully degrades to the current behaviour when shards=1.

anguslees on 14 May 2015

I generally agree with this (and as you say, it has served the test of time in Borg), although I would advise against going with the "exotic feature extension" of multiple replicas per shard (though we probably need something like that under the covers for migration).

As I mentioned earlier this is closely related to what we need to do to support batch jobs with static work assignment (what I was calling "type 1" here: https://github.com/GoogleCloudPlatform/kubernetes/issues/1624#issuecomment-97622142 )

davidopp on 14 May 2015

The other features we need to align if we changed RC (or added something new) is how deployment fits in, specifically:

How do I do a rolling update to:

A regular RC
A PerNodeController
A sharded RC
A batch job?

We might want to either hold implementing deployment until we have a straw man for each of those, because I believe a deployment that only works against a simple RC has some issues.

----- Original Message -----

I generally agree with this (and as you say, it has served the test of time
in Borg), although I would advise against going with the "exotic feature
extension" of multiple replicas per shard (though we probably need something
like that under the covers for migration).

As I mentioned earlier this is closely related to what we need to do to
support batch jobs with static work assignment (what I was calling "type 1"
here:
https://github.com/GoogleCloudPlatform/kubernetes/issues/1624#issuecomment-97622142
)

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-101918080

smarterclayton on 14 May 2015

We have chosen not to prioritize this (and automatically generated persistent volume claims) for 1.0 because there are multiple workarounds, such as creating one service, RC, and persistent volume claim per instance.

While densely, statically assigned task indices in Borg are very widely used, we've learned that they have a number of shortcomings, and have spent some time over the past few years developing alternatives. Many of the problems stem from the fact that the indices are coupled to the lifecycle and identity of the tasks themselves. This makes it difficult to implement a number of deployment strategies, migration, dynamic task management, and many other scenarios. Other complexity stems from statically generating per-index configuration for each task pre-deployment -- this is tantamount to generating one RC per index value. That would be straightforward for someone to do if they really wanted to. Labels could still be used to tear down the whole set.

Per-node/daemon controller: Good point. Densely assigned indices are bad model for that case. What do you do when a node permanently goes away, for example? Indices become sparse. I recommend we don't support that.

Batch jobs: As discussed in #1624, we'd want to assign indices based on completions, not currently running pods.

As discussed above, index assignment needs to take into account associated storage, such as persistent volume claims -- identity stems from network identity (DNS name and/or IP address) and storage.

Assignment can't be driven by a single replication controller. This simply does not work. Replication controllers aren't durable entities, and we expect there to be multiple RCs per service, for canaries, rolling updates, multiple release tracks, multiple clusters, etc. Even the proposed Deployment object (#1743) doesn't have the right scope.

There are 3 alternatives for assignment:

Associate assignment with a special kind of Service. Service has exactly the right scope. (We'll also eventually need regional services.) This would be sort of a hybrid between regular services and headless services. It's what I envisioned when I originally proposed this mechanism. The assignment would be recorded in Endpoints. Endpoints and/or headless-like DNS would be the obvious ways to enumerate all the peers.
Create a new type of object that's similar to Service, but just for nominal services. We'd likely need a new object to record the assignment, also. This would expand our API surface unnecessarily, IMO.
"Pull" instead of "push". Pods are scheduled to hosts without an explicit controller, even with node constraints (selector) or one of the proposed anti-affinity mechanisms (https://github.com/GoogleCloudPlatform/kubernetes/issues/4301#issuecomment-74355529). This is also similar to service VIP assignment. We could do something similar for nominal services. A pod (or pod template) would specify the index pool from which it wanted an index assigned. Unlike general services, we don't expect pods to need to be assigned multiple indices from different pools. The assignment would be recorded in the pod spec.
Pros: Simpler for users. Doesn't require another object. Allows assignment by users.
Cons: Different from other kinds of services.

PVC assignment ideally would use the same model.

It's also worth considering how pod migration #3949 would be orchestrated. The replacement pod MUST be created prior to deleting the pod being replaced in order to transfer the container state. This might make the pull model a little problematic. Regardless, the allocation mechanism would need to be made migration-aware.

Other considerations:

How to communicate the index/identity to peers. DNS is the obvious answer.
How to communicate the index/identity to the containers in the pod. Environment variables, reverse DNS, ... The assigned index isn't going to change dynamically, though the DNS binding may be changed while the pod still exists. I'd like to choose a mechanism that applications already expect to work in other environments, and as with the broader downward API discussion (#386), I don't want to couple applications to Kubernetes-specific environment variables, but the new EnvVarSource and env var substitution (#7936) would help avoid that.

bgrant0607 on 14 May 2015

I disagree with some of what you said, but let's wait to continue this discussion until after 1.0.

davidopp on 19 May 2015

Reviving this old thread to ask a question. Is there any interest in a replication controller naming policy? Specifically nominal naming as discussed above, where all the Pods controlled by this replication controller would have numbered suffixes, something like 0001, 0002 ....

An example use case is a nginx load balancer pointing to these set of pods by domain names. So as pods come and go, the domain names are expected to always be fixed from xxx-0001 to xxx-000N.

ravigadde on 27 Jun 2015

@ravigadde Please read my last comment on this issue: https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-102107352

bgrant0607 on 27 Jun 2015

Ran into this issue while trying to setup a RabbitMQ container. Rabbit's persistence depends on the hostname, so having variable hostnames means you lose the Mnesia database on container restart.

Tried to remedy with image config (hostname directly and Rabbit), ENV variables, and the Downward API. Wasn't able to get any to solve the problem - Rabbit still picks up the generated Pod name. Solving temporarily by switching from using a Replication Controller per @mikedanese's suggest.

mmayernick on 11 Jul 2015

If I understand correctly, the rabbitmq pod (created with a replication controller) in the celerey-rabbit example will lose data on pod failure even if data is stored on a persistent disk. From rabbitmq doc:

RabbitMQ names the database directory using the current hostname of the system. If the hostname changes, a new empty database is created. To avoid data loss it's crucial to set up a fixed and resolvable hostname.

There is not a good solution for this now, but you could create a pod (not bound to an rc) with a migratable persistent disk, the caveat being the pod will need to be manually rescheduled in certain failures cases. That's the only way I can think of to keep the hostname be static.

mikedanese on 11 Jul 2015

Or on startup symlink the hostname dir to a stable location

On Jul 11, 2015, at 5:26 PM, Mike Danese [email protected] wrote:

If I understand correctly, the rabbitmq pod (created with a replication
controller) in the clerey-rabbit
https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples/celery-rabbitmq
example will lose data on pod failure even if data is stored on a
persistent disk. From rabbitmq doc:

RabbitMQ names the database directory using the current hostname of the
system. If the hostname changes, a new empty database is created. To avoid
data loss it's crucial to set up a fixed and resolvable hostname.

There is not a good solution for this now, but you could create a pod (not
bound to an rc) with a migratable persistent disk, the caveat being the pod
will need to be manually rescheduled in certain failures cases. That's the
only way I can think of to keep the hostname be static.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/260#issuecomment-120662490
.

smarterclayton on 12 Jul 2015

That's an example of why the hostname shouldn't be derived from the pod name -- #4825

bgrant0607 on 13 Jul 2015

To give this a slight kick:

There are several fundamental problems that need to be solved:

Some components need to have a unique identity per pod that is tied to their storage on disk
1. As a further wrinkle, some of those (zookeeper in some cases) need the hostname to be resolvable and match
Some components that are scaled out need to have persistent storage that is different per pod (pods point to PVCs that differ per pod, tied back to identity)
1. Sometimes those volumes should be created on demand on scale up
2. Sometimes those volumes should be reused from a pool and not recreated
Some components may scale out to tens of thousands of instances, where tracking individual allocated identities becomes impractical
1. The majority of uses for nominal will probably be between 2-7 instances of a pod (most current scalable DBs, most sharded multi-master setups)
Some components would like to not have to implement their own master election, but let the platform manage that by making one of the identities arbitrary (identity 1 becomes the master)
When we solve these problems, users will still need to roll out changes to the pods (via deployments) and ensure that any identity is either preserved or reallocated.

These do not all necessarily need to be solved in the same solution. For instance, I think there is a significant difference between small sized cluster identity (2-7) and large scale cluster identity (>7). For instance, in the large case, that software is less concerned with gaps, or has an existing consensus / membership protocol. In the small case, the software needs more help in establishing identity. I would separate these into cloud-native (>7) and existing-clustered (2-7) software.

smarterclayton on 18 Aug 2015

I agree with 1a, 1b, and 2a. 2b sounds like a different problem, though perhaps the solution can reuse the same pattern.

I think scale (3) mainly suggests that our workaround of one service and RC per instance isn't adequate, though I agree with the distinction between cloud-native vs. not.

Master election (4) can be addressed separately.

Also agree with (5).

I think most of the design requirements are clear at this point.

Remaining design questions:

I. Allocate VIPs or allow IPs to change? Closely tied to this is whether containers need to be able to discover their own IPs via Linux or not, since VIPs currently are not discoverable via the system. I assume they do need to be able to discover their IPs, but could use the downward API, as with pod IP (#12595). Allowing IPs to change (due to using pod IPs) implies a limit to the rate of change of the IP, due to DNS TTLs and caching "bugs". At some point, we also intend to make pod IPs migratable, though (#3949), so changing IPs wouldn't be forever.

II. Push (to pods) vs. pull (from pods). Services and RCs are intentionally loosely coupled to pods, and therefore both use the "push" model. Nominal services would be more intrinsically tied to pod identity (though not permanently -- pods must be replaceable). Consequently, I see less motivation for the push model. Other cases of allocation, such as scheduling, and esp. exclusive binding, such as persistent volume claims to persistent volumes, use a request/ack model, aka "pull". That is currently the model I favor for nominal services.

Anyone have opinions on (I)?

bgrant0607 on 19 Aug 2015

Master election is #1542, and is being discussed as part of the HA implementation (e.g., #12884).

bgrant0607 on 19 Aug 2015

I don't know what you mean by push and pull.

I've re-read most of this issue, and I am convinced there is not a single
solution. We're going to need a family of features to build a bridge to
these sorts of apps.

I'm starting with the axiom that once a pod is running you can not really
change a running pod. There are some exceptions to this (e.g. virtual
filesystem contents) but the things that seem to matter (env vars, IP,
hostname) you get stuck with whatever you start with.

I'm also starting with the assertion that the loose coupling between
Service and Pod stays, making this thing we are talking about not really a
Service. Pods do not really know what Services front them. If we change
that, it 's not a Service any longer.

I'm just going to do a stream-of-consciousness and see what doesn't stink.

Idea 1: ThingPools and Patches.

Define a new API object or something that lets you define a pool of
things. What is a thing? A Thing is a string (or maybe a JSON blob) that
has an enumerated type. You can create a pool of Things and those Things
are yours to use. Thing types include VUIDs (hostnames), strings, VIPs.

Define a new API construct which can be passed to create operations - a
patch. A Patch is instructions on how to patch the object that is being
created. One of the patch options is "allocate from a ThingPool".

To put these together:

Define a ThingPool { meatadata.name: my-quorum-hostnames, type: VUID,
autogenerate: 5, } // creates a pool of 5 valid VUIDs

Define an RC { replicas: 5 ...}

In the RC's create (POST) also send a Patch: { what:
"podTemplate.spec.containers[*].metadata.VUID", pool: "my-quorum-hostnames"
}

The POST operation would apply the patch to the RC - allocating one VUID
per container from the ThingPool. Any time a pod is killed and recreated,
the VUID is returned to the pool and the next pod to be started will get it.

You could use this to generate a pool of strings ("0" to "99") and stick
those into an env var. You could use this to allocate a pool of VIPs and
then assign those VIPs to the pods (would be a new feature - durable pod
IPs - not sure how this will scale :) You could generate a pool of
PersistentVolumeClaim names and patch the claim volume that each pod uses.

This idea is imperfect in many many ways, but I think it best captures the
idea of a set of pods without outright defining a set of pods.

Idea 2: Define a set of pods. Don't pretend it's a service. It's closer
to a Borg Job. It's like an RC but it assigns ordinality to the pods it
controls - shard numbers. It controls the pool of VUIDs (but we don't want
hostname to be something users can set, hmmm...).

I thought I had more ideas, but I don't. I'm wrestling with abstractions
and implementations - its pointless to define an abstraction we can't
implement. VIPs for nominals sounds great, but I think that will push the
limits of iptables. Providing a set of stable hostnames for a set of pods
seems to be the most important thing, with a set of storage for a set of
pods hot on its tail.

On Tue, Aug 18, 2015 at 7:28 PM, Brian Grant [email protected]
wrote:

I agree with 1a, 1b, and 2a. 2b sounds like a different problem, though
perhaps the solution can reuse the same pattern.

I think scale (3) mainly suggests that our workaround of one service and
RC per instance isn't adequate, though I agree with the distinction between
cloud-native vs. not.

Master election (4) can be addressed separately.

Also agree with (5).

I think most of the design requirements are clear at this point.

Remaining design questions:

I. Allocate VIPs or allow IPs to change? Closely tied to this is whether
containers need to be able to discover their own IPs via Linux or not,
since VIPs currently are not discoverable via the system. I assume they do
need to be able to discover their IPs, but could use the downward API, as
with pod IP (#12595 https://github.com/kubernetes/kubernetes/pull/12595).
Allowing IPs to change (due to using pod IPs) implies a limit to the rate
of change of the IP, due to DNS TTLs and caching "bugs". At some point, we
also intend to make pod IPs migratable, though (#3949
https://github.com/kubernetes/kubernetes/issues/3949), so changing IPs
wouldn't be forever.

II. Push (to pods) vs. pull (from pods). Services and RCs are
intentionally loosely coupled to pods, and therefore both use the "push"
model. Nominal services would be more intrinsically tied to pod identity
(though not permanently -- pods must be replaceable). Consequently, I see
less motivation for the push model. Other cases of allocation, such as
scheduling, and esp. exclusive binding, such as persistent volume claims to
persistent volumes, use a request/ack model, aka "pull". That is currently
the model I favor for nominal services.

Anyone have opinions on (I)?

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-132423385
.

thockin on 19 Aug 2015

I probably didn't truly mean master election as you think of it - more that
when building out functionality where a set of instances need to initialize
(without explicitly coordinating) having only the instance that thinks it's
"1" talk to other pods is usually sufficient to bootstrap a cluster. An
example is mongodb where you need to initiate the replica set - if pods
that think they are "2" or "3" initiate, you can get in cases where you
initiate a split. But "1" can safely initiate itself each time, and then
try to add the other members (which have persistent state they can use to
determine whether they are already part of a cluster). Depending on the
guarantees provided by the identity service you may not actually get a
successful cluster, but you don't have to create a separate pod outside to
initialize your service (although that's not terrible either either).

On Tue, Aug 18, 2015 at 11:59 PM, Brian Grant [email protected]
wrote:

Master election is #1542
https://github.com/kubernetes/kubernetes/issues/1542, and is being
discussed as part of the HA implementation (e.g., #12884
https://github.com/kubernetes/kubernetes/pull/12884).

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-132439272
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 19 Aug 2015

@smarterclayton It's an aside to the nominal services issue, but it isn't safe to bootstrap as you're describing. If "1" is restarted on a new machine and happens to be unable to reach the existing nodes, then it will re-bootstrap and now you have two mongodb's claiming to be authoritative. You need to change the job definition to remove the ability to bootstrap after bootstrapping is complete.

anguslees on 19 Aug 2015

As @thockin might be alluding to, I think we're unnecessarily conflating several similar-but-different features here - and we're including more than is strictly necessary at the k8s level.

I see two quite different use cases described above - differing by source of truth:

_Prescriptive:_ "I want N shards running, and if there are more or less than that, then make changes to bring it back to N."
_Descriptive:_ "I want to just auto-discover all the available shards, and let me find out somehow as they come and go."

I think these are different and should be described differently (although they might present similar metadata to the actual running pods).

The discussion has also incorporated resource fencing and single-access guarantees. Afaics, the _only_ case where this needs to be done within k8s is mounting remote persistent volumes (because the fs mount needs to be done outside the container), and this is pretty easy to do via etcd locks if the underlying remote volume service doesn't already provide fencing internally. All the other cases can be handled by the user jobs themselves, using a distributed lock service (or accessing services that provide locking internally). Asking the jobs to do their own locking opens up all sorts of patterns (fixed assignment, opportunistic assignment, "hot spares", etc) and unreachability/recovery strategies (hard fail, continue-with-last-known, etc).

I suggest moving the general resource fencing feature-request out of this bug (and we implement fencing simply by providing a recipe for running various distributed locking services on k8s with no further involvement by k8s itself).

Just a reminder: there are a _lot_ of ports behind a single VIP. I agree we don't want to migrate IPs since that sounds hard on the underlying network at scale, and awkward for the network to deal with temporary duplicates during failure/recovery edge cases. However, I think it _is_ quite feasible to assign a unique service port to each shard member and proxy data to each pod instance - even for very large jobs. (This suggestion is assuming the proxies are still the direction we want to go with k8s - I haven't kept up with any recent discussion around this area)

anguslees on 19 Aug 2015

By push, I meant another resource, like a Service or controller, would monitor sets of pods via a label selector and bestow "things" (resolvable DNS names, VIPs, PVCs) upon them.

By pull, I meant that the pod would explicitly request allocation of a "thing", which would then be bound to it via something like /binding.

Regarding changing running pods: Changing a pod after creation is different than changing a pod with running containers. The scheduler performs async. initialization, for instance. PodIP is assigned late, also. Discussed more generally in #3585.

Regarding whether this is really a "service" or not:

Things need to be allocated across sets of pods
Thing (e.g., DNS names) must not be intrinsically tied to a pod's resource name
A thing needs to be transferrable from one pod to another
The pod needs to "know" (or be able to find out via downward API) what thing (esp. DNS name and VIP) it was assigned.
One thing we need to support is resolvable DNS
Whether we need VIPs was my one outstanding question. I'm fine with not using VIPs.
It's not ideal, but the user could create a second, headless service targeting the whole set for peer discovery via DNS and/or Endpoints.

Idea (1), ThingPools + Patches is an interesting idea. I would describe it as a "pull" approach. ThingPool isn't bad, but I'm concerned that the Patches would be hard to use, and I wouldn't want to reverse-engineer semantic information from the patches inside the system to provide acceptable behavior around storage lifecycle, etc. I'd prefer something more domain-specific. Also, independent ThingPools for DNS names and PVCs wouldn't work, since they need to be co-allocated.

I'd describe idea (2) as a "push" approach. I don't want to duplicate work on RCs and Deployments, so we shouldn't introduce another pod controller. An assigner would be ok, but is more loosely coupled than necessary and doesn't match precedent for how we handle assignment.

Re. stuffing info into env vars: This has come up for Job, also. I want to stuff the info into an annotation and extend the downward API to be able to request specific annotations. I want to continue to permit users to request only the info they need and to choose env var names.

Re. assigning different ports to each instance: That wouldn't work for many legacy apps, and also isn't compatible with our system design.

Re. prescriptive: Keeping N running is what ReplicationController does.

Re. descriptive: That's headless services, which already exist.

Re. locking: Yes, that's the lease API that's underway.

Storage provisioning is being discussed in #6773 (push) and #12450 (pull).

It's too late now, but I'll somehow try to make time to make a proposal after some sleep.

bgrant0607 on 19 Aug 2015

Quick observation: headless services don't allocate VIPs

We could decouple allocation/assignment from DNS publication. Ignoring how allocation/assignment happens for the moment, we could have a special flavor of headless service that took names from some field on the pods representing hostname.

bgrant0607 on 19 Aug 2015

And, the durable local storage issues are relevant: #7562, #1515, #598

bgrant0607 on 19 Aug 2015

More quick updates as I mostly do other PR reviews:

One advantage of using a service is that we wouldn't need to guarantee that hostnames requested for pods were globally unique, since the service would scope the names published to DNS.

Re. allocation/assignment: I want to avoid adding more template types to ReplicationController, Deployment, Daemon, etc., and I want storage and names to be able to migrate across controllers as easily as across pods (if that's what the user wants).

bgrant0607 on 20 Aug 2015

On Aug 19, 2015, at 1:04 AM, Angus Lees [email protected] wrote:

@smarterclayton https://github.com/smarterclayton It's an aside to the
nominal services issue, but it isn't safe to bootstrap as you're
describing. If "1" is restarted on a new machine and happens to be unable
to reach the existing nodes, then it will re-bootstrap and now you have two
mongodb's claiming to be authoritative. You need to change the job
definition to remove the ability to bootstrap after bootstrapping is
complete.

It's entirely possible you would have overlap in that case - but in most
cases we are talking about persistent storage with locks so you'd block
anyway until gce/AWS released your fence and you were able to attach. So
you sacrifice availability on bootstrapping but not necessarily at runtime
(unless you are on unfenced storage in which case I agree this would not be
safe)

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-132448150
.

smarterclayton on 20 Aug 2015

On Aug 19, 2015, at 1:35 AM, Angus Lees [email protected] wrote:

As @thockin https://github.com/thockin might be alluding to, I think
we're unnecessarily conflating several similar-but-different features here

and we're including more than is strictly necessary at the k8s level.

I see two quite different use cases described above - differing by source
of truth:

_Prescriptive:_ "I want N shards running, and if there are more or less
than that, then make changes to bring it back to N."
_Descriptive:_ "I want to just auto-discover all the available shards, and
let me find out somehow as they come and go."

I think these are different and should be described differently (although

they might present similar metadata to the actual running pods).

The discussion has also incorporated resource fencing and single-access
guarantees. Afaics, the _only_ case where this needs to be done within k8s
(afaics) is mounting remote volumes - because the fs mount needs to be done
outside the container. All the other cases can be handled by the user jobs
themselves, using a distributed lock service (or accessing services that
provide locking internally). Asking the jobs to do their own locking opens
up all sorts of patterns (fixed assignment, opportunistic assignment, "hot
spares", etc) and unreachability/recovery strategies (hard fail,
continue-with-last-known, etc).

I suggest moving the general resource fencing feature-request out of this
bug (and we implement fencing simply by providing a recipe for running
various distributed locking services on k8s with no further involvement by
k8s itself).

I don't think we can decouple the objective to run certain types of
software out - the design says nominal services but the use cases we are
describing is reusing identity and disk. I agree, fencing is a property of
volumes - but tying the acquisition of a volume to a certain property of
the pod instance is not.

Just a reminder: there are a _lot_ of ports behind a single VIP. I agree we
don't want to migrate IPs since that sounds hard on the underlying network
at scale, and awkward for the network to deal with temporary duplicates
during failure/recovery edge cases. However, I think it _is_ quite feasible
to assign a unique service port to each shard member and proxy data to each
pod instance - even for very large jobs. (This suggestion is assuming the
proxies are still the direction we want to go with k8s - I haven't kept up
with any recent discussion around this area)

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-132451774
.

smarterclayton on 20 Aug 2015

Slight twist to @thockin's ThingPool + Patches idea:

Template: list of resources + parameters and substitution. See https://github.com/openshift/origin/blob/master/pkg/template/api/types.go for inspiration. Ref #503, #1743, #4210, #6487.

TemplatePool. Basically ThingPool. Generates densely indexed instances of Template on demand instead of by replicas count.

For fun/consistency: v2 ReplicationController could replicate arbitrary Templates instead of just pods. ref #170 (sort of). Maybe not necessary if all allocation is driven by pod replication.

The main alternative would be something more domain-specific, focused on hostname and PVC pools, but separate pools for each wouldn't work. A single pool would need to be able to allocate tuples of hostnames and (potentially multiple) PVCs. One advantage of the TemplatePool is that someone could use it to allocate one service (perhaps external) per replica, automating the current Zookeeper workaround. Don't know if there might be other resources someone would want to similarly replicate 1-to-1 with pods, but given the rate at which our API is growing, I wouldn't bet there won't be.

bgrant0607 on 20 Aug 2015

@smarterclayton FYI when you reply via email with inline comments, your responses are unreadable, both in gmail and on github. In gmail, there are no line breaks and in github there are no quoting prefixes, so your response is hard to tease apart from the text being quoted.

bgrant0607 on 20 Aug 2015

Yeah, there's something wrong with gmail + github + iPhones.

On Thu, Aug 20, 2015 at 2:24 PM, Brian Grant [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton FYI when you reply
via email with inline comments, your responses are unreadable, both in
gmail and on github. In gmail, there are no line breaks and in github there
are no quoting prefixes, so your response is hard to tease apart from the
text being quoted.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133108803
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 20 Aug 2015

We had a quick chat today, some highlights:

Talked about the three classes of apps:

Apps that are stateless web apps (just works with RCs)
Giant scale apps that uses external masters and has existing coherence
protocols (Cassandra, Infiniband) to handle consensus
Small clustered stateful software like Mongodb, rethinkdb, volt, etcd,
zookeeper, riak, redis clusters, which need a few patterns to work well
- Unique identity (usually)
- Unique persistent volume per instance that is reused (usually)
- Stable IP address or DNS name (commonly) - DNS name is usually a
  
  standin for stable IP address
- Stable hostname (rarely)

Today, you can solve unique identity and persistent volume by having N
replication controllers, stable IP address with N services. We would like
to not require N RCs and services for some of those use cases.

We talked about Tim and Brian's suggestions, in no particular order:

TemplatePool is problematic because the replicator needs to have the
authority to create objects across the entire cluster, and has to know
where every api endpoint is.
- You could have an instantiate template endpoint, but still need to
  
  solve the authority delegation problem (with service accounts?)
- Locating objects and restricting permission are longer term things
The "pod identity" aka vuid field we talked about - the argument was made
and generally agreed with that it does not need to be globally unique, only
unique within the domain that the pod expects to use it.
- A hypothetical index assigner (sparse or dense) could set that value
  
  post creation, and the kubelet could wait to start pods until they had an
  
  identity (may need to indicate they are waiting for one)
- Dense indexes are generally useful for simple things, but because this
  
  is post creation and decoupled from the kubelet you could have different
  
  types of assignment algorithms (dense numeric, consistent hash ring,
  
  random, etc)
Simple templating on pods seems like a good place to start, but we need
to make it consistent with our existing tool chain in a way that doesn't
totally break it
- Use the identity field to templatize other fields
Add a hostname field on the pod spec that users could set themselves and
parameterize with the identity
Allow a headless service to respond to dns queries for endpoints by their
identity (have the endpoints controller materialize the identity from a pod
into endpoints) that guarantees some stability
Somehow allow persistent volume claims to be parameterized, or altered to
draw from a pool (or have a pool controller ensure there is one volume per
identity, or have the identity controller draw from pools), needs thought
Need tools that make it easy to turn environment / downward api into
config files (orthogonal, but should be easier)
Still some concerns about the pattern

On Thu, Aug 20, 2015 at 3:34 PM, Clayton Coleman [email protected]
wrote:

Yeah, there's something wrong with gmail + github + iPhones.

On Thu, Aug 20, 2015 at 2:24 PM, Brian Grant [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton FYI when you reply
via email with inline comments, your responses are unreadable, both in
gmail and on github. In gmail, there are no line breaks and in github there
are no quoting prefixes, so your response is hard to tease apart from the
text being quoted.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133108803
.

Clayton Coleman | Lead Engineer, OpenShift

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 20 Aug 2015

One more issue: TLS certs for members of a nominal services

bgrant0607 on 20 Aug 2015

TLS certs for serivces in general would also be a good thing (signed
by the cluster CA automatically on request).

On Thu, Aug 20, 2015 at 5:19 PM, Brian Grant [email protected] wrote:

One more issue: TLS certs for members of a nominal services

—
Reply to this email directly or view it on GitHub.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 20 Aug 2015

Yes: #11725

bgrant0607 on 21 Aug 2015

Re:

Add a hostname field on the pod spec that users could set themselves and
parameterize with the identity
Allow a headless service to respond to dns queries for endpoints by their
identity (have the endpoints controller materialize the identity from a pod
into endpoints) that guarantees some stability

I believe that we agreed that dns should be supported using the hostnames set on the pod specs (which might be parameterized by their identities once we add that functionality), such that the hostnames seen by the containers would be resolvable by other containers.

In order to set the subdomain, we'd have to tell the pod what service targeted it, explicitly or implicitly. The disadvantage of explicit is that it requires more configuration glue / templating. The disadvantage of implicit is that it would be prone to the challenges we have with loose coupling in the system: creation-order races and inconsistency, 1-to-1 enforcement difficulties, etc. I lean towards explicit. We should just work on improving cross-reference scoping.

Something like:

Hostname string
Subdomain string

above HostNetwork in the PodSpec.

We'd only deliver the Hostname-based DNS if the Subdomain matched that for a service of the right type. We could also smoosh the two into a single field.

I think we should re-purpose issue #4825 for that and get someone working on it. It's far more concrete and less work than the rest.

bgrant0607 on 21 Aug 2015

Somehow allow persistent volume claims to be parameterized, or altered to
draw from a pool (or have a pool controller ensure there is one volume per
identity, or have the identity controller draw from pools), needs thought

Two things about claims I think can help here:

Change binding from 1:1 to 1:N
1. pvc.Spec.VolumeName to pvc.Spec.VolumeNames[] as the binding reference
pvc.Spec.Singelton boolean that denotes if a claim should be bound exactly once for sharing or allow the claim to be bound many times.

An RC can bind to new volumes as needed. I'd figure out how to handle this in the binder.

It would still be important to match the correct volume to the right pod.

markturansky on 21 Aug 2015

You'd have to have a map of identities to volume names.

On Fri, Aug 21, 2015 at 3:17 PM, Mark Turansky [email protected]
wrote:

Somehow allow persistent volume claims to be parameterized, or altered to
draw from a pool (or have a pool controller ensure there is one volume per
identity, or have the identity controller draw from pools), needs thought

Two things about claims I think can help here:

Change binding from 1:1 to 1:N

pvc.Spec.VolumeName to pvc.Spec.VolumeNames[] as the binding

reference

pvc.Spec.Singelton boolean that denotes if a claim should be bound
exactly once for sharing or allow the claim to be bound many times.

An RC can bind to new volumes as needed. I'd figure out how to handle this
in the binder.

It would still be important to match the correct volume to the right pod.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133535336
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 21 Aug 2015

On Fri, Aug 21, 2015 at 12:56 AM, Brian Grant [email protected] wrote:

In order to set the subdomain, we'd have to tell the pod what service targeted it,
explicitly or implicitly. The disadvantage of explicit is that it requires more
configuration glue / templating. The disadvantage of implicit is that it would be
prone to the challenges we have with loose coupling in the system: creation-
order races and inconsistency, 1-to-1 enforcement difficulties, etc. I lean
towards explicit. We should just work on improving cross-reference scoping.

Let's be clear - if we DON'T set subdomain, we can't let users provide
their own hostnames. If we DO set subdomain, we have to do so in a
way that is unique-enough (in DNS space, which currently means across
the namespace). Making the subdomain be the name of a Service is
unique enough.

Given some of the limitations of our DNS (a single TLD), this
realistically means we give the users two DNS_LABEL fields - hostname
and nominal-set. For a concrete example:

pod.spec.metadata.name: my-pod-1bd3
pod.spec.metadata.namespace: my-app
pod.spec.identity.name: foo
pod.spec.identity.surname: bar

This would result in the pod seeing:

$ hostname
foo

$hostname -f
foo.bar.my-app.nom.cluster.local

and DNS serving A and PTR records for foo.bar.my-app.nom.cluster.local

This is one level deeper than our current DNS structure, but I think
our current ndots=5 covers this. Alternately, we could say that a
namespace IS the surname, and put it back on users to assign unique
hostnames in a namespace. The DNS result would be
foo.my-app.pod.cluster.local which mirrors Services. We could even
make them more alike by adding an optional hostname field to Services.

We'd only deliver the Hostname-based DNS if the Subdomain matched that for
a service of the right type. We could also smoosh the two into a single field.

My least favorite part of this is that the Pod is declaring "I am part
of Service 'bar'" (in the above example). Up front. A priori.

The slippery slope is why not allow this in general? It would
simplify a bunch of things that can't make assumptions today (because
of loose coupling) even though in practice we rarely see a single Pod
behind more than one Service. If every Pod was under zero-or-one
Service, does the system get simpler?

I think we should re-purpose issue #4825 for that and get someone working on it. It's far more concrete and less work than the rest.

It's certainly something that could start right away, but (for those
following along at home) it is just the mechanism for implementing pod
identity, not the policy of managing sets of identities.

On Fri, Aug 21, 2015 at 3:17 PM, Mark Turansky [email protected] wrote:

Two things about claims I think can help here:

Change binding from 1:1 to 1:N

This is where I started to, but...

On Fri, Aug 21, 2015 at 12:38 PM, Clayton Coleman
[email protected] wrote:

You'd have to have a map of identities to volume names.

Yep. This is really just turning out to be a templating problem.
Whether you think of the problem as a "pod template" in which some
fields have placeholders that are filled in by a "substituion tuple"
drawn from a pool at runtime or you think it as a fully reified pod
spec that is "patched up" with a "substituion tuple" drawn from a pool
at runtime - they are isomorphic and equally unpleasant.

And yet, I think this is one of the larger problems we need to solve.
Should we start to work out straw-man proposals for these?

thockin on 28 Aug 2015

We have several examples of pods being under two or three services, one for
spreading, one for exposure, and one for testing exposure.

smarterclayton on 28 Aug 2015

Regarding identity, each pod created by a daemon controller should have an identity assigned that is derived from the node identity and the controller identity. This may be a case where identity is obvious (pod x on host y has identity f(y)) and part of the daemons responsibility.

smarterclayton on 11 Sep 2015

@smarterclayton so a common (but not universal) need here is to maintain a stable, ordered list of group members. How might that work when the identity is tied to the node that the pod happens to be scheduled on?

anguslees on 11 Sep 2015

Daemon controller _only_ has identity in the context of the node the pod is
run on. Replication controllers and services are different. Daemon
controller says "there should be one of these running on each node". There
is no other identity that pod can have except its node.

On Fri, Sep 11, 2015 at 1:13 AM, Angus Lees [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton so a common (but not
universal) need here is to maintain a stable, ordered list of group
members. I think that implies the identity should _not_ be tied to the
node that the pod happens to be scheduled on.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-139453809
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 11 Sep 2015

@smarterclayton ah, this issue has clearly morphed into something more than the original feature request - I'm lost on what were trying to build here, so I'll back off rather than continue to add noise...

anguslees on 12 Sep 2015

Sorry, didn't mean to imply the point wasn't relevant. The challenge is that we need to define sets and propagate that into the pod. We need the pod to be reactive to the membership in the set (mount volume X that matches pod identity in the set). There is an open question - could a pod be a member of two sets simultaneously and have a unique identity in either?

It may be the case that pods run by a daemon controller have an identity that is "obvious" (the daemon controller operates on a set of nodes, the RC operates on a set of pods).

One requirement of identity is that it be unique in the set. Could we leverage pod names and pod name generation to ensure that uniqueness? For instance, RCs generate random names. Could an RC generate names for the pods in a deterministic fashion based on the set of pods it selects?

Consider an RC that selects pods with label a=b. In order to calculate the next pod to create, it must fetch pods and reduce them to a count. What if we reduced it to a unique name set, and then the RC tried to follow a name generation pattern within that set? Two RCs that don't overlap, but created pods into a service label, would be unable to generate unique names (they may fight for a name, but that's implicit). Names can be reused after they are deleted

Then persistent volumes in the pod could use PVCs based on the pod name. A pvc pooler could ensure there is enough claims for an observed pool of RCs. The daemon controller can pick names deterministic to a node name. We can also wire DNS names to pod names in a service based on some pattern.

This feels more parsimonious than adding a distinct identity field on the pod. It means that pods still aren't pets, but their names can be (which is not an unreasonable statement). We can't prevent two instances of the same pod process running on the cluster at the same time, but we can make a hard guarantee (and do) that the name is only in use at one spot today.

smarterclayton on 12 Sep 2015

Meant "observed pool of names" above. Endpoint DNS would be hashpodname.ept.service.namespace.svc.cluster.local. We already have the pod name on the endpoint, so we don't even have to add anything.

smarterclayton on 12 Sep 2015

related to https://github.com/kubernetes/contrib/tree/master/service-loadbalancer which proposes resolution of this issue as a possible next iteration

jayunit100 on 15 Sep 2015

Why are we discussing DaemonSet here? That was solved by #12893, IMO.

bgrant0607 on 16 Sep 2015

@smarterclayton Let's not revisit deriving identity from pod names, nor using RCs to assign identity. Problems with those approaches were discussed here: https://github.com/kubernetes/kubernetes/issues/260#issuecomment-102107352

bgrant0607 on 16 Sep 2015

One possibility would be to have an RC option that allows you to generate names the way @smarterclayton described if you want, and distinct sets of features available to you depending on how you set the option. So the things that are hard to implement with a "task index" approach wouldn't be available if you enable the option, but the feature described here would be. And conversely the default behavior (option disabled) would give you those other features but not the one described here.

It seems that the utility of something like the Borg task index abstraction keeps coming up _for certain use cases_ and it's not clear to me why we can't offer it as an option.

davidopp on 16 Sep 2015

DaemonSet I don't think solves the distributed filesystem storage case
(Gluster) where each host needs to have a consistent identity for rejoining
the cluster. I'll dig more but I think it may still need something along
the needs of #260 - I don't know if hostname is sufficient if you reattach
a host disk on a new instance. Maybe I didn't understand what DaemonSet
provided and #12893 delivered, but I don't think that's the problem for a
system host that wants to reuse disk.

On Wed, Sep 16, 2015 at 12:22 AM, Brian Grant [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton Let's not revisit
deriving identity from pod names, nor using RCs to assign identity.
Problems with those approaches were discussed here: #260 (comment)
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-102107352

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140623023
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 16 Sep 2015

As a concrete example, imagine an etcd cluster. It has 7 instances, and 7
volumes. I want (for the sake of argument), the names and volumes etcd-1
through etcd-7 and etcd-pvc-1 through etc-pvc-7.

A rolling deployment has MaxUnavailable 90% (or equiv), MaxSurge 0%. It
scales down the old RC, and scales up the new RC. The old RC gracefully
deletes etcd-5. It instantly passes out of the effective RC set. The new
RC tries to create etcd-5 (by observation of the set gap). It receives a
"AlreadyExistsException". It treats that as backoff and requeues. Once
etcd-5 is truly gone, it is able to proceed. Once etcd-5 is ready, the DC
moves to the next step.

For the alternative - identity set on pod - the old RC is scaled down. A
new pod is created instantly with a new name. The new pod can't launch
because it needs identity set to figure out which PVC to mount for its
volume. It goes into a wait loop on the Kubelet. The identity controller
watches some label selector set and observes identity etcd-5 is not
allocated. It sets identity etcd-5 onto the new pod. The kubelet observes
that change and then can figure out which PVC to mount, starts the pod
(assuming the volume has been detached from the other node).

Structurally, these are very similar. The latter allows start overlap.
Both require mapping the PVC.

On Wed, Sep 16, 2015 at 12:39 AM, Clayton Coleman [email protected]
wrote:

DaemonSet I don't think solves the distributed filesystem storage case
(Gluster) where each host needs to have a consistent identity for rejoining
the cluster. I'll dig more but I think it may still need something along
the needs of #260 - I don't know if hostname is sufficient if you reattach
a host disk on a new instance. Maybe I didn't understand what DaemonSet
provided and #12893 delivered, but I don't think that's the problem for a
system host that wants to reuse disk.

On Wed, Sep 16, 2015 at 12:22 AM, Brian Grant [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton Let's not revisit
deriving identity from pod names, nor using RCs to assign identity.
Problems with those approaches were discussed here: #260 (comment)
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-102107352

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140623023
.

Clayton Coleman | Lead Engineer, OpenShift

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 16 Sep 2015

The thing that makes this problem hard, to me, is not generating a stable unique pool of names or using a pool of PDs. The problem is doing BOTH at the same time. And if you extend that out, why can't just about any identifier in the spec be pooled like this?

As much as I know (mostly second hand) the issues with packed indices, I think they might be livable (especially if we do as David suggests and make them mutually exclusive with other features). But also, the more I stew on it, the less I hate the Patch idea some moron posted up-thread.

This issue has lingered and lingered and I am afraid we've got analysis paralysis.

thockin on 16 Sep 2015

I wish we had more examples of other pooling use cases. I thought Gluster
would be different but it's just another variant on the
cassandra/etcd/mongo/zookeeper "i want to reuse a volume and i need to
remember who I said I was before". We haven't really had a role based (I
want to make the first N pods coordinators and the next M workers) example
to leverage.

On Wed, Sep 16, 2015 at 12:46 AM, Tim Hockin [email protected]
wrote:

The thing that makes this problem hard, to me, is not generating a stable
unique pool of names or using a pool of PDs. The problem is doing BOTH at
the same time. And if you extend that out, why can't just about any
identifier in the spec be pooled like this?

As much as I know (mostly second hand) the issues with packed indices, I
think they might be livable (especially if we do as David suggests and make
them mutually exclusive with other features). But also, the more I stew on
it, the less I hate the Patch idea some moron posted up-thread.

This issue has lingered and lingered and I am afraid we've got analysis
paralysis.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140625746
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 16 Sep 2015

DaemonSet:

12893 set the pod hostname to the node's hostname when using host networking, which is totally reasonable to use for DaemonSet, since these would otherwise use hostPort, anyway.

Also, the intent is to integrate DaemonSet with the node controller in order to provide implicit indefinite forgiveness #1574, and to enable "scheduling" prior to scheduling being enabled for nodes in general. Implicit prioritization may also be reasonable.

For a cluster storage system, I could imagine provision additional disks on each host that are accessed via hostPath only by the storage system -- they wouldn't be used by Docker nor by emptyDir.

bgrant0607 on 16 Sep 2015

With respect to identity, I don't see a problem with the proposal here: https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133318903

bgrant0607 on 16 Sep 2015

This issue has lingered mainly due to lack of sufficient priority, IMO.

bgrant0607 on 16 Sep 2015

To riff on Clayton (or maybe just to say the same things but dumber):

$ kubectl create -f -
kind: IdentityPool
apiVersion: v1
metadata:
  name: my-etcd-idents
  namespace: default
spec:
  identities:
    - etcd-0
    - etcd-1
    - etcd-2
^D

$ kubectl create -f -
kind: VolumePool
apiVersion: v1
metadata:
  name: my-etcd-volumes
  namespace: default
spec:
  volumes:
      - etcd-disk-0
      - etcd-disk-1
      - etcd-disk-2
^D

$ kubectl create -f -
kind: Replicationcontroller
apiVersion: v1
metadata:
  name: my-etcd-rc
  namespace: default
spec:
  selector:
    job: etcd
  identityPoolName: my-etcd-idents
  podTemplate:
    containers:
        - name: etcd
          image: coreos/etcd:2.0
          volumeMounts:
              - name: data
                path: /data
    volumes:
        - name: data
          identity:
              volumePoolName: my-etcd-volumes
^D

The implication is that the RC (or something) realizes this is an identity set, assigns an index from the identityPool, assigns that identity to the pod name, and then peeks inside to see if any other identity-centric fields are set - the identity volume in this case.

It's not totally generic (you can't pool any random field) but we can add more ident-aware options as needed

thockin on 16 Sep 2015

I have a feeling that if we nailed down a spec we liked on this, it might get done sooner than later.

thockin on 16 Sep 2015

Even writing a spec takes time.

With respect to IdentityPool/VolumePool, we already decided a long time ago that multiple pools won't work.

The 1.1 code-complete deadline is Friday. Could we please not do this now?

bgrant0607 on 16 Sep 2015

Also related: #170. That proposed to just split out the pod template, but if we go for a templating scheme (which I strongly prefer over patch), then we should consider them together.

bgrant0607 on 16 Sep 2015

I put at least the design on v1.2-candidate.

bgrant0607 on 16 Sep 2015

Sorry I wasn't clear in my sketch - the RC (or IC?) has a concept of
primary pool or index. Once a pod is assigned an index in the IC, that
index is used to index into all pool-centric fields.

If we split the template out, I assume the assumption is that the RC can't
mutate the template. That is a bit trickier.

I'm fine to stop talking about this this week. It just popped into my
inbox and I had been chewing on this idea...

On Tue, Sep 15, 2015 at 10:09 PM, Brian Grant [email protected]
wrote:

Even writing a spec takes time.

With respect to IdentityPool/VolumePool, we already decided a long time
ago that multiple pools won't work.

The 1.1 code-complete deadline is Friday. Could we please not do this now?

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140627868
.

thockin on 16 Sep 2015

Yeah there were just things that triggered discussions on our end.
Customer problems wait for no release process.

On Wed, Sep 16, 2015 at 1:14 AM, Tim Hockin [email protected]
wrote:

Sorry I wasn't clear in my sketch - the RC (or IC?) has a concept of
primary pool or index. Once a pod is assigned an index in the IC, that
index is used to index into all pool-centric fields.

If we split the template out, I assume the assumption is that the RC can't
mutate the template. That is a bit trickier.

I'm fine to stop talking about this this week. It just popped into my
inbox and I had been chewing on this idea...

On Tue, Sep 15, 2015 at 10:09 PM, Brian Grant [email protected]
wrote:

Even writing a spec takes time.

With respect to IdentityPool/VolumePool, we already decided a long time
ago that multiple pools won't work.

The 1.1 code-complete deadline is Friday. Could we please not do this
now?

—
Reply to this email directly or view it on GitHub
<
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140627868

.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-140628241
.

Clayton Coleman | Lead Engineer, OpenShift

smarterclayton on 16 Sep 2015

The discussion is all about IP numbers and Pods but in my experience with Mongo & Zookeeper the IP the Pods should stay irrelevant (not become pets). The persistent _volume_ needs a nominal number since this volume records the IP number of the other 'volumes'. Whatever Pod that mounts that volume should be able to use that recorded IP number somehow. The volume is the pet ...

A DNS name that is constant in time for a volume and assigned to a Pod that mounts the volume would come a long way I think.

Changing of ensemble membership in Mongo & ZK will always require custom code to run and I expect it to be the same for most other ensembles. So Replication Controller is the wrong name, these pets need more a Member Ship controller. A Member Ship controller should be able to handle the initial setup and then incremental changes to the ensembles, and finally kill it off.

Given a constant DNS name based on mounted volume & the possibility to handle the ensemble membership with custom code should make it possible to handle these type of systems I think.

pkriens on 23 Sep 2015

Yes, pod IP is more of a short term hack.

Where this proposal is going is something above DNS and volume that both
can leverage to stay coupled. The concern is that volume defining DNS name
is the reverse of the coupling we have already established, and we also
want to let other parts of the pod spec leverage that higher concept
(identity) like the downward API on environment. That makes adapting
identity to different ensemble software easier (reduces the custom code per
type).

We need to still ensure there is not a use case beyond ensembles that
overlaps here (haven't heard one suggested) and then try to get to a
concrete proposal that feels minimal and appropriate. Tim I think has
summarized that most recently here.

On Sep 23, 2015, at 11:33 AM, Peter Kriens [email protected] wrote:

The discussion is all about IP numbers and Pods but in my experience with
Mongo & Zookeeper the IP the Pods should stay irrelevant (not become pets).
The persistent _volume_ needs a nominal number since this volume records
the IP number of the other 'volumes'. Whatever Pod that mounts that volume
should be able to use that recorded IP number somehow. The volume is the
pet ...

A DNS name that is constant in time for a volume and assigned to a Pod that
mounts the volume would come a long way I think.

Changing of ensemble membership in Mongo & ZK will always require custom
code to run and I expect it to be the same for most other ensembles. So
Replication Controller is the wrong name, these pets need more a Member
Ship controller. A Member Ship controller should be able to handle the
initial setup and then incremental changes to the ensembles, and finally
kill it off.

Given a constant DNS name based on mounted volume & the possibility to
handle the ensemble membership with custom code should make it possible to
handle these type of systems I think.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-142640825
.

smarterclayton on 23 Sep 2015

/ping

glerchundi on 5 Oct 2015

You can find an example of setting up a ZK cluster under Kubernetes using the existing capabilities at https://hub.docker.com/r/elevy/zookeeper/. The primary requirement is stable IP addresses and hostnames for the ensemble members. This is accomplished using individual Services for each cluster member. The only wrinkle is that by default ZK will attempt to bind to the IP its hostname resolves to, but this can be avoided through a config parameter.

eliaslevy on 10 Oct 2015

@smarterclayton @thockin

Thoughts while biking to the office a few days ago:

I'm currently thinking of this as a PetSet controller, or maybe ShardSet controller, if people want to think about it that way, as discussed back at the beginning: https://github.com/kubernetes/kubernetes/issues/260#issuecomment-57671926.

The case we're trying to address is where there are N similar pets, but they are still pets -- they aren't fungible.

I still think tying this to ReplicationController can't work, in part because ReplicationController has only a single template. PetSet/ShardSet needs to be a higher-level abstraction, more like Deployment or DaemonSet. PetSet could potentially generate one PodTemplate object per instance.

I don't think we can just reuse the Deployment API. Among other things, the rolling update rollout strategy now in kubectl and Deployment won't work for this case. Most pets probably would want Recreate updates, and any that wanted rolling updates would not be able to handle any surge above their shard count and would be particular about the update order.

As much as I like the idea, I don't think bottom-up identity allocation is going to work. Too many race conditions with returning allocated identities to the pool vs. allocating new ones. Much like problems @erictune ran into while exploring use of RabbitMQ with Job.

I don't want to represent the scale of the PetSet in more than one place, such as in a controller and in an identity pool, or in a controller and in a service, since that would make scaling tricky/awkward.

I don't think we need general-purpose templates for this case. We just need to stamp out pods with associated storage and predictable, durable network identities. The rough equivalent has been sufficient in Borg. Therefore, I'm leaning strongly towards something along the lines of #12450, but in the pod template rather than in pods themselves. Perhaps an array of persistentVolumeClaimTemplates in addition to a podTemplate. I don't want to add fancy substitution for all the cloud-provider-specific volume sources, especially since we want to deprecate those anyway. Adding additional requirements and/or selectors to PVCs should be sufficient. We'll also need to be able to provision persistent volumes somehow at some point.

We'll still need to do something about local storage eventually, but I think that's separable.

And then the hostname and headless service changes mentioned above, here:
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-133318903

bgrant0607 on 6 Nov 2015

Working on a proposal now, will have a draft up soon.

smarterclayton on 30 Nov 2015

Proposal is up in https://github.com/kubernetes/kubernetes/pull/18016

smarterclayton on 1 Dec 2015

... so where are we with PetSet?

I'm building HA database solutions on top of Kubernetes. Some, particularly single-master databases, require having services with endpoints which map to particular pods based on events from those pods. For example:

service "postgres" maps round-robin to all pods in the ReplicaSet, but service "postgres-master" maps to only one pod.
the master pod dies.
failover happens. As part of failover, the new master updates Kube-master via the API to "grab" the postgres-master service.
When it does this, all prior connections to postgres-master are terminated.

jberkus on 3 May 2016

@jberkus the initial alpha code to enable PetSets has either been merged or is pending merge. @ingvagabund from my team is going to start putting together examples that make use of PetSet to see what works well and what still needs improvement. I'd recommend you reach out to him if you've got some specific use cases you want to put together and test out.

ncdc on 3 May 2016

@ncdc Is everything done for this in v1.3? It is unclear as the proposal is unmerged and there haven't been other PRs referencing this for awhile.

philips on 19 May 2016

@bprashanth ^

ncdc on 19 May 2016

We are landing e2es and continuing to work on examples. It is in alpha
now, but the proposal is going to take the first round of alpha feedback
before merging.

On May 19, 2016, at 10:25 AM, Brandon Philips [email protected]
wrote:

@ncdc https://github.com/ncdc Is everything done for this in v1.3? It is
unclear as the proposal is unmerged and there haven't been other PRs
referencing this for awhile.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/260#issuecomment-220340104

smarterclayton on 19 May 2016

Where can I find docs for this? I'd like to test it out for a database-failover use case.

jberkus on 19 May 2016

👍2

Ah ha, just what we need a postgres expert :)

see https://github.com/kubernetes/contrib/pull/921 for examples, I can answer any questions about prototying [db of choice] as petset. We have a bunch of sketches under the "apps/stateful" label (eg: https://github.com/kubernetes/kubernetes/issues/23790, @philips an etcd example would be great). I haven't written docs yet, will do so toward the last few weeks of 1.3 (still 5 weeks to release after code complete on friday).

I'm guessing you're going to try automating failover with postgres since that's pretty common. I'll admit that currently that's still not as easy as I'd like it to be, you probably need a watchdog. @jberkus I'd like to hear feedback on what patterns make that easier.

To give you a quick review the petset today gives you consistent network identity (DNS, host name) that matches a network mounted volume, and ordering guarantees. So if you create a petset with replicas: 3, you'll get:
governing service: *.galear.default.svc.cluster.local
mysql-0 - volume0
mysql-1 - volume1: doesn't start till 0 is running and ready
mysql-2 - volume2: doesn't start till 0, 1 are running ready

The pods can use DNS for service discovery by looking up SRV records inserted under the governing service. That's what this simple pod does: https://github.com/kubernetes/contrib/pull/921/commits/4425930cea6f45385561313477662d6fb2ee2c62. So if you use the peer-finder through an init container like in the examples above, mysql-1 will not start till the init container sees (mysql-1, mysql-0) in DNS, and writes out the appropriate config.

The volumes are provisioned by a dynamic provisioner (https://github.com/kubernetes/kubernetes/blob/release-1.2/examples/experimental/persistent-volume-provisioning/README.md), so if you don't have one running in your cluster but just want to prototype, you can simply use emptyDir. The "data-gravity" (https://github.com/kubernetes/kubernetes/issues/7562) case doesn't work yet, but will eventually.

bprashanth on 19 May 2016

I'll add that currently it's easier to deliver "on-start" notification with a list of peers, through init containers. It's clear that we also require "on-change" notifications. Currently to notice cluster membership changes you need to use a custom pid1. Shared pid namespaces might make this easier, since you can then use a sidecar, this is also something that needs to just work.

bprashanth on 19 May 2016

I have a watchdog, it's the service failover which is more complicated than I'd like. Will test, thanks!

jberkus on 19 May 2016

I also need to support etcd, so there may be lots of testing in my future.

jberkus on 19 May 2016

@ncdc What's the status of the alpha code for this? I'd like to start testing / implementing. We need to deploy a cassandra cluster really soon here. I can do it with the existing codebase but it'd be nice to test out the petset stuff.

paralin on 31 May 2016

You can get it if you build from HEAD

bprashanth on 31 May 2016

@bprashanth merged into the main repo? great, thanks. will do.

paralin on 31 May 2016

https://github.com/kubernetes/contrib/tree/master/pets

bprashanth on 31 May 2016

embedded yaml in annotation strings? oof, ouch :(. thanks though, will investigate making a cassandra set.

paralin on 31 May 2016

that's json. It's an alpha feature added to a GA object (init containers in pods).
@chrislovecnm is working on Cassandra, might just want to wait him out.

bprashanth on 31 May 2016

@paralin here is what I am working on. No time to document and get it into k8s repo now, but that is long term plan. https://github.com/k8s-for-greeks/gpmr/tree/master/pet-race-devops/k8s/cassandra Is working for me locally, on HEAD.

Latest C* image in the demo works well.

We do have issue open for more documentation. Wink wink, knudge @bprashanth