Kubernetes: Kubelet/Kubernetes should work with Swap Enabled

Created on 6 Oct 2017  ·  94Comments  ·  Source: kubernetes/kubernetes

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What happened:

Kubelet/Kubernetes 1.8 does not work with Swap enabled on Linux Machines.

I have found this original issue https://github.com/kubernetes/kubernetes/issues/31676
This PR https://github.com/kubernetes/kubernetes/pull/31996
and last change which enabled it by default https://github.com/kubernetes/kubernetes/commit/71e8c8eba43a0fade6e4edfc739b331ba3cc658a

If Kubernetes does not know how to handle memory eviction when Swap is enabled - it should find a way how to do that, but not asking to get rid of swap.

Please follow kernel.org Chapter 11 Swap Management, for example

The casual reader may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.

In case of running a lot of node/java applications I have seen always a lot of pages are swapped, just because they aren't used anymore.

What you expected to happen:

Kubelet/Kubernetes should work with Swap enabled. I believe instead of disabling swap and giving users no choices kubernetes should support more use cases and various workloads, some of them can be an applications which might rely on caches.

I am not sure how kubernetes decided what to kill with memory eviction, but considering that Linux has this capability, maybe it should align with how Linux does that? https://www.kernel.org/doc/gorman/html/understand/understand016.html

I would suggest to rollback the change for failing when swap is enabled, and revisit how the memory eviction works currently in kubernetes. Swap can be important for some workloads.

How to reproduce it (as minimally and precisely as possible):

Run kubernetes/kublet with default settings on linux box

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/sig node
cc @mtaufen @vishh @derekwaynecarr @dims

kinfeature sinode

Most helpful comment

Not supporting swap as a default? I was surprised to hear this -- I thought Kubernetes was ready for the prime time? Swap is one of those features.

This is not really optional in most open use cases -- it is how the Unix ecosystem is designed to run, with the VMM switching out inactive pages.

If the choice is no swap or no memory limits, I'll choose to keep swap any day, and just spin up more hosts when I start paging, and I will still come out saving money.

Can somebody clarify -- is the problem with memory eviction only a problem if you are using memory limits in the pod definition, but otherwise, it is okay?

It'd be nice to work in a world where I have control over the way an application memory works so I don't have to worry about poor memory usage, but most applications have plenty of inactive memory space.

I honestly think this recent move to run servers without swap is driven by the PaaS providers trying to coerce people into larger memory instances--while disregarding ~40 years of memory management design. The reality is that the kernel is really good about knowing what memory pages are active or not--let it do its job.

All 94 comments

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize. We would prefer to improve reliability around pressure detection, and optimize issues around latency before trying to optimize for swap, but if this is a higher priority for you, we would love your help.

/kind feature

@derekwaynecarr thank you for explanation! It was hard to get any information/documentation why swap should be disabled for kubernetes. This was the main reason why I opened this topic. At this point I do not have high priority for this issue, just wanted to be sure that we have a place where it can be discussed.

There is more context in the discussion here: https://github.com/kubernetes/kubernetes/issues/7294 – having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would _then_ start spilling over into swap (this appears to be fixed since f4edaf2b8c32463d6485e2c12b7fd776aef948bc – they won't be allowed to use any swap whether it's there or not).

This is critical use case for us too. We have a cron job that occasionally runs into high memory usage (>30GB) and we don't want to permanently allocate 40+GB nodes. Also, given that we run in three zones (GKE), this will allocate 3 such machines (1 in each zone). And this configuration has to be repeated in 3+ production instances and 10+ test instances making this super expensive to use K8s. We are forced to have 25+ 48GB nodes which incurs huge cost!.
Please enable swap!.

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do not specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

We run in GKE, and I don't know of a way to set those options.

I'd be open to considering adopting zswap if someone can evaluate the implications to memory evictions in kubelet.

I am running Kubernetes in my local Ubuntu laptop and with each restart I have to turnoff swap. Also I have to worry about not to go near memory limit as swap is off.

Is there any way with each restart I don't have to turn off swap like some configuration file change in existing installation?

I don't need swap on nodes running in cluster.

Its just other applications on my laptop other than Kubernetes Local Dev cluster who need swap to be turned on.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Right now the flag is not working.

# systemctl restart kubelet --fail-swap-on=false
systemctl: unrecognized option '--fail-swap-on=false'

Set the following Kubelet flag: --fail-swap-on=false

On Tue, Jan 30, 2018 at 1:59 PM, icewheel notifications@github.com wrote:

I am running Kubernetes in my local Ubuntu laptop and with each restart I
have to turnoff swap. Also I have to worry about not to go near memory
limit if swap if off.

Is there any way with each restart I don't have to turn off swap like some
configuration file change in existing installation?

I don't need swap on nodes running in cluster.

Its just other applications on my laptop other than Kubernetes Local Dev
cluster who need swap to be turned on.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3JwQdj2skL2dSqEVyV46iCllzT-sOVks5tP5DSgaJpZM4PwnD5
.

--
Michael Taufen
Google SWE

thanks @mtaufen

For systems that bootstrap cluster for you (like terraform), you may need to modify the service file

This worked for me

sudo sed -i '/kubelet-wrapper/a \ --fail-swap-on=false \\\' /etc/systemd/system/kubelet.service

Not supporting swap as a default? I was surprised to hear this -- I thought Kubernetes was ready for the prime time? Swap is one of those features.

This is not really optional in most open use cases -- it is how the Unix ecosystem is designed to run, with the VMM switching out inactive pages.

If the choice is no swap or no memory limits, I'll choose to keep swap any day, and just spin up more hosts when I start paging, and I will still come out saving money.

Can somebody clarify -- is the problem with memory eviction only a problem if you are using memory limits in the pod definition, but otherwise, it is okay?

It'd be nice to work in a world where I have control over the way an application memory works so I don't have to worry about poor memory usage, but most applications have plenty of inactive memory space.

I honestly think this recent move to run servers without swap is driven by the PaaS providers trying to coerce people into larger memory instances--while disregarding ~40 years of memory management design. The reality is that the kernel is really good about knowing what memory pages are active or not--let it do its job.

This also has an effect that if the memory gets exhausted on the node, it will potentially become completely locked up - requiring a restart of the node, rather than just slowing down and recovering a while later.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I have high number of disk reads in my cluster nodes(K8s Version - v1.11.2). May be because of disabling swap memory?

https://stackoverflow.com/questions/51988566/high-number-of-disk-reads-in-kubernetes-nodes

@srevenant In the cluster world, the other node's RAM is the new swap. That said, I run two one-node K8s instances where swap makes sense. But this is not the typical use case of K8s.

@srevenant I completely agree with you, SWAP is used on Unix and Linux by default since they were born, I think I didn't see an app during 15 years of working on Linux that asks for SWAP to be off.
The issue SWAP is always on by default when we install any Linux distro, so I must set it off before I install K8s and that was a surprise.
Linux Kernel knows well how to manage SWAP to increase performance of servers specially temporarily when server is about to reach RAM limit.
Does this mean I must switch SWAP off for K8s to work well?

I have an interest in making this work, and I have the skills and a number of machines to test on. If I wanted to contribute, where is the best place to start?

@superdave please put together a KEP in kubernetes/community describing how you would like swap to be supported, and present it to sig-node. we would love to have your help.

I stand for enabling swap in Kubernete's pods properly. it really does not make sense to not have swap, since almost all containers are custom Linux instances, and hence support swap by default.
It's understandable that the feature is complex to implement, but since when did that stopped us from moving forward?

I must agree that the swap issues should be solved in Kubernetes since disabling swap causes Node failure when running out of memory on node. For example if you having 3 worker nodes (20GB of ram each) and one node goes down because the limit of ram is reached 2 other worker nodes will also go down after transferring all the pods to them in that time.

You can prevent this by setting memory requests according to the actual
application's need.

If one third of your application's memory is on 2 orders of magnitude
slower storage, will it be able to do any useful work?

On Wed, Sep 26, 2018 at 6:51 AM vasicvuk notifications@github.com wrote:

I must agree that the swap issues should be solved in Kubernetes since
disabling swap causes Node failure when running out of memory on node. For
example if you having 3 worker nodes (20GB of ram each) and one node goes
down because the limit of ram is reached 2 other worker nodes will also go
down after transferring all the pods to them in that time.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-424604731,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAICBqZApBscFl5aNA4IcYvlxcvPA88Tks5ueyPlgaJpZM4PwnD5
.

@matthiasr You can do that when you have 10-50 services. But when u have Cluster running over 200 services and half of them are deployed using official Helm charts without any memory request in them your hands are tide.

But then, isn’t missing memory requests the problem to be addressed?

@matthiasr in a lot of cases memory once mapped to the process only used once or never actually used. Those are valid cases and are not memory leaks. When you have swap those pages are eventually swapped and may never be swapped in again, yet you release fast ram for better use.

Nor is turning swap off a good way to ensure responsiveness. Unless you pin files in memory (a capability K8s should have for executables, at least), the kernel will still swap out any and all file-backed pages in response to memory pressure, or even simply lack of use.

Having swap enabled doesn't markedly change kernel behavior. All it does is provide a space to swap out anonymous pages, or modified pages loaded from COW-mapped files.

You can't turn off swapping entirely, so K8s needs to survive its existence whether or not the special case of anonymous memory swapping is enabled.

That makes this a bug: You're failing to support a kernel feature that can't actually be turned off.

@Baughn

the kernel will still swap out any and all file-backed pages in response to memory pressure, or even simply lack of use. Having swap enabled doesn't markedly change kernel behavior.

You can't turn off swapping entirely,

Can you provide some reference for this so that I could educate myself?

Unless you pin files in memory (a capability K8s should have for executables, at least),

What is the capability you want k8s to use? If a binary is static just copying it over to a tmpfs on the pod should help with paging latency.

@adityakali any thoughts on impact of swap in the kernel when swap is turned off?

Can you provide some reference for this so that I could educate myself?

Like all modern virtual memory OSes, Linux demand pages executables from disk into memory. Under memory pressure, the kernel swaps the actual executable code of your program to/from disk just like any other memory pages (the "swap out" is simply a discard because read-only, but the mechanism is the same), and they will be re-fetched if required again. Same goes for things like string constants, which are typically mmapped read-only from other sections of the executable file. Other mmapped files (common for database-type workloads) are also swapped in+out to their relevant backing files (requiring an actual write-out if they've been modified) in response to memory pressure. The _only_ swapping you disable by "disabling swap" is "anonymous memory" - memory that is _not_ associated with a file (the best examples are the "stack" and "heap" data structures).

There are lots of details I'm skipping over in the above description of course. In particular, executables can "lock" portions of their memory space into ram using the mlock family of syscalls, do clever things via madvise(), it gets complicated when the same pages are shared by multiple processes (eg libc.so), etc. I'm afraid I don't have a more useful pointer to read more other than those manpages, or general things like textbooks or linux kernel source/docs/mailing-list.

So, a practical effect of the above is that as your process gets close to its memory limit, the kernel will be forced to evict _code_ portions and constants from ram. The next time that bit of code or constant value is required, the program will pause, waiting to fetch it back from disk (and evict something else). So even with "swap disabled", you still get the same degradation when your working set exceeds available memory.

Before people read the above and start calling to mlock everything into memory or copy everything into a ramdrive as part of the anti-swap witch hunt, I'd like to repeat that the real resource of interest here is working set size - not total size. A program that works linearly through gigabytes of data in ram might only work on a narrow window of that data at a time. This hypothetical program would work just fine with a large amount of swap and a small ram limit - and it would be terribly inefficient to lock it all into real ram. As you've learned from the above explanation, this is exactly the same as a program that has a large amount of _code_ but only executes a small amount of it at any particular moment.

My latest personal real-world example of something like that is linking the kubernetes executables. I'm currently (ironically) unable to compile kubernetes on my kubernetes cluster because the go link stage requires several gigabytes of (anonymous) virtual memory, even though the working set is much smaller.

To really belabour the "its about working set, not virtual memory" point, consider a program that does lots of regular file I/O and nothing to do with mmap. If you have sufficient ram, the kernel will cache repeatedly-used directory structures and file data in ram and avoid going to disk, and it will allow writes to burst into ram temporarily to optimise disk write-out. Even a "naive" program like this will degrade from ram-speeds to disk-speeds depending on working set size vs available ram. When you pin something into ram unnecessarily (eg: using mlock or disabling swap), you prevent the kernel from using that page of physical ram for something actually useful and (if you didn't have enough ram for working set) you've just moved the disk I/O to somewhere even more expensive.

@superdave: I too am interested in improving the status-quo here. Please include me if you want another pair of eyes to review a doc, or hands at a keyboard.

My latest personal real-world example of something like that is linking the kubernetes executables. I'm currently (ironically) unable to compile kubernetes on my kubernetes cluster because the go link stage requires several gigabytes of (anonymous) virtual memory, even though the working set is much smaller.

@superdave: I too am interested in improving the status-quo here. Please include me if you want another pair of eyes to review a doc, or hands at a keyboard.

Good summary of the issue at hand! Swap thrashing is the key problem here, indeed; it's something I'm hoping to address rationally. It seems to me, though I haven't really had the time to think through it enough, that some sort of metric of swapping activity (pageins/outs over a given timeframe, perhaps with a percentile rule to avoid over-eager pouncing if a process suddenly spikes temporarily) might be a good way to evaluate things for eviction when swap is enabled. There are a number of metrics, and I suspect we'd want to offer a lot of knobs to twiddle as well as carefully evaluate the likely use cases. I also suspect being able to instrument pods for virtual memory interactions should help people tune better; I'm not familiar enough with what's already offered to say what's there now, but I suspect I'm going to find out.

I also am not familiar enough with the controls we have to know how well we can control swapping behavior in individual pods/containers; it would be helpful to be able to "renice" things for retention or swapping, but developers are obviously always free to try to mlock() when they absolutely need to guarantee things will be resident anyway.

In any case, yes, I absolutely want to move forward on this. I've been swamped at work lately (handling some OOM issues with our own microservices under k8s that would have benefitted from being able to swap under load because 99% of the time they don't need gigs of RAM unless someone makes an inadvisably large request), but please feel free to keep on me about it. I've never participated in the KEP process before, so I'm going to be pretty green at it, but these days I work much better on an interrupt basis than a polling one. :-)

I would like to point out that zram works by piggybacking on swaps. If there is no swaps on k8, then there is no memory compression, which is something most non-Linux OS has enabled by default (cue Windows, MacOS).

We have a Ubuntu instance on k8 that runs a large batch job every night which consumes a lot of memory. As the workload is not predetermined, we are forced to (expensively) allocate 16GB to the node regardless of its actual memory consumption to avoid OOM. With memory compression on our local dev server, the job peaks at only 3GB. Otherwise during the day, it takes only 1GB of memory. Banning swaps and thus memory compression is quite silly a move.

I think the main concern here is probably isolation. A typical machine can host a ton of pods, and if memory gets tight, they could start swapping and completely destroy performance for each other. If there's no swap, isolation is much easier.

I think the main concern here is probably isolation. A typical machine can host a ton of pods, and if memory gets tight, they could start swapping and completely destroy performance for each other. If there's no swap, isolation is much easier.

But as explained previously, disabling swap doesn't buy us anything here. In fact, since it increases memory pressure overall, it may force the kernel to drop parts of the working set when it could otherwise have swapped out unused data -- so it makes the situation worse.

Enabling swap, on its own, should actually improve isolation.

But it does buy you a lot, if you run things the way you're supposed to (and the way Google runs things on Borg): all containers should specify the upper memory limit. Borg takes advantage of Google infra and learns the limits if you want it to (from past resource usage and OOM behavior), but there are limits nonetheless.

I'm actually kind of baffled that K8S folks allowed the memory limit to be optional. Unlike CPU, memory pressure has a very non-linear effect on system performance as anyone who has seen a system completely lock up due to swapping will attest. It should really require it by default, and give you a warning if you choose to disable it.

But it does buy you a lot, if you run things the way you're supposed to (and the way Google runs things on Borg): all containers should specify the upper memory limit. Borg takes advantage of Google infra and learns the limits if you want it to (from past resource usage and OOM behavior), but there are limits nonetheless.

I'm actually kind of baffled that K8S folks allowed the memory limit to be optional. Unlike CPU, memory pressure has a very non-linear effect on system performance as anyone who has seen a system completely lock up due to swapping will attest. It should really require it by default, and give you a warning if you choose to disable it.

I think the issue this fails to address is that the upper limit is variable and not always known for some processes. The issue I am dealing with specifically focuses on using k8s to manage 3d model renderer nodes. Depending on the assets for the model and scene being rendered, the amount of ram required can vary quite a bit, and while most renders will be small, the fact that _some_ can be huge means that our requests and limits have to reserve way more memory than we actually need 90% of the time to avoid OOM, rather than the pod occasionally exceeding the configured limit and being able to spill over into swap space.

Yes, and in that case you'd then set your upper limit to "None" or something to that effect. My point is, it shouldn't be the default. Setting it to nothing completely defeats any kind of intelligent workload scheduling, since the master simply doesn't know the size of the "item" (job) it's about put into a "knapsack" (kubelet).

The problem here isn't that your job will be spilled to swap, it's that all other jobs running on that node will be, too. And some (most?) of them won't like it, at all.

Programs on Borg are written to be preemptible (killable, for those not familiar with the jargon) at any time with no effect on data integrity. This is actually something I don't see much outside of Google, and acknowledging your program's potential sudden mortality leads to writing much more reliable software. But I digress.

Systems built with such programs would much prefer those programs to die and be re-incarnated elsewhere in the cell (Borg cluster), rather than continue to suffer on an oversubscribed node. Tail latency can be really problematic otherwise, especially when the number of tasks in a job is large.

Don't get me wrong, I'm not saying this is the only "correct" way to run things. I'm just trying to elucidate the possible rationale that went into the design.

Disclaimer: I'm a former Googler who used Borg to run several very large services, so I know it quite well, and that knowledge largely translates to Kubernetes. I'm not currently with Google, and whatever I write here are my own thoughts.

@1e100: You are conflating "total VM" size with "working set" size. The workload needs to be scheduled based on _working set_ size, and the program will degrade once _working set_ size is exceeded (assuming there is enough total swap available). The reasoning you've stated above also relies on the incorrect assumption that swapping (and other ram<->I/O degradation tradeoffs) won't happen just because swap is disabled.

(I'm also an ex-Google-SRE, and I agree that these common Google myths are almost certainly what went into the decision that it was ok (or even desirable) to disable swap on k8s too. I watched several Google teams go through the learning that disabling swap does not disable swapping, and the aggregate memory waste that follows from only describing a "hard" (oom-kill) limit for memory - these are precisely some of the things I would like to improve with k8s. There are a number of cgroup/swap tunable knobs available now that we didn't have when the borg resource model was initially designed, and I'm convinced we can achieve the desired outcomes without such a throw-baby-out-with-bathwater approach. I will also note the Google tradeoff is _often_ to be less efficient on average in order to achieve a better/known worst-case time (ie: real-time behaviour) and that this is often _not_ the desired choice outside Google - smaller/fewer hosts, more relaxed SLOs, lower budgets, more poorly-defined batch jobs, more use of non-compiled heap-inefficient languages, etc.)

Swapping of anonymous memory won't happen. Anything memory mapped (including program code and data) can and will still swap if there's memory pressure, that's why I suggested that RAM limits should be required by default: to make it less likely there's memory pressure in the first place. For workloads that need even stricter guarantees, there's also mlockall() and low swappiness value.

As a former Google SRE you can't argue that not specifying the upper RAM limit, or enabling tasks to swap whatever they want on a whim is a good thing, unless you just want to be a contrarian. Swapping of memory mapped files is bad enough, introducing even more potential performance cliffs into the mix is not a good thing.

These are shared environments by design, and you want to eliminate the ways for programs to make each others' performance unpredictable, not add new ones. As Google SREs say "hope is not a strategy". Swap thrashing is the easiest way I know to get a Linux machine to completely and irrecoverably lock up, even if you're swapping to SSD. That can't be good even if you're just running one workload on the machine, let alone a couple dozen. Correlated failures can be especially painful in smaller clusters with few tasks/pods.

One can ignore the swap check even today if one wants to, but with the explicit understanding that all bets are off in that case.

Yep, totally agree that we need to have a "size" that we use for scheduling and to avoid (unintentional) overcommit. And we also want to avoid global VM thrash, because Linux has a poor time recovering from that. What we _do_ want is for the kernel to be _able_ to swap out anonymous memory and reuse that ram for something else, where that makes sense, since this is strictly superior to a system that can't do that. Ideally we want to allow individual containers to be able to manage ram/disk tradeoffs, and face the consequences of their own resource (over/underallocated) choices with minimal effect on other containers.

Just to show where I'm going with this, my current strawman proposal is:

  • Metaphor is that each container behaves like it's on a machine by itself with a certain amount of ram (given by limits.memory) and swap.
  • Same as other resources: schedule based on requests.memory, impose limit based on limits.memory. "Memory" in this case means "ram" - swap usage is free.
  • Specifically k8s requests.memory -> cgroup memory.low (scaled down by any overcommit factor); k8s limits.memory -> cgroup memory.high.
  • If a container exceeds its configured memory limit, then it starts to swap - _regardless_ of the amount of free ram available. Thanks to cgroups, this is _not_ just VM usage, but also includes block cache, socket buffers, etc attributable to the container. This prevents us putting memory pressure on other containers (or host processes). When looking for a page to evict, the kernel will look for containers that are exceeding their memory request size.
  • Introduce a total-swap-usage kubelet soft limit where k8s will stop scheduling new pods onto the host (exactly like other "shared filesystems" such as imagefs).
  • If we reach the total-swap-usage hard limit, then start evicting pods based on qos-class/priority and VM size above "requests" (exactly like other "shared filesystems" such as imagefs).
  • If a container greatly exceeds its working set (requests.memory) then it may thrash (if it also exceeds limits.memory or there is not sufficient ram available on the host). We explicitly _don't_ do anything about this through the resources mechanism. If the container is swap-thrashing then it will (presumably) fail liveness/readiness probe checks and be killed through that mechanism (ie: swap-thrashing is fine if we have no configured responsiveness SLAs).

The end result is that the admin is responsible for configuring "enough" swap on each system. Applications should configure limits.memory with the _max_ ram they ever want to use, and requests.memory with their intended working set (including kernel buffers, etc). As with other resources, qos classes guaranteed (limit==request), burstable (limit undefined or !=request), best-effort (no limit or request) still apply. In particular, this encourages burstable processes to declare close to their intended working-set (no large safety buffer), which allows efficient allocation (ideally exactly 100% of ram allocated for working set) and gives a smooth performance degradation when containers exceed that - just like other "forgiving" resources like cpu.

I think this is implementable within Linux cgroups, addresses the isolation concerns, continues the conceptual precedents set by other k8s resources, and degrades to the existing behaviour when swap is disabled (making migration easy). The only open question I have is whether this is _already_ what's implemented (minus the "swapfs" kubelet soft/hard limit) - I need to go and read the actual kubelet/CRI cgroups code before I can write up a concrete proposal and action items.

Comments/discussion on the above are probably not appropriate in this github issue (it's a poor discussion forum). If there's anything terribly wrong with the above I would welcome feedback to guslees on k8s slack, else I'll write up a proper doc and go through the usual design discussion at some point.

I recommend writing up a formal doc so we can have a better forum for discussion.

Agreed. I'm happy to help write a KEP, because I have some definite ideas for this, but I've never done one before and would rather have a more experienced hand on the rudder.

But, also, I don't have the bandwidth for keeping up with a Slack channel in my spare time; if there's a more asynchronous method of coordinating, let me know.

Just to keep things alive: I'm still very much interested in working on a KEP and/or implementation for this; once things settle down (I have a workshop to prepare for next weekend), I'll try and join the Slack channel.

Hi is there any public discussion of this issue happening currently? (The k8s slack is not open to everyone at the moment, and I assume won't be for some time).

@leonaves there is no discussion on slack going on currently AFAIK. the last comment from @guslees is the last of the discussion. Note that there will have to be a KEP with details in the kubernetes/enhancements repo to kick things off and probably mailing list threads too.

There does seem to be an end of the tunnel for the slack re-opening as well pretty soon. cross my fingers.

As it turns out, I still don't have the mental bandwidth to join yet another Slack channel. I'd still be down with collaborating on this over email.

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do _not_ specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

Is this method not working anymore!? I turn on the swap and deploy a pod without memory setting, and got this container

$ docker inspect <dockerID> | grep Memory
            "Memory": 0,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": -1

MemorySwap is "0", which means this container can not access to the swap :(

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale.

/remove-lifecycle stale

Going to drop this here as another reference for readers of this issue: https://chrisdown.name/2018/01/02/in-defence-of-swap.html

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

this feature is really needed in some use cases. currently we're using k8s for machine learning and sometimes we need to load big models in memory (in our case sometimes 500MB per api request!) , and the limits of physical memory are causing serious issues. Scaling out would work from technical POV but the costs would go through the roof, If we had the virtual memory as an option it'd be great.

Any chance this ticket gets back to the roadmap again?

Sounds like a case for mmap.

I am also very much interested in this feature. Are there any news on this?

I'd be happy to start looking into this when I have time, which is in short supply right now, but it would be good to have a canonical case or two which exacerbate the problem so it can be more fully characterized (beyond "it starts thrashing and everything goes to hell") and the ultimate approach and fix validated.

Any prospective solution should also consider the security implications of swap. This, obviously, is true of anything running in a Unix environment, but if folks have gotten used to running k8s pods with a pseudo-guarantee of no swap and gotten lazy about memory discipline, this could be a pretty rude surprise if it's enabled by default.

would be good to have a canonical case or two which exacerbate the problem so it can be more fully characterized

That sounds like a KEP.

Any prospective solution should also consider the security implications of swap. This, obviously, is true of anything running in a Unix environment, but if folks have gotten used to running k8s pods with a pseudo-guarantee of no swap and gotten lazy about memory discipline, this could be a pretty rude surprise if it's enabled by default.

That logic would apply to any processes running in a mix of containers, regardless of whether Kubernetes is in use or not.

Agreed! But Docker already explicitly supports running with swap. Kubernetes explicitly does not (though you can force it to). My contention is that it should be at least called out, because it's not in everyone's threat model, especially if they haven't had to think about it previously.

Also yes, @sftim, it does. :-) I think what I'm saying is that I'd like to write/contribute to a KEP, but I would like to see a minimal test case or two which reliably exercises the problem on a given test system before venturing out so that we can be sure we're solving the right problems.

@superdave what kind of test case do you have in mind?

Here's a trivial test:

  1. Set up a cluster with 1 node, 16 GiB of RAM and 64GiB of pagefile.
  2. Try to schedule 20 Pods, each with 1GiB memory request and 1GiB memory limit.
  3. Observe that it doesn't all schedule.

Here's another:

  1. Set up 6 machines, each with 16 GiB of RAM and 64GiB of pagefile.
  2. Try to use kubeadm with default options to configure these machines as a Kubernetes cluster.
  3. Observe that kubeadm isn't happy about swap being in use.

There is huge shift for SSD on most respectable cloud platforms now and considering Linux has dedicated optimizations for swapping on SSD https://lwn.net/Articles/704478/ with additional possibility of compression, this situation makes whole new opportunity to utilize swap as predictable and fast resource for additional RAM in case of memory pressure.
Disabled swap becomes wasted resource the same way unused RAM is wasted if not used for I/O buffers.

@superdave

Agreed! But Docker already explicitly supports running with swap. Kubernetes explicitly does not (though you can force it to). My contention is that it should be at least called out, because it's not in everyone's threat model, especially if they haven't had to think about it previously.

In that case it would be fair to assume kubelet will mlock() its memory space and set OOM kill priority low to avoid being swapped out or OOM killed, and run all containers in cgroups with swapiness set to 0 by default. If someone wants to benefit form swapping it may opt-in by use an option i.e. enableSwapiness: 50 for particular container(s) in pod.
No surprises, batteries included.

@sftim Those would demonstrate that a) Kubelet doesn't want to schedule the containers and b) Kubelet won't run with swap on by default. What I'm looking to exercise is the situations way up at the top of the thread, by @derekwaynecarr:

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize. We would prefer to improve reliability around pressure detection, and optimize issues around latency before trying to optimize for swap, but if this is a higher priority for you, we would love your help.

And also right below it, from @matthiasr:

There is more context in the discussion here: #7294 – having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would _then_ start spilling over into swap (this appears to be fixed since f4edaf2 – they won't be allowed to use any swap whether it's there or not).

Both of those give a good view on the issues already seen, but it would be good to get a sense of known, reproducible scenarios that could exacerbate the problems. I can come up with them myself, but if someone else already has done that, it's a wheel I wouldn't mind not reinventing.

@superdave

Agreed! But Docker already explicitly supports running with swap. Kubernetes explicitly does not (though you can force it to). My contention is that it should be at least called out, because it's not in everyone's threat model, especially if they haven't had to think about it previously.

In that case it would be fair to assume kubelet will mlock() its memory space and set OOM kill priority low to avoid being swapped out or OOM killed, and run all containers in cgroups with swapiness set to 0 by default. If someone wants to benefit form swapping it may opt-in by use an option i.e. enableSwapiness: 50 for particular container(s) in pod.
No surprises, batteries included.

I think I agree with everything here. Default to the current behavior to avoid unpleasant surprises.

Here an example of how a simple application could look like, where for some reason a large portion of memory is allocated but never accessed again. Than once all available memory is filled up, the application either hangs or falls into an endless loop basically blocking resources or forcing the out of memory killer:

#include <iostream>
#include <vector>
#include <unistd.h>
int main() {
  std::vector<int> data;
  try
    {
        while(true) { data.resize(data.size() + 200); };
    }
    catch (const std::bad_alloc& ex)
    {
        std::cerr << "Now we filled up memory, so assume we never access that stuff again and just moved on, or we're stuck in an endless loop of some sort...";
        while(true) { usleep(20000); };
    }
  return 0;
}

A workaround for those who really want swap. If you

  • start kubelet with --fail-swap-on=false
  • add swap to your nodes
  • containers which do _not_ specify a memory requirement will then by default be able to use all of the machine memory, including swap.

That's what we're doing. Or at least, I'm pretty sure it is, I didn't actually implement it personally, but that's what I gather.

This might only really be a viable strategy if none of your containers ever specify an explicit memory requirement...

Hi @hjwp , thank you for ur information. That really helps a lot!

Could I ask a question following this?

After setting everything up as you said, is there a way to limit the use of swap memory usage by containers?

I was thinking about setting --memory-swap params of Docker
https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
Currently, my container has no limit on swap usage ( "MemorySwap": -1)

sudo docker inspect 482d70f73c7c | grep Memory
            "Memory": 671088640,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": -1,
            "MemorySwappiness": null,

But I couldn't find this param exposed in k8s.

By the way, will the limit on pod memory also restrain the swap usage?

My vm-related settings

vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.swappiness = 20
vm.vfs_cache_pressure = 1000

Thank you!

@pai911 I don't think that is possible,

currently, CRI doesn't support that, see this, there is no option like --memory-swap in docker

this is CRI limitation though, OCI spec support this option, but not exported to CRI layer

one possible (theoretically) solution is to create privileged DaemonSet which in turn read the annotation data from the pods, and then the DaemonSet edit the cgroup value manually

cgroup

Hi @win-t ,

Thank you for the feedback!

So for now, this option is only for internal use?

Do you happen to know what cgroup value is mapped to this --memory-swap option?

So for now, this option is only for internal use?

Yes, you cannot set this option, as they aren't exposed in k8s

btw, MemorySwap in docker inspect should be the same with Memory according to this, I don't know how can you get -1 in your docker inspect

Do you happen to know what cgroup value is mapped to this --memory-swap option?

  • --memory in docker map to memory.limit_in_bytes in cgroup v1
  • --memory-swap in docker map to memory.memsw.limit_in_bytes in cgroup v1

@win-t Thank you so much!

I am using the following version

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"1bea6c00a7055edef03f1d4bb58b773fa8917f11", GitTreeState:"clean", BuildDate:"2020-02-11T20:05:26Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

and I looked at the history. it seems that the fix was added in this commit

Maybe it's not included in the version I am running?

So for now, this option is only for internal use?

Yes, you cannot set this option, as they aren't exposed in k8s

btw, MemorySwap in docker inspect should be the same with Memory according to this, I don't know how can you get -1 in your docker inspect

Do you happen to know what cgroup value is mapped to this --memory-swap option?

  • --memory in docker map to memory.limit_in_bytes in cgroup v1
  • --memory-swap in docker map to memory.memsw.limit_in_bytes in cgroup v1

This is odd.

I was using kops + Debian, and the Docker inspect shows there is no limit on Swap memory
(The Docker inspect info I posted earlier)

But then I switched to Amazon Linux image, and this is what I got

            "Memory": 671088640,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 671088640,
            "MemorySwappiness": null,

I'll do some more investigation and see if this is a bug

So for now, this option is only for internal use?

Yes, you cannot set this option, as they aren't exposed in k8s
btw, MemorySwap in docker inspect should be the same with Memory according to this, I don't know how can you get -1 in your docker inspect

Do you happen to know what cgroup value is mapped to this --memory-swap option?

  • --memory in docker map to memory.limit_in_bytes in cgroup v1
  • --memory-swap in docker map to memory.memsw.limit_in_bytes in cgroup v1

This is odd.

I was using kops + Debian, and the Docker inspect shows there is no limit on Swap memory
(The Docker inspect info I posted earlier)

But then I switched to Amazon Linux image, and this is what I got

            "Memory": 671088640,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 671088640,
            "MemorySwappiness": null,

I'll do some more investigation and see if this is a bug

I can now reproduce the issue exists in official Debian image by kops

It seems that this kops official image will make the swap memory unlimited
kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17

Reproduction steps:

My kops instance group is defined as the following:

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-12T06:33:09Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: solrcluster.k8s.local
  name: node-2
spec:
  additionalUserData:
  - content: |
      #!/bin/sh
      sudo cp /etc/fstab /etc/fstab.bak
      sudo mkfs -t ext4 /dev/nvme1n1
      sudo mkdir /data
      sudo mount /dev/nvme1n1 /data
      echo '/dev/nvme1n1       /data   ext4    defaults,nofail        0       2' | sudo tee -a /etc/fstab
      sudo fallocate -l 2G /data/swapfile
      sudo chmod 600 /data/swapfile
      sudo mkswap /data/swapfile
      sudo swapon /data/swapfile
      echo '/data/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
      sudo sysctl vm.swappiness=10
      sudo sysctl vm.overcommit_memory=1
      echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
      echo 'vm.overcommit_memory=1' | sudo tee -a /etc/sysctl.conf
    name: myscript.sh
    type: text/x-shellscript
  image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
  instanceProtection: true
  kubelet:
    failSwapOn: false
  machineType: t3.micro

Steps:

  1. After the cluster is up & running.

  2. Deploy the Solr Helm Chart with the following resource setting

resources:
  limits:
    cpu: "1"
    memory: 640Mi
  requests:
    cpu: 100m
    memory: 256Mi

** Any other Pod should work too

  1. List containers to find a container id
    sudo docker container ls

  2. Inspect a container's memory params
    sudo docker inspect d67a72bba427 | grep Memory

            "Memory": 671088640,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": -1,
            "MemorySwappiness": null,

Should I submit the issue somewhere? k8s or kops?

Steps:

  1. After the cluster is up & running.
  2. Deploy the Solr Helm Chart with the following resource setting
resources:
  limits:
    cpu: "1"
    memory: 640Mi
  requests:
    cpu: 100m
    memory: 256Mi

** Any other Pod should work too

  1. List containers to find a container id
    sudo docker container ls
  2. Inspect a container's memory params
    sudo docker inspect d67a72bba427 | grep Memory
            "Memory": 671088640,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": -1,
            "MemorySwappiness": null,

Should I submit the issue somewhere? k8s or kops?

I can confirm that I can only see the correct behavior on Amazon Linux
ami-0cbc6aae997c6538a: amzn2-ami-hvm-2.0.20200304.0-x86_64-gp2

            "Memory": 671088640,
            "CpusetMems": "",
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 671088640,
            "MemorySwappiness": null,

That is: "MemorySwap" == "Memory"

The other two images both have the same setting: "MemorySwap": -1, which lead to unlimited swap usage.

  • Debian

    • ami-075e61ad77b1269a7: k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17

  • Ubuntu

    • ami-09a4a9ce71ff3f20b: ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200112

So I think it might be k8s's issue?

User stories:

(1) My vendor provided program uses a language runtime that mandates access to the source of program code. Due to this, while initializing, all program source code is arranged in a separate memory arena. Once the program is initialized and the container becomes Ready, this memory will not be accessed (you can't prove this, but it won't). Additionally, the program allocates a few pages to reserve for custom OOM handling. This memory can be swapped out. I don't want this "dead memory" crowding out other cluster applications. I can precisely calculate the amount of memory that will become dead and list it as a request for swap in the Pod spec.

(2) I am running a machine learning or data analysis workload where memory usage can suddenly balloon and retract. It's okay if these applications slow down, as long as they don't terminate and eventually finish. I want to provision the cluster with swap to mitigate evictions when these memory balloons happen. These Pods would have low requests for memory and swap, a moderate memory limit - perhaps enough for baseline + one working set - and a large swap limit.

(3) I am running a web server in an interpreter (e.g. Ruby on Rails) that occasionally needs to fork+exec. Strict memory accounting results in fork failures, which are unacceptable. I want to provision swap so the kernel has the guaranteed memory headroom to cover process behavior between the fork and exec calls. The vm.swappiness value can be set to extremely discourage swapping, and I set up alerts to notify operations if the swap is actually used during production. The pod spec would set the swap request and limit to the same value.

Recently we attempted to migrate all our docker-based services into Kubernetes, but had to abandon the project due to swap being unsupported.

We found that we would have needed to provision 3 times the amount of hardware in order to support the exact same number of containers we were currently running with swap enabled.

The main issue is that our workload consists of a number of containers which can use up to 1Gb or so of memory (or swap), but normally use around 50Mb or so when operating normally.

Not being able to swap meant that we had to design everything for the largest possible load it might need to handle, rather than have a block of swap available when large 'jobs' need to be dealt with.

We ended up abandoning our migration to Kubernetes and have temporarily moved everything onto Swarm instead for the time being in hopes that swap will be supported in the future.

The main issue is that our workload consists of a number of containers which can use up to 1Gb or so of memory (or swap), but normally use around 50Mb or so when operating normally.

One might venture to say that the applications running in those containers are written incredibly poorly.

One might venture to say that the applications running in those containers are written incredibly poorly.

This is kind of irrelevant, and fingerpointing is rarely constructive. Kubernetes ecosystem is built to support wide range of application profiles, and singling out one of them like this does not make much sense.

The main issue is that our workload consists of a number of containers which can use up to 1Gb or so of memory (or swap), but normally use around 50Mb or so when operating normally.

One might venture to say that the applications running in those containers are written incredibly poorly.

lol, this is a kernel feature, an application can use madvise(2) on shm file, and we don't block madvise syscall,
so it is eligible for the user to leverage this feature on their design, you cannot say "are written incredibly poorly",

The main issue is that our workload consists of a number of containers which can use up to 1Gb or so of memory (or swap), but normally use around 50Mb or so when operating normally.

One might venture to say that the applications running in those containers are written incredibly poorly.

Your reply indicates you don't understand workloads many developers are working with.

The workload I mentioned deals with different sized data sets provided by users, hence the large range of possibly resource requirements.

Sure, we _could_ use memory mapped files to keep memory use consistent. Then we'd then need to rewrite any libraries to use memory mapping themselves, which would include more-or-less every existing library... ever.

But then we would have created what is essentially an application specific pagefile, that would almost certainly perform worse than one managed by the kernel.

Some Kubernetes Deployments Need Swap

I have a valid use case - I'm developing an on-prem product, linux distro, included with kubeadm. no horizontal scaling by design. To survive opportunistic memory peaks and still function (but slow), I definitely need swap.

To install kubeadm with swap enabled

  1. Create a file in /etc/systemd/system/kubelet.service.d/20-allow-swap.conf with the content:

    [Service]
    Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"
    
  2. Run

    sudo systemctl daemon-reload
    
  3. Initialize kubeadm with flag --ignore-preflight-errors=Swap

    kubeadm init --ignore-preflight-errors=Swap
    

https://stackoverflow.com/a/62158455/3191896

As a naive software developer, it seems perfectly reasonable to me for system pods with time sensitive workloads to request a non-swap behavior, and other workloads (by default) pushed into a best effort category. Wouldn't that solve all the concerns?

For my own needs, a lot of my apps benefit from caches. That said, if an app suddenly needs a bunch of memory, it would be preferred to push oldest cache to disk if those apps don't respond to a request to lower memory pressure than to let the new workload run out of memory, or have to back the burst with physical memory + more for rolling deployments + more for a potential node failure.

@metatick said:

Sure, we could use memory mapped files to keep memory use consistent. Then we'd then need to rewrite any libraries to use memory mapping themselves, which would include more-or-less every existing library... ever.

Linux standard C library is designed for replacing the memory allocator; the malloc, realloc and free are called through pointers for that purpose. So you could just LD_PRELOAD a library that would override them to allocate from a mmapped file.

But then we would have created what is essentially an application specific pagefile, that would almost certainly perform worse than one managed by the kernel.

It would actually perform exactly as normal swap, because it would be managed by the very same code in kernel. The only difference would be the lack of swappiness parameter to adjust its priority.

The only open question I have is whether this is already what's implemented (minus the "swapfs" kubelet soft/hard limit) - I need to go and read the actual kubelet/CRI cgroups code before I can write up a concrete proposal and action items.

@anguslees,
Did you ever get around to check the behaviour. If so, can you add some resolution or a link to one, please?

Thanks,
Jan

Did you ever get around to check the behaviour. If so, can you add some resolution or a link to one, please?

I did not. (I did dig into the docker code a bit, but I've forgotten all about it by now and would need to start again)

Other volunteers welcome! I hope I did not steal the oxygen from someone by saying I would work on this and then failed to follow through :(

To add to @metatick 's story:

I currently use Gigalixir as my host, running on top of Kubernetes. It's a web app. Sometimes, clients upload a batch of photos, so my app spins up a bunch of (ugh) ImageMagick processes to resize them. Memory usage spikes, the OOM killer is triggered, and my app goes down (briefly) and the upload is ruined.

l end up having to pay tons more to Gigalixir than I should just because of spikey usage. As others have mentioned.

You might not like swap from a design perspective, but your decision is costing businesspeople money... and it's wasteful.

Please fix. :)

This is also a very big issue for me. In my use case, I would need to run pods that use ~100MB most of the time, but from time to time, when a user triggers specific events, it can burst up to 2GB of RAM for a few minutes before dropping back (and no, it's not because it's written poorly, it's the reality of the workload).
I run nearly a hundred of such workloads at a time on 16GB machines with swap on them. I simply cannot move that workload to Kubernetes because that would not work at all. So right now, I have my own orchestrator which runs these workloads on non-kubernetes systems while my main app runs in Kubernetes, and it defeats the purpose of my migration to k8s. Without swap, either it gets killed, or I need to always waste a lot of available RAM for the few minutes that the apps may (or may not) burst.

If you can set a CPU limit, which throttles the CPU for a pod, you should be able to set a memory limit which throttles the memory used by a pod. Killing a pod when it reaches the memory limit is as ridiculous as killing one if it uses more CPU resources that the CPU limit that was set (sorry but not every pod is a replica which can be brought down with no consequences).

Kubernetes can't work with swap set on the node because it can affect the whole performance of the whole cluster, fine (though I don't think that's a valid argument). Ideally, what would need to happen is that the pod itself would have a pod-level swap file in which only the processes within those containers would be swapped to. This would theoretically throttle the RAM usage and performance (due to swapping) of the pods that exceed their memory limits, just like CPU limits throttles them.
Unfortunately, it doesn't look like cgroups can specify a swap file, only their swappiness, and you can't tell the kernel to "swap if mem usage is above this limit" as it instead seems to decide when to swap based on last access and other metrics.

But in the meantime, why not let swap exist on a node, set swappiness to 0 for the pods that don't have a limit set, and when a limit is set (or some other spec field to say "swapInsteadOfKill") set the swappiness to a non zero value?

Beside the discussion about "swap or not to swap" it makes me curious that the behavior described by @pai911 was not further addressed by the k8s team.

I can confirm that kubelet seems to behave differently (and on some OS not according to the code snipped mentioned above) with regard to the docker deamon memory settings. Our clusters runs on SUSE linux and we are experiencing the same unlimited swap usage mentioned in https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-598056151

OS details: SUSE Linux Enterprise Server 12 SP4 - Linux 4.12.14-95.45-default

Without proper support for Swap in k8s anyways I would at least wish that kubelet would handle the Docker memory settings consistently regardless of the underlying OS.

I put swap on the agenda to see if there is community appetite or volunteers to help push this forward in 1.21. I have no objection to supporting swap as I noted in 2017, it just needs to make sure not to confuse kubelet eviction, pod priority, pod quality of service, and importantly pods must be able to say if they tolerate swap or not. All these things are important to ensure pods are portable.

A lot of energy has been focused lately on making things like NUMA aligned memory work, but if there are folks that are less performance sensitive and equally motivated to help move this space forward, we would love help to get a head-start on design of detailed KEP in this space.

I have not kept up with the community process terribly well of late as things have been super busy for me lately, however they look like they should be calming down somewhat soon. Is there a way I can engage without having to join a Slack channel?

Was this page helpful?
0 / 5 - 0 ratings