How to Fix OOMKilled Errors in Kubernetes
Here's how to diagnose, troubleshoot, and prevent OOMKilled errors in Kubernetes by addressing memory shortages, leaks, and configuration issues.
At first glance, OOMKilled might sound like a song title from a band trying to fuse the style of "MMMBop" with death metal.
In fact, though, OOMKilled has nothing to do with either the Hanson Brothers or Marilyn Manson. Instead, it's a common type of error on Kubernetes that can cause applications to fail.
That's why understanding what causes OOMKilled events and how to troubleshoot them — the topics this article covers — is important for any IT pro who works with cloud-native technology like Kubernetes.
What Is OOMKilled in Kubernetes?
OOMKilled is an error in Kubernetes that means containers were terminated (or "killed") to save memory. It's also sometimes called exit code 137 because 137 is the code number that Kubernetes reports when containers stop running due to OOMKilled errors.
OOM stands for out-of-memory in this context — so when a container stops running due to an OOMKilled error, it's because it was killed due to an out-of-memory event.
How Process Termination Works on Linux
To understand fully what that means, let's talk about how processes work, and how they can be "killed," on Linux (which is the operating system that usually powers the nodes, or host servers, within a Kubernetes cluster).
Any process running on a Linux server can be terminated if the operating system sends a termination, or "kill," signal to it. You can send that signal manually using a command like the following:
kill 1234
This command tells Linux to terminate a process whose process ID is 1234. (You can find process IDs on Linux using the ps -e command.)
However, it's not only through manual kill commands that processes can be terminated on Linux. The operating system can also terminate a process automatically if it believes doing so is necessary to keep the system running stably.
This may happen in the event that a system runs critically low on memory. If memory becomes totally maxed out, a system is likely to crash because processes will run out of space to store temporary data. To prevent this from occurring, Linux can decide to terminate specific processes.
Linux determines which processes to kill during periods of low memory using an algorithm that assesses factors such as how much memory each process is using and how many processes are running. The selection is made automatically, and there's not much you can do to control it.
So, if a Linux server happens to decide that the process that runs a container is at the top of the list for termination during an out-of-memory event, the operating system will kill that process, resulting in an OOMKilled error for the container.
How to Diagnose and Fix OOMKilled on Kubernetes
Determining whether a Kubernetes application has experienced an OOMKilled error is easy. Simply run the command:
kubectl describe pod <podname>
If an OOMKilled error has occurred, you'll see the following lines in the output of that command:
Last State: Terminated
Reason: OOMKilled
But this doesn't tell you much about why the error occurred. To determine this, you'll need to work through some additional steps:
Check overall memory usage for the node that hosts the application that was OOMKilled. If almost all of the node's memory is still being used even with the application shut down, it's likely that the node is simply short on memory resources.
If overall node memory usage is reasonable, restart the failed application and monitor its memory consumption. If memory consumption increases steadily over time regardless of the number of requests the app is receiving, you probably have a memory leak issue.
Try moving the application to a different node in your cluster. This will help you determine whether the issue is linked to your application or a specific node.
Assess the resource quotas you've configured for your workloads to make sure you're not tying up memory resources on applications that don't need them.
What Causes OOMKilled Errors?
Once you know the underlying cause of the OOMKilled event, you can take steps to fix it based on the guide above.
Here's a look at common causes of OOMKilled and how to fix each one:
You don't have enough memory on your servers to support your workloads. In that case, the solution is to add more servers to your cluster.
A bug in your application code creates a memory leak, causing a container to use more and more memory over time until it has exhausted all available memory. To fix this issue, you need to debug and update your code.
You've assigned too much memory to some workloads. Even if the workloads with high memory allocations don't need all the memory assigned to them, they tie it up nonetheless, causing other workloads to run short of memory and ultimately resulting in the operating system terminating some processes. The solution to this problem is to reconfigure memory allocations.
Your workloads are experiencing a surge in demand, causing them to consume more memory than your servers have available. In this case, you may need to set up autoscaling so you can deploy additional servers to meet an increase in demand.
About the Author
You May Also Like