Out-of-memory errors in containerized Java applications can be very frustrating, especially when happening in a production environment. These errors can happen for various reasons. Understanding the Java Memory Pool model and different types of OOM errors can significantly help us in identifying and resolving them.
Java heap is the region where memory is allocated by JVM for storing Objects and dynamic data at runtime. It is divided into specific areas for efficient memory management (Young Gen, Old Gen, etc.). Reclamation of memory is managed by the Java GC process.
This memory space stores class-related metadata after parsing the classes. The figure below shows the two sections in the Class Related Memory pool.
These 2 commands can give the class-related stats from JVM:
This is the memory region that stores compiled native code generated by the JIT compiler. This serves as a cache for frequently executed byte code that is compiled into native machine code. Frequently executed byte code is referred to as . It is for improving the performance of the Java application.
This area includes JIT-Compiled Code, Runtime Stubs, Interpreter Code, and Profiling Information.
Each thread has its own memory. The purpose of this memory area is to store method-specific data for each thread.
Symbols are represented as shown in the figure below:
This is to bypass the heap to allocate faster off-heap memory. They are primarily used for efficient low-level I/O operations; mostly applications with frequent data transfers.
There are 2 ways you can access Off-Heap memory:
ByteBuffers can be allocated through . Reclamation of direct ByteBuffers is through GC.
This is used to create a memory-map file that allows direct memory access to file contents by mapping a region of a file into the memory of the Java process.
NMT is a tool available for tracking the Memory pools allocated by JVM. Below is a sample output.
In Java, OOM occurs when JVM runs out of memory for Object/Data structure allocation. Below are the different types of OOM errors commonly seen in Java applications.
This happens when the heap memory is exhausted.
This happens when allocated MetaSpace is not sufficient to store class-related metadata. For more information on MetaSpace, refer to the Java Memory Pool model above. From Java 8 onwards, MetaSpace is allocated on the native memory and not on the heap.
This error happens when JVM spends too much time on GC but reclaims too little space. This occurs when the heap is almost full and the garbage collector can't free much space.
This mostly happens when application/JNI/JVM/third-party libraries try to use native memory. This error involves native memory, which is managed by the operating system, and is used by the JVM for other than heap allocation.
These are some of the OOM issues I faced at work, and I also explain how I identified their root causes.
We had a streaming data processing application in Apache Kafka/Apache Flink. The streaming application was deployed on containers managed by Kubernetes. The Java containers were periodically experiencing OOM-killed errors.
We started analyzing the heap dump, but the heap dump didn't reveal any clues as heap space was not growing. Next, we started the containers by enabling the NMT (Native Memory Tracking tool). NMT has a feature to see the difference between 2 snapshots. It clearly reported that a sudden spike in the "Other" section (please check the sample output given in the Java Memory Pool model section) is resulting in OOM killed. Further to this, we enabled on this Java application. This helped us to root out the problematic area.
This was a streaming application and we had enabled the feature of Flink. This feature periodically saves data to a distributed storage. Data transfer is an I/O operation and this requires byte buffers from native space. In this case, the native memory usage was legitimate. We reconfigured the application with the correct combination of heap and native memory. Things started running fine thereafter.
This is another streaming application and the container was getting killed with an OOM error. Since the same service was running fine on another deployment, it was a little hard to identify the root cause. One main feature of this service is to write the data to underlying storage.
We enabled and started the Java application in both environments. Both environments had the same ByteBuffer requirements. However, we noticed that in the environment where it was running fine, the ByteBuffer was getting cleaned up after the GC. In the environment where it was throwing OOM, the data flow was less, and the GC count was way less than the other.
There is a YouTube video explaining the same exact problem. We had two choices here: either enable explicit GC or reduce the heap size to force earlier GC. For this specific problem, we chose the second approach, and that resolved it.
Once again, a streaming application with checkpoint enabled: whenever the "checkpointing" was happening the application crashed with .
The issue was relatively simple to root cause. We took a thread dump and that revealed that there were close to 1000 threads in the application.
There is a limit on the maximum number of threads per process in the operating system. This limit can be checked by using the below command.
We decided to rewrite the application to reduce the total number of threads.
We had a data processing service with a very high input rate. Occasionally, the application would run out of heap memory.
To identify the issue, we decided to periodically collect the heap dumps.
The heap dump revealed that the application logic to clear the Window (streaming pipeline Window) that collects the data was not getting triggered because of a thread contention issue.
The fix was to correct the thread contention issue. After that, the application started running smoothly.
Out-of-memory errors in Java applications are very hard to debug, especially if it is happening in the native memory space of a container. Understanding the Java memory model will help to root the cause of the problem to a certain extent.