In this post, we are going to have a look at
What is a memory leak
Let’s understand what memory leak is. Reading from Wikipedia
A memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in a way that memory which is no longer needed is not released.
Go has a garbage collector that does a very good job managing memory for us, tracking down memory that is no longer used and can be returned back to the system. Still there are some cases where we can end up with either memory leaks or our system needing excessive memory to work.
Why are memory leaks bad
Before looking on cases that can make a Go program “waste” memory let’s first discuss why it can be bad having memory leaks and why we need to be mindful:
- System reliability and stability is affected.
The system might behave unpredictably and crash from OOM (out of memory) errors. - Inefficient resource utilization resulting in increased costs.
- Performance degradation. As memory leaks accumulate the performance of the system may degrade, affecting responsiveness and efficiency. Memory leaks also create additional pressure on the garbage collector resulting also in increased CPU usage.
- Difficult to track down and debug, especially for bigger systems with complex interactions.
Common causes for memory leaks in Go
In this section we will see some cases that can cause memory leaks. Some of those cases are specific to Go and some others more general.
Unbounded resource creation
Creating resources without a limit can be seens as a type of memory leak.
For example if we have a cache that only ever grows our service eventually will crash with OOM (out of memory) error.
The solution would be to restrict how many items can a cache hold (e.g. TTL).
This applies to many resources, like Goroutines (more about this later), http connections or open files. We should always be mindful to have limits on creating resources.
Something you should keep in mind regarding maps in Go. They don’t shrink after elements are deleted runtime: shrink map as elements are deleted #20135
Long lived references
Keeping references to objects that your service no longer needs can result to memory leaks as the garbage collector sees the references and cannot free the memory. Some cases where you can keep references unintentionally is with global variables, never ending goroutines, maps or not resetting pointers.
There is a special case in Go for holding references unintenionally with reslicing a slice (this also applies to strings). In our example we have a function readDetails
that opens a big file and returns only a portion of it, so we slice the data []byte
and return it. In Go slices share the same underlying memory block (Go Slices: usage and internals). That means, that even if we are only interested in a very small subset of the data we are still keeping in memory, (referenced) the whole file.
The correct way here, would be to call bytes.Clone(data[5:10])
so that the data will no longer be referenced and subsequently collected by the garbage collector.
You can also read more information for Go Slices at
Goroutines
Go runtime is doing a great job in spawning and managing goroutines, a lightweight thread, but as mentioned on the ‘Unbounded resource creation’ section, (realistically) there is a limit of how many goroutines you can have at any time, bounded to the underlying system your service is running on.
More over, iniside Goroutines you can allocate or reference resources. So you need to make sure that your goroutines are properly terminated and memory is finally released.
Let’s see the example below. We have a function that creates a new goroutine every second to execute a task, allocates a big data slice does some processing and then hangs forever. This code, has two problems
- creates an unbounder number of goroutines
- due to not termination of those goroutines resources allocated are never going to be released
Deferring function calls
Deferring a big number of functions can also cause a type memory of memory leak. The most common mistake is when you call defer inside a loop but the defer calls are pushed into a stack and only executed in lifo order at the end of the calling function.
In the example below we are processing files in a for loop and calling .Close
on defer. The problem with below code is if we call processManyFiles
with a lot of files we are only going to close all the files after we are done processing.
The correct way to handle this case would be to break opening and processing the file to a separate function so when called in the for loop each file we be closed before moving to the next one.
Not stopping time.Ticker
Finally, another common case for memory leak is the time.Ticker
. As stated in the docs the resources are only released when stopping the ticker.
Correct way to handle a time.Ticker
Methods for identifying memory leaks
The most straightforward method to notice if your service is having a memory leak is to observe memory consumption over time.
In distributed systems, observability is the ability to collect data about programs’ execution, modules’ internal states, and the communication among components.
There are some memory patterns, that if you notice them on your service, you should get at least suspicious that something is wrong.
In the first image, we are seeing a memory graph of our service per pod and the memory is only growing. There are some big spikes in memory, the memory goes down but never back to the previous value. This is a good indication that something is kept into memory and never released. So we should investigate what exactly is using this memory and why.
On the second image, this pattern with the abrupt (cliff like) reduction in memory appears when our service reaches the system’s memory limit and crashes with an out of memory error.
You need to keep in mind that in both cases just by noticing this pattern can’t be enough to conclude that your service is having memory leak. On the 1st image your service can still be on the start phase where it needs to allocate memory for initialisation or “warming caches” and on the second there is no doubt this is an OOM error but it can be the case you have wrongly sized your service so there is not enough memory. In both cases you should look deeper to identify the causes.
Investigate memory leaks
So if you are suspecting a memory leak where you should look at? Where is the culprit? You could answer simply, take a look at your code, and that would be a valid answer. But in a complex system with external factors and interactions it’s not that straightforward.
Profiling
Profiling can help,
Profiling tools analyze the complexity and costs of a Go program such as its memory usage and frequently called functions to identify the expensive sections of a Go program.
Profiling is useful for identifying expensive or frequently called sections of code. The Go runtime provides profiling data in the format expected by the pprof visualization tool. The profiling data can be collected during testing via go test or endpoints made available from the net/http/pprof package. Users need to collect the profiling data and use pprof tools to filter and visualize the top code paths.
The profile type we are interested in for investigating memory leaks is the heap
.
Heap profile reports memory allocations samples used to monitor current and historical memory usage, and to check for memory leaks.
Go has two ways for capturing profiles, either via tests or enabling profiling over HTTP.
go test -cpuprofile cpu.prof -memprofile mem.prof -bench .
and
package main
import (
_ "net/http/pprof"
)
func main() {
...
http.ListenAndServe(":6060", nil)
}
We are not going to get into more details on how you can setup all this as it would be a whole new post you can read more at
net/http/pprof.
It’s worth knowing that go tool pprof supports different types of visualisations for analysing the profiling data you have captured.
But there is a problem with this process, it’s too manual and not always so straightforward. When do you capture the profiles? Do you know under what conditions the memory leak happens is it on a dev environment or in production under specific load?
Continuous profiling
With Continuous Profiling we can collect metrics and data from our Services while serving real traffic.
“Continuous profiling is the process of collecting application performance data in a production environment and making the data available to developers”
There are monitoring platforms like Datadog where with very few steps you can onboard your service and provide tools to slice and dice on the submitted data to investigate memory leaks or other various performance issues.
and personally one of the best features, comparing profile to a previous time period or even a previous version.
The graphs are from Datadog.
The gopher art is from MariaLetta.
[fluentform id="8"]