Catch Me If You Can — Memory Leaks

Catch Me If You Can — Memory Leaks

A retrospective on a memory leak

Elli engineers vs. memory leak (illus. by Jane Kim)

Introduction

Memory leaks are one of those things that, when they happen, can really throw you in at the deep end. Diagnosing them seems like a challenging task at first. They require a deep dive into the tools and components your service relies on. This close-up examination not only deepens your understanding of your service landscape, but also gives an insight into how things run under the hood. Although daunting at first glance, memory leaks are essentially a blessing in disguise.

At Elli, we do our best to minimise technical debt to a bare minimum. However, incidents still happen, and our approach is to learn and share knowledge by resolving such issues.

So, this article aims to do just this. In this post, we walk you through our approach of identifying a memory leak and share our learnings along the way.

Context

Before we dive into repairing the memory leak, we need some context on Elli’s infrastructure and where the memory leak occurred in the first place.

Elli, among other things, is a Charging Point Operator. We are responsible for connecting charging stations (CSs) to our backend and controlling them via OCPP protocol. Ergo, our customers can charge their EVs at private or public stations. The CSs are connected to our systems via WebSockets. When it comes to authentication, we support connections via TLS or mutual TLS (mTLS). During TLS, a CS will verify our server certificate and assure that it connects to an Elli backend. With mTLS, we also verify that the CS has a client certificate issued by us.

On the connectivity side, a server written in Node.js, is responsible for taking care of the UPGRADE logic from HTTP to WebSockets and keeping the state of the connections. It is deployed in a Kubernetes cluster and managed by a Horizontal Pod Autoscaler (HPA). Ideally, the HPA follows the traffic load and scales the pods up or down accordingly.

We maintain tens of thousands of persistent and long-living TCP connections from the charging stations concurrently. This introduces complexity and significantly differs from typical RESTful services. A proxy metric that tracks the load is memory utilisation since it reflects the number of established connections, and the application’s logic does not require many computations. Our pods are long-lived, and scaling via memory led us to the observation that the number of pods is slowly increasing for a constant number of connections. Long story short, we spotted a memory leak.

Impact assessment

When faced with any kind of production issue, the Elli engineering team immediately assesses the implications of this incident on our customers and the business. So upon discovering this memory leak, we made the following assessment:

The application is leaking memory in a matter of days. This means that without receiving any additional traffic our infrastructure continues to grow.

When a pod cannot handle additional traffic, thanks to the readiness probe of Kubernetes, it stops receiving additional traffic but keeps serving the established connections. A pod that would serve X connections could end up only serving a fraction of its capabilities due to the leak, without causing any disruption on the customer side. This means that we can readily absorb the impact by simply spinning up more pods.

The Investigation

Now for the actual technical deep-dive into the memory leak.

Here we explain the tools and methods we used to uncover the source behind the memory leak, what we expected to see from our experiment, and what we actually observed. We included links to the resources that we used in our investigation for your reference.

A quick primer to JS memory

Variables in JavaScript (and most other programming languages) are stored in two places: stack and heap. A stack is usually a continuous region of memory allocating local context for each executing function. Heap is a much larger region storing everything allocated dynamically. This separation is useful to make the execution safer against corruption (stack is more protected) and faster (no need for dynamic garbage collection of the stack frames, fast new frame allocation).
Only primitive types passed by value (Number, Boolean, references to objects) are stored on the stack. Everything else is allocated dynamically from the shared pool of memory called the heap. In JavaScript, you do not have to worry about deallocating objects inside the heap. The garbage collector frees them whenever no one is referencing them. Of course, creating a large number of objects takes a performance toll (someone needs to keep all the bookkeeping), plus causes memory fragmentation.

Source: https://glebbahmutov.com/blog/javascript-stack-size/

Taking a heap snapshot from a production pod | Heap Snapshots & Profiling

Expectations

We collected regular heap snapshots of our application to see an accumulation of objects over time. Due to the nature of the application, mostly holding WebSocket connections, we expected the TLSSocket objects to match the number of connections in the application. We hypothesised that when a station got disconnected, the object was somehow still referenced. Garbage collection works by cleaning up unreachable objects, so in this case, the objects would be left intact.

Results

Getting a heap dump from a 90% utilised pod resulted in the range of 100MB. Each pod requests around 1.5GB of RAM, and the heap was less than 10% of the allocated memory. This looked suspicious…

Where was the rest of the memory allocated? Nonetheless, we continued the analysis. Taking three snapshots in intervals and observing the change in memory over time didn’t reveal anything. We didn’t notice an accumulation of objects nor were there any issues with garbage collection. The heap dump looked rather healthy.

Image 1: Taking heap snapshots via chrome dev tools from a production pod. The number of TLSSocket objects aligns with the current pod’s connection contrary to the expected results.

The TLSSocket objects were matching the state of the application. Going back to the first observation, the heapdump is an order of magnitude less than the memory utilisation. We thought: This can’t be right. We are looking in the wrong place. We need to take a step back.”

In addition, we profiled the application via the Cloud Profiler offered by GCP. We were interested in seeing how objects are allocated with the passing of time and potentially identifying the memory leak.

Getting a heap dump blocks the main thread and can potentially kill the application, opposite to this the profiler can be kept in production with little overhead.

Cloud Profiler is a continuous profiling tool that is designed for applications running on Google Cloud. It’s a statistical, or sampling profiler with low overhead and is suitable for production environments.

Although the profiler contributed to our understanding of the tenants of the heap, it still didn’t give us any leads on the investigation. On the contrary, it pushed us away from going in the right direction.

Spoiler alert: the profiler, however, did provide us with quite valuable information during an incident in production where we identified and fixed an aggressive memory leak, but that’s a story for another time.

Memory usage statistics

We needed greater insights into memory usage. We created dashboards for all metrics that process.memoryUsage() had to offer.

The heapTotal and heapUsed refer to V8’s memory usage.
The external refers to the memory usage of C++ objects bound to JavaScript objects managed by V8.
The rss, Resident Set Size, is the amount of space occupied in the main memory device (that is, a subset of the total allocated memory) for the process, including all C++ and JavaScript objects and code.
The arrayBuffers refers to memory allocated for ArrayBuffers and SharedArrayBuffers, including all Node.js Buffers. This is also included in the external value. When Node.js is used as an embedded library, this value may be 0 because allocations for ArrayBuffers may not be tracked in that case.
Image 2: A visualisation of RSS content. There is no official up-to-date model of V8’s memory as it’s changing quite frequently. This is our best effort to depict what lives under the RSS so we can have a clearer picture of potential memory components that leak memory. If you like to learn more about the garbage collector, we would suggest https://v8.dev/blog/trash-talk. Thanks to @mlippautz for the clarification.

As we saw earlier we were getting ~100MB heap snapshots from a container that had more than 1 GB of memory utilisation. Where is the rest of the memory allocated? Let’s have a look.

Image 3: Memory utilisation per pod (95th percentile). It grows over time. Nothing new here, we are aware of the memory leak.
Image 4: The number of connections per pod over time (95th percentile); pods are handling fewer and fewer connections.
Image 5: Heap used memory (95th percentile). Heap is aligned with the size of the snapshots we collected and is stable over time.
Image 6: External memory (95th percentile): small in size and stable.
Image 7: Memory utilisation and Resident Set Size (RSS) (95th percentile). There is correlation — RSS is following the pattern.

What do we know so far? RSS is growing, the heap and external are stable which leads us to the stack. This could mean a method that gets called and never exits, thus leading to a stack overflow. However, the stack can’t be hundreds of MBs. At this point, we already tested in a non-production environment with multiple thousands of stations but didn’t get any results.

Memory allocator

While brainstorming, we considered memory fragmentation: chunks of memory are allocated non-sequentially leading to small chunks of memory that can’t be used for new allocations. In our case, the application is long-running and does a lot of allocation and freeing. Memory fragmentation is a valid concern in these cases. Extensive googling led us to a GitHub issue where the folks faced the same situation as us. The same pattern of the memory leak was observed, and it aligned with our hypothesis.

We decided to put a different memory allocator to test, and we switched from musl to jemalloc. We found no meaningful results. At this point, we knew we needed to take a break. We had to rethink the approach entirely.

Could it be that the leak only appears on mTLS connections?

During our first tests, we tried to reproduce the issue in a non-production environment but had no luck. We ran load tests with thousands of stations simulating different scenarios, connecting/disconnecting stations for days, but they produced no meaningful results. However, we started to have a growing suspicion that there was something we missed while running these tests.

We didn’t take into account that our stations can connect via TLS or mTLS. Our first test included TLS stations, but not mTLS, and the reason for that is simple: we couldn’t easily create mTLS stations and the respective client certificates. A recent incident motivated us to minimise the blast radius and split the application’s responsibilities so that each deployment would handle TLS and mTLS traffic separately. Eureka! The memory leak appears only on our mTLS pods, while on TLS the memory is stable.

Where do we go from here?

We decided that there are two options: (1) Move on to our next suspects — a library that handles all the Public Key Infrastructure tasks as well as a potential recursion somewhere in that code path, (2) or live with it until we rework our service entirely.

During the memory leak investigation, many unforeseen topics were brought to our attention that were related to the affected service. Taking into account the memory leak and everything else we discovered, we decided to improve our service landscape and split the responsibilities of the service. The CS authentication and authorisation flow, among others, would be delegated to the new service and we would use the right tools for handling PKI tasks.

Summary

Improving our scaling revealed that we have a memory leak that could have been left unnoticed for an indefinite period of time. Prioritising the customer and assessing the impact of the leak was first and foremost. Only then were we able to set the pace of our investigation since we realised that there was no customer impact. We started with the most obvious place to look when diagnosing a memory leak — the heap. Analysing the heap, however, showed us that we were looking at the wrong place. Further clues were needed and the process API of V8 gave us exactly that. In the first results we got, the memory leak appeared in RSS. Finally, analysing all the information gathered, we suspected memory fragmentation.

Changing the memory allocator didn’t improve the situation. Rather, changing our approach and splitting the workload between TLS and mTLS, helped us narrow down the code path affected.

What were the final outcomes of our investigation?

Our plans to improve scalability along with addressing the memory leak, made us decide to split the service and write a new one to take care of the CS connectivity flow separately from the other CS specifics.

Did we fix our memory leak?

Time will tell, but I would say the investigation was way more than that. The experience of probing into the leak helped us grow as developers and our service to adopt a more resilient and scalable architecture.

Key points and learnings

  • Τough engineering challenges bring people together; we played ping-pong on ideas with engineers outside of our team.
  • Gave us the motivation to rethink the service, which led to a more scalable architecture.
  • If it aches, it requires your attention; don’t ignore it.

References

Learn More

Follow us on Medium! (illus. by Jane Kim)

If you are interested in finding out more about how we work, please subscribe to the Elli Medium blog and visit our company’s website at elli.eco! See you next time!

About the author

Thanos Amoutzias is a Software Engineer, he develops Elli’s Charging Station Management System and drives SRE topics. He is passionate about building reliable services and delivering impactful products. You can find him on LinkedIn and in the 🏔️.

Credits: Thanks to all my colleagues who reviewed and gave feedback on the article!


Catch Me If You Can — Memory Leaks was originally published in Elli Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

More
Back

More News

  • Press Release
    Apr 22nd, 2024

    Hannover Messe 2024: Pioneering Research Project – Charging Electric Vehicles with Wind Energy

    More
  • Press Release
    Apr 09th, 2024

    "Charging Site Management" – Elli launches new commercial product

    More
  • Press Review
    Mar 28th, 2024

    VW macht Lade- und Energiegeschäft schick für Investoren

    More
  • Press Release
    Mar 05th, 2024

    New partnership with SIXT – Elli opens its charging network to external partners

    More
  • Press Review
    Feb 22nd, 2024

    Elli: Ladenetz wächst auf 600.000 Ladepunkte

    More
  • Download (.zip)
    Feb 18th, 2024

    Elli Logo

    Download
  • Image
    Feb 17th, 2024

    Giovanni Palazzo, CEO Elli & SVP Volkswagen Group Charging and Energy

    Download
  • Image
    Feb 16th, 2024

    Flexpole_1

    Download
  • Image
    Feb 16th, 2024

    Flexpole_2

    Download
  • Image
    Feb 16th, 2024

    Flexpole white background

    Download
  • Image
    Feb 16th, 2024

    Wallbox_1

    Download
  • Image
    Feb 16th, 2024

    Wallbox_2

    Download
  • Image
    Feb 16th, 2024

    Wallbox_3

    Download
  • Image
    Feb 16th, 2024

    Wallbox_white_background

    Download
  • Press Release
    Dec 28th, 2023

    Elli expands to 600,000 charging points across Europe

    More
  • Press Release
    Nov 02nd, 2023

    Elli launches pan-European charging solution for electric fleets

    More
  • Video
    Sep 13th, 2023

    ICNC23 Keynote by Giovanni Palazzo

    Beyond the Plug- Elli’s Evolution into a Holistic Energy Ecosystem

    More
  • Video
    Sep 08th, 2023

    IAA MOBILITY 2023- Giovanni Palazzo speaks about Elli´s position and potential

    More
  • Press Release
    Sep 05th, 2023

    Elli at IAA 2023: Showcasing Elli’s new e-mobility products, and future projects

    More
  • Elli Insights
    Aug 03rd, 2023

    Navigating the Evolution of Stream-Aligned Teams: Lessons from Our CI/CD Journey

    When they fly like a bird and when they hit the unexpected window hard (illustration made by Lukas Hanke) Introduction and Scope At Elli we aim for “stream-aligned teams”. According to the book “Team Topologies”  — by Manuel Pais and

    More
  • Elli Insights
    Aug 03rd, 2023

    On Communities of Practice

    When they fly like a bird and when they hit the unexpected window hard (illustration made by Lukas Hanke) Introduction and Scope At Elli we aim for “stream-aligned teams”. According to the book “Team Topologies”  — by Manuel Pais and

    More
  • Elli Insights
    Aug 03rd, 2023

    To TOX or not to TOX

    How technical and operational excellence is achieved (illustration made by Lukas Hanke) In our very first blog post, we introduced you to Elli’s guiding principles for achieving engineering excellence. In this article, we delve deeper into one of the

    More
  • Press Review
    Jul 26th, 2023

    Rethinking mobility

    More
  • Press Release
    Jul 13th, 2023

    Volkswagen Group and Elli launch electricity trading on the European energy exchange

    More
  • Press Release
    Jul 10th, 2023

    Volkswagen Group pushes ahead with strategic realignment of charging and energy business

    More
  • Press Release
    May 04th, 2023

    Shell and Volkswagen push ahead the expansion of charging infrastructure: Opening of the first innovative Flexpole charging station

    More
  • Social Media
    Apr 18th, 2023

    Elli’s charging network has now exceeded a remarkable 500,000 charge points

    We’re incredibly proud to be at the forefront of Europe’s e-mobility service providers. And it’s especially gratifying that our customers enjoy easy access to charging stations across 28 European countries.

    More
  • Video
    Mar 31st, 2023

    Wind and sun do not depend on charging times

    How can the batteries of electric vehicles contribute to make better use of renewable energies.

    More
  • Video
    Mar 30th, 2023

    Elli Fleet Charging

    The simplest way to manage your company e-cars

    More
  • Social Media
    Mar 29th, 2023

    „Flotte! The leading fleet trade fair“ in Düsseldorf

    Visit Elli at booth K23-27

    More
  • Elli Insights
    Feb 01st, 2023

    Parlez-vous OCPP?

    How charging stations communicate with their central backend system OCPP, c’est quoi? If we observed a group of people with different native tongues trying to engage in discussion, we would likely observe them consolidating to the language that is

    More
  • Social Media
    Jan 12th, 2023

    Elli Expands Charging Network in Europe with 400,000 Points

    More
  • Social Media
    Jan 12th, 2023

    New Ellians Start Their Journey in January 2023

    More
  • Social Media
    Dec 01st, 2022

    What are ELLIans doing after the first company meeting?

    More
  • Social Media
    Dec 01st, 2022

    New benefits for our ELLIans!

    More
  • Social Media
    Dec 01st, 2022

    All the MSP Business Unit on site in Munich

    More
  • Social Media
    Dec 01st, 2022

    We welcomed the new Ellians joining in December

    More
  • Elli Insights
    Oct 04th, 2022

    Don’t Do Small Pull Requests

    How asynchronous reviews and wait times harm throughput and code quality Smaller the pull request, the better? (illustration by Jane Kim) Introduction In this blog post, I share my learnings in the last couple of months regarding the delivery process of

    More
  • Elli Insights
    Aug 01st, 2022

    Electric Vehicle Charging for Newbies

    A quick read for all newbies to EV charging! Creating a sustainable future means changing the way we get around. Perhaps this means switching to “greener” modes of transportation like commuting by bike or public transit. It could also mean reducing

    More
  • Video
    Nov 07th, 2022

    Discover Cupra Plug & charge.

    Making the street charging easier for you.

    More
  • Video
    Nov 07th, 2022

    How to use Cupra Plug & Charge.

    The fast, easy and secure way to charge your Cupra born.

    More
  • Press Review
    Oct 25th, 2022

    Dank einer Vereinbarung zwischen Elli und Vattenfall gibt es jetzt über 24.000 neue Stationen

    More
  • Video
    Oct 16th, 2022

    Collect points with this app and charge your vehicle for free.

    Thanks to the partnership between Elli and &Charge, users can collect "&Charge kilometers" via the &Charge app and redeem them as € vouchers to use them in the Elli app for free charging kilometers. There is more information in this video by Un Gallego en Munich.

    More
  • Video
    Oct 10th, 2022

    The Elli app: the most convenient workflow ever to charge your car.

    Unlock over 340 000 charging stations across Europe with the Elli app. Watch this video by Jonah Plank to discover more.

    More
  • Press Review
    Oct 06th, 2022

    The Wallbox is now available at any Volkswagen dealer

    Now you can get your Wallbox directly from Audi, Seat, Cupra, Škoda and Volkswagen.

    More
  • Video
    Oct 01st, 2022

    Flexpole: Yes! The most flexible charging station for you.

    With the Flexpole there are no complicated grid connections, just choose a location and start charging your cars. Watch a full review of the Flexpole in this video by Stefan S.

    More
  • Press Review
    Sep 27th, 2022

    Das Elektroauto als mobile Powerbank

    More
  • Elli Insights
    Jun 01st, 2022

    Introduction to Elli Engineering: Our Guiding Principles

    The six guiding principles for technical excellence Elli is a brand of Volkswagen Group providing energy and electric charging solutions. Software and hardware engineering are key to the business. We as engineers focus on creating and maintaining awesome

    More