Elli Einblicke

01.12.2022

Catch Me If You Can — Memory Leaks

A retrospective on a memory leak

Elli engineers vs. memory leak (illus. by Jane Kim)

Introduction

Memory leaks are one of those things that, when they happen, can really throw you in at the deep end. Diagnosing them seems like a challenging task at first. They require a deep dive into the tools and components your service relies on. This close-up examination not only deepens your understanding of your service landscape, but also gives an insight into how things run under the hood. Although daunting at first glance, memory leaks are essentially a blessing in disguise.

At Elli, we do our best to minimise technical debt to a bare minimum. However, incidents still happen, and our approach is to learn and share knowledge by resolving such issues.

So, this article aims to do just this. In this post, we walk you through our approach of identifying a memory leak and share our learnings along the way.

Context

Before we dive into repairing the memory leak, we need some context on Elli’s infrastructure and where the memory leak occurred in the first place.

Elli, among other things, is a Charging Point Operator. We are responsible for connecting charging stations (CSs) to our backend and controlling them via OCPP protocol. Ergo, our customers can charge their EVs at private or public stations. The CSs are connected to our systems via WebSockets. When it comes to authentication, we support connections via TLS or mutual TLS (mTLS). During TLS, a CS will verify our server certificate and assure that it connects to an Elli backend. With mTLS, we also verify that the CS has a client certificate issued by us.

On the connectivity side, a server written in Node.js, is responsible for taking care of the UPGRADE logic from HTTP to WebSockets and keeping the state of the connections. It is deployed in a Kubernetes cluster and managed by a Horizontal Pod Autoscaler (HPA). Ideally, the HPA follows the traffic load and scales the pods up or down accordingly.

We maintain tens of thousands of persistent and long-living TCP connections from the charging stations concurrently. This introduces complexity and significantly differs from typical RESTful services. A proxy metric that tracks the load is memory utilisation since it reflects the number of established connections, and the application’s logic does not require many computations. Our pods are long-lived, and scaling via memory led us to the observation that the number of pods is slowly increasing for a constant number of connections. Long story short, we spotted a memory leak.

Impact assessment

When faced with any kind of production issue, the Elli engineering team immediately assesses the implications of this incident on our customers and the business. So upon discovering this memory leak, we made the following assessment:

The application is leaking memory in a matter of days. This means that without receiving any additional traffic our infrastructure continues to grow.

When a pod cannot handle additional traffic, thanks to the readiness probe of Kubernetes, it stops receiving additional traffic but keeps serving the established connections. A pod that would serve X connections could end up only serving a fraction of its capabilities due to the leak, without causing any disruption on the customer side. This means that we can readily absorb the impact by simply spinning up more pods.

The Investigation

Now for the actual technical deep-dive into the memory leak.

Here we explain the tools and methods we used to uncover the source behind the memory leak, what we expected to see from our experiment, and what we actually observed. We included links to the resources that we used in our investigation for your reference.

A quick primer to JS memory

Variables in JavaScript (and most other programming languages) are stored in two places: stack and heap. A stack is usually a continuous region of memory allocating local context for each executing function. Heap is a much larger region storing everything allocated dynamically. This separation is useful to make the execution safer against corruption (stack is more protected) and faster (no need for dynamic garbage collection of the stack frames, fast new frame allocation).

Only primitive types passed by value (Number, Boolean, references to objects) are stored on the stack. Everything else is allocated dynamically from the shared pool of memory called the heap. In JavaScript, you do not have to worry about deallocating objects inside the heap. The garbage collector frees them whenever no one is referencing them. Of course, creating a large number of objects takes a performance toll (someone needs to keep all the bookkeeping), plus causes memory fragmentation.

Source: https://glebbahmutov.com/blog/javascript-stack-size/

Taking a heap snapshot from a production pod | Heap Snapshots & Profiling

Expectations

We collected regular heap snapshots of our application to see an accumulation of objects over time. Due to the nature of the application, mostly holding WebSocket connections, we expected the TLSSocket objects to match the number of connections in the application. We hypothesised that when a station got disconnected, the object was somehow still referenced. Garbage collection works by cleaning up unreachable objects, so in this case, the objects would be left intact.

Results

Getting a heap dump from a 90% utilised pod resulted in the range of 100MB. Each pod requests around 1.5GB of RAM, and the heap was less than 10% of the allocated memory. This looked suspicious…

Where was the rest of the memory allocated? Nonetheless, we continued the analysis. Taking three snapshots in intervals and observing the change in memory over time didn’t reveal anything. We didn’t notice an accumulation of objects nor were there any issues with garbage collection. The heap dump looked rather healthy.

*Image 1: Taking heap snapshots via chrome dev tools from a production pod. The number of TLSSocket objects aligns with the current pod’s connection contrary to the expected results.*

The TLSSocket objects were matching the state of the application. Going back to the first observation, the heapdump is an order of magnitude less than the memory utilisation. We thought: “This can’t be right. We are looking in the wrong place. We need to take a step back.”

In addition, we profiled the application via the Cloud Profiler offered by GCP. We were interested in seeing how objects are allocated with the passing of time and potentially identifying the memory leak.

Getting a heap dump blocks the main thread and can potentially kill the application, opposite to this the profiler can be kept in production with little overhead.

Cloud Profiler is a continuous profiling tool that is designed for applications running on Google Cloud. It’s a statistical, or sampling profiler with low overhead and is suitable for production environments.

Although the profiler contributed to our understanding of the tenants of the heap, it still didn’t give us any leads on the investigation. On the contrary, it pushed us away from going in the right direction.

Spoiler alert: the profiler, however, did provide us with quite valuable information during an incident in production where we identified and fixed an aggressive memory leak, but that’s a story for another time.

Memory usage statistics

We needed greater insights into memory usage. We created dashboards for all metrics that process.memoryUsage() had to offer.

The heapTotal and heapUsed refer to V8’s memory usage.

The external refers to the memory usage of C++ objects bound to JavaScript objects managed by V8.

The rss, Resident Set Size, is the amount of space occupied in the main memory device (that is, a subset of the total allocated memory) for the process, including all C++ and JavaScript objects and code.

The arrayBuffers refers to memory allocated for ArrayBuffers and SharedArrayBuffers, including all Node.js Buffers. This is also included in the external value. When Node.js is used as an embedded library, this value may be 0 because allocations for ArrayBuffers may not be tracked in that case.

Image 2: A visualisation of RSS content. There is no official up-to-date model of V8’s memory as it’s changing quite frequently. This is our best effort to depict what lives under the RSS so we can have a clearer picture of potential memory components that leak memory. If you like to learn more about the garbage collector, we would suggest https://v8.dev/blog/trash-talk. Thanks to @mlippautz for the clarification.

As we saw earlier we were getting ~100MB heap snapshots from a container that had more than 1 GB of memory utilisation. Where is the rest of the memory allocated? Let’s have a look.

Image 3: Memory utilisation per pod (95th percentile). It grows over time. Nothing new here, we are aware of the memory leak.

Image 4: The number of connections per pod over time (95th percentile); pods are handling fewer and fewer connections.

Image 5: Heap used memory (95th percentile). Heap is aligned with the size of the snapshots we collected and is stable over time.

Image 6: External memory (95th percentile): small in size and stable.

Image 7: Memory utilisation and Resident Set Size (RSS) (95th percentile). There is correlation — RSS is following the pattern.

What do we know so far? RSS is growing, the heap and external are stable which leads us to the stack. This could mean a method that gets called and never exits, thus leading to a stack overflow. However, the stack can’t be hundreds of MBs. At this point, we already tested in a non-production environment with multiple thousands of stations but didn’t get any results.

Memory allocator

While brainstorming, we considered memory fragmentation: chunks of memory are allocated non-sequentially leading to small chunks of memory that can’t be used for new allocations. In our case, the application is long-running and does a lot of allocation and freeing. Memory fragmentation is a valid concern in these cases. Extensive googling led us to a GitHub issue where the folks faced the same situation as us. The same pattern of the memory leak was observed, and it aligned with our hypothesis.

We decided to put a different memory allocator to test, and we switched from musl to jemalloc. We found no meaningful results. At this point, we knew we needed to take a break. We had to rethink the approach entirely.

Could it be that the leak only appears on mTLS connections?

During our first tests, we tried to reproduce the issue in a non-production environment but had no luck. We ran load tests with thousands of stations simulating different scenarios, connecting/disconnecting stations for days, but they produced no meaningful results. However, we started to have a growing suspicion that there was something we missed while running these tests.

We didn’t take into account that our stations can connect via TLS or mTLS. Our first test included TLS stations, but not mTLS, and the reason for that is simple: we couldn’t easily create mTLS stations and the respective client certificates. A recent incident motivated us to minimise the blast radius and split the application’s responsibilities so that each deployment would handle TLS and mTLS traffic separately. Eureka! The memory leak appears only on our mTLS pods, while on TLS the memory is stable.

Where do we go from here?

We decided that there are two options: (1) Move on to our next suspects — a library that handles all the Public Key Infrastructure tasks as well as a potential recursion somewhere in that code path, (2) or live with it until we rework our service entirely.

During the memory leak investigation, many unforeseen topics were brought to our attention that were related to the affected service. Taking into account the memory leak and everything else we discovered, we decided to improve our service landscape and split the responsibilities of the service. The CS authentication and authorisation flow, among others, would be delegated to the new service and we would use the right tools for handling PKI tasks.

Summary

Improving our scaling revealed that we have a memory leak that could have been left unnoticed for an indefinite period of time. Prioritising the customer and assessing the impact of the leak was first and foremost. Only then were we able to set the pace of our investigation since we realised that there was no customer impact. We started with the most obvious place to look when diagnosing a memory leak — the heap. Analysing the heap, however, showed us that we were looking at the wrong place. Further clues were needed and the process API of V8 gave us exactly that. In the first results we got, the memory leak appeared in RSS. Finally, analysing all the information gathered, we suspected memory fragmentation.

Changing the memory allocator didn’t improve the situation. Rather, changing our approach and splitting the workload between TLS and mTLS, helped us narrow down the code path affected.

What were the final outcomes of our investigation?

Our plans to improve scalability along with addressing the memory leak, made us decide to split the service and write a new one to take care of the CS connectivity flow separately from the other CS specifics.

Did we fix our memory leak?

Time will tell, but I would say the investigation was way more than that. The experience of probing into the leak helped us grow as developers and our service to adopt a more resilient and scalable architecture.

Key points and learnings

Τough engineering challenges bring people together; we played ping-pong on ideas with engineers outside of our team.
Gave us the motivation to rethink the service, which led to a more scalable architecture.
If it aches, it requires your attention; don’t ignore it.

References

https://nodejs.org/en/docs/guides/diagnostics/memory/using-heap-snapshot/
Enable remote debugging to a pod via port forwarding: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
https://developers.google.com/cast/docs/debugging/remote_debugger

Learn More

Follow us on Medium! (illus. by Jane Kim)

If you are interested in finding out more about how we work, please subscribe to the Elli Medium blog and visit our company’s website at elli.eco! See you next time!

About the author

Thanos Amoutzias is a Software Engineer, he develops Elli’s Charging Station Management System and drives SRE topics. He is passionate about building reliable services and delivering impactful products. You can find him on LinkedIn and in the 🏔️.

Credits: Thanks to all my colleagues who reviewed and gave feedback on the article!

Catch Me If You Can — Memory Leaks was originally published in Elli Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Mehr

Zurück

Weitere News

Pressemitteilung

01.07.2025

900.000 Ladepunkte: Elli baut europäisches Ladenetzwerk weiter aus
Mehr
Pressemitteilung

07.05.2025

Power2Drive 2025: Elli präsentiert neue Schnellladesäule "Flexpole Plus" und Kooperation mit Bundesverband UNITI
Mehr
Pressespiegel

28.03.2025

Elli Charger Pro 2 im Test: Diese clevere Wallbox lässt kaum Wünsche offen
Mehr
Pressemitteilung

28.03.2025

Elli präsentiert eichrechtskonforme Wallbox für den deutschen Markt
Mehr
Pressemitteilung

14.01.2025

Aus einer Hand: Elli bündelt Lade- und Tankservices in neuer Einheit
Mehr
Pressemitteilung

12.12.2024

Elli Charger 2 triumphiert beim Wallbox Testvergleich von AUTO BILD und P3
Mehr
Pressemitteilung

09.10.2024

Elli präsentiert Smart Charging Produktoffensive für Europa
Mehr
Pressemitteilung

17.09.2024

Elli präsentiert neue Ladeprodukte für E-Flotten und Unternehmen auf der IAA-Transportation 2024
Mehr
Pressemitteilung

01.07.2024

"Elli Drive Plus" - Neuer Ladetarif zum Start in der Urlaubssaison
Mehr
Pressemitteilung

07.06.2024

Debüt des neuen Elli Charger 2: Kostengünstiges Laden für die Energiewende zu Hause
Mehr
Pressemitteilung

07.06.2024

Elli steigt in das Geschäft mit industriellen Energiespeichern ein
Mehr
Bild

07.06.2024

Grossspeicher
Download
Bild

06.06.2024

Charger 2 an Garagenwand
Download
Bild

06.06.2024

Charger 2 an Wand
Download
Bild

06.06.2024

Charger 2 an Wand 2
Download
Bild

06.06.2024

Charger 2 Freisteller
Download
Pressemitteilung

27.05.2024

Elli präsentiert ersten intelligenten Stromtarif: Volkswagen Naturstrom Flex hilft Kunden, Geld zu sparen
Mehr
Pressemitteilung

22.04.2024

Hannover Messe 2024: Forschungsprojekt für das Laden von E-Fahrzeugen mit Windenergie
Mehr
Pressemitteilung

09.04.2024

"Charging Site Management" – Elli präsentiert Neuheit für das Laden von Unternehmensflotten
Mehr
Pressespiegel

28.03.2024

VW macht Lade- und Energiegeschäft schick für Investoren
Mehr
Pressemitteilung

05.03.2024

Neue Partnerschaft mit SIXT – Elli öffnet eigenes Ladenetzwerk für externe Unternehmen
Mehr
Pressespiegel

22.02.2024

Elli: Ladenetz wächst auf 600.000 Ladepunkte
Mehr
Download (.zip)

18.02.2024

Elli Logo
Download
Bild

17.02.2024

Giovanni Palazzo, CEO Elli & SVP Volkswagen Group Charging and Energy
Download
Bild

16.02.2024

Flexpole_1
Download
Bild

16.02.2024

Flexpole_2
Download
Bild

16.02.2024

Flexpole white background
Download
Bild

16.02.2024

Wallbox_1
Download
Bild

16.02.2024

Wallbox_2
Download
Bild

16.02.2024

Wallbox_3
Download
Bild

16.02.2024

Wallbox_white_background
Download
Pressemitteilung

28.12.2023

Mehr als 600.000 Ladepunkte in Europa: Elli bietet Zugang zu einem der größten Ladenetze
Mehr
Pressemitteilung

02.11.2023

Elli launcht europaweite Ladelösung für E-Fahrzeugflotten
Mehr
Video

13.09.2023

ICNC23 Keynote by Giovanni Palazzo

Beyond the Plug- Elli’s Evolution into a Holistic Energy Ecosystem
Mehr
Video

08.09.2023

IAA MOBILITY 2023- Giovanni Palazzo spricht über die Position und das Potenzial von Elli
Mehr
Pressemitteilung

05.09.2023

Elli auf der IAA 2023: Neue Mobilitätsdienstleistungen und Zukunftsprojekte des Unternehmens
Mehr
Elli Einblicke

03.08.2023

Navigating the Evolution of Stream-Aligned Teams: Lessons from Our CI/CD Journey

When they fly like a bird and when they hit the unexpected window hard (illustration made by Lukas Hanke) Introduction and Scope At Elli we aim for “stream-aligned teams”. According to the book “Team Topologies” — by Manuel Pais and
Mehr
Elli Einblicke

03.08.2023

On Communities of Practice

When they fly like a bird and when they hit the unexpected window hard (illustration made by Lukas Hanke) Introduction and Scope At Elli we aim for “stream-aligned teams”. According to the book “Team Topologies” — by Manuel Pais and
Mehr
Elli Einblicke

03.08.2023

To TOX or not to TOX

How technical and operational excellence is achieved (illustration made by Lukas Hanke) In our very first blog post, we introduced you to Elli’s guiding principles for achieving engineering excellence. In this article, we delve deeper into one of the
Mehr
Pressespiegel

26.07.2023

Mobilität neu denken
Mehr
Pressemitteilung

13.07.2023

Volkswagen Group und Elli starten Stromhandel an der europäischen Energiebörse
Mehr
Pressemitteilung

10.07.2023

Volkswagen Group treibt strategische Neuausrichtung des Lade- und Energiegeschäfts voran
Mehr
Pressemitteilung

04.05.2023

Shell und Volkswagen treiben Ausbau der Ladeinfrastruktur voran: Eröffnung der ersten innovativen Flexpole-Ladestation
Mehr
Social Media

18.04.2023

Europas größtes Ladenetz: 500.000 Ladepunkte

Mobility Service Provider Elli bietet Europas größtes und am schnellsten wachsendes Ladenetz (+100.000 Ladepunkte in vier Monaten)
Mehr
Video

31.03.2023

Wind und Sonne richten sich nicht nach Ladezeiten

Wie können die Batterien von Elektrofahrzeugen dazu beitragen, die erneuerbaren Energien besser zu nutzen?
Mehr
Video

30.03.2023

Elli Fleet Charging

Die einfachste Lösung Ihre elektrischen Firmenwagen zu managen
Mehr
Social Media

29.03.2023

„Flotte! Der Branchentreff“ in Düsseldorf

Elli auf dem Stand K23-27 auf der Messe
Mehr
Elli Einblicke

01.02.2023

Parlez-vous OCPP?

How charging stations communicate with their central backend system OCPP, c’est quoi? If we observed a group of people with different native tongues trying to engage in discussion, we would likely observe them consolidating to the language that is
Mehr
Social Media

12.01.2023

Elli vergrößert sein Netzwerk auf 400.000 Ladepunkte in Europa
Mehr
Social Media

12.01.2023

Neues Jahr 2023, neue Ellians!
Mehr
Social Media

01.12.2022

Ellians im Office ;-)
Mehr
Social Media

01.12.2022

Neue Vorteile für unsere ELLIans!
Mehr
Social Media

01.12.2022

Die gesamte MSP Business Unit vor Ort in München
Mehr
Social Media

01.12.2022

Willkommen liebe Ellians!
Mehr
Elli Einblicke

04.10.2022

Don’t Do Small Pull Requests

How asynchronous reviews and wait times harm throughput and code quality Smaller the pull request, the better? (illustration by Jane Kim) Introduction In this blog post, I share my learnings in the last couple of months regarding the delivery process of
Mehr
Elli Einblicke

01.08.2022

Electric Vehicle Charging for Newbies

A quick read for all newbies to EV charging! Creating a sustainable future means changing the way we get around. Perhaps this means switching to “greener” modes of transportation like commuting by bike or public transit. It could also mean reducing
Mehr
Video

07.11.2022

Plug & Charge bei Cupra

Unterwegs laden- jetzt noch schneller und bequemer
Mehr
Video

07.11.2022

Wie funktioniert Cupra Plug & Charge

Die schnelle, einfache und sichere Art, Deinen Cupra aufzuladen.
Mehr
Pressespiegel

25.10.2022

Dank einer Vereinbarung zwischen Elli und Vattenfall gibt es jetzt über 24.000 neue Stationen
Mehr
Video

16.10.2022

Punkte sammeln und Fahrzeug kostenlos aufladen

Dank der Partnerschaft zwischen Elli und &Charge können Nutzer über die &Charge-App „&Charge-Kilometer“ sammeln und diese als €-Gutscheine einlösen, um sie in der Elli-App für kostenlose Ladevorgänge einzusetzen. Mehr Infos gibt es in diesem Video.
Mehr
Video

10.10.2022

Die Elli App: Mit dem Testsieger überall laden

Mit der Elli App an über 400.000 Ladepunkte in Europa laden!
Mehr
Pressespiegel

06.10.2022

Die Wallbox ist ab sofort bei den Volkswagen Händlern erhältlich
Mehr
Video

01.10.2022

Elli Flexpole- die flexible Schnellladesäule

Die smarte Ladesäule von Elli kann fast überall aufgestellt werden. Wie genau sie funktioniert, siehst Du hier.
Mehr
Pressespiegel

27.08.2022

Das Elektroauto als mobile Powerbank
Mehr
Elli Einblicke

01.06.2022

Introduction to Elli Engineering: Our Guiding Principles

The six guiding principles for technical excellence Elli is a brand of Volkswagen Group providing energy and electric charging solutions. Software and hardware engineering are key to the business. We as engineers focus on creating and maintaining awesome
Mehr