Navigating the Evolution of Stream-Aligned Teams: Lessons from Our CI/CD Journey
When they fly like a bird and when they hit the unexpected window hard
Introduction and Scope
At Elli we aim for “stream-aligned teams”. According to the book “Team Topologies” — by Manuel Pais and Matthew Skelton, we focus on optimising our workflow when it comes to feature delivery and minimising the number of handovers between teams. To reduce handovers, these teams have end-to-end ownership of their services and infrastructure. Ownership means freedom and independence but also a high cognitive load. To reduce this load and upskill teams in cross-team topics like infrastructure or continuous integration and delivery (CI/CD), we have various Communities of Practice (CoPs). We are proud of our CoPs and our culture of continuous improvement. It has served us well but recently we hit some barriers we cannot push through with our current setup. Facing these difficulties we see the need for having dedicated teams to also drive such improvements across the organisation.
In this blog post, I want to share with you the story of how we got here, what our CoPs have done well and where they ended up coming short. To illustrate my points I will use one CoP as an example — the CI/CD CoP. But we see similar issues also in other CoPs.
Status Quo of CI/CD
Before we start, CI/CD at Elli is in a good state and it is consistently improving. Our services deploy multiple times per week or even per day as we deploy every merged change. Deployments usually take less than 30 minutes (median is around 25 minutes). The pie chart shows the 2023 survey results for the statement “I am happy with the state of CI/CD in my team”. In 2023, over 50% strongly agree, which is an over 20% improvement to 2022. But we noticed a couple of difficulties with our approach of solely relying on CoPs for topics like CI/CD. This blog post is about these difficulties and barriers, which we face on our way to turn the pie chart fully green.
The early achievements and the following decline
We have around 30% of our working time dedicated to technical and operational excellence (TOX) in software development. Most of this budget is spent on our TOX work in each of our feature teams. But it also includes the time we spend in our CoP’s one-hour weekly meetings and bi-weekly four-hour workshops. Reserving a time budget aside from feature delivery work has reaped many benefits across the entire organisation: We went from triggering releases with tags once or twice per story to releasing every merged change. The median deployment frequency is about 3–4 times per week now with a maximum of 20 for our main frontend. The median run time for our releases is around 25 minutes with deployments on several environments, end-2-end tests and so on.
Such improvements were either driven by some team with a need for it and then spread through the CoP or driven by the CoP itself. Workshops are especially effective for propagating best practices such as these.
However, the participation in these workshops, relative to the number of engineers working at Elli, has steadily decreased over the years.
To get a better understanding of why participation levels are declining, we administered a survey to identify the core causes of low participation rates. The main reasons for people not to join were:
- Feature pressure in the team
- Sprint rituals and many other meetings collide with CoP meetings
- Engineers like working on features better
- Topics are not relevant or exciting
The first two points are issues that came through our scale up. Engineering management needs to emphasize the priority and importance of the TOX budget repeatedly to our engineers to prevent it from falling by the wayside. Regarding scheduling, CoP meetings are visible in our company’s engineering department-wide calendar for everyone to see. Thus Product Owners and engineers should be wary of this and not schedule any conflicting meetings. Lack of prioritisation and commitment ultimately caused our CoP rituals to fall through.
But now to the really shocking part. Topics are not relevant and engineers rather choose more feature work. This is clearly on the CoPs and must be improved.
Out of the rabbit hole and addressing the needs of engineers
Every year, the CI/CD CoP, led by our senior infrastructure engineer –Madhusudan, conducts a survey on the state of DevOps here at Elli. After the 2022 survey, they decided to address these shortcomings with a couple of initiatives:
- Pipeline template repository: A dedicated repository with templates for the initial pipeline setup and continuous improvements.
- Upskilling workshop: An Elli Engineering Academy workshop on our general pipeline setup and fundamentals of Azure Pipelines.
- “Bring your own pipeline” (BYOP) workshops: Teams can look into and improve their pipelines together with experts from the CI/CD CoP.
- Data analysis: Collect data from Elli’s Azure Pipelines in Big Query for more in-depth analysis of our software delivery processes.
Let’s examine some key insights from the 2022 survey and compare them to the new 2023 results:
- Complexity of pipelines: In 2022, 35% of participants thought our pipelines are complex. This went down to 20% in 2023. The usage of the pipeline templates definitely helps here as well as the BYOP workshops.
- Lack of understanding of pipelines: In 2022, 45% of participants were not confident in modifying the pipelines. This decreased to 15% in 2023. The upskilling workshops surely did not hurt.
- Flaky pipelines: PR pipelines fail for the wrong reasons 60% of the time in 2022 and 2023.
- Slow pipelines: In 2022, 70% thought our pipelines are slow, in 2023 50% still have the need for speed. 60% of participants in 2022 actively waited for the pull request (PR) pipeline to complete. This dropped to 23% in 2023.
The first two issues we address rather well with the measures taken. Regarding the workshop participation, we clearly see that these upskilling and BYOP workshops have way higher participation than other workshops (roughly double). However, especially Elli Engineering Academy workshops require a lot of time to prepare and facilitate. It is unrealistic to hold these lengthy workshops every week for each of the different engineering topics. So far this year we do not see an improvement in the overall participation.
Flaky tests and slow pipelines however are more difficult. Let’s look into that in more detail and start with the drop of engineers waiting for the PR pipeline. We started collecting pipeline data only 3 months after the 2022 survey. For the time that we have data our median runtime of our PR pipelines is basically constant at about 15 minutes. So either, the drop of 60% to 23% is because of improvements before we collected data (I doubt it) or people adapted to the wait time and do other things in parallel, which is what I often hear when talking to other engineers.
This is something which we actually want to avoid. Starting a new task, which will occupy you for the next 4 hours and increase the amount of work in the delivery system, all because of a 15 minute wait time, is not desirable. So this decrease in people waiting may look better than it actually is. However, sitting around for 15 minutes is not great as well. But getting the runtime down to 10–12 minutes does not help. You still wait. You still jump to another task. We need less than 5 minutes, a small bathroom break, and getting the pipeline this fast is hard.
Also, it is difficult to make the pipelines and tests more robust. We have a failure rate of roughly 20% on the PR pipeline, also quite constant over time. The survey hints that these 20% are made up of 12% unnecessary failures and only 8% meaningful failures. This shows that teams see this pain point but struggle to significantly improve on it with their regular TOX work. For these hard topics, the small CoP time is not enough. What we need in these cases is a full time effort. For example an existing team could pick up the topic for one of their TOX sprints. Another option would be to form a task force or even an enabling team to have full time capacity temporarily but not move the ownership permanently away from CoPs and stream-aligned teams.
We have seen early on at Elli that CoPs can drive great improvements across the engineering teams. Especially in our company’s earlier stages, our CoPs tackled realistically achievable and time-bound topics with high impact. But at some point you need to shift gears when an existing way of working does not achieve desired results. Our CoP approach struggled to generate the desired impact and lost support among engineers. And hence the need to challenge the status quo, be agile and look for better solutions.
Setting goals that are achievable with the CoP capacity, as well as focusing on upskilling and knowledge sharing across teams, shows promising participation levels and improvements. At the same time we should not buckle when continuous improvement gets tough. Instead, we can seek alternative data-driven solutions. Surveys and having honest conversations with our fellow engineers helped us see the need for a proper full-time effort to dig deep down into rabbit holes. We see different possibilities here to address the issue. In the end it is important that we push forward on these topics in collaboration with CoPs and stream-aligned teams and keep the end-to-end ownership in the teams, while supporting them on their improvement efforts.
At Elli, we are always in search of improving the way we work together as a team. If you are interested in finding out more about how we work, please subscribe to the Elli Medium blog and visit our company’s website at elli.eco! See you next time!
About the author
Matthias Förth is a Product Owner and former Software Engineer focused on backend development. His current interests are the streamlining of development and pushing data-driven decision making at Elli.
Navigating the Evolution of Stream-Aligned Teams: Lessons from Our CI/CD Journey was originally published in Elli Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.