Infrastructure Platform Engineering Organizational structures and models

Introduction

(I’d like to thank Jordyn Bonds, Nick Tittley and Brian Bossé for contributing to the discussion leading to this article)

The common belief is that reorganizing infrastructure teams will inherently improve efficiency and reduce costs. However, our key insight is that altering organizational boundaries does not fundamentally change the work required for cloud systems. It does impact the distribution of roles and responsibilities, and how teams interact and prioritize, but not the tasks involved in building and running cloud systems.

Reorganization shifts where the work takes place, not what that work fundamentally demands. Smaller companies may consolidate teams seeking economies of scale. Larger organizations decentralize to limit accumulated complexity. But the work remains the same.

(The 4 canonical org models discussed below. Color coding represents the type of work/responsibilities, location shows which part of org is responsible)

Over the past two years we conducted hundreds of interviews with engineering leaders, CTOs, CEOs, heads of infrastructure and platform, DevOps organizations, teams and infrastructure engineers, and we’ve found recurring organizational models and tradeoffs across organizations. We’ll describe those models, the tradeoffs involved and what we think can be a new version of a model that leverages cloud intelligence to maintain more benefits with less tradeoffs.

Why Re-Org

Smaller companies often look to consolidate teams to find engineering efficiency gains. The goal of consolidation is usually  to bring together duplicate infrastructure/cloud related efforts from multiple product orgs working in silos into centralized teams that work across the organization. 

For example, instead of each team building and maintaining their deployment tooling, automation and systems operations, a central team can solve those problems once and then reuse those solutions for multiple orgs. Efficiency gains aren’t only in the engineering capacity needed to build and operate those systems, but also in optimizing the re-use of the infra resources like shared clusters for multiple organizations. Because the central infrastructure team has visibility into all product orgs, they have the ability to optimize within and across resources. This potentially saves dollars and engineering operational cost and is referred to as economies of scale.

Another, more psychological reason to re-organize infrastructure initiatives is to reduce application developers’ cognitive load. This is done by shifting non-application responsibilities to infrastructure teams and orgs. Product orgs can then treat the infra-org as a black box service provider to ask solutions of. The degree of cognitive load reduction depends on the organizational model and division of roles and responsibilities.

Larger companies have different reasons to re-org. Organizationally, product org leaders prefer to externalize non-core product responsibilities by giving them to central orgs like the infra org. This gives them one less thing to deal with outside of their domain of expertise. They can now set expectations (and blame) on the infrastructure org instead of dealing with it within their organization.

Companies with infra-platform orgs eventually consider decomposing them. The goal is to reduce the complexity of maintaining the custom and convoluted internal abstractions built up over time. They do this by re-tasking product orgs with more platform responsibilities, which causes those orgs to outsource solutions to native cloud providers.

Driven by these consolidation and decomposition forces, infrastructure orgs tend to evolve into a few canonical forms. From our research engaging with dozens of companies, these 4 models represent the primary organizational structures adopted.

4 Common Infrastructure Re-Org Models

Based on the many interviews we’ve had over the past 2 years, these are the 4 common organizational models that we’ve found. In many companies these were evolved from one to the other in this order, as explained in the evolution of infrastructure cloud blog post

Product Orgs with Embedded Infra Teams

Most teams and companies start by having each team tackle their own problems separately – so each product org takes the entire set of cloud and infra responsibilities which works great for moving fast at the cost of full duplication of work:

This highly decentralized approach embeds infrastructure engineers directly within autonomous product teams. It maximizes flexibility and agility for product orgs to make infrastructure decisions tailored to their specific needs. Technology choices, provisioning, operations , and standards are all owned by each product team. There is complete duplication of work as they build out their own independent infrastructure. While this avoids bureaucracy, the duplication incurs substantially higher costs than leveraging centralized infrastructure. It also requires heavy coordination between product orgs for interoperability. This model suits early startups but may transition to more centralized platforms as duplication and complexity mounts over time.

Shared Cloud Engineering Team and Product Orgs

The shared engineering team structure consolidates infrastructure engineers from disparate product teams into a central group:

This centralized cluster targets economies of scale and reduced duplication for aspects like cluster management, network configurations, and IaC templates. The shared team provides infrastructure building blocks and support to the product teams. But the product orgs maintain control over their own application operations and some provisioning needs. Rather than dictate standards, the shared team offers guardrails and guidance. This balances infrastructure consistency with product autonomy. There is less duplication than fully decentralized models but more than a consolidated platform org.

Infrastructure Org and Product Orgs

In this model, an infrastructure organization caters to common infrastructure needs across product groups and provides core services and some degree of standardization:

But product teams retain autonomy over their own application operations and provisioning requirements. There is flexibility in technology choices, with the infrastructure org aiming to curate options without completely restricting product teams. Alignment between infrastructure and product groups is essential for smooth workflows. Duplication of work can occur across product orgs, but is reduced compared to fully decentralized approaches. This structure attempts to balance standardization with product team ownership.

Infrastructure-Platform Org and Product Orgs

The Infrastructure-Platform Org model represents a highly centralized approach with a dedicated platform team owning all core infrastructure and standards:

This singular group defines approved technologies, handles infrastructure provisioning, manages underlying services, and creates custom SDKs and abstractions for product teams. The goals are consistency across the organization and reducing duplication of work and cognitive load on product groups. However, the tight centralization risks the platform org drifting from actual product needs over time. And the accumulation of complex custom systems with convoluted internal abstractions can hinder agility.

Why Re-Orgs Fail

Based on our research, centralizing infrastructure team efforts and decomposing infrastructure efforts have 2 different sets of failure reasons.

Centralizing into Infrastructure-Platform Orgs

Multiple engineering leaders mentioned how dogma is aspirational, but not practical and they must be balanced appropriately. They highlighted how infrastructure experts’ dogmatic viewpoints created frustrating barriers for application teams. 

In one example, a services team trying to debug a Java memory leak ran into issues, and previously they would SSH into VMs and attach standard tools to debug the application. This approach however no longer worked given containers’ short lifespan. This prevented developers from grabbing heap dumps to analyze memory usage trends. When the team requested SSH access to attach standard debugging tools to running containers, the experts refused because it was considered an anti-pattern. This refusal stemmed from the “cattle not pets” philosophy, which treats infrastructure as disposable commodities rather than units that can be singled out and getting special treatment. The experts’ rigid viewpoint prevented pragmatic solutions like temporarily accessing containers to collect diagnostics and forced the engineers to figure alternatives out, making the engineers miserable and unproductive.

If you can’t change your customer, you must adapt to them.

Other teams called out too many paper cuts in using the internal platform, but the infra-platform team did not understand why their meticulously detailed guides were not sufficient for adoption and believed it was engineers resisting change and not following the guides correctly. 

At one company, there were growing frustrations between the application team and the infrastructure team. While the infrastructure team firmly believed their detailed onboarding tutorials were easy enough for developers to be productive, the application teams struggled. After months of back-and-forth, the infrastructure team agreed to observe the onboarding process directly via a Google Meet session. What followed was an unexpectedly painful 6-hour tutorial that highlighted the very real issues the application team faced, for example they found that some basic tasks like provisioning a single secret key took nearly 45 minutes to complete.  

Rather than proving the developers were at fault, it became a sobering reflection of the infrastructure team’s limited application of empathy. This moment of empathy helped both teams set new expectations to address the underlying usability challenges. One could argue that the application engineers didn’t have the right skillset, but that’s where the centralization mindshift has to occur, and if they’re unable to solve the application engineers’ problems with the skillsets they mostly have, then they’ve made the wrong design decisions for the platform.

making the customer happy isn’t the same as solving the customer’s problem.

Infrastructure teams themselves reported feeling that centralized team leaders often did not push back on tech focused users demanding specific solutions. 

At one company, the infrastructure team invested heavily in abstracting the aging service mesh implementation using Linkerd so it can replace it with the new hotness Istio. However, after many months of development adopting Istio enabled only minor reliability and simplicity improvements that didn’t meet business goals. The significant engineering effort to move to Istio was not justified, and was taken on because leads were not accustomed to pushing back and negotiating with their users. Had the infrastructure team been better equipped, they would have been able to get to the need behind the request instead of feeling limited to perform the exact remedy the customer demanded

Less mature infrastructure teams also faced issues with application engineers going directly to infrastructure people for help instead of using the official process. While not efficient, this allowed application teams to skip the formal way and get what they needed from infrastructure, especially when the official process was frustrating and slow.

For the application team, going directly to the infrastructure folks was often the easiest way to meet their goals when the official channels like prioritization meetings or extrapolating from documentation was too time consuming.

In some cases, these backchannel requests, which often were in private Slack messages, consumed over 30% of the infrastructure team’s time, making it hard to finish planned work. More application teams going directly to infrastructure reinforced this behavior since teams saw it could work.

While going directly can help get important things done despite bureaucracy, heavy use of unofficial requests can hurt trust in the process over time. It also duplicates effort to understand what teams need. 

Heavy use of backchannels is a signal that your leadership structure is failing. Being draconian about it isn’t a solution, fixing leadership is.

Decomposing Infrastructure-Platform Orgs

Decomposition Reorgs often disrupt team dynamics, leading to the fragmentation of previously unified workflows. In effective central infrastructure-platform orgs, there are established mechanisms for cross-organizational prioritization, proactive planning, and a clear understanding of the purpose and scope of central teams compared to product teams. However when the org is decomposed, those systems are broken and will now need to be re-created where the responsibilities of the org were decomposed to, and all users who relied on it have to re-learn how to now interact with multiple orgs, both within their product group and outside of it.

When larger organizations break up their infrastructure teams to reduce the perceived accumulated complexity, engineering leadership often overlooks the fact that the total cost of ownership may not significantly decrease. The inherent complexity doesn’t disappear; it merely shifts onto the product organizations.As product features become more interconnected, the costs of engineering alignment and communication increase quadratically to the number of teams (N^2) compared to the linear costs facilitated by a central infrastructure team driving consistent standards and abstractions across different product organizations.

When infrastructure-platforms are decomposed, key personnel are likely to depart as the systems and abstractions they’ve built are being put to slow pasture, resulting in knowledge gaps. Subtle constraints and design rationales may be forgotten, leading new owners to re-implement currently effective systems and repeating past mistakes due to limited historical awareness of how those choices played out. This problem is further exacerbated when leadership changes, as the lessons learned from organizational evolutions have not been documented or institutionalized, and organizational repetitions have a much larger blast radius that’s hard to unwind. 

One of the most commonly underestimated pitfalls is the cost of transitioning. Each time a company transfers responsibilities from one organization to another, there is a significant and costly transitional period that extends the total cost of ownership for many months or even years. These costs can be financial, such as needing to hire for the expertise being moved into an org or implementing new systems that were being offered centrally, as well as non-financial, such as decreased productivity during the transition period as engineers figure out what the new way of doing things should look like and how to account for the increased scope.

Organizational decisions are often influenced by a leader’s background. We have observed a pattern where newly hired engineering leaders, who previously worked in companies with infrastructure/platform organizations, tend to replicate similar structures without thoroughly analyzing their alignment with current internal constraints of the companies they’ve joined. Structures that worked in the past may no longer be suitable due to changes in technology, business objectives, or engineering capabilities. Attempting to implement an infra-platform org like ones that exist in companies like Meta, Google or Amazon will not fit companies like Etsy, Qualtrics and Docusign. Failing to grasp this contextual understanding leads to the repetition of previously attempted cycles, risking the recurrence of past mistakes at significant multi-year costs.

Tradeoffs and When to use What

Product Orgs with Embedded Infra Teams

This model is especially suitable for early startups or smaller organizations where speed and flexibility are paramount. With infrastructure engineers embedded directly within product teams, organizations can benefit from quick decision-making and tailored infrastructure solutions to meet specific product needs. This high degree of autonomy allows for dynamic customization and agility. However, the tradeoff for this flexibility is a higher duplication of work and cost, as each team builds and maintains their own independent infrastructure. Coordination between product orgs for interoperability can also be a challenge.

Pros:

  • Infrastructure solutions fit the specific needs of each product org rather than being generalized
  • Engineers can leverage their existing skills and familiar tools rather than having to learn new mandated tools
  • Avoids abstraction layers on top of tools that can add complexity on top
  • Flexibility to choose whatever tools work best for their needs instead of waiting for central approval

Cons:

  • Duplicated engineering effort as each org builds and maintains infrastructure independently
  • Interoperability is costly due to different technology choices and practices across orgs without central coordination
  • Infrastructure resources tend to be under utilized, increasing total company costs 
  • Engineers take time ramping up when switching teams due to different tools/processes
  • Autonomy allows teams to churn infrastructure tech even if not highly beneficial
  • Collaboration across teams requires aligning both technical stacks and engineering cultures

Shared Cloud Engineering Team and Product Orgs

The Shared Cloud Engineering Team model is beneficial for organizations looking to balance infrastructure consistency with product autonomy. By consolidating infrastructure engineers into a central team, organizations can achieve economies of scale, reduce duplication of work, and provide shared resources and guidance to product teams. However, product teams retain control over their application operations and some provisioning needs, allowing for some level of flexibility. The tradeoff is that there is less duplication than fully decentralized models, but more than a consolidated platform org.

Pros:

  • Reduces duplicated engineering effort by doing the work once for most orgs
  • Lowers risks from team member turnover by retaining institutional knowledge within the central team
  • Easier to disseminate best practices across multiple orgs

Cons:

  • Prioritization tends to over index on urgent fire-fighting needs rather than important ones
  • Lacks organizational structure to support having strong points of view on infrastructure
  • Operates as an order taker rather than guiding infrastructure strategy
  • Hard to enforce guardrails and consistency of cloud resources
  • Product orgs still have the majority of operational responsibilities of clusters and infrastructure tooling

Infrastructure Org and Product Orgs

The Infrastructure Org model is fitting for organizations seeking to balance standardization with product team ownership. With a separate infrastructure organization providing core services and some degree of standardization, organizations can reduce duplication of work compared to fully decentralized models. However, product teams retain autonomy over their application operations and provisioning requirements, allowing for a degree of flexibility. The tradeoff is that alignment between infrastructure and product groups is essential for smooth workflows, and there can still be duplication of work across product orgs.

Pros:

  • Allows for strategic vision and strong opinions on infrastructure
  • Achieves cost efficiencies through shared resources like shared clusters
  • Removes operational burden of managing clusters from product orgs
  • Eases visibility into usage and costs across product orgs
  • Gives product orgs with predictability even if some constraints
  • More easily adopts and enforces standards, best practices

Cons:

  • Product orgs must use prescribed infra technologies which may not fit needs
  • Bad central decisions have a large blast radius, negatively impacting all product orgs
  • Difficult for product orgs to adopt new technologies on their own
  • Does not eliminate difficulty of using provided infrastructure
  • Forced migration to new technologies disrupts product orgs

Infrastructure-Platform Org and Product Orgs

The highly centralized Infrastructure-Platform Org model is ideal for larger organizations aiming for consistency across the organization and reducing duplication of work. By having a dedicated platform team owning all core infrastructure and standards, organizations can ensure uniformity and reduce cognitive load on product groups. However, the tradeoff is that the tight centralization can lead to the platform org drifting from actual product needs over time, and the accumulation of complex custom systems can hinder agility.

Pros:

  • Lowers developer cognitive load by providing a unified SDK and abstraction layer
  • Upgrading underlying infrastructure is cheaper overall as its done centrally
  • Clear ownership for optimizing and simplifying the platform
  • Limits infrastructure churn and “over-innovation” to one org

Cons:

  • Forces developers to learn company-specific abstractions
  • Once the company aligns on core tech, custom abstractions are seen as low value
  • Brain drain or bad decisions have huge blast radius
  • Perception of total cost of ownership of platform org is very high, creating continuous leadership friction
  • Decomposing the Platform Org model into one of the previous ones requires changes on all product orgs and teams

Infrastructure Org + Intelligence-Assisted Model

For the majority of organizations that we’ve talked to, building and operating backend systems in the cloud remains unsustainably complex, driving up costs and reliance on scarce engineering expertise. Creating internal platform engineering teams seems promising initially, but these teams tend to accumulate custom implementations and abstraction layers that diverge from engineering needs over time. Platform teams get overloaded as workloads and technologies evolve over time, and the overhead of sustaining them outweighs the benefits as complexity is shifted rather than resolved. 

The cloud is the platform

We believe that a better approach is to encapsulate this complexity behind accessible and intelligent tools designed for all application developers, while configured and operated by a central infrastructure org. Rather than funnel expertise into platform teams, it should be encoded into solutions that empower all developers. With abilities to reason about guardrails, optimizations, correctness, and evolving best practices baked in, these tools can encapsulate cloud complexity while enhancing productivity. This democratization of expertise through automation represents the future – one where any organization can reliably and sustainably build sophisticated backend systems in the cloud.

That’s where our InfraCopilot journey is headed. If you believe in this vision, let’s make that journey together!

2023 Report: If Platform Engineering is so great, why isn’t it?

Several reports have painted an overwhelmingly optimistic portrait of platform engineering. They portray it as a remedy to solve all efficiency challenges. However, the on-the-ground reality often proves far more nuanced. In this post, we will closely analyze findings from the State of Platform Engineering 2023 Report to highlight inconsistencies between the surveys’ rosy outlook and real-world complexity. 

Read More

Platform Engineering Landmines – Part 1

(Read the full version on the Klotho Blog)

As engineering leaders contemplate internal developer platforms (IDPs), many are unaware of the organizational and cultural “fine print” that comes with them.

After 15 years building platform and development teams, and interviewing countless peers about developer platforms, I decided to share a few of the stories on patterns and themes related to the unforeseen costs of IDPs. Centralizing platform functions brings about a new level of organizational dynamics that most orgs aren’t equipped to tackle from the get go. The stories I’ll share are based on interviews with leaders that faced those challenges in real time and were willing to share them with the broader community.

Few industry leaders seem to talk about these realities. The IDP hype suggests it’s a silver bullet – an inevitable evolution for infrastructure  engineering organizations. But in my experience, the transition is far more complex.

My hope is that sharing these stories will help you prepare tech orgs for the cultural and organizational costs so you can pave a smoother path to internal platform success.

“I was hired to lead”

In one of the interviews with a CTO from a leading mid-sized tech company with over 500 engineers, they shared the story that colored their platform engineering journey the most. In preparation for scale, they prioritized recruiting technical visionaries, with proven track records from past companies. However, in their eagerness to scale, they overlooked a critical factor: the need for empathy in technical design. Failing to account for and accepting the constraints and realities of the application developers proved pivotal to the platform’s success and failure.

In one interview with Alex, an engineering manager in the infrastructure-platform group at the time, he described the buzz in the platform team after they hired a visionary tech lead who had come from a hot container startup. The new lead’s big ideas and confidence about the future of cloud computing got the platform engineering team excited, and Alex admitted sharing in their enthusiasm at first.

Over the next several weeks, Alex had multiple conversations with the new tech lead, especially around the tech lead’s thoughts that the product teams should fully commit to adopting the platform team’s tools and approaches. ‘It’d be better if they fully committed to using them,’ he would focus a lot on the politics and how the product teams were stuck in the old ways of doing things and unwilling to adopt a true DevOps mentality.

Over months, tensions built up between the tech lead’s stance and product teams pushing back. The product teams highlighted both technical and cultural constraints that didn’t align well with the tech lead’s vision. The developers just didn’t work that way, and expressed that they didn’t have the mental capacity to both build their product quickly, and also re-shape how they did development.  But the tech lead insisted, convinced that their resistance was merely an unwillingness to do the work and learn something new. Eventually the CTO caved and mandated adoption, forcing the product engineering groups to use the new platform for all new services.

After one too many fiery meetings, the product org leadership got involved and made their stance clear: If the tech lead were not to budge, the product groups will not adopt the platform and will start looking at self-funding their own platform efforts. At this stage the platform tech lead was let go, and his refusal to adjust his worldview and listen to the product teams was the final straw. At that point, Alex wasn’t sure if the platform org would survive.

In a last-ditch effort, Alex and the platform team formed a 20 person software engineering strike team that embedded directly with one of the product groups. The product groups’ condition was that the platform team would join their daily rituals, co-design tools fitting their workflows, and align as mutually invested partners. But the decision came with a price. 

The platform team’s morale suffered, as those working on the existing platform felt left out from new design decisions. And those focusing on the new platform felt the rest of their teammates were not aware of or embracing the future needs of the company.

As Alex reflects now, bridging disparate worlds is messy and difficult, but the hard-won empathy and insight gained made all the difference. Trust was reestablished between the teams, setting an example within the company of how the platform team cares and empathizes with its users, going the extra mile. However, it also set a dangerous precedent. Leadership had to manage expectations and make clear to all teams that a similar embed would not be possible at that scale again.

It’s easy for platform teams to lose sight of the problems app developers face day-to-day. By immersing in the product team’s reality, Alex believes they not only built tools that empowered but learned to lead with compassion. The lessons were painful but clear – vision must align with reality, and empathy must guide innovation.

vision must align with reality, and empathy must guide innovation

Read the rest

Read the rest of the stories on the official Klotho blog and stay tuned for part 2 of this series! If you’d like to chat about platform engineering you can reach out to me on linkedin (just mention this blog post), or checkout what I work on.


The Evolution of Cloud, Infrastructure, and Platform Engineering Organizations

(I’d like to thank Brian Bossé and David de Regt for contributing to the discussion leading to this article)

Modern backend systems have reached a level of complexity that organizations struggle to wield. Many find themselves in a constant cycle of hiring, reorganization and reprioritization in order to better align themselves with that complexity, only to find familiar problems occurring again. Perversely, this constant organizational flux further contributes to the complexity of their backend system, creating a negative feedback loop.

Simplicity won’t be found at the end of the infrastructure-platform journey, instead it reaches an incremental yet cyclical steady-state of streamlined complexity. Breaking the cycle requires a paradigm shift in how cloud development is approached, something we’ve been working on with Klotho and InfraCopilot for the past several years. Regardless of where you are in the cycle of coping with backend complexity, we can help you step outside of the loop.

How we got here

Small startups have small teams of developers balancing all the development responsibilities from coding to basic CI/CD setups. With the scaling of the company comes more services to connect and deploy, leading to extra burden that draws developers away from their product focus, making way for hiring the first dedicated cloud ‘devops’ engineer.

Once the company moves up to the Series A stage, they’re tasked to handle the complexities of operationalizing services across many repositories, usually laid out in a microservice architecture. Their efforts clear the way for the developers to turn their focus back onto the product.

Yet, as the startup’s journey continues, dedicated product teams are spun up to capture more of the opportunity the startup creates. This results in more services, which means more cloud engineering work than one engineer can handle. A shared cloud engineering team is formed, focusing on creating reusable tools and templates.

Despite their expertise, the shared team’s capacity has a cap, and supporting everyone’s needs becomes impossible, and with limited organizational tools, prioritizing across varying product groups is equally as hard. This forces product teams to come up with their own infrastructure stop gaps, moving away from the shared team’s offerings. 

When the divergence becomes large enough and the benefits of centralizing the cloud engineers becomes more evident, a reorganization is brought about to put more directional thinking and more rigorous prioritization. 

That starts the shift towards platform engineering and the beginning of the self-reinforcing cycle.

The Cycle

Step 1 – Split out Infrastructure

This is the common entry point for organizations as they scale, where cloud engineering work has grown to sufficient size that it becomes worth separating into its own entity, tasked with creating software to facilitate product teams delivery. This often is accomplished through consolidating the previous “everything goes” set of technologies onto a standardized set of choices, allowing for economies of scale in creating an ecosystem around them.

Step 2 – Grassroots Platform

However, to fully gel into a cohesive ecosystem sensitive to the company’s particular needs, there needs to be some software engineering applied. Often this is noticed first by team members within the infrastructure organization who have a software engineering background, and with good intent and initiative, take it upon themselves to start creating that software. This early work proves very valuable to product teams, and desire for the fully realized version to exist rises quickly.

Step 3 – Formal Platform

Responding to that desire, the Infrastructure team leadership takes on the scope of providing a cohesive platform for their infrastructure. There’s a vision of an easy to use self-service interface to the infrastructure that’s aligned tightly with the organization’s needs and a high-powered FAANG-like hire brought in to realize it. These types of systems take a tremendous amount of work, and the new platform team created to make it happen quickly balloons and priorities start to get strained between what the platform needs and what infrastructure needs.

Step 4 – Infrastructure-Platform Org

If everything goes well and the vision is executed successfully, someone still needs to maintain the software as its underlying technology and the company’s needs naturally shift, so the large team created to stand the new platform up is calcified into its own organization, treating the platform it’s responsible for as a product unto itself, further distancing it from the infrastructure organization that it came from. As platform teams enter a more operational and iterative refinement phase, they can suffer from many problems, ranging from bored engineers looking to reinvent the wheel to bleeding out senior talent to projects that are building something new.

All the while, the company continues to adapt and its engineering practices evolve and expertise in backend development increases. Each product group’s ambition grows, and some commonalities that brought all parties into a shared platform solution in the first place stop being quite so common. With the underlying technologies now being mostly standardized across the product groups, and its maturing tooling ecosystem, perhaps it would be better if we de-scoped and went back to just handling the infrastructure…

Conclusion

The industry is now streamlining the complexity of cloud computing, which is long overdue but insufficient. As growing utilization of cloud infrastructures increases complexity even further, it will continue to drive the need for more complex organizations to streamline the technical complexity. The desire for simplicity as we journey from startups to infrastructure-platform orgs isn’t met at the journey’s end; instead, it reaches an incremental yet cyclical steady-state of streamlined complexity that never fully succeeds in taming it.

Simplicity isn’t found in the infrastructure-platform engineering reorganization loop. To find it, we need a paradigm shift. That paradigm shift is beginning, and it’s ready for you to take part. Learn more about it by visiting Klotho and InfraCopilot.

Search The Deck

Filtering tags is so 90’s

When I was putting together Klotho’s pitch deck, I followed the classic sections everyone suggested: introduction, problem, solution, target market, market size, competition, go-to-market strategy, product or service, team, financials, funding, milestones, and conclusion.

I searched for slide decks, but it was difficult to find the specific sections I wanted to learn from. Opening tens of decks and scrolling through to find that one relevant slide felt wasteful.

That’s when I decided to create a tool that would make it easier to search inside the decks.

Searching Decks

I needed to find pitch-deck slides of a certain type, like the ‘Problem’ slide or ‘Vision’ slide.

There are several pitchdeck sites out there, and with a little scraping, I collected 15k+ slides, totaling 5GB of data. The only way to find the relevant slides without manually tagging them was to OCR the images and search the words inside the slides.

However, the resolution of the slides was too low and the OCR library tesseract wasn’t able to recognize the text in them.

To solve this, I used the latest GPU-based open source upscaler called Upscayl to upscale each slide to 4x its original size. This created a data set of 150GB of images that was ready for OCR.

Running tesseract on 150 GB of images on a single machine proved to be slow, and to speed things up, I wrote a lambda-based event-driven Klotho application to parallelize the scan.

Klotho

The application would get an image path, pass it to a function that runs tesseract on it, then passes the detected text and image path to another klotho::exec_unit that resized and optimized the image into a smaller yet still high resolution webp file.

To upload the images, I used the klotho::persist capability to create a data store backed by S3, and manually uploaded the 150 GB of images.

The event driven flow used the klotho::pubsub capability. The processed image was then written into the same klotho::persist‘ed object store but with a different path, and the path + detected text were saved into a klotho::persist‘ed key-value store.

The processed data-set was only 2GB in size.

Fast search

In order to create a fast, searchable data set, I used Algolia to index the text results from the OCR. Facets such as the startup name and the public image URL for the slide made the UI easy to construct.

For the front-end, React, NextJS, NextUI, static building, and Klotho’s klotho::static_unit capability made a great combo running on AWS’s S3+Cloudfront CDN. Due to the 15k results going over the Algolia free-tier, we decided to sponsor the project.

What I liked

  • The Upscayl GPU-based upscaler quality was impressive despite the low resolution of the sources.
  • I enjoyed the developer experience building the cloud system with Klotho. (Though being one of the founders, I’m biased). I used the open source klotho::persist and klotho::static_unit and the pro klotho::pubsub and klotho::exec_unit capabilities to construct the larger system in a few hours with virtually no infra/platform work – maximum productivity!
  • Tesseract’s OCR produced quality results and worked well in a Lambda-based environment.
  • Algolia APIs had a seamless experience and their starter React components for the front end UI worked as expected.

What I disliked

  • I couldn’t figure out how to run Upscayl in a cloud environment, so I wound up not automating it. That meant that Search the Deck isn’t fully automated (yet), and there are manual steps that have to be taken to add or update the decks.
  • The manual nature of collecting all the slides from all the web sites felt unnecessary. Similar projects pop up all the time, there’s no point in re-scraping them.
  • The developer experience for klotho::static was experimental at the time of writing, but I wanted to use it anyway. This wound up being useful input for its next iteration.

Opening the data set

We’ll be releasing the image dataset as a downloadable set, or hosting it in a repository so people can contribute to it. That way the next person that wants to create a fun new version can use that central set and benefit everyone.

Open Source

I wasn’t originally planning to make this an open source project, but it seems like it would be really useful to make it available to everyone.

Help us get 1000+ Github stars within a week and we’ll prioritize the effort to open source it.

Now go and Search the Deck!

via GIPHY

How to refactor for Startups 2022: reasons and tradeoffs

(Cross-posted from the Klotho web site)

This blog post is a high level overview on the reasons to refactor code and systems in a startup setting. We cover risks, approaches and tradeoffs to consider in 2022.

How to judge when the cost is worth the gains

Be honest with yourself

The temptation will always be to refactor: real-world code is messy, and engineers don’t like messy code. Make sure there’s a business case for refactoring by measuring how much time the team is spending on directly customer-visible features.

Our research shows that mid-sized companies and fast-growing startups spend 39% of engineering capacity on undifferentiated work, like infrastructure. Split stories into sub-tasks like infrastructure and refactoring so you can measure where your time is going.

Well-structured is well-maintained

Clear boundaries between modules make them easier to test, deploy, and monitor. Keep an eye on customer-reported bugs, service latency, and how often you have to revert code. On the other end of the software lifecycle, there are multiple indicators that you’re trending towards a bottleneck: how much time it takes to go from design to shipped, the degree of engineering satisfaction and raising infrastructure costs — these can all be leading indicators that you’ve built up debt.

Inflection points in the businesses

Code bases tend to organize themselves on three dimensions: team size, pace of new features, and number of customers. When these numbers change significantly, it may be time to look at splitting up components.

If you’re growing the team or increasing the rate of feature development, the limiting factor will be the code’s readability. Start with targeted, opportunistic refactoring. If your customer base is growing, you may need more scalable technologies or workflows.

Control what you can, plan for the rest

Choosing right won’t prevent re-architecting

Refactoring and re-architecture doesn’t mean you made a bad choice earlier. More often, the driving forces behind re-architecture are tied to requirement changes or external factors. There are at least 4 significant dimensions that will force a re-architecture over the lifetime of a product: new feature development rate, engineering team size, the amount of time spent on undifferentiated work, and customer growth. Each of these is progressively harder to directly control.

Start with the easy dials…

f the four dimensions, the two that are easiest to control are the rate of new features and the engineering team’s size. You can control the rate of new features by being stricter about planning and prioritization. Scaling the team is a slower process, but it’s usually one you can at least plan for.

If both the team and the rate of new features is small, refactoring is unlikely to have a significant impact on the business. At the other end of the spectrum, a large team working on many features may benefit from reorganizing into smaller teams — and you should consider refactoring or re-architecting the code to match. An architecture that enables cleaner organizational and code boundaries will allow the product and company to scale.

…and then move onto the harder ones

The amount of time your team spends on undifferentiated work can be hard to rein in, and customer growth is the hardest measure of all to affect. If these were easy, everyone would minimize undifferentiated work and maximize customer growth! Still, you can

get ahead of problems with a careful and proactive approach to refactoring.

The first step is knowing when not to refactor. If your customer growth and amount of time spent on undifferentiated work is low, don’t spend time on refactoring: focus instead on impactful, customer-visible features. Similarly, if you have good customer growth and low amount of undifferentiated work, your team is doing well. Consider tactical refactoring to avoid the amount of undifferentiated work from growing, but don’t spend too much time on it.

If your team is spending too much time on undifferentiated work, it’s time to revisit the architecture to one that scales better to where your company is today.

If your customer adoption is lower, your priority should be a cheaper architecture that will give you more runway.
If both your customer adoption and the amount of time your team spends on undifferentiated work are high, it may be time to focus on a centralized, optimized solution. This typically takes the form of a dedicated operations team that can efficiently execute on infrastructure tasks. This is a great problem to have — so take a moment to congratulate yourself and your team for getting here!

Have a target, then find shortcuts to get there

Have a plan, even if it’s not perfect

Once you’ve committed to a re-architecture, don’t be afraid to think big. Lean on your engineers to come up with an end-state they’d love, and then pare it down as needed. Chances are, the opportunity for a major re-architecture will only come once or twice in a product’s lifecycle, so be prepared to live with any compromise you make. But by the same token, know that even the best laid plans will go awry as you start implementing. 

Make big plans, take little steps

Once you know where you want the code to be, be tactical about how to get it there. Work one one component at a time, or pick components that are as far away as possible from each other. If you haven’t already invested in solid testing, both at the unit and system level, now’s the time. Tests will give you confidence that your changes won’t break existing customer experiences, but they can also help your team come up with its definition of done. When the tests pass, the component is ready!

The best technology is the one you can adapt

The key to reducing the impact of refactoring and re-architecting on startups in particular is to use technology that is adaptable. 

Historically, companies choose specific technologies like VMs, serverless, or containers to host their applications. The problem is that switching from one technology to another is prohibitively expensive, and what you need today may not be what you need tomorrow.

An adaptive architecture is one that lets you host your application on any technology equally easily. This lets you adjust the hosting environment on the fly, to match your current needs.  Specific technology choices like AWS Lambda, Fargate, Kubernetes, gRPC, Linkerd, Azure/GCP become interchangeable.

By reusing existing programming language constructs like functions and event handlers, as well as interfaces that are idiomatic to each language, adaptive architectures make cloud services easier to use.

Look for abstractions and tools that are lightweight, but flexible enough to let you switch technologies. We think Klotho annotations fit the bill, since they let you separate your architecture’s semantic meaning from the deployment configuration — but  with enough investment in runtime libraries and infrastructure automation, you can build out a similar solution yourself.

Serverless vs. Microservices: Two Sides of the Same Coin

I just wrote up a piece around the confusion the Internet creates around Serverless and Microservices. The “Serverless vs. Microservices” debate presents a dilemma between two supposedly incompatible strategies that must be fundamentally at odds with each other. In reality, they are as similar as two flavors of ice cream – you might prefer chocolate chip, but strawberry will work just as well.

Check it out at the official Klotho blog.

Cloud computing architecture for the next ten years – Part 2

Cloud development has become prohibitively complex, and the current generation of solutions have low-level interfaces that require extensive investment from developers and operators to understand how to configure, learn, assemble, and scale them properly. For a new architectural shift to occur, we need approaches that absorbs the cognitive load, not streamline it.

Maintain benefits from existing architectures

There has been a continuous discussion among backend and service developers about whether things should be built using one strategy (monoliths) or the other (microservices). There are no one-size-fits-all solutions, because there’s always a trade-off involved.

Monolithic development offers high productivity, ease of deployment, and a straightforward observability story. Microservices offer flexibility in fault isolation, resource tuning and team autonomy. Unfortunately, microservice-based architectures usually involve piecing things back together – back into the monolith’s basic architecture, but with duct tape. As a result, the benefits of neither solution are fully realized.

When building Klotho, we zoomed out and asked, “What aspects of computer engineering can we apply to bridge that gap?”. We concluded that a key characteristic in the new architecture must be the convenience of monolithic development, coupled with an adaptive system that leveraged the benefits found in microservices architectures. Most importantly, it has to reduce the cognitive load for developers, while maintaining configurability and control for operators.

By focusing on developer and operator intent, we created a solution based on ease of use through separation of concerns. Using three different programming constructs, Capabilities, Requirements, and Directives, developers and operators can specify what parts of the application should be cloud-aware, what additional tradeoffs Klotho should consider for your application, and what specific overrides are required.

Solution: Developers should write code the way they know best. We leverage their intent early on to determine what backend wiring and analysis is done behind the scenes to properly meet their needs. Requirements and Directives allow developers and operators to provide more fine tuning and controls without developers needing to change the code.

Read the rest of the post on the official blog:

Cloud computing architecture for the next ten years

In computing, bigger and more ambitious dreams have always been realized by pushing the limits. Cloud computing is no exception; parallel computing, cluster computing, grid computing, and edge computing are all continuously expanding what we consider to be possible. But they also make development more difficult.

Cloud computing is now in the phase of streamlining complexity. There are several examples of integrated solutions that are optimized for certain workloads or development models: Google’s Anthos, Amazon’s Outposts, Azure’s Stack Hub, and Hashistack.

These solutions bundle together building blocks necessary for larger-scale applications and systems, but they present complicated low-level interfaces that require developers and operators to configure, learn, assemble, and scale appropriately.

It’s similar to the complexity reduction evolution happening in programming languages: Punch cards, assembly, C, C++, Java …

Continuous improvement keeps happening, but at some point, an architecture shift emerges that addresses the accumulation of complexity.

In our first blog post on Klo.Dev, we take a look at a few principles that we view as critical for this architectural shift to emerge, and what we need from products to effectively take us into the new world of cloud computing: