Skip to main content

Scalability and Bottlenecks

How to think about scale by starting with what breaks first and why.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

Track

System Design Interviews - From Basics to Advanced

Step 3 / 19

The problem

A lot of conversations about scale start way too big.

Before anyone proves where the system is hurting, the room is already talking about queues, Kafka, load balancers, CDNs, sharding, and microservices. That can sound sophisticated. It usually does not help you decide what to do next.

Real scaling starts with a simpler question:

if this flow grows 10x, what breaks first?

Until you can answer that, most architecture talk is decoration.

Mental model

Systems almost never break everywhere at once.

Most of the time, the first pain shows up in one specific resource:

  • CPU
  • memory
  • database connections
  • network bandwidth
  • disk I/O
  • a slow external dependency

So thinking about scale is not about imagining an infinite system. It is about finding the first physical or logical limit that gets tight.

A simple way to think about it is:

  1. find the flow that matters most
  2. find the resource that flow consumes
  3. find which resource saturates first
  4. relieve that point before redesigning the rest

That avoids two common mistakes:

  • optimizing the wrong part of the system
  • adding complexity before you need it

Breaking it down

A practical bottleneck review usually looks like this:

  1. pick a critical flow
  2. define the metric that matters for that flow
  3. find the resource under the most pressure
  4. choose the smallest change that reduces that pressure

Critical flow might be:

  • checkout
  • login
  • redirect
  • search
  • upload

The metric might be:

  • latency
  • throughput
  • error rate
  • cost

The pressured resource might be:

  • application CPU
  • the database
  • a saturated queue
  • a third-party API

Once you talk in terms of flow, metric, and resource, “scalability” stops being abstract and becomes a diagnosis.

A useful rule is this:

  • if you cannot say what will saturate first, you are not really making an architecture decision yet

Simple example

Imagine an API that generates a heavy PDF on demand.

Every time the user clicks “export report,” the server:

  • loads several datasets
  • builds the file
  • renders the PDF
  • returns the download in the same request

If this system grows, what is the first likely bottleneck?

Not route caching. Not a CDN. Not microservices.

The first likely bottleneck is CPU during PDF generation, plus the time each request holds an instance busy.

A mature response sounds more like this:

The pain is not mainly in the database. It is in heavy work inside the synchronous request path. I would move PDF generation out of the main route, return 202 Accepted, process it in the background, and let the client poll for status or fetch the file later.

Notice what changed:

  • the bottleneck was named
  • the change attacked the right bottleneck
  • the architecture changed because the flow needed it

That is the opposite of theatre.

Common mistakes

  • starting with your favorite technology instead of the actual bottleneck
  • assuming the database is always the problem
  • ignoring third-party dependencies because they are not “your code”
  • redesigning the whole system before locating the first saturation point
  • looking only at averages and ignoring spikes

Another common mistake is confusing the current bottleneck with the final bottleneck.

Maybe today application CPU saturates first. After you fix that, the next limit might be the database. Scale is usually a chain of bottlenecks, not one final answer.

It is also worth distrusting any solution that claims to fix everything at once. Most of the time you are relieving one pressure point and accepting a new cost in return: more queueing, more observability, more operational work, or more consistency trade-offs.

How a senior thinks

More experienced engineers are usually less impressed by pretty architecture and more obsessed with the real symptom.

The reasoning often sounds like this:

Show me the flow that matters. Show me the metric that is under pressure. Show me the resource underneath it. Then I will choose the smallest change that actually changes the result.

That is senior thinking because it combines two things:

  • diagnosis before change
  • proportionality in the response

Not every scale problem needs a distributed system. Sometimes it needs an index. Sometimes a cache. Sometimes it means moving heavy work out of the request path.

The point is not to think big. The point is to think in proportion to the problem.

What the interviewer wants to see

In system design interviews, talking about scale this way shows maturity fast because you move from slogans to engineering.

The interviewer usually wants to see whether you:

  • locate the critical flow
  • talk about resources, not just components
  • propose a proportional change
  • understand degradation and the next likely bottleneck

Scaling is not adding more boxes to a diagram. It is relieving the point that blocks the system first.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Part of the track: System Design Interviews - From Basics to Advanced (3/19)

Next article APIs and Services Without Blurry Boundaries Previous article Cache and Consistency in Real Systems

Keep exploring

Related articles