May 29 2025

Scalability and Bottlenecks

How to think about scale by starting with what breaks first and why.

Andrews Ribeiro

Founder & Engineer

4 min Intermediate Systems

#system-design#systems#scaling#bottlenecks

Track

System Design Interviews - From Basics to Advanced

Step 3 / 19

Back to track Previous article Next article

The problem

A lot of conversations about scale start way too big.

Before anyone proves where the system is hurting, the room is already talking about queues, Kafka, load balancers, CDNs, sharding, and microservices. That can sound sophisticated. It usually does not help you decide what to do next.

Real scaling starts with a simpler question:

if this flow grows 10x, what breaks first?

Until you can answer that, most architecture talk is decoration.

Mental model

Systems almost never break everywhere at once.

Most of the time, the first pain shows up in one specific resource:

CPU
memory
database connections
network bandwidth
disk I/O
a slow external dependency

So thinking about scale is not about imagining an infinite system. It is about finding the first physical or logical limit that gets tight.

A simple way to think about it is:

find the flow that matters most
find the resource that flow consumes
find which resource saturates first
relieve that point before redesigning the rest

That avoids two common mistakes:

optimizing the wrong part of the system
adding complexity before you need it

Breaking it down

A practical bottleneck review usually looks like this:

pick a critical flow
define the metric that matters for that flow
find the resource under the most pressure
choose the smallest change that reduces that pressure

Critical flow might be:

checkout
login
redirect
search
upload

The metric might be:

latency
throughput
error rate
cost

The pressured resource might be:

application CPU
the database
a saturated queue
a third-party API

Once you talk in terms of flow, metric, and resource, “scalability” stops being abstract and becomes a diagnosis.

A useful rule is this:

if you cannot say what will saturate first, you are not really making an architecture decision yet

Simple example

Imagine an API that generates a heavy PDF on demand.

Every time the user clicks “export report,” the server:

loads several datasets
builds the file
renders the PDF
returns the download in the same request

If this system grows, what is the first likely bottleneck?

Not route caching. Not a CDN. Not microservices.

The first likely bottleneck is CPU during PDF generation, plus the time each request holds an instance busy.

A mature response sounds more like this:

The pain is not mainly in the database. It is in heavy work inside the synchronous request path. I would move PDF generation out of the main route, return 202 Accepted, process it in the background, and let the client poll for status or fetch the file later.

Notice what changed:

the bottleneck was named
the change attacked the right bottleneck
the architecture changed because the flow needed it

That is the opposite of theatre.

Common mistakes

starting with your favorite technology instead of the actual bottleneck
assuming the database is always the problem
ignoring third-party dependencies because they are not “your code”
redesigning the whole system before locating the first saturation point
looking only at averages and ignoring spikes

Another common mistake is confusing the current bottleneck with the final bottleneck.

Maybe today application CPU saturates first. After you fix that, the next limit might be the database. Scale is usually a chain of bottlenecks, not one final answer.

It is also worth distrusting any solution that claims to fix everything at once. Most of the time you are relieving one pressure point and accepting a new cost in return: more queueing, more observability, more operational work, or more consistency trade-offs.

How a senior thinks

More experienced engineers are usually less impressed by pretty architecture and more obsessed with the real symptom.

The reasoning often sounds like this:

Show me the flow that matters. Show me the metric that is under pressure. Show me the resource underneath it. Then I will choose the smallest change that actually changes the result.

That is senior thinking because it combines two things:

diagnosis before change
proportionality in the response

Not every scale problem needs a distributed system. Sometimes it needs an index. Sometimes a cache. Sometimes it means moving heavy work out of the request path.

The point is not to think big. The point is to think in proportion to the problem.

What the interviewer wants to see

In system design interviews, talking about scale this way shows maturity fast because you move from slogans to engineering.

The interviewer usually wants to see whether you:

locate the critical flow
talk about resources, not just components
propose a proportional change
understand degradation and the next likely bottleneck

Scaling is not adding more boxes to a diagram. It is relieving the point that blocks the system first.

Quick summary

What to keep in your head

Scaling starts by finding the critical flow and what saturates first.
Bottlenecks usually show up in a concrete resource like CPU, database connections, network, or a slow dependency.
The mature move is the smallest change that relieves the current pressure point.
Fixing one bottleneck often exposes the next one. Scale is a sequence of limits, not one perfect architecture.

Practice checklist

Use this when you answer

Can I name the critical flow and the metric that really matters for it?
Can I tell whether the likely bottleneck is CPU, database, network, or an external dependency?
Can I propose the smallest useful change before redesigning the whole system?
Can I explain what the next likely bottleneck would be after the first fix?