May 29 2025
Scalability and Bottlenecks
How to think about scale by starting with what breaks first and why.
Andrews Ribeiro
Founder & Engineer
4 min Intermediate Systems
Track
System Design Interviews - From Basics to Advanced
Step 3 / 19
The problem
A lot of conversations about scale start way too big.
Before anyone proves where the system is hurting, the room is already talking about queues, Kafka, load balancers, CDNs, sharding, and microservices. That can sound sophisticated. It usually does not help you decide what to do next.
Real scaling starts with a simpler question:
if this flow grows 10x, what breaks first?
Until you can answer that, most architecture talk is decoration.
Mental model
Systems almost never break everywhere at once.
Most of the time, the first pain shows up in one specific resource:
- CPU
- memory
- database connections
- network bandwidth
- disk I/O
- a slow external dependency
So thinking about scale is not about imagining an infinite system. It is about finding the first physical or logical limit that gets tight.
A simple way to think about it is:
- find the flow that matters most
- find the resource that flow consumes
- find which resource saturates first
- relieve that point before redesigning the rest
That avoids two common mistakes:
- optimizing the wrong part of the system
- adding complexity before you need it
Breaking it down
A practical bottleneck review usually looks like this:
- pick a critical flow
- define the metric that matters for that flow
- find the resource under the most pressure
- choose the smallest change that reduces that pressure
Critical flow might be:
- checkout
- login
- redirect
- search
- upload
The metric might be:
- latency
- throughput
- error rate
- cost
The pressured resource might be:
- application CPU
- the database
- a saturated queue
- a third-party API
Once you talk in terms of flow, metric, and resource, “scalability” stops being abstract and becomes a diagnosis.
A useful rule is this:
- if you cannot say what will saturate first, you are not really making an architecture decision yet
Simple example
Imagine an API that generates a heavy PDF on demand.
Every time the user clicks “export report,” the server:
- loads several datasets
- builds the file
- renders the PDF
- returns the download in the same request
If this system grows, what is the first likely bottleneck?
Not route caching. Not a CDN. Not microservices.
The first likely bottleneck is CPU during PDF generation, plus the time each request holds an instance busy.
A mature response sounds more like this:
The pain is not mainly in the database. It is in heavy work inside the synchronous request path. I would move PDF generation out of the main route, return
202 Accepted, process it in the background, and let the client poll for status or fetch the file later.
Notice what changed:
- the bottleneck was named
- the change attacked the right bottleneck
- the architecture changed because the flow needed it
That is the opposite of theatre.
Common mistakes
- starting with your favorite technology instead of the actual bottleneck
- assuming the database is always the problem
- ignoring third-party dependencies because they are not “your code”
- redesigning the whole system before locating the first saturation point
- looking only at averages and ignoring spikes
Another common mistake is confusing the current bottleneck with the final bottleneck.
Maybe today application CPU saturates first. After you fix that, the next limit might be the database. Scale is usually a chain of bottlenecks, not one final answer.
It is also worth distrusting any solution that claims to fix everything at once. Most of the time you are relieving one pressure point and accepting a new cost in return: more queueing, more observability, more operational work, or more consistency trade-offs.
How a senior thinks
More experienced engineers are usually less impressed by pretty architecture and more obsessed with the real symptom.
The reasoning often sounds like this:
Show me the flow that matters. Show me the metric that is under pressure. Show me the resource underneath it. Then I will choose the smallest change that actually changes the result.
That is senior thinking because it combines two things:
- diagnosis before change
- proportionality in the response
Not every scale problem needs a distributed system. Sometimes it needs an index. Sometimes a cache. Sometimes it means moving heavy work out of the request path.
The point is not to think big. The point is to think in proportion to the problem.
What the interviewer wants to see
In system design interviews, talking about scale this way shows maturity fast because you move from slogans to engineering.
The interviewer usually wants to see whether you:
- locate the critical flow
- talk about resources, not just components
- propose a proportional change
- understand degradation and the next likely bottleneck
Scaling is not adding more boxes to a diagram. It is relieving the point that blocks the system first.
Quick summary
What to keep in your head
- Scaling starts by finding the critical flow and what saturates first.
- Bottlenecks usually show up in a concrete resource like CPU, database connections, network, or a slow dependency.
- The mature move is the smallest change that relieves the current pressure point.
- Fixing one bottleneck often exposes the next one. Scale is a sequence of limits, not one perfect architecture.
Practice checklist
Use this when you answer
- Can I name the critical flow and the metric that really matters for it?
- Can I tell whether the likely bottleneck is CPU, database, network, or an external dependency?
- Can I propose the smallest useful change before redesigning the whole system?
- Can I explain what the next likely bottleneck would be after the first fix?
You finished this article
Part of the track: System Design Interviews - From Basics to Advanced (3/19)
Share this page
Copy the link manually from the field below.