Every startup dreams of 'going viral', but for our client's e-commerce platform, viral traffic became a nightmare. During their Black Friday sale, the servers crashed under the load of 10,000 concurrent users. This is the story of how we re-engineered their system to handle 10x that load.
Scalable System Architecture: From 100 to 10k Concurrent Users

Phase 1: The Diagnosis
The application was a standard Monolithic Node.js service connected to a single PostgreSQL instance. It worked perfectly for 500 users. But at 10,000, we observed severe degradation:
- Event Loop Lag: The single-threaded nature of Node.js meant that CPU-intensive tasks (like PDF invoice generation) blocked all incoming requests.
- Database Locks: Long-running analytics queries locked rows needed for new orders, causing a cascade of failures.
- Memory Leaks: Stateless sessions weren't truly stateless, consuming RAM rapidly.
We used New Relic to identify that the database CPU was sitting at 100% continuously, while the Node.js servers were idle waiting for I/O.
Phase 2: Decoupling with Event-Driven Architecture
The first step wasn't to rewrite everything, but to offload. We identified that the checkout process was doing too much: validating inventory, charging cards, sending emails, and updating analytics synchronously.
We introduced RabbitMQ to handle these side effects asynchronously. This pattern is often called 'Fire and Forget' for non-critical path items.
// ❌ OLD: Synchronous Bottleneck
app.post('/checkout', async (req, res) => {
await inventory.check(req.body);
await payment.process(req.body); // Waits 2s
await email.sendConfirm(req.body); // Waits 0.5s
res.send('Order Placed');
});
// ✅ NEW: Asynchronous & Non-Blocking
app.post('/checkout', async (req, res) => {
const isValid = await inventory.check(req.body);
if (!isValid) return res.status(400).send('Out of Stock');
// Push to Queue and respond immediately
await rabbitMQ.publish('orders', req.body);
res.status(202).json({ status: 'Processing', id: req.body.id });
});Phase 3: Database Optimization and Sharding
The database is often the hardest thing to scale. Vertical Scaling (buying a bigger server) worked for a while, but it hit a ceiling.
1. Read Replicas
We set up 3 Read Replicas (Slaves) on AWS RDS. We routed all generic `GET` requests to these replicas, keeping the Master DB dedicated to Write operations (INSERT/UPDATE).
2. Horizontal Sharding
We implemented application-level sharding based on `tenant_id`. User data for European customers was routed to `DB_EU`, while US customers went to `DB_US`. This instantly halved the load on any single instance.
Phase 4: The Caching Layer (Redis)
We realized 80% of database hits were for the same 'Top Selling Products'. By caching these queries in AWS ElastiCache (Redis), we reduced DB load by 70%.
"There are only two hard things in Computer Science: cache invalidation and naming things."
We used a 'Write-Through' caching strategy to ensure data consistency. Whenever a product was updated, the cache key was immediately invalidated.
const getProduct = async (id) => {
const cacheKey = `product:${id}`;
// 1. Check Redis Cache
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// 2. Fetch from DB if miss
const product = await db.query('SELECT * FROM products WHERE id = ?', [id]);
// 3. Store in Cache (TTL 1 hour)
await redis.setex(cacheKey, 3600, JSON.stringify(product));
return product;
};Monitoring and Observability
You can't improve what you can't measure. We deployed a robust monitoring stack:
- Prometheus: To scrape metrics (request count, latency, error rates).
- Grafana: To visualize these metrics in real-time dashboards.
- ELK Stack: For centralized logging, allowing us to trace a request ID across multiple microservices.


Final Results and Lessons Learned
After a 2-month migration effort, the results spoke for themselves during the Christmas sale:
99.99%
Uptime
40ms
Avg Response Time
$3k
Monthly Cloud Savings
Scalability isn't just about adding more servers. It's about designing systems that fail gracefully and handle pressure intelligently. The biggest lesson? Don't microservice too early. The monolith served its purpose for 2 years. Only optimize when you have data proving the bottleneck.
Is your infrastructure ready for growth?
Book a System AuditEnjoyed this post? Share it with your network!
