MicroservicesNode.jsArchitectureDevOps
Scaling Node.js Microservices: Lessons from Production Systems
Hard-won lessons from building and maintaining high-traffic microservices architectures
8 mins read
When our team first migrated to microservices three years ago, we learned the hard way that distributed systems bring their own special chaos. Through late-night outages and scaling nightmares, here's what actually worked for our Node.js services handling 10k+ RPM.
Service Design: What Actually Matters
The Folder Structure That Survived 3 Major Rewrites
/src
/services # Individual microservices
/lib # Shared utilities
/events # Event schemas
/types # Type definitions
/infra # Deployment configs
Why this works:
- Clear separation between business logic (services) and infrastructure
- Shared utilities prevent code duplication
- Event schemas enforce contract between services
Real-world example of a payment service:
// services/payment/src/index.ts
import { initTracing } from "../lib/tracing";
const tracer = initTracing("payment-service");
async function processPayment(payload: PaymentRequest) {
const span = tracer.startSpan("process-payment");
try {
// Payment logic
} finally {
span.end();
}
}
// Health check endpoint for Kubernetes
app.get("/health", (req, res) => {
res.json({
status: "OK",
version: process.env.APP_VERSION,
uptime: process.uptime(),
});
});
Communication Patterns That Don't Break
Sync vs Async: When to Use What
// Good for critical path operations
app.post("/orders", async (req, res) => {
const order = await createOrder(req.body);
await inventoryService.reserveItems(order); // Sync HTTP call
res.json(order);
});
// Better for non-critical updates
eventBus.subscribe("order_created", async (event) => {
await analyticsService.trackOrder(event); // Async processing
await recommendationService.updateSuggestions(event.userId);
});
Critical Decision Factors:
- Data consistency requirements
- Latency sensitivity
- Failure tolerance
- Team ownership boundaries
Event-Driven Architecture Survival Kit
// lib/event-bus.ts - Enhanced with retry logic
export class EventBus {
private producers = new Map<string, Producer>();
private retryPolicy = {
maxRetries: 3,
backoff: [1000, 5000, 10000], // Exponential backoff
};
async publish(topic: string, event: unknown) {
let attempt = 0;
while (attempt <= this.retryPolicy.maxRetries) {
try {
const producer = await this.getProducer(topic);
await producer.send({ value: JSON.stringify(event) });
return;
} catch (error) {
if (attempt === this.retryPolicy.maxRetries) throw error;
await sleep(this.retryPolicy.backoff[attempt]);
attempt++;
}
}
}
// Dead letter queue setup
private async handleFailedEvents(event: FailedEvent) {
await db.failedEvents.create({
data: {
payload: JSON.stringify(event),
error: event.error.message,
retryCount: 0,
},
});
}
}
Data Management: Hard Lessons Learned
The Cache Strategy That Saved Our Database
// services/product/src/cache.ts - With cache stampede prevention
const cache = new Redis({
ttl: 30,
allowStale: true,
refreshThreshold: 5000, // Refresh cache before expiration
});
async function getProduct(id: string) {
return await cache.wrap(
`product:${id}`,
async () => {
const product = await db.products.findUnique({ where: { id } });
if (!product) throw new NotFoundError("Product not found");
return product;
},
{
ttl: 60, // Extend TTL on successful fetch
}
);
}
Cache Invalidation Strategies:
- Time-based expiration
- Write-through on updates
- Event-driven invalidation
- Versioned cache keys
Transactional Outbox Pattern Implementation
// lib/outbox.ts - With batch processing and dead letter queue
export async function processOutbox() {
const events = await db.$transaction(async (tx) => {
const events = await tx.outbox.findMany({
where: { status: "PENDING" },
take: 100,
orderBy: { createdAt: "asc" },
});
await tx.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: { status: "PROCESSING" },
});
return events;
});
try {
await eventBus.publishBatch(events);
await db.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: { status: "COMPLETED" },
});
} catch (error) {
await db.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: {
status: "FAILED",
retries: { increment: 1 },
lastError: error.message,
},
});
}
}
Deployment Tricks That Actually Scale
The Kubernetes Config That Works
# infra/k8s/payment-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
resources:
limits:
memory: "512Mi"
cpu: "1000m"
requests:
memory: "256Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Pro Tips:
- Always set resource limits
- Use pod anti-affinity for critical services
- Implement graceful shutdown
// Graceful shutdown example
process.on("SIGTERM", async () => {
logger.info("Starting graceful shutdown");
// Close HTTP server
server.close(() => {
logger.info("HTTP server closed");
});
// Close database connections
await db.$disconnect();
// Flush metrics
await metricsClient.flush();
process.exit(0);
});
Observability: What's Worth Monitoring
The 4 Golden Signals Dashboard
// lib/metrics.ts - Enhanced with error tracking
export function trackCoreMetrics() {
const httpErrors = new Counter({
name: "http_errors_total",
help: "Total HTTP errors by status code",
labelNames: ["status"],
});
app.use((req, res, next) => {
const originalSend = res.send;
res.send = function (body) {
if (res.statusCode >= 400) {
httpErrors.inc({ status: res.statusCode });
}
return originalSend.call(this, body);
};
next();
});
}
Critical Metrics to Track:
- Error rates (4xx/5xx responses)
- Latency percentiles (p95, p99)
- System resources (CPU/Memory usage)
- Queue depths (for async processing)
- Cache hit ratios
Production Checklist I Wish I Had
Before Going Live:
- [ ] Implement circuit breakers for external calls
// lib/circuit-breaker.ts
const circuit = new CircuitBreaker(
async (url) => {
return axios.get(url);
},
{
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
}
);
- [ ] Set up distributed tracing with 100% sampling
- [ ] Configure auto-scaling thresholds
- [ ] Test failure modes:
- Network partitions
- Database outages
- Third-party API failures
- [ ] Establish SLOs:
- 99.9% availability
- p95 latency < 500ms
- Error budget: 0.1% monthly
Post-Launch Monitoring:
# Useful kubectl commands
kubectl top pods --containers
kubectl logs -f deployment/payment-service --tail=100
kubectl describe pod payment-service-xyz
Remember: Microservices are like cats - they work best when you assume they'll misbehave. What war stories have you survived with distributed systems? Let's compare battle scars! 🔥