Building Large Node.js Microservices: What I Learned in 3 Years
My experience building and managing high-traffic distributed service systems
Three years ago, my team decided to move from a single large application to multiple smaller services. I discovered that distributed systems have their own unique problems that are very different from what I expected. After many late nights fixing problems and learning how to handle high traffic, I want to share what techniques actually helped us with our Node.js applications handling more than 10,000 requests per minute.
How to Organize Your Services: The Important Things
The Directory Organization That Worked Through Many Changes
/src
/services # Individual microservices
/lib # Shared utilities
/events # Event schemas
/types # Type definitions
/infra # Deployment configs
Why I think this approach is effective:
- Business logic (services) and infrastructure are clearly separated
- Common utilities help us avoid writing the same code multiple times
- Event schemas make sure all services use the same data format
Here is a practical example from our payment service:
// services/payment/src/index.ts
import { initTracing } from "../lib/tracing";
const tracer = initTracing("payment-service");
async function processPayment(payload: PaymentRequest) {
const span = tracer.startSpan("process-payment");
try {
// Payment logic
} finally {
span.end();
}
}
// Health check endpoint for Kubernetes
app.get("/health", (req, res) => {
res.json({
status: "OK",
version: process.env.APP_VERSION,
uptime: process.uptime(),
});
});
Communication Methods That Work Reliably
Synchronous vs Asynchronous: When to Choose Each One
// Use this for important operations that must happen immediately
app.post("/orders", async (req, res) => {
const order = await createOrder(req.body);
await inventoryService.reserveItems(order); // Direct HTTP call
res.json(order);
});
// Use this for updates that can happen later
eventBus.subscribe("order_created", async (event) => {
await analyticsService.trackOrder(event); // Background processing
await recommendationService.updateSuggestions(event.userId);
});
Important Things to Consider:
- How important is data consistency?
- How fast does the response need to be?
- What happens if something fails?
- Which team is responsible for each service?
Event-Based Architecture: Essential Tools
// lib/event-bus.ts - With retry functionality added
export class EventBus {
private producers = new Map<string, Producer>();
private retryPolicy = {
maxRetries: 3,
backoff: [1000, 5000, 10000], // Wait longer each time
};
async publish(topic: string, event: unknown) {
let attempt = 0;
while (attempt <= this.retryPolicy.maxRetries) {
try {
const producer = await this.getProducer(topic);
await producer.send({ value: JSON.stringify(event) });
return;
} catch (error) {
if (attempt === this.retryPolicy.maxRetries) throw error;
await sleep(this.retryPolicy.backoff[attempt]);
attempt++;
}
}
}
// Handle events that couldn't be processed
private async handleFailedEvents(event: FailedEvent) {
await db.failedEvents.create({
data: {
payload: JSON.stringify(event),
error: event.error.message,
retryCount: 0,
},
});
}
}
Managing Data: What I Learned From Mistakes
The Caching Approach That Protected Our Database
// services/product/src/cache.ts - Prevents multiple requests to database
const cache = new Redis({
ttl: 30,
allowStale: true,
refreshThreshold: 5000, // Update cache before it expires
});
async function getProduct(id: string) {
return await cache.wrap(
`product:${id}`,
async () => {
const product = await db.products.findUnique({ where: { id } });
if (!product) throw new NotFoundError("Product not found");
return product;
},
{
ttl: 60, // Keep data longer if fetch succeeds
}
);
}
Different Ways to Clear Cache:
- Automatic expiration after set time
- Update cache when data changes
- Clear cache when events happen
- Use different cache keys for different versions
Outbox Pattern: How to Handle Database and Events Together
// lib/outbox.ts - With batch processing and dead letter queue
export async function processOutbox() {
const events = await db.$transaction(async (tx) => {
const events = await tx.outbox.findMany({
where: { status: "PENDING" },
take: 100,
orderBy: { createdAt: "asc" },
});
await tx.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: { status: "PROCESSING" },
});
return events;
});
try {
await eventBus.publishBatch(events);
await db.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: { status: "COMPLETED" },
});
} catch (error) {
await db.outbox.updateMany({
where: { id: { in: events.map((e) => e.id) } },
data: {
status: "FAILED",
retries: { increment: 1 },
lastError: error.message,
},
});
}
}
Deployment Techniques That Work at Scale
The Kubernetes Configuration That I Found Reliable
# infra/k8s/payment-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
resources:
limits:
memory: "512Mi"
cpu: "1000m"
requests:
memory: "256Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Important Recommendations:
- Always define resource boundaries
- Use pod anti-affinity for important applications
- Make sure your application shuts down properly
// Example of proper shutdown handling
process.on("SIGTERM", async () => {
logger.info("Starting graceful shutdown");
// Stop HTTP server
server.close(() => {
logger.info("HTTP server closed");
});
// End database connections
await db.$disconnect();
// Send remaining metrics
await metricsClient.flush();
process.exit(0);
});
Monitoring: What You Should Actually Watch
The Essential Metrics Dashboard
// lib/metrics.ts - With error counting functionality
export function trackCoreMetrics() {
const httpErrors = new Counter({
name: "http_errors_total",
help: "Total HTTP errors by status code",
labelNames: ["status"],
});
app.use((req, res, next) => {
const originalSend = res.send;
res.send = function (body) {
if (res.statusCode >= 400) {
httpErrors.inc({ status: res.statusCode });
}
return originalSend.call(this, body);
};
next();
});
}
Important Measurements to Monitor:
- Error percentages (4xx/5xx responses)
- Response time percentiles (p95, p99)
- System resource usage (CPU/Memory)
- Queue sizes (for background processing)
- Cache success ratios
Production Preparation List I Wish I Had Earlier
Before Going Live:
- [ ] Add circuit breakers for external API calls
// lib/circuit-breaker.ts
const circuit = new CircuitBreaker(
async (url) => {
return axios.get(url);
},
{
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
}
);
- [ ] Configure distributed request tracing with complete sampling
- [ ] Set up automatic scaling rules
- [ ] Test different failure scenarios:
- Network connection problems
- Database becoming unavailable
- External API services failing
- [ ] Define service level objectives:
- 99.9% uptime availability
- p95 response time under 500ms
- Error allowance: 0.1% per month
After Launch Monitoring:
# Helpful kubectl commands I use regularly
kubectl top pods --containers
kubectl logs -f deployment/payment-service --tail=100
kubectl describe pod payment-service-xyz
What I learned is that distributed systems behave unpredictably, so you should always expect problems to happen. I would be interested to hear about your experiences with distributed systems and what challenges you faced!