MicroservicesNode.jsArchitectureDevOps

Scaling Node.js Microservices: Lessons from Production Systems

Hard-won lessons from building and maintaining high-traffic microservices architectures

February 10, 2024Updated: February 15, 2024

8 mins read

When our team first migrated to microservices three years ago, we learned the hard way that distributed systems bring their own special chaos. Through late-night outages and scaling nightmares, here's what actually worked for our Node.js services handling 10k+ RPM.

Service Design: What Actually Matters

The Folder Structure That Survived 3 Major Rewrites

/src
  /services       # Individual microservices
  /lib            # Shared utilities
  /events         # Event schemas
  /types          # Type definitions
  /infra          # Deployment configs

Why this works:

Clear separation between business logic (services) and infrastructure
Shared utilities prevent code duplication
Event schemas enforce contract between services

Real-world example of a payment service:

// services/payment/src/index.ts
import { initTracing } from "../lib/tracing";

const tracer = initTracing("payment-service");

async function processPayment(payload: PaymentRequest) {
  const span = tracer.startSpan("process-payment");
  try {
    // Payment logic
  } finally {
    span.end();
  }
}

// Health check endpoint for Kubernetes
app.get("/health", (req, res) => {
  res.json({
    status: "OK",
    version: process.env.APP_VERSION,
    uptime: process.uptime(),
  });
});

Communication Patterns That Don't Break

Sync vs Async: When to Use What

// Good for critical path operations
app.post("/orders", async (req, res) => {
  const order = await createOrder(req.body);
  await inventoryService.reserveItems(order); // Sync HTTP call
  res.json(order);
});

// Better for non-critical updates
eventBus.subscribe("order_created", async (event) => {
  await analyticsService.trackOrder(event); // Async processing
  await recommendationService.updateSuggestions(event.userId);
});

Critical Decision Factors:

Data consistency requirements
Latency sensitivity
Failure tolerance
Team ownership boundaries

Event-Driven Architecture Survival Kit

// lib/event-bus.ts - Enhanced with retry logic
export class EventBus {
  private producers = new Map<string, Producer>();
  private retryPolicy = {
    maxRetries: 3,
    backoff: [1000, 5000, 10000], // Exponential backoff
  };

  async publish(topic: string, event: unknown) {
    let attempt = 0;
    while (attempt <= this.retryPolicy.maxRetries) {
      try {
        const producer = await this.getProducer(topic);
        await producer.send({ value: JSON.stringify(event) });
        return;
      } catch (error) {
        if (attempt === this.retryPolicy.maxRetries) throw error;
        await sleep(this.retryPolicy.backoff[attempt]);
        attempt++;
      }
    }
  }

  // Dead letter queue setup
  private async handleFailedEvents(event: FailedEvent) {
    await db.failedEvents.create({
      data: {
        payload: JSON.stringify(event),
        error: event.error.message,
        retryCount: 0,
      },
    });
  }
}

Data Management: Hard Lessons Learned

The Cache Strategy That Saved Our Database

// services/product/src/cache.ts - With cache stampede prevention
const cache = new Redis({
  ttl: 30,
  allowStale: true,
  refreshThreshold: 5000, // Refresh cache before expiration
});

async function getProduct(id: string) {
  return await cache.wrap(
    `product:${id}`,
    async () => {
      const product = await db.products.findUnique({ where: { id } });
      if (!product) throw new NotFoundError("Product not found");
      return product;
    },
    {
      ttl: 60, // Extend TTL on successful fetch
    }
  );
}

Cache Invalidation Strategies:

Time-based expiration
Write-through on updates
Event-driven invalidation
Versioned cache keys

Transactional Outbox Pattern Implementation

// lib/outbox.ts - With batch processing and dead letter queue
export async function processOutbox() {
  const events = await db.$transaction(async (tx) => {
    const events = await tx.outbox.findMany({
      where: { status: "PENDING" },
      take: 100,
      orderBy: { createdAt: "asc" },
    });

    await tx.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: { status: "PROCESSING" },
    });

    return events;
  });

  try {
    await eventBus.publishBatch(events);
    await db.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: { status: "COMPLETED" },
    });
  } catch (error) {
    await db.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: {
        status: "FAILED",
        retries: { increment: 1 },
        lastError: error.message,
      },
    });
  }
}

Deployment Tricks That Actually Scale

The Kubernetes Config That Works

# infra/k8s/payment-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  resources:
    limits:
      memory: "512Mi"
      cpu: "1000m"
    requests:
      memory: "256Mi"
      cpu: "200m"
  livenessProbe:
    httpGet:
      path: /health
      port: 3000
    initialDelaySeconds: 10
    periodSeconds: 5
  readinessProbe:
    httpGet:
      path: /ready
      port: 3000
    initialDelaySeconds: 5
    periodSeconds: 5

Pro Tips:

Always set resource limits
Use pod anti-affinity for critical services
Implement graceful shutdown

// Graceful shutdown example
process.on("SIGTERM", async () => {
  logger.info("Starting graceful shutdown");

  // Close HTTP server
  server.close(() => {
    logger.info("HTTP server closed");
  });

  // Close database connections
  await db.$disconnect();

  // Flush metrics
  await metricsClient.flush();

  process.exit(0);
});

Observability: What's Worth Monitoring

The 4 Golden Signals Dashboard

// lib/metrics.ts - Enhanced with error tracking
export function trackCoreMetrics() {
  const httpErrors = new Counter({
    name: "http_errors_total",
    help: "Total HTTP errors by status code",
    labelNames: ["status"],
  });

  app.use((req, res, next) => {
    const originalSend = res.send;
    res.send = function (body) {
      if (res.statusCode >= 400) {
        httpErrors.inc({ status: res.statusCode });
      }
      return originalSend.call(this, body);
    };
    next();
  });
}

Critical Metrics to Track:

Error rates (4xx/5xx responses)
Latency percentiles (p95, p99)
System resources (CPU/Memory usage)
Queue depths (for async processing)
Cache hit ratios

Production Checklist I Wish I Had

Before Going Live:

[ ] Implement circuit breakers for external calls

// lib/circuit-breaker.ts
const circuit = new CircuitBreaker(
  async (url) => {
    return axios.get(url);
  },
  {
    timeout: 3000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
  }
);

[ ] Set up distributed tracing with 100% sampling
[ ] Configure auto-scaling thresholds
[ ] Test failure modes:
- Network partitions
- Database outages
- Third-party API failures
[ ] Establish SLOs:
- 99.9% availability
- p95 latency < 500ms
- Error budget: 0.1% monthly

Post-Launch Monitoring:

# Useful kubectl commands
kubectl top pods --containers
kubectl logs -f deployment/payment-service --tail=100
kubectl describe pod payment-service-xyz

Remember: Microservices are like cats - they work best when you assume they'll misbehave. What war stories have you survived with distributed systems? Let's compare battle scars! 🔥