Skip to main content
Back to Blogs
MicroservicesNode.jsArchitectureDevOps

Building Large Node.js Microservices: What I Learned in 3 Years

My experience building and managing high-traffic distributed service systems

9 mins read

Three years ago, my team decided to move from a single large application to multiple smaller services. I discovered that distributed systems have their own unique problems that are very different from what I expected. After many late nights fixing problems and learning how to handle high traffic, I want to share what techniques actually helped us with our Node.js applications handling more than 10,000 requests per minute.

How to Organize Your Services: The Important Things

The Directory Organization That Worked Through Many Changes

/src
  /services       # Individual microservices
  /lib            # Shared utilities
  /events         # Event schemas
  /types          # Type definitions
  /infra          # Deployment configs

Why I think this approach is effective:

  • Business logic (services) and infrastructure are clearly separated
  • Common utilities help us avoid writing the same code multiple times
  • Event schemas make sure all services use the same data format

Here is a practical example from our payment service:

// services/payment/src/index.ts
import { initTracing } from "../lib/tracing";

const tracer = initTracing("payment-service");

async function processPayment(payload: PaymentRequest) {
  const span = tracer.startSpan("process-payment");
  try {
    // Payment logic
  } finally {
    span.end();
  }
}

// Health check endpoint for Kubernetes
app.get("/health", (req, res) => {
  res.json({
    status: "OK",
    version: process.env.APP_VERSION,
    uptime: process.uptime(),
  });
});

Communication Methods That Work Reliably

Synchronous vs Asynchronous: When to Choose Each One

// Use this for important operations that must happen immediately
app.post("/orders", async (req, res) => {
  const order = await createOrder(req.body);
  await inventoryService.reserveItems(order); // Direct HTTP call
  res.json(order);
});

// Use this for updates that can happen later
eventBus.subscribe("order_created", async (event) => {
  await analyticsService.trackOrder(event); // Background processing
  await recommendationService.updateSuggestions(event.userId);
});

Important Things to Consider:

  • How important is data consistency?
  • How fast does the response need to be?
  • What happens if something fails?
  • Which team is responsible for each service?

Event-Based Architecture: Essential Tools

// lib/event-bus.ts - With retry functionality added
export class EventBus {
  private producers = new Map<string, Producer>();
  private retryPolicy = {
    maxRetries: 3,
    backoff: [1000, 5000, 10000], // Wait longer each time
  };

  async publish(topic: string, event: unknown) {
    let attempt = 0;
    while (attempt <= this.retryPolicy.maxRetries) {
      try {
        const producer = await this.getProducer(topic);
        await producer.send({ value: JSON.stringify(event) });
        return;
      } catch (error) {
        if (attempt === this.retryPolicy.maxRetries) throw error;
        await sleep(this.retryPolicy.backoff[attempt]);
        attempt++;
      }
    }
  }

  // Handle events that couldn't be processed
  private async handleFailedEvents(event: FailedEvent) {
    await db.failedEvents.create({
      data: {
        payload: JSON.stringify(event),
        error: event.error.message,
        retryCount: 0,
      },
    });
  }
}

Managing Data: What I Learned From Mistakes

The Caching Approach That Protected Our Database

// services/product/src/cache.ts - Prevents multiple requests to database
const cache = new Redis({
  ttl: 30,
  allowStale: true,
  refreshThreshold: 5000, // Update cache before it expires
});

async function getProduct(id: string) {
  return await cache.wrap(
    `product:${id}`,
    async () => {
      const product = await db.products.findUnique({ where: { id } });
      if (!product) throw new NotFoundError("Product not found");
      return product;
    },
    {
      ttl: 60, // Keep data longer if fetch succeeds
    }
  );
}

Different Ways to Clear Cache:

  1. Automatic expiration after set time
  2. Update cache when data changes
  3. Clear cache when events happen
  4. Use different cache keys for different versions

Outbox Pattern: How to Handle Database and Events Together

// lib/outbox.ts - With batch processing and dead letter queue
export async function processOutbox() {
  const events = await db.$transaction(async (tx) => {
    const events = await tx.outbox.findMany({
      where: { status: "PENDING" },
      take: 100,
      orderBy: { createdAt: "asc" },
    });

    await tx.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: { status: "PROCESSING" },
    });

    return events;
  });

  try {
    await eventBus.publishBatch(events);
    await db.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: { status: "COMPLETED" },
    });
  } catch (error) {
    await db.outbox.updateMany({
      where: { id: { in: events.map((e) => e.id) } },
      data: {
        status: "FAILED",
        retries: { increment: 1 },
        lastError: error.message,
      },
    });
  }
}

Deployment Techniques That Work at Scale

The Kubernetes Configuration That I Found Reliable

# infra/k8s/payment-deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  resources:
    limits:
      memory: "512Mi"
      cpu: "1000m"
    requests:
      memory: "256Mi"
      cpu: "200m"
  livenessProbe:
    httpGet:
      path: /health
      port: 3000
    initialDelaySeconds: 10
    periodSeconds: 5
  readinessProbe:
    httpGet:
      path: /ready
      port: 3000
    initialDelaySeconds: 5
    periodSeconds: 5

Important Recommendations:

  • Always define resource boundaries
  • Use pod anti-affinity for important applications
  • Make sure your application shuts down properly
// Example of proper shutdown handling
process.on("SIGTERM", async () => {
  logger.info("Starting graceful shutdown");

  // Stop HTTP server
  server.close(() => {
    logger.info("HTTP server closed");
  });

  // End database connections
  await db.$disconnect();

  // Send remaining metrics
  await metricsClient.flush();

  process.exit(0);
});

Monitoring: What You Should Actually Watch

The Essential Metrics Dashboard

// lib/metrics.ts - With error counting functionality
export function trackCoreMetrics() {
  const httpErrors = new Counter({
    name: "http_errors_total",
    help: "Total HTTP errors by status code",
    labelNames: ["status"],
  });

  app.use((req, res, next) => {
    const originalSend = res.send;
    res.send = function (body) {
      if (res.statusCode >= 400) {
        httpErrors.inc({ status: res.statusCode });
      }
      return originalSend.call(this, body);
    };
    next();
  });
}

Important Measurements to Monitor:

  1. Error percentages (4xx/5xx responses)
  2. Response time percentiles (p95, p99)
  3. System resource usage (CPU/Memory)
  4. Queue sizes (for background processing)
  5. Cache success ratios

Production Preparation List I Wish I Had Earlier

Before Going Live:

  1. [ ] Add circuit breakers for external API calls
// lib/circuit-breaker.ts
const circuit = new CircuitBreaker(
  async (url) => {
    return axios.get(url);
  },
  {
    timeout: 3000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
  }
);
  1. [ ] Configure distributed request tracing with complete sampling
  2. [ ] Set up automatic scaling rules
  3. [ ] Test different failure scenarios:
    • Network connection problems
    • Database becoming unavailable
    • External API services failing
  4. [ ] Define service level objectives:
    • 99.9% uptime availability
    • p95 response time under 500ms
    • Error allowance: 0.1% per month

After Launch Monitoring:

# Helpful kubectl commands I use regularly
kubectl top pods --containers
kubectl logs -f deployment/payment-service --tail=100
kubectl describe pod payment-service-xyz

What I learned is that distributed systems behave unpredictably, so you should always expect problems to happen. I would be interested to hear about your experiences with distributed systems and what challenges you faced!