12 Jan 2020 • 2 min read

Apollo Federation beyond basics

Apollo provides a great foundation for building a federated graph but it lacks built-in observability and tracing capabilities. We'll go through solutions.

Federating a GraphQL gateway is straightforward with Apollo — you point it at your subgraphs, and requests start flowing. What’s harder is answering the questions that come later:

Why is this query slow?
Which subgraph is the bottleneck?
How many requests are failing in the “reviews” service specifically?

Out of the box, Apollo Federation doesn’t give you much. To make it production-grade, you need observability.

Step 1: metrics and traces

We added Datadog’s GraphQL integration to our Apollo Gateway and subgraphs. With it, every query execution is automatically traced, tagged with operation name and subgraph, and shipped to Datadog. Suddenly, instead of staring at black boxes, we had flame graphs showing time spent in users vs products vs checkout.

Step 2: distributed tracing across subgraphs

In Federation, a single client query might hit four subgraphs in sequence. Without distributed tracing, you only see the gateway’s perspective. With tracing enabled, you see the entire call tree. This helped us uncover that what looked like a “slow product query” was actually time spent waiting on the inventory subgraph.

We wired Apollo’s built-in tracing extension with OpenTelemetry, which Datadog ingests natively. Here’s a minimal gateway config:

import { ApolloServer } from "apollo-server";
import { ApolloGateway } from "@apollo/gateway";
import { ApolloServerPluginInlineTrace } from "apollo-server-core";
import { datadogPlugin } from "apollo-server-datadog";

const gateway = new ApolloGateway({
  serviceList: [
    { name: "users", url: "http://localhost:4001" },
    { name: "products", url: "http://localhost:4002" },
  ],
});

const server = new ApolloServer({
  gateway,
  subscriptions: false,
  plugins: [
    ApolloServerPluginInlineTrace(),
    datadogPlugin({ apiKey: process.env.DD_API_KEY }),
  ],
});

server.listen().then(({ url }) => console.log(`🚀 Gateway at ${url}`));

Step 3: trade-offs

every trace adds latency (~2–5ms per request). Fine for us, but worth measuring.
traces can get big, especially with nested federated queries. We sampled aggressively.
observability is only useful if engineers look at the dashboards. We built alerts around P95 latency and error rates to surface problems early.

The impact

With observability in place, we stopped guessing. One concrete win: we found a hot path where the gateway was making N+1 calls to the reviews subgraph. Without tracing, it just looked like “queries are slow.” With tracing, the call tree made the problem obvious. Fixing it cut median latency in half.

Federation unlocks scale, but it also hides complexity. Adding observability and tracing is how you take it from “it works” to “we understand it.”