Apollo Federation beyond basics
Apollo provides a great foundation for building a federated graph but it lacks built-in observability and tracing capabilities. We'll go through solutions.
Federating a GraphQL gateway is straightforward with Apollo — you point it at your subgraphs, and requests start flowing. What’s harder is answering the questions that come later:
- Why is this query slow?
- Which subgraph is the bottleneck?
- How many requests are failing in the “reviews” service specifically?
Out of the box, Apollo Federation doesn’t give you much. To make it production-grade, you need observability.
Step 1: metrics and traces
We added Datadog’s GraphQL integration to our Apollo Gateway and subgraphs. With it, every query execution is automatically traced, tagged with operation name and subgraph, and shipped to Datadog. Suddenly, instead of staring at black boxes, we had flame graphs showing time spent in users
vs products
vs checkout
.
Step 2: distributed tracing across subgraphs
In Federation, a single client query might hit four subgraphs in sequence. Without distributed tracing, you only see the gateway’s perspective. With tracing enabled, you see the entire call tree. This helped us uncover that what looked like a “slow product query” was actually time spent waiting on the inventory
subgraph.
We wired Apollo’s built-in tracing extension with OpenTelemetry, which Datadog ingests natively. Here’s a minimal gateway config:
import { ApolloServer } from "apollo-server";
import { ApolloGateway } from "@apollo/gateway";
import { ApolloServerPluginInlineTrace } from "apollo-server-core";
import { datadogPlugin } from "apollo-server-datadog";
const gateway = new ApolloGateway({
serviceList: [
{ name: "users", url: "http://localhost:4001" },
{ name: "products", url: "http://localhost:4002" },
],
});
const server = new ApolloServer({
gateway,
subscriptions: false,
plugins: [
ApolloServerPluginInlineTrace(),
datadogPlugin({ apiKey: process.env.DD_API_KEY }),
],
});
server.listen().then(({ url }) => console.log(`🚀 Gateway at ${url}`));
Step 3: trade-offs
- every trace adds latency (~2–5ms per request). Fine for us, but worth measuring.
- traces can get big, especially with nested federated queries. We sampled aggressively.
- observability is only useful if engineers look at the dashboards. We built alerts around P95 latency and error rates to surface problems early.
The impact
With observability in place, we stopped guessing. One concrete win: we found a hot path where the gateway was making N+1 calls to the reviews
subgraph. Without tracing, it just looked like “queries are slow.” With tracing, the call tree made the problem obvious. Fixing it cut median latency in half.
Federation unlocks scale, but it also hides complexity. Adding observability and tracing is how you take it from “it works” to “we understand it.”