Hot Path, Cold Path: Designing a Telemetry Ingestion System for 1M+ Connected Devices


Waiting for telemetry…
🔥 Hot path — fault alerts & dashboards❄️ Cold path — aggregations & rules💀 DLQ — simulated write failure

Context

Connected HVAC equipment — commercial chillers, air handlers, heat pumps spread across thousands of sites globally — generates a continuous stream of telemetry: sensor readings, operational states, fault codes, energy consumption data. The volume is relentless and non-negotiable.

The brief: build a platform that could ingest this reliably at scale, store it efficiently, and make it queryable for alerting, dashboards, and energy reporting.

Scale and Constraints

At design time: roughly one million devices in active operation, each emitting telemetry at variable intervals. Some every 30 seconds, some every few minutes. Ingest peaks during business hours and weather events — spikes you can’t predict.

Latency requirements weren’t uniform:

  • Fault alerting: sub-minute end-to-end
  • Operational dashboards: near real-time, seconds
  • Energy reporting: batch, hourly or daily aggregations

This tiered requirement shaped every architecture decision that followed.

The Hot/Cold Path Split

The core design decision was splitting ingestion into two explicit paths based on latency requirements.

Kafka
  ├── Hot path  → real-time write to TimescaleDB + Redis fan-out
  └── Cold path → rules engine, aggregations, deferred processing

Hot path messages are written to TimescaleDB immediately. Cold path messages flow through a separate topic and are processed by a rules engine that can tolerate latency.

In NestJS, two separate @EventPattern handlers manage this:

@EventPattern(TELEMETRY_DATA_GLOBAL_TOPIC_COMBINED)
@TrackKafkaMetrics(TELEMETRY_DATA_GLOBAL_TOPIC_COMBINED)
async handleGlobalTelemetryTopicCombined(
  @Payload() payload: AggregatedTelemetryPayload
) {
  try {
    await this.telemetryService.processTelemetryCombined(payload);
  } catch (error) {
    this.sendToDlqAsync(TELEMETRY_DATA_GLOBAL_TOPIC_COMBINED, payload, error);
  }
}

@EventPattern(TELEMETRY_DATA_GLOBAL_COLD_PATH_TOPIC)
@TrackKafkaMetrics(TELEMETRY_DATA_GLOBAL_COLD_PATH_TOPIC)
@TrackColdPathMetrics()
async handleGlobalColdPathTelemetryTopic(
  @Payload() payload: InputColdPathTelemetryPayload
) {
  try {
    await this.telemetryService.processColdPathTelemetry(payload);
  } catch (error) {
    this.sendToDlqAsync(TELEMETRY_DATA_GLOBAL_COLD_PATH_TOPIC, payload, error);
  }
}

The @TrackColdPathMetrics decorator tracks cold path-specific metrics separately from the hot path — useful for understanding rules engine load independently of ingestion throughput.

Storage: TimescaleDB

We chose TimescaleDB (built on PostgreSQL) as the primary time-series store. The decision came down to:

  • Native time-series optimisations — hypertables partition data by time automatically
  • Continuous aggregates for pre-computing roll-ups without external batch jobs
  • SQL — familiar to the whole team, compatible with existing tooling
  • Compression policies that automatically compress older chunks, significantly reducing storage costs

We maintain multiple granularities: raw event data, 5-minute aggregates, hourly aggregates. The continuous aggregate pipeline runs automatically and keeps reporting queries fast without scanning raw partitions.

Reliability: DLQ and Republish Queue

Consumer failures happen. Our reliability strategy had two components.

Dead letter queue. When a message fails processing, it’s sent to ${baseTopic}.dlq asynchronously — fire-and-forget, so the failure doesn’t block the next message:

private sendToDlqAsync(baseTopic: string, payload: any, error: Error): void {
  const dlqTopic = `${baseTopic}.dlq`;
  const dlqMessage = {
    error: error.message,
    originalPayload: payload,
    timestamp: new Date().toISOString(),
  };

  const operationId = this.memoryMonitorService.trackAsyncOperationStart('dlq');
  this.kafkaService
    .sendToDlq(dlqTopic, dlqMessage)
    .catch((dlqError) => {
      this.logger.error('Failed to send message to DLQ:', dlqError);
    })
    .finally(() => {
      this.memoryMonitorService.trackAsyncOperationEnd(operationId);
    });
}

The fire-and-forget pattern is intentional. Blocking the consumer while a DLQ write is in-flight means one bad message can halt an entire partition. We accept the risk that a DLQ write might fail; that failure is logged and monitored separately.

BullMQ republish queue. A separate worker processes the DLQ and retries failed messages with backoff. This decouples retry logic from the main consumer and makes retry behaviour independently configurable.

Redis Subscriptions

Some consumers — primarily the dashboard layer — need push notifications when specific devices emit new telemetry. Rather than polling TimescaleDB, we maintain Redis subscriptions: when a hot path event arrives, the subscription service fans out to any registered listeners for that device.

This avoids write amplification — one ingest event fans out to N subscribers rather than N subscribers polling independently.

What Was Hard

Device heterogeneity. Every HVAC manufacturer uses slightly different field names for the same logical reading. One device sends rt for indoor temperature, another sends actualTemperature, another sends indoor_temp. The normalisation layer that handles this gracefully turned out to be a whole separate engineering project — enough for its own post.

Outcome

  • 1M+ devices processed daily in production
  • Sub-minute fault alerting end-to-end via the hot path
  • Energy reporting queries that previously scanned billions of rows now return in under 5 seconds via continuous aggregates
  • Clean separation between latency-sensitive and deferred processing, with independent scaling for each path

Building something in this space? Reach out.