Ingest and Visualization for OpenTelemetry Metrics

Published January 23, 2025

by Vadim Korolik

1 LaunchDarkly is an [open source](https://github.com/highlight/highlight) monitoring platform. If you're interested in learning more, get started at [LaunchDarkly](https://launchdarkly.com).

OpenTelemetry Metrics

OpenTelemetry (OTeL) is becoming the de facto standard for observability, providing a unified way to collect, process, and export telemetry data, including traces, logs, and metrics. While traces and logs are crucial for debugging, metrics offer a high-level view of system performance and health. Efficiently storing and querying these metrics is essential for real-time insights, and ClickHouse—a high-performance, columnar database—provides an ideal backend for scalable and cost-effective metric ingestion.

At LaunchDarkly, we recently introduced support for OTeL metrics ingest. Below, we’ll describe how we structured the implementation to deliver an efficient OpenTelemetry metrics pipeline using ClickHouse, covering ingestion, aggregation, querying, and visualization.

OTeL Metrics Formats

OpenTelemetry metrics are designed to be flexible, supporting various aggregation and encoding formats. The key formats include: • Gauge: Represents a single numerical value that changes over time, such as CPU usage or memory consumption. • Counter: A monotonically increasing value, commonly used for request counts or error rates. • Histogram: Captures the distribution of values over a given time period, useful for tracking request latencies. • Summary: Similar to histograms but includes percentile calculations for more detailed insights.

The OTel protocol transmits these metric types in a structured format, typically in protobuf or JSON when using OTLP ( OpenTelemetry Protocol). Understanding these formats is crucial for designing an efficient ingestion pipeline that minimizes storage overhead while maximizing query performance.

Building an Ingest Path

LaunchDarkly uses Apache Kafka to buffer data for bulk inserts into ClickHouse. While we use the OpenTelemetry collector to receive, deserialize, and batch data, we export to our Golang API that mutates the data before writing to Apache Kafka. A set of workers (the Apache Kafka Connect ClickHouse exporter) read the data and write it to ClickHouse in large batches.

            +------------------+
            |  OTel Collector  |
            +------------------+
                    |
                    v
            +------------------+
            |   Highlight API  |
            +------------------+
                    |
                    v
            +------------------+
            |   Apache Kafka   |
            +------------------+
                    |
         +----------------------+
         |     Kafka Connect    |
         +----------------------+
                    |
                    v
           +------------------+
           |    ClickHouse    |
           +------------------+

OpenTelemetry Collector Setup

The OpenTelemetry Collector is a key component in an OTel pipeline, responsible for receiving, processing, and exporting telemetry data. For metric ingestion into ClickHouse, we configure the collector to receive OTel metrics via the OTLP receiver, process them using built-in processors (e.g., batch and transform), and export them to our API.

Here’s an example OpenTelemetry Collector configuration for exporting metrics to our LaunchDarkly API which then batch exports data to ClickHouse:

1 receivers:
2   awsfirehose/cwmetrics:
3     record_type: cwmetrics
4   awsfirehose/otlp_v1:
5     record_type: otlp_v1
6   otlp:
7     protocols:
8       grpc:
9       http:
10 
11 processors:
12   batch:
13     timeout: 5s
14     send_batch_size: 1000
15 
16 exporters:
17   otlphttp:
18     endpoint: 'http://pub.prod.vpc.highlight.io:8082/otel'
19 
20 service:
21   pipelines:
22     metrics:
23       receivers: [ otlp, awsfirehose/cwmetrics, awsfirehose/otlp_v1 ]
24       processors: [ batch ]
25       exporters: [ otlphttp ]

See our full production OpenTelemetry collector config here.

If you are building an OpenTelemetry pipeline from scratch, you can use the clickhouse collector export for direct writes to the database. For our production use-case, we route the data through our API for pre-processing and write buffering via Apache Kafka, but you may find success with the exporter even for large volumes.

1 exporters:
2   clickhouse:
3     endpoint: "tcp://clickhouse-server:9000"
4     database: "otel_metrics"
5     username: "default"
6     password: ""

By using the OpenTelemetry collector as the initial entrypoint for the data, we get the benefit of the collector supporting myriad receivers which can be compatible with different data formats. For instance, as shown in the example above, we also set up a receiver for the AWS Firehose CloudWatch metrics format in the same collector. We’ll be covering cloud integrations in a future blog post, stay tuned!

Aggregating and Reducing Data Granularity

High-cardinality metrics can quickly balloon in storage size, making efficient aggregation crucial. ClickHouse provides materialized views and TTL-based rollups to downsample data while retaining aggregate insights.

Our production data pipeline initially writes the metrics in their OTeL native format to one of three tables. Metrics are written to one of the metrics_sum, metrics_histogram, and metrics_summary tables.

The frequency of metric data can be a challenge with querying over wide time-ranges. While the OpenTelemetry SDK emitting the metrics may aggregate data, the collector does not perform any additional aggregation.

A real-world example: imagine having a 100-node Kubernetes cluster running your application. Each application instance is receiving many requests per second and emitting a number of latency metrics for each API endpoint. Even if the OTeL SDK is configured to aggregate metrics down to each second, each node will still produce one row per second for each of the unique metrics and their attributes. Any unique tags emitted on the metrics will result in unique metric rows written to ClickHouse. On top of that, the 100 nodes will all be sending their respective data which will not be aggregated by the Collector. The result: writing thousands of rows per second to ClickHouse with fine timestamp granularity.

Another reason to transform the data is to aggregate the different OTeL metrics formats into a cohesive one that’s easier to query. We went with a an approach that solves both problems, aggregating metric values to 1-second resolution and merging data between the metrics formats.

Below you’ll find the schema we adopted for each OTeL metric type along with the materialized views that perform aggregations:

1 CREATE TABLE IF NOT EXISTS metrics_sum
2 (
3     ProjectId         UInt32,
4     ServiceName       LowCardinality(String),
5     MetricName        String,
6     MetricDescription String,
7     MetricUnit        String,
8     Attributes        Map(LowCardinality(String), String),
9     Timestamp         DateTime64(9) CODEC (Delta, ZSTD),
10     RetentionDays     UInt8 DEFAULT 30,
11     -- sum
12     Value             Float64
13     -- other columns omitted for brevity
14 ) ENGINE = MergeTree()
15       TTL toDateTime(Timestamp) + toIntervalDay(RetentionDays)
16       PARTITION BY toStartOfDay(Timestamp)
17       ORDER BY (ProjectId, ServiceName, MetricName, toUnixTimestamp64Nano(Timestamp));
18 
19 CREATE TABLE IF NOT EXISTS metrics_histogram
20 (
21     ProjectId         UInt32,
22     ServiceName       LowCardinality(String),
23     MetricName        String,
24     MetricDescription String,
25     MetricUnit        String,
26     Attributes        Map(LowCardinality(String), String),
27     Timestamp         DateTime64(9) CODEC (Delta, ZSTD),
28     RetentionDays     UInt8 DEFAULT 30,
29     -- common
30     -- histogram
31     Count             UInt64 CODEC (Delta, ZSTD),
32     Sum               Float64,
33     BucketCounts Array (UInt64),
34     ExplicitBounds Array (Float64),
35     Min               Float64,
36     Max               Float64
37     -- other columns omitted for brevity
38 ) ENGINE = MergeTree()
39       TTL toDateTime(Timestamp) + toIntervalDay(RetentionDays)
40       PARTITION BY toStartOfDay(Timestamp)
41       ORDER BY (ProjectId, ServiceName, MetricName, toUnixTimestamp64Nano(Timestamp));
42 
43 
44 CREATE TABLE IF NOT EXISTS metrics_summary
45 (
46     ProjectId         UInt32,
47     ServiceName       LowCardinality(String),
48     MetricName        String,
49     MetricDescription String,
50     MetricUnit        String,
51     Attributes        Map(LowCardinality(String), String),
52     Timestamp         DateTime64(9) CODEC (Delta, ZSTD),
53     RetentionDays     UInt8 DEFAULT 30,
54     -- common
55     Flags             UInt32,
56     -- summary
57     Count             Float64,
58     Sum               Float64
59     -- other columns omitted for brevity
60 ) ENGINE = MergeTree()
61       TTL toDateTime(Timestamp) + toIntervalDay(RetentionDays)
62       PARTITION BY toStartOfDay(Timestamp)
63       ORDER BY (ProjectId, ServiceName, MetricName, toUnixTimestamp64Nano(Timestamp));
64 
65 -- the destination table which contains the aggregate across metrics formats
66 CREATE TABLE IF NOT EXISTS default.metrics
67 (
68     ProjectId         UInt32,
69     ServiceName       String,
70     MetricName        String,
71     MetricType        Enum8('Empty' = 0, 'Gauge' = 1, 'Sum' = 2, 'Histogram' = 3, 'ExponentialHistogram' = 4, 'Summary' = 5),
72     Attributes        Map(LowCardinality(String), String),
73     Timestamp         DateTime CODEC (Delta(4), ZSTD(1)),
74     -- meta
75     MetricDescription SimpleAggregateFunction(anyLast, String),
76     MetricUnit        SimpleAggregateFunction(anyLast, String),
77     RetentionDays     SimpleAggregateFunction(max, UInt8) DEFAULT 30,
78     -- histogram
79     Min               SimpleAggregateFunction(min, Float64),
80     Max               SimpleAggregateFunction(max, Float64),
81     BucketCounts      SimpleAggregateFunction(groupArrayArray, Array(UInt64)),
82     ExplicitBounds    SimpleAggregateFunction(groupArrayArray, Array(Float64)),
83     -- common
84     Count             SimpleAggregateFunction(sum, UInt64),
85     Sum               SimpleAggregateFunction(sum, Float64)
86     -- other columns omitted for brevity
87 ) ENGINE = AggregatingMergeTree()
88     PARTITION BY toStartOfDay(Timestamp)
89     ORDER BY (ProjectId, ServiceName, MetricName, MetricType, toUnixTimestamp(Timestamp))
90     TTL toDateTime(Timestamp) + toIntervalDay(RetentionDays);
91 
92 CREATE MATERIALIZED VIEW IF NOT EXISTS metrics_sum_mv TO metrics AS
93 SELECT ProjectId,
94        ServiceName,
95        MetricName,
96        MetricType,
97        Attributes,
98        toDateTime(toStartOfSecond(Timestamp)) as Timestamp,
99        -- meta
100        anyLastSimpleState(MetricDescription)  as MetricDescription,
101        anyLastSimpleState(MetricUnit)         as MetricUnit,
102        minSimpleState(StartTimestamp)         as StartTimestamp,
103        maxSimpleState(RetentionDays)          as RetentionDays,
104        -- sum
105        sumSimpleState(1)                      as Count,
106        sumSimpleState(Value)                  as Sum
107 -- other columns omitted for brevity
108 FROM metrics_sum
109 GROUP BY all;
110 
111 CREATE MATERIALIZED VIEW IF NOT EXISTS metrics_histogram_mv TO metrics AS
112 SELECT ProjectId,
113        ServiceName,
114        MetricName,
115        'Histogram'                                as MetricType,
116        Attributes,
117        toDateTime(toStartOfSecond(Timestamp))     as Timestamp,
118        -- meta
119        anyLastSimpleState(MetricDescription)      as MetricDescription,
120        anyLastSimpleState(MetricUnit)             as MetricUnit,
121        minSimpleState(StartTimestamp)             as StartTimestamp,
122        maxSimpleState(RetentionDays)              as RetentionDays,
123        -- histogram
124        minSimpleState(Min)                        as Min,
125        minSimpleState(Max)                        as Max,
126        groupArrayArraySimpleState(BucketCounts)   as BucketCounts,
127        groupArrayArraySimpleState(ExplicitBounds) as ExplicitBounds,
128        sumSimpleState(Count)                      as Count,
129        sumSimpleState(Sum)                        as Sum
130 -- other columns omitted for brevity
131 FROM metrics_histogram
132 GROUP BY all;
133 
134 CREATE MATERIALIZED VIEW IF NOT EXISTS metrics_summary_mv TO metrics AS
135 SELECT ProjectId,
136        ServiceName,
137        MetricName,
138        'Summary'                              as MetricType,
139        Attributes,
140        toDateTime(toStartOfSecond(Timestamp)) as Timestamp,
141        -- meta
142        anyLastSimpleState(MetricDescription)  as MetricDescription,
143        anyLastSimpleState(MetricUnit)         as MetricUnit,
144        minSimpleState(StartTimestamp)         as StartTimestamp,
145        maxSimpleState(RetentionDays)          as RetentionDays,
146        -- summary
147        sumSimpleState(Count)                  as Count,
148        sumSimpleState(Sum)                    as Sum
149 -- other columns omitted for brevity
150 FROM metrics_summary
151 GROUP BY all;

Find the full-example from our production configuration in our GitHub Repo: the metrics schema and the materialized views.

This reduces the volume of stored data by grouping metrics into one-second intervals, balancing granularity and storage efficiency. In the future, we may also aggregate across metric Attributes for keys that are similar across metrics.

Query Layer

With metrics efficiently ingested and aggregated, querying performance becomes inherent. We share the ClickHouse query layer across the products and can extract metrics just like we query other data ingested in LaunchDarkly:

1 SELECT Timestamp,
2        toFloat64(Sum / Count) as value
3 FROM metrics
4 WHERE ProjectId = ?
5   AND Timestamp <= ?
6   AND Timestamp >= ?
7   AND toString(MetricName) = ?
8   AND Attributes[?] = ?

Additional bucketing logic allows us to aggregate the results in a format that’s easily displayed in our dashboards.

Conclusion

Building an OpenTelemetry metrics pipeline with ClickHouse offers a scalable and efficient solution for observability. By leveraging OTLP ingestion, data aggregation, SQL-based querying, and visualization tools, organizations can gain deep insights into their applications with minimal storage and performance overhead.

Ready to get started? Try out LaunchDarkly and explore how open-source observability can transform your monitoring stack. 🚀