Using Materialized Views in ClickHouse (vs. Postgres)

Published January 23, 2024

portrait of Vadim Korolik.

by Vadim Korolik

1LaunchDarkly is an open source observability solution. We record sessions, traces, errors and logs to help engineers debug and maintain their web applications.
2
3You could be one of those engineers; check us out on [GitHub](https://github.com/highlight/highlight).

Introduction: The Use Case

Our team recently adopted ClickHouse to store and query high-volume observability data. The implementation was straightforward for the intended aggregate querying, but solving other access patterns was more complex. For example, we needed to aggregate traces over a large time range to get average duration values, but at the same time find and load a single trace by ID. In ClickHouse, a table only has one primary index that could optimally query data. Thankfully, there’s just the tool for the job…

At a High-level: ClickHouse vs Postgres

PostgreSQL (Postgres) is a versatile SQL database, known for its reliability and used for various applications. It’s an Online Transaction Processing (OLTP) database, providing real-time, exact results. ClickHouse, on the other hand, specializes in Online Analytical Processing (OLAP), making it better for fast, complex data analysis. In short, ClickHouse performs better at collecting aggregate results from a large dataset, while PostgreSQL exceeds at finding single records based on a known query pattern.

What is a Materialized View?

Both databases provide Materialized Views as a way to transform data into a different structure that can be queried in a performant way. Think of it as a Pivot in Excel or other table-viewing tools. Rather than having to process the data into a different format every time a query is made, a Materialized View remembers the transformation and applies it periodically so that the query can be made quickly against the processed form. There are notable differences between PostgreSQL and ClickHouse Materialized Views (MVs), but the use case for MVs in both is similar.

Read more for a deep dive into an actual use case for setting up a series of ClickHouse Materialized Views.

Deep-dive: Ingesting Traces from an Example LLM App

At LaunchDarkly, we recently launched a new tracing product that records code execution from your application to help debug issues or troubleshoot performance problems. The query engine in our app allows reporting and searching across structured attributes sent with traces. Each trace has a given duration in seconds but can also carry arbitrary numeric properties. Let’s say our trace measures the performance of AI inference for a large language model, and we report the input size in tokens as the tokens:123 numeric property.

1from transformers import AutoModelForCausalLM, AutoTokenizer
2from highlight_io import H
3
4# Set up OpenTelemetry export with Highlight
5H = H(
6 "<YOUR_HIGHLIGHT_PROJECT_ID>",
7 service_name="llm-inference", service_version="14",
8 instrument_logging=True
9)
10
11def generate_text(input_str: str):
12 # Wrap the code in a span to record inference execution duration
13 with H.trace() as span:
14 # Load the tokenizer and model
15 model = AutoModelForCausalLM.from_pretrained(
16 "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
17 torch_dtype="auto", trust_remote_code=True
18 )
19 tokenizer = AutoTokenizer.from_pretrained(
20 "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
21 trust_remote_code=True
22 )
23
24 # Tokenize the input text
25 inputs = tokenizer(input_str, return_tensors="pt")
26
27 # Generate text using the model
28 output = model.generate(**inputs, max_length=500)
29
30 # Decode and print the generated text
31 generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
32 span.set_attributes({
33 "input": input_text, "output": generated_text, "num_tokens": len(inputs)
34 })
35 return generated_text
36
37# Example usage
38input_text = "The future of AI is"
39print(generate_text(input_text))

The sample code below runs inference using a popular HuggingFace model. The critical code path is wrapped with a contextmanager that starts and stops a span to time the duration of execution while reporting useful attributes that can help us debug the root cause of potential problems. Here, we’re reporting the input text, the output text, and the num_tokens sent to the model during inference.

Now that we’ve started recording traces, we need a way to find interesting ones. In LaunchDarkly, we store each of these trace spans as a row in a ClickHouse table and provide the ability to visualize that data via time-series aggregations to identify trends or learn useful correlations.

For example, lets say we wanted to search for cases where the inference was slow to see if the size of our input (and the number of input tokens) had an effect on the inference duration. To write a ClickHouse query for that, we first would need to understand the traces schema (or how they are stored in the database). The table DDL looks something like this (see it here in our repo).

1CREATE TABLE traces
2(
3 Timestamp DateTime64(9),
4 UUID UUID,
5 TraceId String,
6 SpanId String,
7 /* ... omitted for brevity ... */
8 ServiceName LowCardinality(String),
9 ServiceVersion String,
10 TraceAttributes Map(LowCardinality(String), String)
11)
12ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
13PARTITION BY toDate(Timestamp)
14ORDER BY (ProjectId, Timestamp, UUID);

While the attributes that are present on all traces are stored as top-level columns, traces can have arbitrary custom attributes that can differ. The column TraceAttributes is a map that can store all of these values without having to worry about their types. In our ingest that writes data to the table, we format all TraceAttributes as strings where the key stores the name of the attribute. This allows us to mix different potential trace attributes within the same table regardless of their data type (where more complex values can be stored as their JSON string representation).

However, this presents a challenge when searching. The LaunchDarkly UI auto-completes attribute keys that can be used for searching, but having to scan over the entire table to gather the distinct keys is slow. We also can’t quickly determine which keys have at least one numeric value since we’d need to look at all values in the table to determine that.

This is where materialized views come to the rescue. Let’s create the materialized view for this table that will recall which trace attributes are numeric for quick key recommendation in our UI:

1CREATE MATERIALIZED VIEW trace_keys_mv
2 TO trace_keys (
3 ProjectId Int32,
4 Key LowCardinality(String),
5 Day DateTime,
6 Count UInt64,
7 Type String
8 )
9AS
10SELECT ProjectId,
11 arrayJoin(TraceAttributes).1 AS Key,
12 toStartOfDay(Timestamp) AS Day,
13 countDistinct(UUID) AS Count,
14 if(isNull(toFloat64OrNull(
15 arrayJoin(TraceAttributes).2
16 )), 'String', 'Numeric') AS Type
17FROM traces
18GROUP BY ProjectId,
19 arrayJoin(TraceAttributes).1,
20 Day,
21 isNull(toFloat64OrNull(arrayJoin(TraceAttributes).2));

We define a new table trace_keys_mv which is based on the contents of traces . The new MV has 5 columns populated based on the SELECT query written in the second half of the statement above. In ClickHouse, the materialized view definition looks like a normal SELECT query, but it runs asynchronously when data is inserted into the source traces table. The SELECT statement defines the filters, aggregation, and other logic that transforms the data from the source table before it is written into the destination table. The result is a table with data that is processed into our desired form. Since data is written as it is inserted into the source table, it can be queried instantly from the materialized view:

LaunchDarkly traces list showing queried trace data from a ClickHouse materialized view

LaunchDarkly traces list showing queried trace data from a ClickHouse materialized view.

The only downside is that each materialized view created consumes additional CPU, memory, and disk on the ClickHouse cluster, since inserted data must be processed and written into the new form.

A Materialized View for Fast Row Lookup

In PostgreSQL, searching for a single row can be optimized with an index. If you have more than one query pattern, you can create multiple indexes on combinations of columns used. In ClickHouse on the other hand, you only have one primary key for the table that has significant query performance. Thankfully, we can use materialized views to create other versions of the table with a different partitioning scheme, allowing us to efficiently query in other ways. For example, the following materialized view creates a copy of our traces table ORDER BY (ProjectId, TraceId) rather than ORDER BY (ProjectId, Timestamp, UUID).

1CREATE TABLE IF NOT EXISTS traces_by_id (
2 `Timestamp` DateTime64(9),
3 `UUID` UUID,
4 `TraceId` String,
5 `SpanId` String
6 -- ... --
7) ENGINE = MergeTree
8ORDER BY (ProjectId, TraceId) TTL toDateTime(Timestamp) + toIntervalDay(30);
9CREATE MATERIALIZED VIEW IF NOT EXISTS traces_by_id_mv TO traces_by_id (
10 `Timestamp` DateTime64(9),
11 `UUID` UUID,
12 `TraceId` String,
13 `SpanId` String,
14
15 -- ... --
16) AS
17SELECT *
18FROM traces;

While the traces table provides efficient lookup of traces for a given time range (sorted by time), the traces_by_id table offers fast lookup of a trace for a given TraceId. We rely on this in the LaunchDarkly traces UI to build a flame graph, where we query for all spans for a given trace:

LaunchDarkly flame graph visualization for a trace looked up by TraceId

LaunchDarkly flame graph visualization for a trace looked up by TraceId.

These two examples scratch the surface of how we use ClickHouse materialized views at LaunchDarkly. If you’d like to learn more or look at the code more closely, check out our table definitions and source code in our Apache 2.0 licensed GitHub repository. Thanks!

Grafana dashboard displaying trace metrics powered by ClickHouse materialized views

Grafana dashboard displaying trace metrics powered by ClickHouse materialized views.