Solving large logs with ClickHouse

Engineering

28 June 2024 • 11 min read

Embrace engineers share a few key learnings from supporting larger log sizes, including working around a current limitation in ClickHouse and testing several skip indices to optimize query performance and storage cost.

Logs represent one of the three pillars of software observability, and a key component of OpenTelemetry. Log data holds crucial insights that can help developers understand application behavior and troubleshoot issues, making the ability to efficiently search logs essential. Developers need to be able to quickly and accurately locate valuable information embedded within logs.

Embrace is a mobile app observability platform, and our system processes around 2 billion logs a day. We recently set out to increase the amount of log data a customer can send us, while at the same time improving the performance when searching against those larger logs.

We use ClickHouse for our log database, and we wanted to share what we learned while optimizing our system to handle larger log sizes. In this post, we’ll cover:

An overview of our log system
Testing our ingestion pipeline with larger data volume
Writing a more efficient query
Reducing query time with skip indices

TL;DR: We initially thought ingestion might be a problem, but after testing, our systems were able to handle the increased log sizes. We’ll share more about our ingestion pipeline in a future post. After increasing our log sizes, we improved our slowest queries from 60 seconds to 1-2 seconds by rewriting our query to work around a current limitation in ClickHouse and testing several skip indices with Bloom filter configurations to find the best match in terms of performance versus storage cost.

An overview of Embrace’s log system

Our existing log ingestion was limited to 128 bytes per message. While this was adequate for many use cases, support for longer log messages was a highly requested feature. Additionally, searches against this data relied on simple LIKE queries, which could be problematic. For example, a search might fail to find relevant results if the query didn’t match exactly, such as when punctuation was omitted.

For example, if you search for the terms “Connection Error,” a regular query with LIKE will look like the following:

SELECT * FROM TABLE WHERE text LIKE ‘%Connection Error%’;

This will not include the following texts in the result set:

Connection:Error
Error Connection
Error:Connection

To address these issues, we aimed to increase the maximum log length to 4 kilobytes, and to process the logs in a way that would make searches more effective.

However, we were concerned about the potential impact on the performance of our ingestion pipeline. We needed to test thoroughly to ensure that storage of larger log messages would not negatively impact ClickHouse or other parts of our infrastructure, and to confirm that searches would remain fast and efficient for our users.

Testing our ingestion pipeline with larger data volume

Ideally, we would be able to use real-world data to test the larger data volume. This would allow us to gradually increase the volume until it reached full production levels, with the ability to stop the process at any point if issues arose. However, we did not have enough real-world data in the necessary volume for such a stress test. While we had examples of the larger log data that customers wanted to send, these examples were limited to a small number of customers and comprised only a few sample values. We needed data on a much larger scale.

A naive approach would be to generate random data to extend the length of the log messages. However, this method has significant drawbacks because the shape of random data differs considerably from real-world data. Random data doesn’t compress as effectively, whereas real-world logs often contain many repeated tokens. Additionally, the length of individual words and punctuation in random data would differ from actual logs, potentially skewing the results of our testing due to differences in data processing during ingestion.

Instead, we synthesized test data using real-world data as a seed. We created a weighted dictionary of all the terms seen in our existing logs, assigning higher weights to terms that occurred more frequently. We then added a secondary log processing stream that ingested data into a separate ClickHouse table in our production database. Using a hash of the incoming log message, we deterministically extended the lengths of messages with terms drawn from the weighted dictionary. This approach ensured that the shape and repetition of terms in the log messages closely resembled real-world data, but with arbitrarily long lengths.

It’s worth noting that we purposely wanted to test these changes in our production cluster. Although you can do testing in a staging environment, the results can be misleading for large scale benchmarks, as the load the cluster is under and the amount of data are not representative of production circumstances. While you could theoretically spin up a production-like cluster for the purposes of testing, the cost would be large, and it would be a significant undertaking to ensure that the load on the cluster was similar to production.

Architecture diagram for test log data — We created test log data with a weighted dictionary and stored it in our production cluster.

After deploying these changes to production, we found that our ingestion pipeline and ClickHouse itself had no issues handling the additional data volume. This was a pleasant surprise, though not entirely unexpected, given the significant effort we’ve invested in optimizing our Go-based loaders and our positive historical experience with ClickHouse’s resilience.

However, search performance against the new table was lacking. These search queries regularly exceeded the 60-second time limits we enforce for ClickHouse queries. The next step in our testing was to figure out how we could improve the performance of these queries, either by restructuring the queries themselves or by leveraging ClickHouse’s index support.

Writing a more efficient query

Like many databases, ClickHouse allows you to prefix a query with EXPLAIN to gain insights into what the database will do when executing the query. For example, from the ClickHouse documentation:

EXPLAIN SELECT sum(number) FROM numbers(10) UNION ALL SELECT sum(number) FROM numbers(10) ORDER BY sum(number) ASC FORMAT TSV;

Union
  Expression (Projection)
    Expression (Before ORDER BY and SELECT)
      Aggregating
        Expression (Before GROUP BY)
          SettingQuotaAndLimits (Set limits and quota after reading from storage)
            ReadFromStorage (SystemNumbers)
  Expression (Projection)
    MergingSorted (Merge sorted streams for ORDER BY)
      MergeSorting (Merge sorted blocks for ORDER BY)
        PartialSorting (Sort each block for ORDER BY)
          Expression (Before ORDER BY and SELECT)
            Aggregating
              Expression (Before GROUP BY)
                SettingQuotaAndLimits (Set limits and quota after reading from storage)
                  ReadFromStorage (SystemNumbers)

This provides a general idea of the query execution plan, but it is very high-level and does not offer enough detail to understand the root cause of our performance issues.

For deeper insights, ClickHouse offers the send_logs_level setting, which lets you specify the amount of detail you want from the server-side logs during query execution. By using SET send_logs_level = 'trace', you can obtain more information about the server’s activities, which can be helpful in diagnosing performance problems.

Reviewing these logs revealed that an open issue in ClickHouse was affecting our query. In short, when a query includes both an ORDER BY and a LIMIT, ClickHouse reads all the selected columns from disk, even for rows that do not make it into the final result set. Ideally, it should defer reading columns until absolutely necessary. At the time of this writing, there is an open pull request to add a lazy column type that addresses this issue, but the current version of ClickHouse still exhibits this inefficient behavior.

This problem affects all such queries, but is particularly problematic for queries involving columns with large size data values. Since our message column had significantly increased in size, it was being read for all potential results, leading to substantial overhead (especially with our distributed table, which allows us to distribute the storage and query costs for this table across our cluster, but which also resulted in more data being transmitted over the network).

We worked around this issue by adding an additional clause to the WHERE statement in our query. This clause filtered the result set down to a smaller set of rows after the ORDER BY ... LIMIT operation, only reading the larger values from this reduced set by matching rows via the primary key.

For example, if you have a table like:

CREATE TABLE example (
    `date` Date,
    `small1` LowCardinality(String),
    `small2` LowCardinality(String),
    `small3` LowCardinality(String),
    `time` DateTime,
    `id` UUID,
    `large` String
)
    ENGINE = MergeTree()
    PARTITION BY date
    ORDER BY (small1, small2, small3, time);

A query like this one would be affected by this issue:

SELECT date, small1, small2, small3, time, id, large
FROM example
WHERE
      date >= today() - 7 AND date < today() AND
      small1 = 'a' AND small2 = 'b'
ORDER BY time DESC, id DESC LIMIT 10

You can instead rewrite this to first select the order by tuple to mitigate the issue:

SELECT date, small1, small2, small3, time, id, large
FROM example
WHERE
      (date, small1, small2, small3, time) IN (
            SELECT date, small1, small2, small3, time
            FROM example
            WHERE
                  date >= today() - 7 AND date < today() AND
                  small1 = 'a' AND small2 = 'b'
            ORDER BY time DESC LIMIT 10 WITH TIES
      )
ORDER BY time DESC, id DESC LIMIT 10

With this change, our slowest search queries for this test were reduced to under 30 seconds, which was below our timeout threshold. However, this was still longer than we wanted our users to wait for search results. With no further obvious optimizations for the query itself, our next step was to investigate whether ClickHouse’s indexing features could help us.

Reducing query time with indices

Traditional relational databases typically use secondary B-tree indices to improve performance, providing an efficient way to quickly look up the rows that match a given query. ClickHouse does not support this style of index, but supports a data skipping index, which allows ClickHouse to instead determine whether it can skip large ranges of data because none of the rows in those chunks match the query.

It’s worth noting that ClickHouse also supports an experimental full-text index, designed precisely for use cases like ours. During our initial tests, this full-text index performed exceptionally well. However, as this feature is still experimental and has several unresolved issues, it is not recommended for production use. That is why we decided not to use it, and why it is not covered in this document. We plan to revisit the full-text index once it is stable and production-ready.

Testing methodology

We suspected that a data skipping index would help improve our query performance. However, these indices have several tunable parameters that need to be optimized for best performance, and this tuning depends on the characteristics of the data being indexed.

To reliably compare the performance of different index configurations, we needed to test a variety of search terms against different data sets. Our test cases included:

Frequently occurring terms with many matching rows
Frequently occurring terms with no matching rows
Infrequently occurring terms with many matching rows
Infrequently occurring terms with no matching rows

For each of these scenarios, our primary metric for comparison was query duration. Secondary metrics included peak memory usage and the number of rows read during query execution.

To minimize variance and ensure reliable results, we wanted to repeat each test case multiple times. Additionally, we wanted to test with a variety of apps to ensure that our results were consistent across different datasets.

Skip index parameters

ClickHouse has several types of data skipping indexes, but the Bloom filter skip indices are the ones relevant to our use case. Bloom filters allow you to efficiently determine whether a value is present in a set. They will give false positives for set membership, but will never give a false negative. This means that the Bloom filter will sometimes say that the value is present in the set, but it is not actually there, but if it says it is not present it is definitely not there. This makes them a great fit for a skipping index.

There are three types of Bloom filter skip indices in ClickHouse:

The bloom_filter type takes only one parameter: false positive rate (between 0 and 1).
The tokenbf_v1 type splits a string into tokens by non-alphanumeric characters, and is designed to efficiently search text data similar to our log lines. It takes three parameters:
- Size of the Bloom filter in bytes (fewer false positives at the expense of storage)
- Number of hash functions (fewer false positives at the expense of CPU)
- Seed for the Bloom filter hash functions
The ngrambf_v1 type is just like the tokenbf_v1 type, but instead indexes substrings of a given length, rather than whole word tokens. It is useful for cases when you need to match against partial words, or languages without word breaks (such as Chinese).

For our use case, tokenbf_v1 was the most suitable as it gave us control over more parameters and our primary need was whole-word matches.

We aimed to tune the Bloom filter size and number of hash functions with our benchmarking, and used a Bloom filter calculator to determine the parameters for our test cases.

Custom benchmarking tool

The ClickHouse distribution includes a clickhouse-benchmark tool, which is excellent for testing the performance of individual queries. Our use case, however, was more complex.

ClickHouse does not allow you to force the use of a specific index for a query, so we needed to ensure that only the index we were testing was present on the table. Additionally, ClickHouse does not materialize the index for existing rows by default, and materializing everything at once could negatively impact cluster performance in production.

This meant that for each run of the test suite, we needed to:

Drop any existing indices
Create a new skip index with the desired parameters
Materialize the index for each partition one by one. This is an asynchronous operation, so we needed to monitor the system.mutations table to detect when the materialization was complete.

Then, for each test case we needed to:

Run the query multiple times (to minimize variance)
Wait for all the results to appear in the system.query_log table
Aggregate values from the query log

Our test runner initially ran each test case multiple times back to back, then proceeded to the next test case in the suite. This caused problems, however, as the regular load on the ClickHouse cluster would change over time, and the performance of one test case would be disproportionately affected by the higher load. We instead changed the test runner to run all test cases, then repeat the entire suite N times, to distribute the variance from load across all the test cases.

These steps were time-consuming, with each execution of the test suite taking around 45 minutes. We wanted the test runner to be resilient, handle errors, retry operations when necessary, and persist state so that it could be resumed in the event of an unexpected crash. This would allow us to run it in the background for hours without constant supervision.

We ended up building a custom tool to run these benchmarks, which significantly improved our testing process by making it easier to iterate and run the suite multiple times. In the end, we ran about 1870 test cases to compare all the different index configurations to the baseline, so speeding up this process was very helpful.

Three bar charts showing test results for tokenbf_v1 skip index — Results from testing the tokenbf_v1 skip index against the baseline

Results from testing

Our testing provided valuable insights into the performance of different Bloom filter configurations. We found that skip index yielded significantly faster query times in many cases, and performed no worse than baseline in cases where it was not helpful. Increasing the Bloom filter size generally helped with performance. The number of hash functions had a lesser, but still noticeable, impact on performance. Memory usage remained relatively stable across all configurations, and the number of rows read during query execution decreased as the Bloom filter size increased.

However, larger Bloom filters also increased the disk space used by the index. We had to strike a balance between query performance and storage cost. Through iterative testing and adjustments, we settled on a configuration that offered the best compromise between speed, memory usage, and storage requirements.

With the new index in place, most of our queries were taking 1-2 seconds to return, with many taking just milliseconds.

Conclusion

Our journey to increase log limits and improve search performance was a challenging one, filled with a deep dive into ClickHouse indices and our queries, numerous testing iterations, and careful balancing of query performance versus storage costs. It was a rewarding experience that significantly improved our product’s capabilities.

Real-world testing was very important here, as it gave us the confidence to move forward without worrying about affecting user experience or reliability. The power of iterative improvements helped us balance different factors for optimal performance.

We’ll continue to monitor the system over time, to ensure that changes in the dataset don’t adversely affect the performance of logs in the future, and we’re looking forward to applying the same technique to different areas of our product to further improve performance.

Acknowledgments

We want to thank Altinity for their expertise and support as we researched and implemented support for larger logs in the Embrace architecture.

Learn more about Embrace's large log support.

Access OpenTelemetry-based, searchable logs that are fully contextualized within user sessions.

Learn more

Author

Nathan Ostgard

Nathan Ostgard is a senior software engineer at Embrace. Nathan has 25 years of experience as a developer. He has a wide range of experience spanning several technologies including backend development in both Go and Python, frontend development in React and TypeScript, and game development and lower-level programming in C and C++.

Enhancing storage efficiency and reducing storage costs with s3-batch-object-store

Embrace reduced storage costs by 70% by creating an open source module for efficiently storing and retrieving objects from a single file in S3.

Product Overview

User-focused observability for mobile and web

Use Cases

Industries

Featured Resource

Overcoming key challenges in mobile observability: A guide for modern DevOps and SRE teams

Company

Community + Support