Logs represent one of the three pillars of software observability, and a key component of OpenTelemetry. Log data holds crucial insights that can help developers understand application behavior and troubleshoot issues, making the ability to efficiently search logs essential. Developers need to be able to quickly and accurately locate valuable information embedded within logs.
Embrace is a mobile app observability platform, and our system processes around 2 billion logs a day. We recently set out to increase the amount of log data a customer can send us, while at the same time improving the performance when searching against those larger logs.
We use ClickHouse for our log database, and we wanted to share what we learned while optimizing our system to handle larger log sizes. In this post, we’ll cover:
- An overview of our log system
- Testing our ingestion pipeline with larger data volume
- Writing a more efficient query
- Reducing query time with skip indices
TL;DR: We initially thought ingestion might be a problem, but after testing, our systems were able to handle the increased log sizes. We’ll share more about our ingestion pipeline in a future post. After increasing our log sizes, we improved our slowest queries from 60 seconds to 1-2 seconds by rewriting our query to work around a current limitation in ClickHouse and testing several skip indices with Bloom filter configurations to find the best match in terms of performance versus storage cost.