Updated: Oct 23
Scenario: The log data emitted from applications is different from transaction data in that only a small portion of that data is deemed useful at any time, but most of it is needed for a few days or months in case it is needed. Most centralized log mgmt products use the same foundational components of Lucene and ElasticSearch, and typically store logs in a database like fashion. The same server cluster is used for indexing and searching, and logs data has to be live in the cluster before it can be searched. The cost of handling logs starts at 10 cents per GB, and becomes costly when volume starts to hit a few TBs per day.
The Ask: A log mgmt product is to be created with these objectives
Indexing logs and Searching logs should not be dependent on each other
Indexing throughput should be 10MBps per core, this will also reduce the overall cost of ingestion
Cost of storing logs should be close to disk storage cost. Compute cost for searching logs should be proportional to the amount of logs searched
Indexing and Search operations should scale horizontally within defined memory and cpu limits
Solution: Below is a schematic the product Episilia
C++ is chosen as the language for its high performance characteristics, to leverage SIMD instructions in json log parsing and for indexing logs
Block storage compatible to S3 is chosen to hold logs and folders are organized by date/time.
There is no central database or metadata storage, S3 itself acts as the metadata holder. This choice was made to keep ops simple. Indexer writes logs to S3, Search reads logs off S3, and archival/restore of logs is a file copy from and to S3. This made Storage separate from Compute”
The data structures to hold the logs and indices are designed to ensure optimal storage and low retrieval latency. Logs are grouped into blocks by their commonality of application id etc and metadata is deduplicated, thus reducing storage required for logs.
Hadoop Sequence File format is chosen for the Log and Index blocks so these files are compatible with other big data products such as Hive.
Log data blocks are compressed with LZ4, a TB of logs typically comes down to 100GB on disk.
Index is at a maximum of 1% of the logs size, so a TB of logs has just 10GB of overhead for index. A Lucene index is anywhere between 10-50%, to compare. The index is optimized using bloom filters and other data structure level optimizations.
Indexer is a core module for data processing and is designed as a pipeline of steps, each step operating as a separate thread connected by a lightweight in-memory queue.
Logs are read off kafka at speeds exceeding 100MBps. Logs are organized by topic/partition id and pushed to a queue, and this thread goes back to reading logs.
Json Parser parses the logs to a metadata/data structure. At about 40MBps, this is the slowest step in the pipeline. Intel SIMD json parser is used which also necessitates use of AVX compatible processors. This is a choice made to get the best performance to cost metrics.
Block Maker re-organizes logs into Data blocks, and creates Index blocks, at 200MBps
Compressor compresses data blocks and writes to files; LZ4 operates at about 150MBps. While GZ compression is better, LZ4 is much faster than GZ and chosen for throughput.
S3 uploaders uploads files every 2 minutes and commits kafka offsets
Kafka is chosen as the data transporter for its throughput and indexer instances scale up and down without missing any messages.
Queues without locks were used with the consumer thread polling for data.
Most method calls were in-lined to avoid method call overhead.
Logging was turned off in production, with an option to turn on logging for short periods at runtime without restarting the instances. All metrics were custom collected and sent out on kafka.
Backpressure is built via a separate thread that monitors process memory usage and pauses Kafka readers based on pre-set memory thresholds.