Reducing disk read operations from statistics collection

CrateDB regularly populates the system table pg_catalog.pg_stats. It contains approximate statistical information at the column level, such as a column’s number of distinct values or most common values. Client applications can use those statistics for insights into data characteristics, but CrateDB’s query planner also uses them. One example is the optimization of joins, where knowing the cardinalities of both join keys can be beneficial.

Statistics are collected regularly in the background, by default every 24 hours. This interval is defined on the cluster level as stats.service.interval. Even though the statistics collection applies sampling, it can generate noticeable read throughput on the disk. To prevent spikes in disk usage, CrateDB 5.1.0 introduced the setting stats.service.max_bytes_per_sec that throttles disk throughput for statistics collection to a maximum of 40 MB/s.

In this article, we discuss strategies for older CrateDB versions (prior to the implementation of throttling) on how to prevent spikes in disk usage.

Collecting statistics less often

By changing the cluster-wide setting stats.service.interval, CrateDB can be configured to collect statistics less often. To do so only once a week, change the setting to 7d by running SET GLOBAL PERSISTENT "stats.service.interval" = '7d'; or changing the setting in crate.yml.

When increasing the interval, keep in mind that statistics can become outdated and impact the query planner to make adverse decisions. If cardinalities change significantly for certain tables, it is advisable to run ANALYZE manually at a convenient time.

Collecting statistics manually

Instead of having CrateDB schedule and execute statistics collection, you can do so externally to gain more control over it.

For this, we disable the automatic statistics collection: SET GLOBAL PERSISTENT "stats.service.interval" = 0;. As a replacement, we set up a bash script using the crash command line client:

#!/bin/bash

CRATEDB_USER="crate"
CRATEDB_PASSWORD=""
CRATEDB_HOST="localhost"

crash --hosts "http://${CRATEDB_USER}:${CRATEDB_PASSWORD}@${CRATEDB_HOST}:4200" \
      -c "ANALYZE;" 1> /dev/null

Using a cronjob or similar scheduling mechanism, the bash script can be executed at a convenient time, such as outside of business hours.

Upgrading CrateDB

The mentioned workarounds are only needed for CrateDB versions older than 5.1.0. If you still see disk usage spikes on later versions, try lowering the throttling (SET GLOBAL PERSISTENT "stats.service.max_bytes_per_sec" = '20mb';).

1 Like