Calculating CDFs in Sqlite
A common SQL task in customer-facing work (less often than checking a quick histogram, but it comes up) is calculating a cumulative distribution function (CDF) to answer questions like:
- What percentage of DB storage utilization is for data older than 12 months? 6 months? 3 months?
- What proportion of total message volume do the chattiest 1000 devices send?
- What are the top 5 applications by logs volume? What is their percentage of the total?
These questions can be answered easily by dropping a couple quick formulas in Excel, but it gets repetitive. It's nice to get quick feedback on cumulative values automatically while iterating on the underlying query.
Here's how I calculate CDFs with sqlite window functions (as a note to myself, because I will never remember the details).
Note: Sqlite has a cume_dist
window function that sound promising but works
only on partition row numbers, not arbitrary expressions.
References:
- SQLite Window Functions
- Histogram vs. eCDF: Marc Brooker on the usefulness of eCDFs.
1. Cumulative Quantity Y by Ascending Date X
Cumulative bloomed diameter by date. Yes, I'm stretching this made up data set beyond any sensible meaning.
select date(ts) as day, sum(diameter) as daily_diameter, sum(diameter) over (rows between unbounded preceding and current row) as cumulative_diameter, round(sum(diameter) over (rows between unbounded preceding and current row) / sum(diameter) over (rows between unbounded preceding and unbounded following), 2) as cdf from blooms where ts > '2023-01-01' group by date(ts) order by ts;
day | daily_diameter | cumulative_diameter | cdf |
---|---|---|---|
2023-04-10 | 7.0 | 7.0 | 0.12 |
2023-04-14 | 5.0 | 12.0 | 0.21 |
2023-04-19 | 5.0 | 17.0 | 0.3 |
2023-04-20 | 40.0 | 57.0 | 1.0 |
The same query, plotted with gnuplot on the command line:
sqlite3 -csv ~/dev/test.db "select date(ts) as day, round(sum(diameter) over (rows between unbounded preceding and current row) / sum(diameter) over (rows between unbounded preceding and unbounded following), 3) as cdf from blooms where ts > '2023-01-01' group by date(ts) order by ts;" | gnuplot -p -e 'set datafile separator ",";\ set term svg; set output "blooms-cdf.svg";\ set title "Blooms Diameter CDF";\ set xlabel ""; set ylabel "Cumulative Diameter";\ set xdata time; set timefmt "%Y-%m-%d"; set format x "%Y-%m-%d"; set xtics rotate;\ plot "-" using 1:2 with lines title ""' && open blooms-cdf.svg
2. Cumulative Quantity Y Ordered by Y
Cumulative bloomed diameter ordered by descending diameter. This example also shows a named window definition and window chaining.
select plant, diameter, sum(diameter) over cume as cumulative_diameter, round(sum(diameter) over cume / sum(diameter) over (diam_desc rows between unbounded preceding and unbounded following), 2) as cdf from blooms where ts > '2023-01-01' window diam_desc as (order by diameter desc), cume as (diam_desc rows between unbounded preceding and current row)
plant | diameter | cumulative_diameter | cdf |
---|---|---|---|
oak | 40.0 | 40.0 | 0.7 |
dogwood | 7.0 | 47.0 | 0.82 |
tulip | 5.0 | 52.0 | 0.91 |
hemlock | 5.0 | 57.0 | 1.0 |
3. Sample Setup
Create a database:
sudo apt install sqlite3 gnuplot sqlite3 test.db
Set up a table:
create table blooms ( plant text, location text, ts timestamptz, diameter double -- Feet ); insert into blooms values ('dogwood', 'deck', '2023-04-10', 7), ('tulip', 'front', '2023-04-19', 5), ('oak', 'side', '2023-04-20', 40), ('hemlock', 'side', '2022-04-15', 4), ('hemlock', 'side', '2023-04-14', 5);