Calculating CDFs in Sqlite

A common SQL task in customer-facing work (less often than checking a quick histogram, but it comes up) is calculating a cumulative distribution function (CDF) to answer questions like:

What percentage of DB storage utilization is for data older than 12 months? 6 months? 3 months?
What proportion of total message volume do the chattiest 1000 devices send?
What are the top 5 applications by logs volume? What is their percentage of the total?

These questions can be answered easily by dropping a couple quick formulas in Excel, but it gets repetitive. It's nice to get quick feedback on cumulative values automatically while iterating on the underlying query.

Here's how I calculate CDFs with sqlite window functions (as a note to myself, because I will never remember the details).

Note: Sqlite has a cume_dist window function that sound promising but works only on partition row numbers, not arbitrary expressions.

References:

SQLite Window Functions
Histogram vs. eCDF: Marc Brooker on the usefulness of eCDFs.

1. Cumulative Quantity Y by Ascending Date X

Cumulative bloomed diameter by date. Yes, I'm stretching this made up data set beyond any sensible meaning.

select
    date(ts) as day,
    sum(diameter) as daily_diameter,
    sum(diameter) over (rows between unbounded preceding and current row) as cumulative_diameter,
    round(sum(diameter) over (rows between unbounded preceding and current row) /
          sum(diameter) over (rows between unbounded preceding and unbounded following), 2) as cdf
from blooms
where ts > '2023-01-01'
group by date(ts)
order by ts;

day	daily_diameter	cumulative_diameter	cdf
2023-04-10	7.0	7.0	0.12
2023-04-14	5.0	12.0	0.21
2023-04-19	5.0	17.0	0.3
2023-04-20	40.0	57.0	1.0

The same query, plotted with gnuplot on the command line:

sqlite3 -csv ~/dev/test.db "select
    date(ts) as day,
    round(sum(diameter) over (rows between unbounded preceding and current row) /
          sum(diameter) over (rows between unbounded preceding and unbounded following), 3) as cdf
from blooms
where ts > '2023-01-01'
group by date(ts)
order by ts;" | gnuplot -p -e 'set datafile separator ",";\
      set term svg; set output "blooms-cdf.svg";\
      set title "Blooms Diameter CDF";\
      set xlabel ""; set ylabel "Cumulative Diameter";\
      set xdata time; set timefmt "%Y-%m-%d"; set format x "%Y-%m-%d"; set xtics rotate;\
      plot "-" using 1:2 with lines title ""' && open blooms-cdf.svg

2. Cumulative Quantity Y Ordered by Y

Cumulative bloomed diameter ordered by descending diameter. This example also shows a named window definition and window chaining.

select
    plant,
    diameter,
    sum(diameter) over cume as cumulative_diameter,
    round(sum(diameter) over cume /
          sum(diameter) over (diam_desc rows between unbounded preceding and unbounded following), 2) as cdf
from blooms
where ts > '2023-01-01'
window diam_desc as (order by diameter desc),
cume as (diam_desc rows between unbounded preceding and current row)

plant	diameter	cumulative_diameter	cdf
oak	40.0	40.0	0.7
dogwood	7.0	47.0	0.82
tulip	5.0	52.0	0.91
hemlock	5.0	57.0	1.0

3. Sample Setup

Create a database:

sudo apt install sqlite3 gnuplot
sqlite3 test.db

Set up a table:

create table blooms (
    plant text,
    location text,
    ts timestamptz,
    diameter double -- Feet
);

insert into blooms values
    ('dogwood', 'deck', '2023-04-10', 7),
    ('tulip', 'front', '2023-04-19', 5),
    ('oak', 'side', '2023-04-20', 40),
    ('hemlock', 'side', '2022-04-15', 4),
    ('hemlock', 'side', '2023-04-14', 5);