October 08, 2023

Calculating CDFs in Sqlite

A common SQL task in customer-facing work (less often than checking a quick histogram, but it comes up) is calculating a cumulative distribution function (CDF) to answer questions like:

  1. What percentage of DB storage utilization is for data older than 12 months? 6 months? 3 months?
  2. What proportion of total message volume do the chattiest 1000 devices send?
  3. What are the top 5 applications by logs volume? What is their percentage of the total?

These questions can be answered easily by dropping a couple quick formulas in Excel, but it gets repetitive. It's nice to get quick feedback on cumulative values automatically while iterating on the underlying query.

Here's how I calculate CDFs with sqlite window functions (as a note to myself, because I will never remember the details).

Note: Sqlite has a cume_dist window function that sound promising but works only on partition row numbers, not arbitrary expressions.

References:

1. Cumulative Quantity Y by Ascending Date X

Cumulative bloomed diameter by date. Yes, I'm stretching this made up data set beyond any sensible meaning.

select
    date(ts) as day,
    sum(diameter) as daily_diameter,
    sum(diameter) over (rows between unbounded preceding and current row) as cumulative_diameter,
    round(sum(diameter) over (rows between unbounded preceding and current row) /
          sum(diameter) over (rows between unbounded preceding and unbounded following), 2) as cdf
from blooms
where ts > '2023-01-01'
group by date(ts)
order by ts;
day daily_diameter cumulative_diameter cdf
2023-04-10 7.0 7.0 0.12
2023-04-14 5.0 12.0 0.21
2023-04-19 5.0 17.0 0.3
2023-04-20 40.0 57.0 1.0

The same query, plotted with gnuplot on the command line:

sqlite3 -csv ~/dev/test.db "select
    date(ts) as day,
    round(sum(diameter) over (rows between unbounded preceding and current row) /
          sum(diameter) over (rows between unbounded preceding and unbounded following), 3) as cdf
from blooms
where ts > '2023-01-01'
group by date(ts)
order by ts;" | gnuplot -p -e 'set datafile separator ",";\
      set term svg; set output "blooms-cdf.svg";\
      set title "Blooms Diameter CDF";\
      set xlabel ""; set ylabel "Cumulative Diameter";\
      set xdata time; set timefmt "%Y-%m-%d"; set format x "%Y-%m-%d"; set xtics rotate;\
      plot "-" using 1:2 with lines title ""' && open blooms-cdf.svg
blooms-cdf.svg
Figure 1: Blooms CDF plot

2. Cumulative Quantity Y Ordered by Y

Cumulative bloomed diameter ordered by descending diameter. This example also shows a named window definition and window chaining.

select
    plant,
    diameter,
    sum(diameter) over cume as cumulative_diameter,
    round(sum(diameter) over cume /
          sum(diameter) over (diam_desc rows between unbounded preceding and unbounded following), 2) as cdf
from blooms
where ts > '2023-01-01'
window diam_desc as (order by diameter desc),
cume as (diam_desc rows between unbounded preceding and current row)
plant diameter cumulative_diameter cdf
oak 40.0 40.0 0.7
dogwood 7.0 47.0 0.82
tulip 5.0 52.0 0.91
hemlock 5.0 57.0 1.0

3. Sample Setup

Create a database:

sudo apt install sqlite3 gnuplot
sqlite3 test.db

Set up a table:

create table blooms (
    plant text,
    location text,
    ts timestamptz,
    diameter double -- Feet
);

insert into blooms values
    ('dogwood', 'deck', '2023-04-10', 7),
    ('tulip', 'front', '2023-04-19', 5),
    ('oak', 'side', '2023-04-20', 40),
    ('hemlock', 'side', '2022-04-15', 4),
    ('hemlock', 'side', '2023-04-14', 5);

RSS

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.