### Aggregate function MEDIAN in PostgreSQL

Searching and calculating Median in databases was terrible. Still median isn't ANSI SQL aggregate function. There are two commons method how to calculate median of some column. First - very old, and very slow based on self join alchemy, second - new - based on analytic function. Now, I will be test some newer methods on one million rows large table:

postgres=# create table milrows(a real); CREATE TABLE Time: 7,975 ms postgres=# insert into milrows select random()*1000000 from generate_series(1,1000000); INSERT 0 1000000 Time: 6863,575 mssimple scan on this table takes 200ms

Frompostgres=# select avg(a) from milrows ; avg ────────────────── 499515.883033113 (1 row) Time: 200,176 msIn 8.4 we can use analytic functions. These functions uses TupleStore - internal store feature - it allows work with very large tables - limit is free space on disc.

## Analytic methods

--Joe Celko's method postgres=# SELECT avg(a)::float FROM (SELECT a, row_number() OVER (ORDER BY a asc) AS hi, count(*) OVER () + 1 - row_number() OVER (ORDER BY a) AS lo FROM milrows) qs WHERE hi IN (lo-1,lo,lo+1); avg ─────────────── 499188.546875 (1 row) Time: 4922,678 ms -- Andrew Gierth's method postgres=# select avg(a) from ( select a, row_number() over (order by a),count(*) over () from milrows ) s where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1) ; avg ─────────────── 499188.546875 (1 row) Time: 5021,001 ms -- modified Andrew's method (count(*) over () is slow) postgres=# select avg(a) from ( select a, row_number() over (order by a),(select count(*) from milrows) as count from milrows ) s where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1) ; avg ─────────────── 499188.546875 (1 row) Time: 3931,922 ms

## Array based methods

Next methods are based on using an arrays. These methods are fast, but limit for this methods is size of operation memory. For very very large tables could to take all application memory.--Regina's method -- it's not 100% correct http://www.postgresonline.com/journal/index.php?/archives/67-Build-Median-Aggregate-Function-in-SQL.html#extended CREATE OR REPLACE FUNCTION array_median(double precision[]) RETURNS double precision AS $$ SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)]::double precision END FROM (SELECT ARRAY(SELECT $1[n] FROM generate_series(1, array_upper($1, 1)) AS n WHERE $1[n] IS NOT NULL ORDER BY $1[n]) As asorted) As foo $$ LANGUAGE 'sql' IMMUTABLE; CREATE AGGREGATE median(double precision) ( SFUNC=array_append, STYPE=double precision[], FINALFUNC=array_median ); postgres=# select median(a) from milrows ; ^CCancel request sent ERROR: canceling statement due to user request -- killed 5 minutes !don't use array_append for bigger arrays (length > 10000) postgres=# --My method postgres=# create or replace function median(anyarray) returns double precision as $$ select ($1[array_upper($1,1)/2+1]::double precision + $1[(array_upper($1,1)+1) / 2]::double precision) / 2.0; $$ language sql immutable strict; CREATE FUNCTION Time: 1,557 ms Time: 2574,677 ms postgres=# select median(array(select a from milrows where a is not null order by a)); median ─────────────── 499188.546875 (1 row) Time: 2555,342 msThis week I added support for median aggregate to orafce package. You can download it from url http://pgfoundry.org/frs/download.php/2472/orafce-3.0.2-devel.tar.gz . Function median use some fetures 8.4 and needs 8.4 - it isn't supported on PostgreSQL 8.3 and older.

-- orafce 3.0.2 median (needs PostgreSQL 8.4 and higher) postgres=# select median(a::float8) from milrows; median ─────────────── 499188.546875 (1 row) Time: 687,577 msIt's very fast - if your table has about one million rows (1000000) you can use it (for this table size takes max. 15MB RAM (for one column)).

## 4 Comments:

Pavel,

Sounds cool. I'll have to check otu that package.

Thanks,

Regina

Pavel,

Why not submit median() to mainstream PostgreSQL? It doesn't seem particularly Oracle-ish.

[Josh] - My implementation isn't good for very large data sets (but is better then currently used others method). Real implementation have to be based on tuple store. And it needs different API than is current aggregate API :( - so it needs lot of work.

I'm not sure why this is faster, but here's a 4th analytic method that runs a tad faster than the other three:

select

avg(ordered.value)

from

(

select

row_number() over(order by a) as position,

a as value

from

milrows

) as ordered

where

ordered.position = (select ceil((count(a)::float8-1)/2+1) from milrows) or

ordered.position = (select floor((count(a)::float8-1)/2+1) from milrows)

;

Here's the timing:

alpha=# create table milrows(a real);

CREATE TABLE

Time: 2.410 ms

alpha=# insert into milrows select random()*1000000 from generate_series(1,1000000);

INSERT 0 1000000

Time: 2474.768 ms

alpha=#

alpha=# select avg(a)

alpha-# from ( select a, row_number() over (order by a),(select count(*) from milrows) as count from milrows ) s

alpha-# where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1)

alpha-# ;

avg

--------------

499884.46875

(1 row)

Time: 3242.837 ms

alpha=# select

alpha-# avg(ordered.value)

alpha-# from

alpha-# (

alpha(# select

alpha(# row_number() over(order by a) as position,

alpha(# a as value

alpha(# from

alpha(# milrows

alpha(# ) as ordered

alpha-# where

alpha-# ordered.position = (select ceil((count(a)::float8-1)/2+1) from milrows) or

alpha-# ordered.position = (select floor((count(a)::float8-1)/2+1) from milrows)

alpha-# ;

avg

--------------

499884.46875

(1 row)

Time: 2802.536 ms

alpha=#

Post a Comment

Subscribe to Post Comments [Atom]

<< Home