Wednesday, November 25, 2009

Aggregate function MEDIAN in PostgreSQL

Searching and calculating Median in databases was terrible. Still median isn't ANSI SQL aggregate function. There are two commons method how to calculate median of some column. First - very old, and very slow based on self join alchemy, second - new - based on analytic function. Now, I will be test some newer methods on one million rows large table:
postgres=# create table milrows(a real);
CREATE TABLE
Time: 7,975 ms
postgres=# insert into milrows select random()*1000000 from generate_series(1,1000000);
INSERT 0 1000000
Time: 6863,575 ms
simple scan on this table takes 200ms
Frompostgres=# select avg(a) from milrows ;
       avg        
──────────────────
 499515.883033113
(1 row)

Time: 200,176 ms
In 8.4 we can use analytic functions. These functions uses TupleStore - internal store feature - it allows work with very large tables - limit is free space on disc.

Analytic methods

--Joe Celko's method
postgres=# SELECT avg(a)::float
              FROM (SELECT a, row_number() OVER (ORDER BY a asc) AS hi,
                              count(*) OVER () + 1 - row_number() OVER (ORDER BY a) AS lo
                       FROM milrows) qs
             WHERE hi IN (lo-1,lo,lo+1);
      avg      
───────────────
 499188.546875
(1 row)

Time: 4922,678 ms

-- Andrew Gierth's method 
postgres=# select avg(a)
   from ( select a, row_number() over (order by a),count(*) over () from milrows ) s
  where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1)
;
      avg      
───────────────
 499188.546875
(1 row)

Time: 5021,001 ms

-- modified Andrew's method (count(*) over () is slow)
postgres=# select avg(a)
   from ( select a, row_number() over (order by a),(select count(*) from milrows) as count from milrows ) s
  where row_number between floor((count::float8-1)/2+1) and ceil((count::float8-1)/2+1)
;
      avg      
───────────────
 499188.546875
(1 row)

Time: 3931,922 ms

Array based methods

Next methods are based on using an arrays. These methods are fast, but limit for this methods is size of operation memory. For very very large tables could to take all application memory.
--Regina's method -- it's not 100% correct http://www.postgresonline.com/journal/index.php?/archives/67-Build-Median-Aggregate-Function-in-SQL.html#extended

CREATE OR REPLACE FUNCTION array_median(double precision[])
  RETURNS double precision AS
$$
    SELECT CASE WHEN array_upper($1,1) = 0 THEN null 
                ELSE asorted[ceiling(array_upper(asorted,1)/2.0)]::double precision END
       FROM (SELECT ARRAY(SELECT $1[n] 
                FROM generate_series(1, array_upper($1, 1)) AS n
               WHERE $1[n] IS NOT NULL
               ORDER BY $1[n]) As asorted) As foo 
$$ LANGUAGE 'sql' IMMUTABLE;

CREATE AGGREGATE median(double precision) (
  SFUNC=array_append,
  STYPE=double precision[],
  FINALFUNC=array_median
);

postgres=# select median(a) from milrows ;
^CCancel request sent
ERROR:  canceling statement due to user request -- killed 5 minutes !don't use array_append for bigger arrays (length > 10000)
postgres=# 

--My method
postgres=# create or replace function median(anyarray) 
returns double precision as $$
  select ($1[array_upper($1,1)/2+1]::double precision + $1[(array_upper($1,1)+1) / 2]::double precision) / 2.0; 
$$ language sql immutable strict;
CREATE FUNCTION
Time: 1,557 ms

Time: 2574,677 ms
postgres=# select median(array(select a from milrows where a is not null order by a));
    median     
───────────────
 499188.546875
(1 row)

Time: 2555,342 ms
This week I added support for median aggregate to orafce package. You can download it from url http://pgfoundry.org/frs/download.php/2472/orafce-3.0.2-devel.tar.gz . Function median use some fetures 8.4 and needs 8.4 - it isn't supported on PostgreSQL 8.3 and older.
-- orafce 3.0.2 median (needs PostgreSQL 8.4 and higher)

postgres=# select median(a::float8) from milrows;
    median     
───────────────
 499188.546875
(1 row)

Time: 687,577 ms
It's very fast - if your table has about one million rows (1000000) you can use it (for this table size takes max. 15MB RAM (for one column)).

Saturday, November 14, 2009

longtime plpgsql misfeature removed

Tom Lane did refactoring of plpgsql source code. These changes are very very important. plpgsql is good language - simple, robust with good error diagnostic. But it had one bizarre behave. plpgsql connects two worlds - procedural ADA like code and SQL statements. Usually there are not problems. But there are one exception - collision of identifiers. Older behave was too simply. Plpgsql identifiers win every-time. It was a source of some bizarre bugs. Look on code: 
postgres=# select * from omega;
 a  
────
 10
 20
 30
(3 rows)

create or replace function foo() 
returns void as $$
#variable_conflict use_variable -- compatible with 8.4 and older
declare a integer; 
begin 
  for a in select a from omega 
  loop 
    raise notice '%', a;
  end loop; 
end; 
$$ language plpgsql;
This code is very simple. Just show content of table omega.
postgres=# select foo();
NOTICE:  <null>
NOTICE:  <null>
NOTICE:  <null>
 foo 
─────
 
(1 row)
or not? Why we don't see values 10,20,30? Because interpret prefer plpgsql identifier against to sql identifier omega.a. This bug is very strange and some time is very difficult to find it. But it is a history. plpgsql 8.5 is much more cleaner. Wrong code raises en exception:
postgres=# 
create or replace function foo() 
returns void as $$
declare a integer;                                              
begin              
  for a in select a from omega 
  loop                         
    raise notice '%', a;
  end loop; 
end; 
$$ language plpgsql;
CREATE FUNCTION
Time: 3,501 ms
postgres=# select foo();
ERROR:  column reference "a" is ambiguous
LINE 1: select a from omega
               ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.
QUERY:  select a from omega
CONTEXT:  PL/pgSQL function "foo" line 3 at FOR over SELECT rows
We could to fix this problem and we get a good answer:
postgres=# 
create or replace function foo() 
returns void as $$
declare a integer; 
begin 
  for a in select omega.a from omega 
  loop 
    raise notice '%', a;
  end loop; 
end; 
$$ language plpgsql;
CREATE FUNCTION
Time: 2,289 ms
postgres=# select foo();
NOTICE:  10
NOTICE:  20
NOTICE:  30
 foo 
─────
 
(1 row)
I am very happy from this changes. Thanks Tom.