Re: Why percent_rank is so slower than rank?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jie Li <jay23jack(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why percent_rank is so slower than rank?
Date: 2010-12-09 21:01:23
Message-ID: 29707.1291928483@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jie Li <jay23jack(at)gmail(dot)com> writes:
> I'm new to window functions. Recently I run some simple queries but
> surprised to find percent_rank is so slower than rank, could anybody tell me
> why?

Huh, interesting. I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 100000 from generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy. A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():

/*
* Slide the array down and readjust pointers. This may look pretty
* stupid, but we expect that there will usually not be very many
* tuple-pointers to move, so this isn't that expensive; and it keeps a
* lot of other logic simple.
*
* In fact, in the current usage for merge joins, it's demonstrable that
* there will always be exactly one non-removed tuple; so optimize that
* case.
*/
if (nremove + 1 == state->memtupcount)
state->memtuples[0] = state->memtuples[nremove];
else
memmove(state->memtuples, state->memtuples + nremove,
(state->memtupcount - nremove) * sizeof(void *));

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 100000+ tuple pointers each time. Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller. But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state->memtuples[0], but just like the
comment says, that complicates a lot of other logic. And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble. I'm not sure
that extra complication is worthwhile though. Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-12-09 21:14:48 Re: initdb failure with Postgres 8.4.4
Previous Message Robert Haas 2010-12-09 20:59:29 Re: Patch to add a primary key using an existing index