Quick Links

Re: Why percent_rank is so slower than rank?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jie Li <jay23jack(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why percent_rank is so slower than rank?
Date:	2010-12-09 21:01:23
Message-ID:	29707.1291928483@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Jie Li <jay23jack(at)gmail(dot)com> writes:
> I'm new to window functions. Recently I run some simple queries but
> surprised to find percent_rank is so slower than rank, could anybody tell me
> why?

Huh, interesting. I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 100000 from generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy. A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():

/*
* Slide the array down and readjust pointers. This may look pretty
* stupid, but we expect that there will usually not be very many
* tuple-pointers to move, so this isn't that expensive; and it keeps a
* lot of other logic simple.
*
* In fact, in the current usage for merge joins, it's demonstrable that
* there will always be exactly one non-removed tuple; so optimize that
* case.
*/
if (nremove + 1 == state->memtupcount)
state->memtuples[0] = state->memtuples[nremove];
else
memmove(state->memtuples, state->memtuples + nremove,
(state->memtupcount - nremove) * sizeof(void *));

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 100000+ tuple pointers each time. Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller. But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state->memtuples[0], but just like the
comment says, that complicates a lot of other logic. And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble. I'm not sure
that extra complication is worthwhile though. Comments?

regards, tom lane

In response to

Why percent_rank is so slower than rank? at 2010-12-09 07:26:09 from Jie Li

Responses

Re: Why percent_rank is so slower than rank? at 2010-12-09 22:18:57 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2010-12-09 21:14:48	Re: initdb failure with Postgres 8.4.4
Previous Message	Robert Haas	2010-12-09 20:59:29	Re: Patch to add a primary key using an existing index