Tuplestore trimming in window-functions patch

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: Tuplestore trimming in window-functions patch
Date: 2008-12-26 23:28:28
Message-ID: 16925.1230334108@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The last bit of performance-related hacking that seems to be needed in
the window functions patch is to fix things so that we can trim old
rows from the underlying tuplestore when they're no longer needed.
In particular I think it's critical to be able to do this for the
simple cases of lead() and lag() with constant offsets. This topic
has been mentioned repeatedly in the discussions to date, but nothing
much got done.

The components we seem to be missing are:

1. lead/lag have to be able to tell whether their offset arguments are
constants.

2. The WindowObject API has to be extended so that a window function can
tell the executor that it will never again fetch a row before row X.
(This has to be cheap enough that we can do it on each call.)

3. The tuplestore API itself seems a bit shy of a load --- what we do
not have is a concept that a particular read pointer can move back and
forth, but is guaranteed never to move back before some-defined-place.
(This is clearly an oversight in my recent redesign of that API.)

I don't think it's hard to fix #1. What I propose we do is extend the
existing fmgr.h APIs that allow inquiry about the data types of function
arguments -- say, something along the line of
extern bool get_fn_arg_is_const(FmgrInfo *flinfo, int argnum)
This would at bottom be an "IsA(arg_expr, Const)" type of test but it
seems best to keep the details hidden from caller functions. (Note:
I've already fixed the WindowAgg infrastructure to support
get_fn_expr_argtype and friends --- this really isn't optional if you
expect polymorphic window functions to be common, and the evidence of
the built-in ones is that they are. It's pure luck that none of the
built-in ones has had any need to explicitly find out what data type
it's processing.)

#2 is also easy given a solution to #3; the WinForgetBefore() call would
just translate to some appropriate action on the tuplestore read pointer
associated with the window function's WindowObject.

So I think the problem really reduces to fixing the tuplestore API some
more. One way to go at it would be to re-introduce an explicit notion
of a "mark" pointer associated with (some or all) read pointers, and
have a capability bit saying that a read pointer is allowed to go
backward so long as it doesn't go before its mark pointer. However
I really like the simplicity of the current API design where all read
pointers are alike and there's no explicit notion of a mark pointer.
We could keep that simplicity by having a capability bit that says
"this pointer can move backward so long as it doesn't become the oldest
one". In this design you still need a mark pointer to delimit how
much tuplestore you want to keep, but there's no explicit link between
any one mark pointer and any one active read pointer so far as
tuplestore.c is aware. (Note this'd imply that a WindowObject contains
both a mark pointer and a read pointer for its window function, and the
WinForgetBefore() call actually moves the mark pointer.) A possible
objection is that it's a bit more error-prone to not explicitly say
which other read pointer is intended to be the movement fence for your
read pointer; but I'm not convinced whether that's important or not.

Also, we could really do this without any tuplestore API changes at all,
if we were willing to implement any backing-up of the read pointer as
copying the read pointer from the mark pointer and then advancing the
read pointer. However that could get pretty inefficient if there were
any significant distance between the mark and read pointers.

Another thing that I'm thinking is that it was a mistake to not
explicitly expose tuplestore_trim() as part of the tuplestore API.
It seems that the callers always have a much better idea than
tuplestore.c of when it would be worth trying to trim the tuplestore.
The current WF patch wants to put trim calls into such bizarre places
as select_read_pointer, which is utterly counterintuitive and usually
a waste of cycles too. If we changed the rule to be that we trim when
the caller tells us to, we could do something saner --- I think that
really WindowAgg only needs to trim once per input tuple. We'd have to
add a line or two in some other places like mergejoin, but it'd really
not be much of a problem.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2008-12-27 00:59:56 pg_restore --clean text
Previous Message Tom Lane 2008-12-26 21:41:01 Re: Frames vs partitions: is SQL2008 completely insane?