Quick Links

Rethinking representation of sort/hash semantics in queries and plans

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Rethinking representation of sort/hash semantics in queries and plans
Date:	2010-11-28 03:02:33
Message-ID:	8075.1290913353@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

In recent discussions of the plan-tree representation for KNNGIST index
scans, there seemed to be agreement that it'd be a good idea to explicitly
represent the expected sort ordering of the output. While thinking about
how best to do that, it struck me that there's some pretty horrendous
impedance mismatches in the way we do things now. Different parts of the
system use two different representations of sort ordering:

* A sort operator (which can have < or > semantics) plus nulls-first flag

* A btree opfamily plus direction and nulls-first flags

Getting from one of these representations to the other is not exactly
cheap, as it requires one or more syscache lookups. But consider what
happens when we process a simple SELECT ... ORDER BY query to produce
a Sort plan:

1. The parser generates a SortGroupClause, which contains the
sort-operator representation. This involves looking up the default btree
opclass for the ORDER BY column's datatype, then finding the < or > member
of the opclass. (These lookups are buffered by the typcache, but we'll
still have to do them at least once per session.) If you use ORDER BY
USING then you might think it's cheaper ... but we still do a lookup to
verify that the operator is in some btree family.

2. The planner generates a PathKey, which is based on the opfamily
representation, so we have to do get_ordering_op_properties to go back the
other way.

3. If a sort plan is chosen, createplan.c uses get_opfamily_member to go
from the PathKey representation back to sort-operator representation,
because the Sort plan node type stores sort operators.

4. At runtime, tuplesort_begin_heap needs the comparison function for the
sort operator, so it calls get_ordering_op_properties *again* to re-derive
the opfamily, from which it can extract the right support function using
get_opfamily_proc.

Things are only marginally less comical if an indexscan plan is chosen,
and that's only because the IndexScan plan node doesn't contain any
explicit representation of the output sort order. If we were to solve the
KNNGIST issue by installing sort operator lists in IndexScan, it'd be
about as ugly as this. (For some extra amusement, trace through where
build_index_pathkeys' data comes from...)

I have not traced through the behavior for hash-based plans as carefully,
but I believe that there are a similar number of conversions between
operator OID and hash opfamily OID representations.

We got into this mess by revising the planner to use opfamily OIDs to
define sort/hash semantics without changing the structures that are its
input and output APIs. I think it's probably time to fix this. I don't
have any evidence at the moment about what fraction of SearchSysCache load
is coming from these repeated conversions, but it can't be trivial. And
quite aside from any performance benefits, it would be conceptually
cleaner to have only one representation of sort semantics not two.

If you look closely at what we're doing with sort operators
(get_ordering_op_properties pretty much encapsulates this), it turns out
that a sort operator is shorthand for three pieces of information:

1. btree opfamily OID
2. specific input datatype for the opfamily
3. ascending or descending direction

So to fix these problems we'd need to replace sort operator OIDs in
SortGroupClause and plan nodes with those three items. Obviously, this
would be slightly bulkier, but the extra cost added to copying parse and
plan trees should be tiny compared to the avoided syscache lookups.

A possible compromise is to avoid storing the specific input datatype.
In most cases it's the same as the column datatype or expression datatype
that we're feeding to the sort operation, which I believe we can always
get from somewhere else in the plan. However, it can be different in
binary-compatibility cases (such as feeding a specific array type to
anyarray_ops, or varchar to text_ops). If we don't store it then we'd
need to add extra complexity in get_opfamily_proc and similar places
to search for a binary-compatible member function or operator. This
extra cost would be paid only when actually using binary-compatible
cases, but it still seems like it'd be better to pay a little extra
storage to avoid it.

In a similar fashion, I think that the eqop member of SortGroupClause
should be replaced by a hash opfamily OID and specific input datatype
(no direction needed here, of course), and likewise for hash operator
OIDs in hash-based plan node types.

A possible objection to this idea is that instead of having stored rules
and views depending on specific operators, they'd now be shown as
depending on specific opfamilies. If you were to drop the relevant
operators then you'd get failures at runtime because no suitable member
operator or function could be found. However, with the current scheme
it's possible to drop the opfamily that makes use of a particular operator
valid in a given query context; so you can still get a failure, though it
would happen a bit upstream in the planner when it fails to locate the
opfamily. On balance I don't think this is any better or worse, just
different. We don't support dynamic modification of opclasses terribly
well anyhow, given all the caching that is done for them.

Thoughts, objections?

regards, tom lane

Responses

Re: Rethinking representation of sort/hash semantics in queries and plans at 2010-11-28 14:45:54 from Martijn van Oosterhout
Re: Rethinking representation of sort/hash semantics in queries and plans at 2010-11-28 20:02:56 from Dimitri Fontaine
Re: Rethinking representation of sort/hash semantics in queries and plans at 2010-11-28 22:03:20 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2010-11-28 03:22:03	Re: Report: Linux huge pages with Postgres
Previous Message	Bruce Momjian	2010-11-28 01:09:11	Re: s/LABEL/VALUE/ for ENUMs