Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
Date: 2016-06-17 18:34:50
Message-ID: CA+TgmoYkMqdGnjg7LAbxF9DgYUeJS50ecXV1HOVUAs0W_qzvKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 17, 2016 at 1:45 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> I don't mind if you feel the need to refactor this.
>
>> I'm not sure yet. What I think we need to work out first is exactly
>> how polymorphic parallel aggregates ought to identify which concrete
>> data types they are using. There were well-defined rules before,
>> but how do we adapt those to two levels of aggregate evaluation?
>
> After reviewing things a bit, it seems like the first design-level issue
> there is what about polymorphic combine functions. So far as the type
> system is concerned, there's no issue: the declared signature is always
> going to be "combine(foo, foo) returns foo" and it doesn't matter whether
> foo is an ordinary type, a polymorphic one, or INTERNAL. The other
> question is whether the function itself knows what it's operating on in
> the current invocation. For regular polymorphic types this seems no
> different from any other usage. If the transtype is INTERNAL, then the
> type system provides no help; but we could reasonably assume that the
> internal representation itself contains as much information as the combine
> func needs. We could add extra arguments like the "finalfunc_extra"
> option does for the finalfunc, but I don't really see a need --- that hack
> is mainly to satisfy the type system that the finalfunc's signature is
> sane, not to tell the finalfunc something it has no other way to find out.
> So I think we're probably OK as far as the combine function is concerned.

OK....

> The situation is much direr as far as serialize/deserialize are concerned.
> These basically don't work at all for polymorphic transtypes: if you try
> to declare "deserialize(bytea) returns anyarray", the type system won't
> let you. Perhaps that's not an issue because you shouldn't really need
> serialize/deserialize for anything except INTERNAL transtype.

In fact, I think that it won't even let you declare a serialize or
deserialize function for anything other than an INTERNAL transtype
right now. There was a PostGIS use case that made it seem like we
might want to relax that, but currently I believe that's the law of
the land.

> However,
> there's a far bigger problem which is that "deserialize(bytea) returns
> internal" is a blatant security hole, which I see you ignored the defense
> against in opr_sanity.sql. The claim in the comment there that it's okay
> if we check for aggregate context is a joke --- or haven't you heard that
> we allow users to create aggregates? That's not nearly good enough to
> prevent unsafe usage of such functions.

That may be true, but it isn't obvious to me. The user can't simply
say SELECT numeric_avg_deserialize(...) or whatever because the
function won't execute when called that way. How would you arrange to
call such a function in a context where it wouldn't fail outright but
would return a value that you could then hand to some other function
that expects a type-internal input?

> Not to mention that CREATE
> FUNCTION won't allow creation of such functions, so extensions are locked
> out of using this feature.

Oops.

> This has to be redesigned or else reverted entirely. I'm not going to
> take no for an answer.

Please don't act as if I've been refusing to fix bugs. I've been
doing practically nothing else.

> A possible solution is to give deserialize an extra dummy argument, along
> the lines of "deserialize(bytea, internal) returns internal", thereby
> ensuring it can't be called in any non-system-originated contexts.

Seems promising.

> This
> is still rather dangerous if the other argument is variable, as somebody
> might be able to abuse an internal-taking function by naming it as the
> deserialize function for a maliciously-designed aggregate.

The complex and, as far as I can tell, largely undocumented set of
rules about what sorts of internal-taking-and-returning functions we
allow to exist strikes me as a house of cards just waiting for the
next stiff breeze to come along.

> What I'm
> inclined to do to lock it down further is to drop the "serialtype"
> argument to CREATE AGGREGATE, which seems rather pointless (what else
> would you ever use besides bytea?). Instead, insist that
> serialize/deserialize apply *only* when the transtype is INTERNAL, and
> their signatures are exactly "serialize(internal) returns bytea" and
> "deserialize(bytea, internal) returns internal", never anything else.

I originally had the thought that you might want to make the
serialized representation something that could be human-readable. I'd
actually like to have an SQL syntax that outputs the transition state
or, if it's internal, the serialized representation of it. That would
allow us to do partitionwise aggregation given a query like SELECT
avg(foo) FROM parent_table_for_several_foreign_tables. You'd need to
be able to ship something like SELECT PARTIAL avg(foo) FROM t1 to each
child. And it might be nicer if the results came back in some
human-readable format rather than as a bytea, but it's not essential,
of course.

> A different way of locking things down, which might be cleaner in the
> long run, is to invent a new pseudo-type for the sole purpose of being
> the serialization type, that is we'd insist on the signatures being
> "serialize(internal) returns serialized_internal" and
> "deserialize(serialized_internal) returns internal", where
> serialized_internal has the representation properties of bytea but is
> usable for no other purpose than this. Not sure it's worth the trouble
> though.

I think we should break up internal into various kinds of internal
depending on what kind of a thing we've actually got a pointer to. We
don't need more synonyms for bytea, IMO.

> Anyway, setting aside the security angle, it doesn't seem like there is
> anything wrong with the high-level design for polymorphic cases. I'm now
> thinking the problem I saw is just a garden variety implementation bug
> (or bugs). I'm still not very happy with the confusion around aggtype
> though, not least because I suspect it contributed directly to this bug.
> I also feel that both the code comments and the user-facing documentation
> for this feature are well below project standards, eg where's the
> discussion in section 35.10?

Good point. That needs to be addressed -- ideally by David, but if he
doesn't then I guess I'll have to do it. If you have other specific
comments, please send them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2016-06-17 18:37:47 Re: Experimental dynamic memory allocation of postgresql shared memory
Previous Message Aleksey Demakov 2016-06-17 18:34:42 Re: Experimental dynamic memory allocation of postgresql shared memory