Re: Combining Aggregates

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2016-04-05 19:25:51
Message-ID: CA+TgmoaOMHfMUWNHMad3skUhFGBT7PAoVrx0xau6Qq3=SRZQTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 5, 2016 at 2:52 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>> Now, let's suppose that the user sets up a sharded table and then
>> says: SELECT a, SUM(b), AVG(c) FROM sometab. At this point, what we'd
>> like to have happen is that for each child foreign table, we go and
>> fetch partially aggregated results. Those children might be running
>> any version of PostgreSQL - I was not assuming that we'd insist on
>> matching major versions, although of course that could be done - and
>> there would probably need to be a minimum version of PostgreSQL
>> anyway. They could even be running some other database. As long as
>> they can spit out partial aggregates in a format that we can
>> understand, we can deserialize those aggregates and run combine
>> functions on them. But if the remote side is, say, MariaDB, it's
>> probably much easier to get it to spit out something that looks like a
>> PostgreSQL array than it is to make it spit out some bytea blob that's
>> in an entirely PostgreSQL-specific format.
>
> Basing parts of the Postgres sharding mechanism on FDWs sounds
> acceptable. Trying to design things so that *any* FDW can be part of a
> shard, so that you have some shards in Postgres and other shards in
> MariaDB, seems ludicrous to me. Down that path lies madness.

I'm doubtful that anyone wants to do the work to make that happen, but
I don't agree that we shouldn't care about whether it's possible.
Extensibility is a feature of the system that we've worked hard for,
and we shouldn't casually surrender it. For example, postgres_fdw now
implements join pushdown, and I suspect few other FDW authors will
care to do the work to add similar support to their implementations.
But some may, and it's good that the code is structured in such a way
that they have the option.

Actually, though, MariaDB is a bad example. What somebody's much more
likely to want to do is have PostgreSQL as a frontend accessing data
that's actually stored in Hadoop. There are lots of SQL interfaces to
Hadoop already, so it's clearly a thing people want, and our SQL is
the best SQL (right?) so if you could put that on top of Hadoop
somebody'd probably like it. I'm not planning to try it myself, but I
think it would be cool if somebody else did. I have been very pleased
to see that many of the bits and pieces of infrastructure that I
created for parallel query (parallel background workers, DSM, shm_mq)
have attracted quite a bit of interest from other developers for
totally unrelated purposes, and I think anything we do around
horizontal scalability should be designed the same way: the goal
should be to work with PostgreSQL on the other side, but the bits that
can be made reusable for other purposes should be so constructed.

> In fact, trying to ensure cross-major-version compatibility already
> sounds like asking for too much. Requiring matching major versions
> sounds a perfectly acceptable restricting to me.

I disagree. One of the motivations (not the only one, by any means)
for wanting logical replication in PostgreSQL is precisely to get
around the fact that physical replication requires matching major
versions. That restriction turns out to be fairly onerous, not least
because when you've got a cluster of several machines you'd rather
upgrade them one at a time rather than all at once. That's going to
be even more true with a sharded cluster, which will probably tend to
involve more machines than a replication cluster, maybe a lot more.
If you say that the user has got to shut the entire thing down,
upgrade all the machines, and turn it all back on again, and just hope
it works, that's going to be really painful. I think that we should
treat this more like we do with libpq, where each major release can
add new capabilities that new applications can use, but the old stuff
has got to keep working essentially forever. Maybe the requirements
here are not quite so tight, because it's probably reasonable to say,
e.g. that you must upgrade every machine in the cluster to at least
release 11.1 before upgrading any machine to 11.3 or higher, but the
fewer such requirements we have, the better. Getting people to
upgrade to new major releases is already fairly hard, and anything
that makes it harder is an unwelcome development from my point of
view.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-04-05 19:48:22 Re: [COMMITTERS] pgsql: Move each SLRU's lwlocks to a separate tranche.
Previous Message Alvaro Herrera 2016-04-05 19:09:33 Re: dealing with extension dependencies that aren't quite 'e'