Re: I'd like to discuss scaleout at PGCon

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: I'd like to discuss scaleout at PGCon
Date: 2018-06-01 14:44:43
Message-ID: CAFjFpRcYYvjqF92c=dL5W2t-Viy7CUyyCqfBznVK2tB2BeDVxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 31, 2018 at 11:00 PM, MauMau <maumau307(at)gmail(dot)com> wrote:
> 2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas(at)gmail(dot)com>:
>> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>>> Oh, I didn't know you support FDW approach mainly for analytics. I
>>> guessed the first target was OLTP read-write scalability.
>>
>> That seems like a harder target to me, because you will have an extra
>> hop involved -- SQL from the client to the first server, then via SQL
>> to a second server. The work of parsing and planning also has to be
>> done twice, once for the foreign table and again for the table. For
>> longer-running queries this overhead doesn't matter as much, but for
>> short-running queries it is significant.
>
> Yes, that extra hop and double parsing/planning were the killer for
> our performance goal when we tried to meet our customer's scaleout
> needs with XL. The application executes 82 DML statements in one
> transaction. Those DMLs consist of INSERT, UPDATE and SELECT that
> only accesses one row with a primary key. The target tables are only
> a few, so the application PREPAREs a few statements and EXECUTEs them
> repeatedly. We placed the coordinator node of XL on the same host as
> the application, and data nodes and GTM on other individual nodes.
>

I agree that there's double parsing happening, but I am hesitant to
agree with the double planning claim. We do plan, let's say a join
between two foreign tables, on the local server, but that's only to
decide whether it's efficient to join locally or on the foreign
server. That means we create foreign paths for scan on the foreign
tables, may be as many parameterized plans as the number of join
conditions, and one path for the join pushdown that's it. We then
create local join paths but we need those to decide whether it's
efficient to join locally and if yes, which way. But don't create
paths as to how the foreign server would plan that join. That's not
double planning since we do not create same paths locally and on the
foreign server.

In order to avoid double parsing, we might want to find a way to pass
a "normalized" parse tree down to the foreign server. We need to
normalize the OIDs in the parse tree since those may be different
across the nodes.

>
>
>> I don't know what "node management" and "failure dectection/failover"
>> mean specifically. I'd like to hear proposals, though.
>
> That's nothing special or new. Things like:

That's a good summary of what we need here. Thanks for the summary.

>
> * Define a set of nodes that can join the cluster.
> * Initialize or configure a node according to its role in the cluster.
> * Decommission a node from the cluster.
> * Define a node group in which all member nodes have the same data set
> for redundancy.
> * One command to start and shutdown the entire cluster.

Right.

> * System tables to display the member nodes and node groups.

I think we need system tables on each node to store the cluster
configuration as seen by that node not just display. But that's a bit
of a detail.

> * Each node's in-memory view of the current cluster state.
> * How each node monitors which other nodes.

That's where we can have multiple ways. It's either each node
monitoring other nodes or we have a kind of watch-dog or a central
place (not necessarily a node, a file shared across the nodes might as
well server that purpose) from where we can fetch the "last known"
state of the cluster. But that along with split brain problem is much
larger problem to solve. I agree we need some way to know the "last
known" status of the cluster and then correct it as the cluster has
conflicting experiences, and do all of that without having much
communication overhead.

> * Elect a new primary node within a node group when the current
> primary node fails.

Well, some configuration might want more than one primary or
coordinator nodes. Having a single primary in itself creates an SPOF,
which should be avoided. But then there will be some loads which will
be happy with a single primary and risks arising out of that.

> * Whether each node group should be configured with a master-slaves
> replication topology, or a multi-master topology like MySQL Group
> Replication

Instead of a master-slave configuration, we might want to use logical
replication or some such method to create replicas of tables on
multiple nodes in a cluster and let the optimizer take advantage of
that for join push-down or load balancing.

>
> Some of the above may end up with XL's things like
> pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
> NODE/NODE GROUP commands, etc.
>

I agree.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2018-06-01 14:48:29 Re: Loaded footgun open_datasync on Windows
Previous Message Alexander Korotkov 2018-06-01 14:22:26 Re: Re: [HACKERS] [PATCH] Incremental sort