RE: I'd like to discuss scaleout at PGCon

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Sumanta Mukherjee' <sumanta(dot)mukherjee(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "maumau307(at)gmail(dot)com" <maumau307(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: I'd like to discuss scaleout at PGCon
Date: 2020-06-18 04:38:06
Message-ID: TYAPR01MB299054B6A05F102AEC25F1DBFE9B0@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

It seems you didn't include pgsql-hackers.

From: Sumanta Mukherjee <sumanta(dot)mukherjee(at)enterprisedb(dot)com>
> I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

Thanks. Separation of compute and storage is surely to be considered. Unlike the old days when the shared storage was considered to be a bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to flash memory, NVMe-oF, and RDMA.

> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.

This is interesting. I'll go into this. Do you know there's any product based on Hyder? OTOH, Hyder seems to require drastic changes when adopting for Postgres -- OCC, log-structured database, etc. I'd like to hear how feasible those are. However, its scale-out capability without the need for data or application partitioning appears appealing.

To explore another possibility that would have more affinity with the current Postgres, let me introduce our proprietary product called Symfoware. It's not based on Postgres.

It has shared nothing scale-out functionality with full ACID based on 2PC, conventional 2PL locking and distributed deadlock resolution. Despite being shared nothing, all the database files and transaction logs are stored on shared storage.

The database is divided into "log groups." Each log group has one transaction log and multiple tablespaces (it's called "database space" instead of tablespace.)

Each DB instance in the cluster owns multiple log groups, and handles reads/writes to the data in its owning log groups. When a DB instance fails, other surviving DB instances take over the log groups of the failed DB instance, recover the data using the transaction log of the log group, and accepts reads/writes to the data in the log group. The DBA configures which DB instance initially owns which log groups and which DB instances are candidates to take over which log groups.

This way, no server is idle as a standby. All DB instances work hard to process read-write transactions. This "no idle server for HA" is one of the things Oracle RAC users want in terms of cost.

However, it still requires data and application partitioning unlike Hyder. Does anyone think of a way to eliminate partitioning? Data and application partitioning is what Oracle RAC users want to avoid or cannot tolerate.

Ref: Introduction of the Symfoware shared nothing scale-out called "load share."
https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf

Regards
Takayuki Tsunakawa

--- below is Sumanta's original mail ---
From: Sumanta Mukherjee <sumanta(dot)mukherjee(at)enterprisedb(dot)com>
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>; Merlin Moncure <mmoncure(at)gmail(dot)com>; Robert Haas <robertmhaas(at)gmail(dot)com>; maumau307(at)gmail(dot)com
Subject: Re: I'd like to discuss scaleout at PGCon

Hello,

I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.

2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html<http://blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html>.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksei Ivanov 2020-06-18 05:19:29 Re: Binary transfer vs Text transfer
Previous Message Tom Lane 2020-06-18 04:26:04 Re: More tzdb fun: POSIXRULES is being deprecated upstream