Quick Links

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

From:	Vladimir Churyukin <vladimir(at)churyukin(dot)com>
To:	Pierre Barre <pierre(at)barre(dot)sh>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Date:	2025-07-26 07:42:54
Message-ID:	CAFSGpE2xzAz4zefZa8sQLkNajp0hT7LiONQDGSAxigwGG3ii8w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Sorry, I was referring to this:

> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible
node)
> - Partitions between PostgreSQL nodes don't prevent the system from
functioning

Some pretty well-known cases of storage / compute separation (Aurora, Neon)
also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking
about this approach too, that's why I mentioned what kind of challenges one
may have on that path.

On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pierre(at)barre(dot)sh> wrote:

> What you describe doesn’t look like something very useful for the vast
> majority of projects that needs a database. Why would you even want that if
> you can avoid it?
>
> If your “single node” can handle tens / hundreds of thousands requests per
> second, still have very durable and highly available storage, as well as
> fast recovery mechanisms, what’s the point?
>
> I am not trying to cater to extreme outliers that may want very weird like
> this, that’s just not the use-cases I want to address, because I believe
> they are few and far between.
>
> Best,
> Pierre
>
> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>
> A shared storage would require a lot of extra work. That's essentially
> what AWS Aurora does.
> You will have to have functionality to sync in-memory states between
> nodes, because all the instances will have cached data that can easily
> become stale on any write operation.
> That alone is not that simple. You will have to modify some locking logic.
> Most likely do a lot of other changes in a lot of places, Postgres was not
> just built with the assumption that the storage can be shared.
>
> -Vladimir
>
> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre(at)barre(dot)sh> wrote:
>
> Now, I'm trying to understand how CAP theorem applies here. Traditional
> PostgreSQL replication has clear CAP trade-offs - you choose between
> consistency and availability during partitions.
>
> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible
> node)
> - Partitions between PostgreSQL nodes don't prevent the system from
> functioning
>
> It seems that CAP assumes specific implementation details (like nodes
> maintaining independent state) without explicitly stating them.
>
> How should we think about CAP theorem when distributed nodes share storage
> rather than coordinate state? Are the trade-offs simply moved to a
> different layer, or does shared storage fundamentally change the analysis?
>
> Client with awareness of both PostgreSQL nodes
> | |
> ↓ (partition here) ↓
> PostgreSQL Primary PostgreSQL Standby
> | |
> └───────────┬───────────────────┘
> ↓
> Shared ZFS Pool
> |
> 6 Global ZeroFS instances
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> > Hi Seref,
> >
> > For the benchmarks, I used Hetzner's cloud service with the following
> setup:
> >
> > - A Hetzner s3 bucket in the FSN1 region
> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
> > - 3 ZeroFS nbd devices (same s3 bucket)
> > - A ZFS stripped pool with the 3 devices
> > - 200GB zfs L2ARC
> > - Postgres configured accordingly memory-wise as well as with
> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
> >
> > Best,
> > Pierre
> >
> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
> >> Sorry, this was meant to go to the whole group:
> >>
> >> Very interesting!. Great work. Can you clarify how exactly you're
> running postgres in your tests? A specific AWS service? What's the test
> infrastructure that sits above the file system?
> >>
> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre(at)barre(dot)sh> wrote:
> >>> Hi everyone,
> >>>
> >>> I wanted to share a project I've been working on that enables
> PostgreSQL to run on S3 storage while maintaining performance comparable to
> local NVMe. The approach uses block-level access rather than trying to map
> filesystem operations to S3 objects.
> >>>
> >>> ZeroFS: https://github.com/Barre/ZeroFS
> >>>
> >>> # The Architecture
> >>>
> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3
> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built
> on these block devices:
> >>>
> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
> >>>
> >>> By providing block-level access and leveraging ZFS's caching
> capabilities (L2ARC), we can achieve microsecond latencies despite the
> underlying storage being in S3.
> >>>
> >>> ## Performance Results
> >>>
> >>> Here are pgbench results from PostgreSQL running on this setup:
> >>>
> >>> ### Read/Write Workload
> >>>
> >>> ```
> >>> postgres(at)ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000
> example
> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> >>> starting vacuum...end.
> >>> transaction type: <builtin: TPC-B (sort of)>
> >>> scaling factor: 50
> >>> query mode: simple
> >>> number of clients: 50
> >>> number of threads: 15
> >>> maximum number of tries: 1
> >>> number of transactions per client: 100000
> >>> number of transactions actually processed: 5000000/5000000
> >>> number of failed transactions: 0 (0.000%)
> >>> latency average = 0.943 ms
> >>> initial connection time = 48.043 ms
> >>> tps = 53041.006947 (without initial connection time)
> >>> ```
> >>>
> >>> ### Read-Only Workload
> >>>
> >>> ```
> >>> postgres(at)ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
> example
> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> >>> starting vacuum...end.
> >>> transaction type: <builtin: select only>
> >>> scaling factor: 50
> >>> query mode: simple
> >>> number of clients: 50
> >>> number of threads: 15
> >>> maximum number of tries: 1
> >>> number of transactions per client: 100000
> >>> number of transactions actually processed: 5000000/5000000
> >>> number of failed transactions: 0 (0.000%)
> >>> latency average = 0.121 ms
> >>> initial connection time = 53.358 ms
> >>> tps = 413436.248089 (without initial connection time)
> >>> ```
> >>>
> >>> These numbers are with 50 concurrent clients and the actual data
> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches,
> while cold data comes from S3.
> >>>
> >>> ## How It Works
> >>>
> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS
> can use like any other block device
> >>> 2. Multiple cache layers hide S3 latency:
> >>> a. ZFS ARC/L2ARC for frequently accessed blocks
> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
> device
> >>> c. Optional local disk cache
> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS'
> LSM-tree
> >>>
> >>> ## Geo-Distributed PostgreSQL
> >>>
> >>> Since each region can run its own ZeroFS instance, you can create
> geographically distributed PostgreSQL setups.
> >>>
> >>> Example architectures:
> >>>
> >>> Architecture 1
> >>>
> >>>
> >>> PostgreSQL Client
> >>> |
> >>> | SQL queries
> >>> |
> >>> +--------------+
> >>> | PG Proxy |
> >>> | (HAProxy/ |
> >>> | PgBouncer) |
> >>> +--------------+
> >>> / \
> >>> / \
> >>> Synchronous Synchronous
> >>> Replication Replication
> >>> / \
> >>> / \
> >>> +---------------+ +---------------+
> >>> | PostgreSQL 1 | | PostgreSQL 2 |
> >>> | (Primary) |◄------►| (Standby) |
> >>> +---------------+ +---------------+
> >>> | |
> >>> | POSIX filesystem ops |
> >>> | |
> >>> +---------------+ +---------------+
> >>> | ZFS Pool 1 | | ZFS Pool 2 |
> >>> | (3-way mirror)| | (3-way mirror)|
> >>> +---------------+ +---------------+
> >>> / | \ / | \
> >>> / | \ / | \
> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
> >>> | | | | | |
> >>> +--------++--------++--------++--------++--------++--------+
> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
> >>> +--------++--------++--------++--------++--------++--------+
> >>> | | | | | |
> >>> | | | | | |
> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5
> S3-Region6
> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
> >>>
> >>> Architecture 2:
> >>>
> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
> >>> \ /
> >>> \ /
> >>> Same ZFS Pool (NBD)
> >>> |
> >>> 6 Global ZeroFS
> >>> |
> >>> S3 Regions
> >>>
> >>>
> >>> The main advantages I see are:
> >>> 1. Dramatic cost reduction for large datasets
> >>> 2. Simplified geo-distribution
> >>> 3. Infinite storage capacity
> >>> 4. Built-in encryption and compression
> >>>
> >>> Looking forward to your feedback and questions!
> >>>
> >>> Best,
> >>> Pierre
> >>>
> >>> P.S. The full project includes a custom NFS filesystem too.
> >>>
> >
>
>
>

In response to

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance at 2025-07-26 07:36:07 from Pierre Barre

Responses

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance at 2025-07-26 07:51:15 from Pierre Barre

Browse pgsql-general by date

	From	Date	Subject
Next Message	Pierre Barre	2025-07-26 07:51:15	Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Previous Message	Pierre Barre	2025-07-26 07:36:07	Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance