From: | "Pierre Barre" <pierre(at)barre(dot)sh> |
---|---|
To: | "Vladimir Churyukin" <vladimir(at)churyukin(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance |
Date: | 2025-07-26 07:51:15 |
Message-ID: | 44dafe90-9ad6-41ae-b9fe-bea4aaf49a59@app.fastmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Ah, by "shared storage" I mean that each node can acquire exclusivity, not that they can both R/W to it at the same time.
> Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,
That model is cool, but I think it's more of a solution for outliers as I was suggesting, not something that most would or should want.
Best,
Pierre
On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
> Sorry, I was referring to this:
>
> > But when PostgreSQL instances share storage rather than replicate:
> > - Consistency seems maintained (same data)
> > - Availability seems maintained (client can always promote an accessible node)
> > - Partitions between PostgreSQL nodes don't prevent the system from functioning
>
> Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,
> that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path.
>
>
> On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pierre(at)barre(dot)sh> wrote:
>> __
>> What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?
>>
>> If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point?
>>
>> I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between.
>>
>> Best,
>> Pierre
>>
>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>>> A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.
>>> You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.
>>> That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.
>>>
>>> -Vladimir
>>>
>>> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre(at)barre(dot)sh> wrote:
>>>> Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.
>>>>
>>>> But when PostgreSQL instances share storage rather than replicate:
>>>> - Consistency seems maintained (same data)
>>>> - Availability seems maintained (client can always promote an accessible node)
>>>> - Partitions between PostgreSQL nodes don't prevent the system from functioning
>>>>
>>>> It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.
>>>>
>>>> How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?
>>>>
>>>> Client with awareness of both PostgreSQL nodes
>>>> | |
>>>> ↓ (partition here) ↓
>>>> PostgreSQL Primary PostgreSQL Standby
>>>> | |
>>>> └───────────┬───────────────────┘
>>>> ↓
>>>> Shared ZFS Pool
>>>> |
>>>> 6 Global ZeroFS instances
>>>>
>>>> Best,
>>>> Pierre
>>>>
>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>>>> > Hi Seref,
>>>> >
>>>> > For the benchmarks, I used Hetzner's cloud service with the following setup:
>>>> >
>>>> > - A Hetzner s3 bucket in the FSN1 region
>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>>>> > - 3 ZeroFS nbd devices (same s3 bucket)
>>>> > - A ZFS stripped pool with the 3 devices
>>>> > - 200GB zfs L2ARC
>>>> > - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>> >
>>>> > Best,
>>>> > Pierre
>>>> >
>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>>> >> Sorry, this was meant to go to the whole group:
>>>> >>
>>>> >> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>>> >>
>>>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre(at)barre(dot)sh> wrote:
>>>> >>> Hi everyone,
>>>> >>>
>>>> >>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>> >>>
>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> >>>
>>>> >>> # The Architecture
>>>> >>>
>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>> >>>
>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> >>>
>>>> >>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>> >>>
>>>> >>> ## Performance Results
>>>> >>>
>>>> >>> Here are pgbench results from PostgreSQL running on this setup:
>>>> >>>
>>>> >>> ### Read/Write Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres(at)ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: TPC-B (sort of)>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average = 0.943 ms
>>>> >>> initial connection time = 48.043 ms
>>>> >>> tps = 53041.006947 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> ### Read-Only Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres(at)ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: select only>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average = 0.121 ms
>>>> >>> initial connection time = 53.358 ms
>>>> >>> tps = 413436.248089 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>> >>>
>>>> >>> ## How It Works
>>>> >>>
>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>>> >>> 2. Multiple cache layers hide S3 latency:
>>>> >>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>>> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>>> >>> c. Optional local disk cache
>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>> >>>
>>>> >>> ## Geo-Distributed PostgreSQL
>>>> >>>
>>>> >>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>> >>>
>>>> >>> Example architectures:
>>>> >>>
>>>> >>> Architecture 1
>>>> >>>
>>>> >>>
>>>> >>> PostgreSQL Client
>>>> >>> |
>>>> >>> | SQL queries
>>>> >>> |
>>>> >>> +--------------+
>>>> >>> | PG Proxy |
>>>> >>> | (HAProxy/ |
>>>> >>> | PgBouncer) |
>>>> >>> +--------------+
>>>> >>> / \
>>>> >>> / \
>>>> >>> Synchronous Synchronous
>>>> >>> Replication Replication
>>>> >>> / \
>>>> >>> / \
>>>> >>> +---------------+ +---------------+
>>>> >>> | PostgreSQL 1 | | PostgreSQL 2 |
>>>> >>> | (Primary) |◄------►| (Standby) |
>>>> >>> +---------------+ +---------------+
>>>> >>> | |
>>>> >>> | POSIX filesystem ops |
>>>> >>> | |
>>>> >>> +---------------+ +---------------+
>>>> >>> | ZFS Pool 1 | | ZFS Pool 2 |
>>>> >>> | (3-way mirror)| | (3-way mirror)|
>>>> >>> +---------------+ +---------------+
>>>> >>> / | \ / | \
>>>> >>> / | \ / | \
>>>> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>>> >>> | | | | | |
>>>> >>> +--------++--------++--------++--------++--------++--------+
>>>> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>> >>> +--------++--------++--------++--------++--------++--------+
>>>> >>> | | | | | |
>>>> >>> | | | | | |
>>>> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>> >>>
>>>> >>> Architecture 2:
>>>> >>>
>>>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>> >>> \ /
>>>> >>> \ /
>>>> >>> Same ZFS Pool (NBD)
>>>> >>> |
>>>> >>> 6 Global ZeroFS
>>>> >>> |
>>>> >>> S3 Regions
>>>> >>>
>>>> >>>
>>>> >>> The main advantages I see are:
>>>> >>> 1. Dramatic cost reduction for large datasets
>>>> >>> 2. Simplified geo-distribution
>>>> >>> 3. Infinite storage capacity
>>>> >>> 4. Built-in encryption and compression
>>>> >>>
>>>> >>> Looking forward to your feedback and questions!
>>>> >>>
>>>> >>> Best,
>>>> >>> Pierre
>>>> >>>
>>>> >>> P.S. The full project includes a custom NFS filesystem too.
>>>> >>>
>>>> >
>>>>
>>
From | Date | Subject | |
---|---|---|---|
Next Message | Pierre Barre | 2025-07-26 08:44:41 | Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance |
Previous Message | Vladimir Churyukin | 2025-07-26 07:42:54 | Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance |