Trying out libarchive for reading user-generated WAL tarballs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Trying out libarchive for reading user-generated WAL tarballs
Date: 2026-04-05 02:42:59
Message-ID: CA+hUKGJYThZZp0AfvEbzNX_ZQ22pTpcjtgT0J_Pb+HAGH=QChw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an experimental patch that gives you optional extra tar (and
potentially zip etc) support if compiled --with-libarchive, but only
for pg_waldump where we expect to meet user-generated archives. The
recent band-aid applied to pg_waldump/t/001_basic.pl becomes:

+# If we don't have libarchive, then we tell tar to stick to ustar format that
+# astreamer_tar.c can decode. Otherwise we should be able to accept anything
+# that any current tar produces.
+(at)tar_p_flags = tar_portability_options($tar)
+ if !check_pg_config("#define USE_LIBARCHIVE");

I was compelled to try this to avoid being sucked into the rabbithole
of hacking on tar code, after pg_waldump broke my computer[1]. It
doesn't seem to make much sense to try to speedrun everything that
happened to archiving since 1988 when you're a database project. I
was encouraged by Robert's prediction[2] that we'd probably want to do
precisely this as soon as we started accepting user-generated
archives. I postdict the same!

libarchive is really easy to work with, widely used and seems well put
together. The only thing I was a bit sad about was the lack of an
async-friendly API that would let us push a raw byte stream into it.
So I tried modelling it as a "source only" astreamer that you pump by
calling astreamer_pull() when you want more content to be delivered to
the next streamer.

I don't immediately see why that'd be a problem, but I may lack
imagination. It's still incremental, can still stop earlier, and we
don't do any multiplexing or AIO in this or any other uses of
astreamers. It does mean that pg_waldump's read_archive_file() has to
treat this astreamer slightly differently though, which is annoying.
Perhaps that could be fixed if astreamer_file.c provided
"astreamer_file_reader" with the same semantics, so that it could
unconditionally call astreamer_pull(privateInfo->archive_streamer),
instead of doing the read, push-into-stream itself? Just a thought.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGL2dppjO4o28ZY7n_LTWviKLAi-7KZ%3Dtx5w2HGevCEYPA%40mail.gmail.com#0897c3b9c0aa583fef9459a711c7de60
[2] https://www.postgresql.org/message-id/CA+TgmoYg0C4ZkuSD=mag+wbq=0GGiBm+-k1zM7LHJTDpioLYuw@mail.gmail.com

Attachment Content-Type Size
0001-libarchive-Add-configure-and-meson-options.patch text/x-patch 9.0 KB
0002-libarchive-Provide-astreamer_libarchive.c.patch text/x-patch 10.8 KB
0003-fixup-Use-more-efficient-zero-copy-API.patch text/x-patch 4.3 KB
0004-pg_waldump-Use-astreamer_libarchive.c.patch text/x-patch 5.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-04-05 03:45:27 Re: TupleDescAttr bounds checks
Previous Message Bruce Momjian 2026-04-05 02:33:54 Re: PG 19 release notes and authors