Re: Large files for relations

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jim Mlodgenski <jimmy76(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large files for relations
Date: 2023-05-11 23:37:33
Message-ID: CA+hUKGKvcbdTnrKtD33oWf3q1RGYiPz8=wRDdPPtJgOTfsvDOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76(at)gmail(dot)com> wrote:
> On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> I am not aware of any modern/non-historic filesystem[2] that can't do
>> large files with ease. Anyone know of anything to worry about on that
>> front?
>
> There is some trouble in the ambiguity of what we mean by "modern" and "large files". There are still a large number of users of ext4 where the max file size is 16TB. Switching to a single large file per relation would effectively cut the max table size in half for those users. How would a user with say a 20TB table running on ext4 be impacted by this change?

Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
various places that MySQL/InnoDB can't have tables bigger than 16TB on
ext4 because of this, when it's in its default one-file-per-object
mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
mode like DB2, Oracle etc, in which case I think you can have multiple
16TB segment files and get past that ext4 limit). It's frustrating
because 16TB is still really, really big and you probably should be
using partitions, or more partitions, to avoid all kinds of other
scalability problems at that size. But however hypothetical the
scenario might be, it should work, and this is certainly a plausible
argument against the "aggressive" plan described above with the hard
cut-off where we get to drop the segmented mode.

Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
concatenate with the above patches, so you'd have to use link or
reflink mode (you'd probably want to use that anyway unless due to
sheer volume of data to copy otherwise, since ext4 is also not capable
of block-range sharing), but then you'd be out of luck after N future
major releases, according to that plan where we start deleting the
code, so you'd need to organise some smaller partitions before that
time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
future version of extN will increase its max file size.

A less aggressive version of the plan would be that we just keep the
segment code for the foreseeable future with no planned cut off, and
we make all of those "piggy back" transformations that I showed in the
patch set optional. For example, I had it so that CLUSTER would
quietly convert your relation to large format, if it was still in
segmented format (might as well if you're writing all the data out
anyway, right?), but perhaps that could depend on a GUC. Likewise for
base backup. Etc. Then someone concerned about hitting the 16TB
limit on ext4 could opt out. Or something like that. It seems funny
though, that's exactly the user who should want this feature (they
have 16,000 relation segment files).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-12 00:08:30 Re: psql tests hangs
Previous Message Jeff Davis 2023-05-11 21:29:18 Re: Order changes in PG16 since ICU introduction