Re: Large files for relations

From: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jim Mlodgenski <jimmy76(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large files for relations
Date: 2023-05-12 16:41:33
Message-ID: CAFbpF8O2BAyyn0gifSNfrdfUdvjf0vergwKUh9osG-O-W+4_pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.

On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimmy76(at)gmail(dot)com> wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease. Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph. Yeah, that might be a bit of a problem. I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit). It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size. But however hypothetical the
> scenario might be, it should work, and this is certainly a plausible
> argument against the "aggressive" plan described above with the hard
> cut-off where we get to drop the segmented mode.
>
> Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
> concatenate with the above patches, so you'd have to use link or
> reflink mode (you'd probably want to use that anyway unless due to
> sheer volume of data to copy otherwise, since ext4 is also not capable
> of block-range sharing), but then you'd be out of luck after N future
> major releases, according to that plan where we start deleting the
> code, so you'd need to organise some smaller partitions before that
> time comes. Or pg_upgrade to a target on xfs etc. I wonder if a
> future version of extN will increase its max file size.
>
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional. For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC. Likewise for
> base backup. Etc. Then someone concerned about hitting the 16TB
> limit on ext4 could opt out. Or something like that. It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).
>
>
>

--
Mark Callaghan
mdcallag(at)gmail(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-12 17:28:04 Re: psql tests hangs
Previous Message Pavel Stehule 2023-05-12 16:12:56 Re: psql tests hangs