Re: Large files for relations

From: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jim Mlodgenski <jimmy76(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large files for relations
Date: 2023-05-15 16:43:17
Message-ID: CAFbpF8OaxX+ZhKb=XTnLxGgJZxC8iTxEF_YeNEjwWWZNG1tAEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2023 at 4:02 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN <mdcallag(at)gmail(dot)com> wrote:
> > Repeating what was mentioned on Twitter, because I had some experience
> with the topic. With fewer files per table there will be more contention on
> the per-inode mutex (which might now be the per-inode rwsem). I haven't
> read filesystem source in a long time. Back in the day, and perhaps today,
> it was locked for the duration of a write to storage (locked within the
> kernel) and was briefly locked while setting up a read.
> >
> > The workaround for writes was one of:
> > 1) enable disk write cache or use battery-backed HW RAID to make writes
> faster (yes disks, I encountered this prior to 2010)
> > 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
> locked for the duration of a write
> >
> > I have a vague memory that filesystems have improved in this regard.
>
> (I am interpreting your "use XFS" to mean "use XFS instead of ext4".)
>

Yes, although when the decision was made it was probably ext-3 -> XFS. We
suffered from fsync a file == fsync the filesystem
because MySQL binlogs use buffered IO and are appended on write. Switching
from ext-? to XFS was an easy perf win
so I don't have much experience with ext-? over the past decade.

> Right, 80s file systems like UFS (and I suspect ext and ext2, which
>

Late 80s is when I last hacked on Unix fileys code, excluding browsing XFS
and ext source. Unix was easy back then -- one big kernel lock covers
everything.

> some time sooner). Currently our code believes that it is not safe to
> call fdatasync() for files whose size might have changed. There is no
>

Long ago we added code for InnoDB to avoid fsync/fdatasync in some cases
when O_DIRECT was used. While great for performance
we also forgot to make sure they were still done when files were extended.
Eventually we fixed that.

Thanks for all of the details.

--
Mark Callaghan
mdcallag(at)gmail(dot)com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2023-05-15 16:45:23 Re: Autogenerate some wait events code and documentation
Previous Message Bruce Momjian 2023-05-15 16:22:38 Re: cutting down the TODO list thread