Re: Bumping block size to 16K on FreeBSD...

From: David Schultz <dschultz(at)uclink(dot)Berkeley(dot)EDU>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Sean Chittenden <sean(at)chittenden(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bumping block size to 16K on FreeBSD...
Date: 2003-08-29 00:51:12
Message-ID: 20030829005112.GA45785@HAL9000.homeunix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 28, 2003, Tom Lane wrote:
> Sean Chittenden <sean(at)chittenden(dot)org> writes:
> > Are there any objections
> > to me increasing the block size for FreeBSD installations to 16K for
> > the upcoming 7.4 release?
>
> I'm a little uncomfortable with introducing a cross-platform variation
> in the standard block size. That would have implications for things
> like whether a table definition that works on FreeBSD could be expected
> to work elsewhere; to say nothing of recommendations for shared_buffer
> settings and suchlike.
>
> Also, there is no infrastructure for adjusting BLCKSZ automatically at
> configure time, and I don't much want to add it.

On recent versions of FreeBSD (and Solaris too, I think), the
default UFS block size is 16K, and file fragments are 2K. This
works great for many workloads, but it kills pgsql's random write
performance unless pgsql uses 16K blocks as well, due to the
read-modify-write involved. Either the filesystem or the database
needs to be changed in order to get decent performance. I have
not compared 16K UFS/16K pgsql to 8K UFS/8K pgsql, so I can't say
which option makes more sense, though. There probably isn't
anything wrong with the pgsql default, except that it's set in
stone.

It's entirely feasible for administrators to create 8K/1K UFS
filesystems specifically for pgsql, but they need to be aware of
the issue. On the other hand, I don't see how it would be a bad
thing if pgsql were able to adapt at runtime either. Thus, I've
come up with two possible fixes:

(1) Document the problem with having a filesystem block size
larger than the database block size. With a simple call to
statvfs(2), the postmaster could warn about this on startup, too.

(2) Make BLCKSZ a runtime constant, stored as part of the database.
Grepping through the source, I didn't see any places
using BLCKSZ where efficiency appeared to be so critical that
you had to have constant folding. Of course, one could introduce
a 'lg2blksz' constant to avoid divides and multiplies.

This would NOT introduce cross-platform incompatibilities, only
efficiency problems with databases that have been moved across
filesystems in some cases. The ability to adapt at database
creation time is also useful in that it allows the database to
be tuned to the characteristics of the particular device on
which it resides.[1]

I don't know very much about pgsql, so corrections and feedback
regarding these ideas would be appreciated.

[1] Right now, the seek time to transfer time ratio of the drive
is mostly hidden by the operating system's clustering and
read-ahead. I tried modifying pgsql to use direct I/O, but
it seems that pgsql doesn't do its own clustering or read-ahead,
so that was a lose...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Schultz 2003-08-29 00:55:53 Re: Bumping block size to 16K on FreeBSD...
Previous Message Joe Conway 2003-08-29 00:50:40 Re: [HACKERS] 2-phase commit