RE: [Patch] Optimize dropping of relation buffers using dlist

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Thomas Munro' <thomas(dot)munro(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-10-23 00:56:35
Message-ID: TYAPR01MB2990C1DBBC8985E27CC28ADBFE1A0@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> > I'm probably being silly, but can't we avoid the problem by using fstat()
> instead of lseek(SEEK_END)? Would they return the same value from the
> i-node?
>
> Amazingly, st_size can disagree with SEEK_END when using the Linux NFS
> client, but its behaviour is worse. Here's a sequence from a Linux
> NFS client talking to a Linux NFS server with no free space. This
> time, I also replaced the fsync() with sleep(60), just to make it
> clear that SEEK_END offset can move at any time due to asynchronous
> activity in kernel threads:

Thank you for experimenting. That's surely amazing. So, it makes sense for commercial DBMSs and MySQL to preallocate data files... (But IIRC, MySQL has provided an option to allocate a file per table like Postgres relatively recently.)

FWIW, it seems safe to use the nodelalloc mount option with ext4 to disable delayed allocation, while xfs doesn't have such an option.

> > Or, can't we just try to do BufTableLookup() one block after what
> smgrnblocks() returns?
>
> Unfortunately the problem isn't limited to one block.

You're right. The data file can be extended by multiple blocks between disk writes.

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-10-23 01:05:28 heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits)
Previous Message Ian Lawrence Barwick 2020-10-23 00:53:29 proposal: function pg_setting_value_split() to parse shared_preload_libraries etc.