Re: Cache relation sizes?

From: Edmund Horner <ejrh00(at)gmail(dot)com>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cache relation sizes?
Date: 2018-11-09 01:23:42
Message-ID: CAMyN-kCPin_stCMoXCVCq5J557e9-WEFPZTqdpO3j8wzoNVwNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 7 Nov 2018 at 11:41, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> Hello,
>
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot. For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).
>
> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation. Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

On behalf of those looking after databases running over NFS (sigh!), I
think this is definitely worth exploring. Looking at the behaviour of
my (9.4.9) server, there's an lseek(SEEK_END) for every relation
(table or index) used by a query, which is a lot of them for a heavily
partitioned database. The lseek counts seem to be the same with
native partitions and 10.4.

As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT
0" on a table with 600 partitions, which builds a
MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes
600ms to return when repeated. (It's much slower the first time,
because the backend has to open the files, and read index blocks. I
found that increasing max_files_per_process above the number of
tables/indexes in the query made a huge difference, too!) Testing
separately, 1200 lseeks on that NFS mount takes around 400ms.

I'm aware of other improvements since 9.4.9 that would likely improve
things (the pread/pwrite change; possibly using native partitioning
instead of inheritance), but I imagine reducing lseeks would too.

(Of course, an even better improvement is to not put your data
directory on an NFS mount (sigh).)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-11-09 01:27:29 Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
Previous Message Ideriha, Takeshi 2018-11-09 01:19:25 RE: Copy data to DSA area