Re: Linux kernel impact on PostgreSQL performance

From: Jim Nasby <jim(at)nasby(dot)net>
To: Mel Gorman <mgorman(at)suse(dot)de>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>
Subject: Re: Linux kernel impact on PostgreSQL performance
Date: 2014-01-15 03:34:10
Message-ID: 52D601B2.70806@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/14/14, 4:21 AM, Mel Gorman wrote:
> There is an interesting side-line here. If all IO is initiated by one
> process in postgres then the memory locality will be sub-optimal.
> The consumer of the data may or may not be running on the same
> node as the process that read the data from disk. It is possible to
> migrate this from user space but the interface is clumsy and assumes the
> data is mapped.

That's really not the case in Postgres. There's essentially 3 main areas for IO requests to come from:

- Individual "backends". These are processes forked off of our startup process (postmaster) for the purpose of serving user connections. This is always "foreground" IO and should be avoided as much as possible (but is still a large percentage).
- autovacuum. This is a set of "clean-up" processes, meant to be low impact, background only. Similar to garbage collection is GC languages.
- bgwriter. This process is meant to greatly reduce the need for user backends to write data out.

Generally speaking, read requests are most likely to come from user backends. autovacuum can issue them too, but it's got a throttling mechanism so generally shouldn't be that much of the workload.

Ideally most write traffic would come from bgwriter (and autovacuum, though again we don't care too much about it). In reality though, that's going to depend very highly on a user's actual workload. To start, backends normally must write all write-ahead-log traffic before they finalize (COMMIT) a transaction for the user. COMMIT is sort of similar in idea to fsync... "When this returns I guarantee I've permanently stored your data."

The amount of WAL data generated for a transaction will vary enormously, even as a percentage of raw page data written. In some cases a very small (10s-100s of bytes) amount of WAL data will cover 1 or more base data pages (8k by default, up to 64k). But to protect against torn page writes, by default we write a complete copy of a data page to WAL the first time the page is dirtied after a checkpoint. So the opposite scenario is we actually write slightly MORE data to WAL than we do to the data pages.

What makes WAL even trickier is that bgwritter tries to write WAL data out before backends need to. In a system with a fairly low transaction rate that can work... but with a higher rate most WAL data will be written by a backend trying to issue a COMMIT. Note however that COMMIT needs to write ALL WAL data up to a given point, so one backend that only needs to write 100 bytes can easily end up flushing (and fsync'ing) megabytes of data written by some other backend.

Further complicating things is temporary storage, either in the form of user defined temporary tables, or temporary storage needed by the database itself. It's hard to characterize these workloads other than to say that typically reading and writing to them will want to move a relatively large amount of data at once.

BTW, because Postgres doesn't have terribly sophisticated memory management, it's very common to create temporary file data that will never, ever, ever actually NEED to hit disk. Where I work being able to tell the kernel to avoid flushing those files unless the kernel thinks it's got better things to do with that memory would be EXTREMELY valuable, because it's all temp data anyway: if the database or server crashes it's just going to get throw away. It might be a good idea for the Postgres to look at simply putting this data into plain memory now and relying on the OS to swap it as needed. That'd be more problematic for temp tables, but in that case mmap might work very well, because that data is currently never shared by other processes, though if we start doing parallel query execution that will change.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2014-01-15 03:52:20 Re: WAL Rate Limiting
Previous Message Peter Eisentraut 2014-01-15 03:22:08 Re: integrate pg_upgrade analyze_new_cluster.sh into vacuumdb