Re: Partitioned checkpointing

From: Takashi Horikawa <t-horikawa(at)aj(dot)jp(dot)nec(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partitioned checkpointing
Date: 2015-09-14 09:42:49
Message-ID: 73FA3881462C614096F815F75628AFCD035590E2@BPXM01GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I wrote:
> The original purpose is to mitigate full-page-write rush that occurs at
> immediately after the beginning of each checkpoint.
> The amount of FPW at each checkpoint is reduced to 1/16 by the
> 'Partitioned checkpointing.'
Let me show another set of measurement results that clearly illustrates this point. Please find DBT2-sync.jpg and DBT2-sync-FPWoff.jpg.

At first, I noticed the performance dip due to checkpointing when I conducted some performance measurement using DBT-2 that implements transactions based on the TPC-C specification. As can be seen in DBT2-sync.jpg, original 9.5alpha2 showed sharp dips in throughput periodically.

The point here is that I identified that those dips were caused by full-page-write rush that occurs immediately after the beginning of each checkpoint. As shown in DBT2-sync-FPWoff.jpg; those dips were eliminated when a GUC parameter 'full_page_writes' was set to 'off.' This also indicates that existing mechanism of spreading buffer sync operations over time was effectively worked. As only difference between original 9.5alpha2 case in DBT2-sync.jpg and DBT2-sync-FPWoff.jpg was in the setting of 'full_page_writes,' those dips were attributed to the full-page-write as a corollary.

The 'Partitioned checkpointing' was implemented to mitigate the dips by spreading full-page-writes over time and was worked exactly as designed (see DBT2-sync.jpg). It also produced good effect for pgbench, thus I have posted an article with a Partitioned-checkpointing.patch to this mailing list.

As to pgbench, however, I have found that full-page-writes did not cause the performance dips, because the dips also occurred when 'full_page_writes' was set to 'off.' So, honestly, I do not exactly know why 'Partitioned checkpointing' mitigated the dips in pgbench executions.

However, it is certain that there are some, other than pgbench, workloads for PostgreSQL in which the full-page-write rush causes performance dips and 'Partitioned checkpointing' is effective to eliminate (or mitigate) them; DBT-2 is an example.

And also, 'Partitioned checkpointing' is worth to study why it is effective in pgbench executions. By studying it, it may lead to devising better ways.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Takashi Horikawa
> Sent: Saturday, September 12, 2015 12:50 PM
> To: Simon Riggs; Fabien COELHO
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Partitioned checkpointing
>
> Hi,
>
> > I understand that what this patch does is cutting the checkpoint
> > of buffers in 16 partitions, each addressing 1/16 of buffers, and each
> with
> > its own wal-log entry, pacing, fsync and so on.
> Right.
> However,
> > The key point is that we spread out the fsyncs across the whole checkpoint
> > period.
> this is not the key point of the 'partitioned checkpointing,' I think.
> The original purpose is to mitigate full-page-write rush that occurs at
> immediately after the beginning of each checkpoint.
> The amount of FPW at each checkpoint is reduced to 1/16 by the
> 'Partitioned checkpointing.'
>
> > This method interacts with the current proposal to improve the
> > checkpointer behavior by avoiding random I/Os, but it could be combined.
> I agree.
>
> > Splitting with N=16 does nothing to guarantee the partitions are equally
> > sized, so there would likely be an imbalance that would reduce the
> > effectiveness of the patch.
> May be right.
> However, current method was designed with considering to split
> buffers so as to balance the load as equally as possible;
> current patch splits the buffer as
> ---
> 1st round: b[0], b[p], b[2p], … b[(n-1)p]
> 2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
> …
> p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
> ---
> where N is the number of buffers,
> p is the number of partitions, and n = (N / p).
>
> It would be extremely unbalance if buffers are divided as follow.
> ---
> 1st round: b[0], b[1], b[2], … b[n-1]
> 2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
> …
> p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
> ---
>
>
> I'm afraid that I miss the point, but
> > 2.
> > Assign files to one of N batches so we can make N roughly equal sized
> > mini-checkpoints
> Splitting buffers with considering the file boundary makes FPW related
> processing
> (in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
> checkpointing' is strongly related to the decision of whether this buffer
> is necessary to FPW or not at the time of inserting the xlog record.
> # 'partition id = buffer id % number of partitions' is fairly simple.
>
> Best regards.
> --
> Takashi Horikawa
> NEC Corporation
> Knowledge Discovery Research Laboratories
>
>
>
> > -----Original Message-----
> > From: Simon Riggs [mailto:simon(at)2ndQuadrant(dot)com]
> > Sent: Friday, September 11, 2015 10:57 PM
> > To: Fabien COELHO
> > Cc: Horikawa Takashi(堀川 隆); pgsql-hackers(at)postgresql(dot)org
> > Subject: Re: [HACKERS] Partitioned checkpointing
> >
> > On 11 September 2015 at 09:07, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
> >
> >
> >
> > Some general comments :
> >
> >
> >
> > Thanks for the summary Fabien.
> >
> >
> > I understand that what this patch does is cutting the checkpoint
> > of buffers in 16 partitions, each addressing 1/16 of buffers, and each
> with
> > its own wal-log entry, pacing, fsync and so on.
> >
> > I'm not sure why it would be much better, although I agree that
> > it may have some small positive influence on performance, but I'm afraid
> > it may also degrade performance in some conditions. So I think that maybe
> > a better understanding of why there is a better performance and focus
> on
> > that could help obtain a more systematic gain.
> >
> >
> >
> > I think its a good idea to partition the checkpoint, but not doing it
> this
> > way.
> >
> > Splitting with N=16 does nothing to guarantee the partitions are equally
> > sized, so there would likely be an imbalance that would reduce the
> > effectiveness of the patch.
> >
> >
> > This method interacts with the current proposal to improve the
> > checkpointer behavior by avoiding random I/Os, but it could be combined.
> >
> > I'm wondering whether the benefit you see are linked to the file
> > flushing behavior induced by fsyncing more often, in which case it is
> quite
> > close the "flushing" part of the current "checkpoint continuous flushing"
> > patch, and could be redundant/less efficient that what is done there,
> > especially as test have shown that the effect of flushing is *much* better
> > on sorted buffers.
> >
> > Another proposal around, suggested by Andres Freund I think, is
> > that checkpoint could fsync files while checkpointing and not wait for
> the
> > end of the checkpoint. I think that it may also be one of the reason why
> > your patch does bring benefit, but Andres approach would be more
> systematic,
> > because there would be no need to fsync files several time (basically
> your
> > patch issues 16 fsync per file). This suggest that the "partitionning"
> > should be done at a lower level, from within the CheckPointBuffers, which
> > would take care of fsyncing files some time after writting buffers to
> them
> > is finished.
> >
> >
> > The idea to do a partial pass through shared buffers and only write a
> fraction
> > of dirty buffers, then fsync them is a good one.
> >
> > The key point is that we spread out the fsyncs across the whole checkpoint
> > period.
> >
> > I think we should be writing out all buffers for a particular file in
> one
> > pass, then issue one fsync per file. >1 fsyncs per file seems a bad idea.
> >
> > So we'd need logic like this
> > 1. Run through shared buffers and analyze the files contained in there
> 2.
> > Assign files to one of N batches so we can make N roughly equal sized
> > mini-checkpoints 3. Make N passes through shared buffers, writing out
> files
> > assigned to each batch as we go
> >
> > --
> >
> > Simon Riggs http://www.2ndQuadrant.com/
> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
image/jpeg 38.5 KB
image/jpeg 31.0 KB
smime.p7s application/pkcs7-signature 6.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-09-14 09:53:02 Re: On-demand running query plans using auto_explain and signals
Previous Message Kouhei Kaigai 2015-09-14 09:35:16 Can extension build own SGML document?