Re: Performance Improvement by reducing WAL for Update Operation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mike Blackwell <mike(dot)blackwell(at)rrd(dot)com>
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Date: 2014-01-14 06:16:11
Message-ID: CAA4eK1JHPiCCmsnSOBoiRwSa-QV9KBdt5crt9005jtZFfyLU_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> Yes, currently this applies to update, what I have in mind is that
>> in future if some one wants to use WAL compression for any other
>> operation like 'full_page_writes', then it can be easily extendible.
>>
>> To be honest, I have not evaluated whether such a flag or compression
>> would make sense for full page writes, but I think it should be possible
>> while doing full page write (BkpBlock has RelFileNode) to check such a
>> flag if it's present.
>
> Makes sense.

So shall I change it to string instead of bool and keep the name as
compress_wal or compress_wal_for_opr?

>> The reason of adding the same chunk in head of list is that it uses same
>> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
>> from c~f, as it has concept of good_match which leads to get this done in
>> one go.
>>
>> Being said above, I am really not sure, how much real world data falls
>> in above category and should we try to optimize based on above example,
>> but yes it will save some CPU cycles in current test we are using.
>
> In the Rabin algorithm, we shouldn't try to find a longer match. The
> match should end at the chunk end, period. Otherwise, you lose the
> shift-resistant property of the algorithm.

Okay, it will work well for cases when most chunks in tuple are due
due to special pattern in it, but it will loose out on CPU cycles in
cases where most of the chunks are due to maximum chunk boundary
and most part of new tuple matches with old tuple. The reason is that
if the algorithm have some such property of finding longer matches than
chunk boundaries, then it can save us on calculating hash again and
again when we try to find match in old tuple.
However I think it is better to go with rabin's algorithm instead of adding
optimizations based on our own assumptions, because it is difficult to
predict the real world tuple data.

>>
>> Isn't it similar to how current pglz works, basically it also
>> uses next 4 bytes to calculate index (pglz_hist_idx) but still
>> does byte by byte comparison, here if we try to map to rabin's
>> delta encoding then always chunk size is 1.
>
> I don't quite understand this. The point of the Rabin algorithm is to
> split the old tuple up into chunks and then for those chunks in the
> new tuple. For example, suppose the old tuple is
> abcdefghijklmnopqrstuvwxyz. It might get split like this: abcdef
> hijklmnopqrstuvw xyz. If any of those three chunks appear in the new
> tuple, then we'll use them for compression. If not, we'll just copy
> the literal bytes. If the chunks appear in the new tuple reordered or
> shifted or with stuff inserted between one chunk at the next, we'll
> still find them. Unless I'm confused, which is possible, what you're
> doing is essentially looking at the string and spitting it in those
> three places, but then recording the chunks as being three bytes
> shorter than they really are. I don't see how that can be right.

Today again spending some time on algorithm, I got the bug you
are pointing to and you are right in saying that chunk is shorter.
I think it should not be difficult to address this issue without affecting
most part of algorithm, let me try to handle it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message KONDO Mitsumasa 2014-01-14 06:52:17 Re: gaussian distribution pgbench
Previous Message Craig Ringer 2014-01-14 05:55:16 Re: Case sensitive mode in windows build option