Re: Performance Improvement by reducing WAL for Update Operation

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Noah Misch'" <noah(at)leadboat(dot)com>
Cc: <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-27 11:27:46
Message-ID: 004201cdb436$16fa52b0$44eef810$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
> On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:
> > I have fixed all the review comments raised in delta encoding approach
> raised by you (for needs toast, for now I have kept the code as it will
> not go in patch of encoding for it. however it can be changed.).
> > I have also fixed the major comment about this patch by Heikki and Tom
> [use memcmp to find modified columns].


> Could you elaborate on your reason for continuing to treat TOAST as a
> special
> case? As best I recall, the only reason to do so before was the fact
> that
> TOAST can change the physical representation of a column even when
> executor
> did not change its logical content. Since you're no longer relying on
> the
> executor's opinion of what changed, a TOASTed value is not special.

I thought for initial version of patch, without this change, patch will have
less impact and less test.
However now in the new version I shall take care of TOASTed value as
suggested by you.

> > The patch containing review comments fixed for delta encoding method
> is attached with this mail.
>
> In my previous review, I said:
>
> Given [not relying on the executor to know which columns changed],
> why not
> treat the tuple as an opaque series of bytes and not worry about
> datum
> boundaries? When several narrow columns change together, say a
> sequence
> of sixteen smallint columns, you will use fewer binary delta
> commands by
> representing the change with a single 32-byte substitution. If an
> UPDATE
> changes just part of a long datum, the delta encoding algorithm
> will still
> be able to save considerable space. That case arises in many
> forms:
> changing one word in a long string, changing one element in a long
> array,
> changing one field of a composite-typed column. Granted, this
> makes the
> choice of delta encoding algorithm more important.
>
> We may be leaving considerable savings on the table by assuming that
> column
> boundaries are the only modified-range boundaries worth recognizing.
> What is
> your willingness to explore general algorithms for choosing such
> boundaries?
> Such an investigation may, of course, be a dead end.

For this patch I am interested to go with delta encoding approach based on
column boundaries.

However I shall try to do it separately and if it gives positive results
then I will share with hackers.
I will try with VCDiff once or let me know if you have any other algorithm
in mind.
The other positive point I am seeing for exploring some standard diff
algorithm is, incase it gives positive results, we can even try to explore
for some standard compression algorithm for Full_Page_Writes as well to
reduce WAL further.

> If you conclude that finding sub-column similarity is not worthwhile, at
> least
> teach your algorithm to aggregate runs of changing or unchanging columns
> into
> fewer delta instructions. If a table contains twenty unchanging bool
> columns,
> you currently use at least 80 bytes to encode that fact. By treating
> the run
> of columns as a unit for delta encoding purposes, you could encode it in
> 23
> bytes.

Do you mean to say handle for non-continuous unchanged columns?
I believe for continuous unchanged columns its already handled until there
are any alignment changes. Example

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,

f20 bool, f21 bool);

insert into tbl values(10,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't');

update tbl set f1 = 20;

The delta algorithm for the above operation reduced the size of the tuple
from 24 bytes to 12 bytes.

4 bytes - IGN command and LEN
4 bytes - ADD command and LEN
4 bytes - Data block

> The same applies for runs of changing columns.

I shall handle for changed columns similar to unchanged columns.

> Like Heikki, I'm left wondering why your custom delta encoding is
> preferable to an encoding from the literature. Your encoding has
> much in
> common with VCDIFF, even sharing two exact command names. If a
> custom
> encoding is the right thing, code comments or a README section
> should at
> least discuss the advantages over an established alternative.
>
> Reflecting on those comments, I'm no longer too concerned about your use
> of a
> novel format to encode the delta. For example, VCDIFF is designed to be
> flexible and self-describing; we don't need that. But I would still
> review it
> for useful ideas when designing the encoding for PostgreSQL.
>
> Idle thought: it might pay off to use 1-byte sizes and offsets
> most of the
> time. Tuples shorter than 256 bytes are common; for longer
> tuples, we can
> afford wider offsets.
>
> I still think this deserves attention. You currently need two bits to
> select
> the delta command. Here are a few strategies to consider:
>
> 1) Store the commands in groups of eight. Each group starts with two
> bytes
> representing the eight commands, followed by the corresponding 2-byte
> lengths
> and ADD data blocks. The last group may end early.
>
> 2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16.
> Now
> each command specifier needs three bits. Again store commands in groups
> of
> eight, this time three bytes per group header. Expect a 1-byte length
> for the
> original commands and a 2-byte length for _*16 commands.
>
> 2) Store each 2-bit command with a 14-bit length in a uint16. We would
> need
> to do something grotty for --with-blocksize=32, where a 14-bit length is
> not
> always enough.

I shall also refer VCDiff format and based on you suggestion, I shall modify
the encoding algorithm.


> I mentioned that heap_delta_{encode,decode} should have essentially-
> inverse
> argument lists. That remains unchanged in this version.

> Review the comments added and moved in your patch; some have become
> obsolete.
> At least one is missing detail I requested in the previous review.
>
> > + /*
> > + * get_tuple_info - Gets the tuple offset and value.
> > + *
> > + * calculates the attribute value and offset, where the attribute
> ends in the
> > + * tuple based on the attribute number and previous fetched
> attribute info.
> > + *
> > + * offset (I/P and O/P variable) - Input as end of previous
> attribute offset
> > + * and incase if it is a first attribute then it's
value
> is zero.
> > + * Output as end of the current attribute in the tuple.
> > + * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be
> used or not.
> > + */
> > + static void
> > + get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
> > + bool hasnulls, int attnum, Datum *value, uint16
> *offset,
> > + bool *usecacheoff)
>
> This name is too generic and not particularly tied to what the function
> does.
> As far as I can see, this is a variant of heap_getattr() that also
> returns the
> offset and has some redundant arguments. I suggest
> heap_getattr_with_offset()
> for a name and making its argument list more like heap_getattr().
>
> > + if (wal_offset < heaptup->t_len)
> > + {
> > + wal_tup->t_len = wal_offset;
> > + wal_tup->t_self = heaptup->t_self;
> > + wal_tup->t_tableOid = heaptup->t_tableOid;
> > +
> > + return true;
> > + }
>
> The passed-in wal_tup happens to always have MaxHeapTupleSize of
> storage.
> However, heap_delta_encode() cannot verify that and does not document it
> as an
> precondition. Furthermore, it makes no effort to avoid overrunning the
> buffer
> before finally checking the accumulated length here. The encoded data
> could
> have grown well beyond MaxHeapTupleSize.
>
> > +
> > + memcpy(wal_tup, heaptup, sizeof(HeapTuple));
> > + return false;
>
> You would need heap_copytuple_with_tuple() here, but perhaps just
> specify the
> contents of wal_tup as undefined when heap_delta_encode() returns false.
>
> > + }
>
> > + #define MinHeapTupleSizeForDeltaUpdate 64
>
> This is now unused.

I shall handle review comments in new version.

> Overall, there's still plenty to do before this patch is ready. I'm
> marking
> it Returned with Feedback. I look forward to the benefits it will bring
> when
> finished.

Thank you for the review.

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-10-27 13:43:05 Re: Logical to physical page mapping
Previous Message Jan Wieck 2012-10-27 05:10:37 Logical to physical page mapping