Re: [HACKERS] UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: [HACKERS] UPDATE of partition key
Date: 2018-01-12 10:12:29
Message-ID: CAJ3gD9c9w3SX02WgFtA8SES=Le8XPBZ+a4bNDheYsaY_pQ3RxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 January 2018 at 01:18, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> In the first paragraph of my explanation, I was explaining why two
>> Transition capture states does not look like a good idea to me :
>
> Oh, sorry. I didn't read what you wrote carefully enough, I guess.
>
> I see your points. I think that there is probably a general need for
> some refactoring here. AfterTriggerSaveEvent() got significantly more
> complicated and harder to understand with the arrival of transition
> tables, and this patch is adding more complexity still. It's also
> adding complexity in other places to make ExecInsert() and
> ExecDelete() usable for the semi-internal DELETE/INSERT operations
> being produced when we split a partition key update into a DELETE and
> INSERT pair. It would be awfully nice to have some better way to
> separate out each of the different things we might or might not want
> to do depending on the situation: capture old tuple, capture new
> tuple, fire before triggers, fire after triggers, count processed
> rows, set command tag, perform actual heap operation, update indexes,
> etc. However, I don't have a specific idea how to do it better, so
> maybe we should just get this committed for now and perhaps, with more
> eyes on the code, someone will have a good idea.
>
>> Slight correction; it was suggested by Amit Langote; not by David.
>
> Oh, OK, sorry.
>
>> So there are two independent optimizations we are talking about :
>>
>> 1. Create the map only when needed.
>> 2. In case of UPDATE, for partitions that take part in update scans,
>> there should be a single map; there should not be two separate maps,
>> one for accessing per-subplan and the other for accessing per-leaf.
>
> These optimizations aren't completely independent. Optimization #2
> can be implemented in several different ways. The way you've chosen
> to do it is to index the same array in two different ways depending on
> whether per-leaf indexing is not needed, which I think is
> unacceptable. Another approach, which I proposed upthread, is to
> always built the per-leaf mapping, but you pointed out that this could
> involve doing a lot of unnecessary work in the case where most leaves
> were pruned. However, if you also implement #1, then that problem
> goes away. In other words, depending on the design you choose for #2,
> you may or may not need to also implement optimization #1 to get good
> performance.
>
> To put that another way, I think Amit's idea of keeping a
> subplan-offsets array is a pretty good one. From your comments, you
> do too. But if we want to keep that, then we need a way to avoid the
> expense of populating it for leaves that got pruned, except when we
> are doing update row movement. Otherwise, I don't see much choice but
> to jettison the subplan-offsets array and just maintain two separate
> arrays of mappings.

Ok. So giving more thought on our both's points, here's what I feel we
can do ...

With the two arrays mt_per_leaf_tupconv_maps and
mt_per_subplan_tupconv_maps, we want the following things :
1. Create the map on-demand.
2. If possible, try to share the maps between the per-subplan and
per-leaf arrays.

For this, option 1 is :

-------

Both the arrays elements are made of this structure :

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : Not known if map is required */
/* 1 : map is created/required */
/* 2 : map is not necessary */
TupleConversionMap *map;
} TupleConversionMapInfo;

Arrays look like this :
TupleConversionMapInfo mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo mt_per_leaf_tupconv_maps[];

When a per-subplan array is to be accessed at index i, a macro
get_tupconv_map(mt_per_subplan_tupconv_maps, i, forleaf=false) will be
called. This will create a new map if necessary, populate the array
element fields, and it will also copy this info into a corresponding
array element in the per-leaf array. To get to the per-leaf array
element, we need a subplan-offsets array. Whereas, if the per-leaf
array element is already populated, this info will be copied into the
subplan element in the opposite direction.

When a per-leaf array is to be accessed at index i,
get_tupconv_map(mt_per_leaf_tupconv_maps, i, forleaf=true) will be
called. Here, it will similarly update the per-leaf array element. But
it will not try to access the corresponding per-subplan array because
we don't have such offset array.

This is how the macro will look like :

#define get_tupconv_map(mapinfo, i, perleaf)
((mapinfo[i].map_required == 2) ? NULL :
((mapinfo[i].map_required == 1) ? mapinfo[i].map :
create_new_map(mapinfo, i, perleaf)))

where create_new_map() will take care of populating the array element
on both the arrays, and then return the map if created, or NULL if not
required.

-------

Option 2 :

Elements of both arrays are pointers to TupleConversionMapInfo structure.
Arrays look like this :
TupleConversionMapInfo *mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo *mt_per_leaf_tupconv_maps[];

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : map is not required, 1 : ... */
TupleConversionMap *map;
}

So in ExecInitModifyTable(), for each of the array elements of both
arrays, we palloc TupleConversionMap structure, and wherever
applicable, a common palloc'ed structure is shared between the two
arrays. This way, subplan-offsets array is not required.

In this case, the macro get_tupconv_map() similarly populates the
structure, but it does not have to access the other map array, because
the structures are already shared in the two arrays.

The problem with this option is : since we have to share some of the
structures allocated by the array elements, we have to build the two
arrays together, but in the code the arrays are to be allocated when
required at different points, like update_tuple_routing required and
transition tables required. Also, beforehand we have to individually
palloc memory for TupleConversionMapInfo for all the array elements,
as against allocating memory in a single palloc of the whole array as
in option 1.

As of this writing, I am writing code relevant to adding the on-demand
logic, and I anticipate option 1 would turn out better than option 2.
But I would like to know if you are ok with both of these options.

------------

The reason why I am having map_required field inside a structure along
with the map, as against a separate array, is so that we can do the
on-demand allocation for both per-leaf array and per-subplan array.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darafei Komяpa Praliaskouski 2018-01-12 10:15:20 Re: bytea bitwise logical operations implementation (xor / and / or / not)
Previous Message David Rowley 2018-01-12 09:58:32 Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning