Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)
Date: 2018-09-28 02:25:07
Message-ID: 499ce6cb-3235-880d-588d-886e7e12d10c@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018/09/28 10:23, David Rowley wrote:
> On Thu, 20 Sep 2018 at 11:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>> On 2018-09-19 12:06:47 +0900, Michael Paquier wrote:
>>> On Wed, Sep 19, 2018 at 01:14:10PM +1200, David Rowley wrote:
>>>> Wouldn't it be better to modify copy.c to just perform the heap_sync
>>>> on just the partitions it touches?
>>>
>>> Yeah, my gut is telling me that this would be the best approach for now,
>>> still I am not sure that this is the best move in the long term.
>>
>> ISTM heap_sync() would be the entirely wrong layer to handle
>> partitioning. For several reasons: 1) With pluggable storage, we want to
>> have multiple different table implementations, doing the syncing on the
>> heap_* for partitions would thus be wrong. 2) In just about all cases we
>> only want to sync a few partitions, there's not really a use-case for
>> doing syncs across all partitions imo.
>
> I started looking into this and noticed that during the check where we
> set the hi_options bits to determine if we can skip WAL and FSM
> checks, we only check the table that's the target of the COPY command.
> That's fine for normal tables, but for partitioned tables, it's pretty
> wrong. We could end up skipping FSM checks for partitions that have
> concurrent DML being performed which could bloat the table. We could
> also skip writing WAL for partitions that should have had WAL written.
>
> To fix this we'd need to also know each partition that we're about to
> route tuples to was also created or truncated in the same transaction,
> but we can't know that just yet since we've not opened up any
> partitions to check yet. I think the best back-patchable fix for this
> is just to disable the optimization for partitioned tables.

Agreed. Data could go into partitions all of which may not pass the tests
used to select the value of hi_options being used to enable this
optimization. So, it's better to avoid the optimization until we find a
way to set hi_option's value per partition. Doing it upfront for all
partitions doesn't seem like a good idea as you already said, so we may
not consider patching it like that in back-branches.

> I also noticed that we support COPY FREEZE for partitioned tables and
> attempt to just check if the partitioned table was created or
> truncated in the current transaction. This is also wrong as even if
> the partitioned table had been just created, it's partitions may have
> already existed and actually be visible to other transactions. It
> seems best to just error out if anyone tries to do COPY FREEZE with a
> partitioned table as there is no option, in this case, to build
> per-partition hi_options lazily since we must error out if FREEZE
> cannot be supported before the COPY starts processing tuples.

Yeah, the same argument seems to apply here. Not all partitions might be
created in the same sub-transaction.

>>> All the other callers of heap_sync don't care about partitioned
>>> tables, so we could add an assertion on RELKIND_PARTITIONED_TABLE.
>>
>> Or rather, it should assert the expected relkinds?
>
> I think an Assert() in heap_sync is a good idea. As far as I know, it
> should only ever be used for normal tables or materialized views. So
> perhaps best to assert the rel is one of those?

Agree with Andres that we should assert relkinds that heap_sync is
designed to handle, instead of, say, asserting that no partitioned tables
are passed to it.

> I've attached a patch to assist discussion on the problem with setting
> the incorrect hi_options for partitioned tables.
>
> Comments are welcome.

Thank you for creating the patch and Tomas for reporting the bug.

Looking at the patch itself, does it seem like both the newly added
comments repeat the same point (that we'll need per-partition hi_options
to enable these optimizations) and are pretty close to each other?

Regards,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-09-28 02:46:30 Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)
Previous Message Haribabu Kommi 2018-09-28 02:21:08 Re: Pluggable Storage - Andres's take