Re: User defined data types in Logical Replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Huong Dangminh <huo-dangminh(at)ys(dot)jp(dot)nec(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Hiroshi Yanagisawa <hir-yanagisawa(at)ut(dot)jp(dot)nec(dot)com>
Subject: Re: User defined data types in Logical Replication
Date: 2017-11-16 08:30:02
Message-ID: CAD21AoAzy2=Ca6H1dpm9_LzCap4zCckWYSUBN=S4kcMNEfzJRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 15, 2017 at 7:55 PM, Huong Dangminh
<huo-dangminh(at)ys(dot)jp(dot)nec(dot)com> wrote:
> Hi,
>
>> We are getting the bellow error while trying use Logical Replication with
>> user defined data types in a C program (when call elog function).
>>
>> ERROR: XX000: cache lookup failed for type XXXXX
>>
>
> Sorry for continuously disturbing in this topic, but am I missing something here?

No, but I'd suggest to provide a procedure for reproducing if
possible, which will be helpful for investigation.

> I mean that in case of type's OID in PUBLICATION host does not exists in SUBSCRIPTION host's pg_type,
> it could returns unintended error (the XX000 above) when elog or ereport is executed.
>
> For more details, it happen in slot_store_error_callback when it try to call format_type_be(localtypoid) for errcontext.
> slot_store_error_callback is set in slot_store_cstrings, slot_modify_cstrings function and it also be unset here, so the effect here is small but it happens.
>

I think I found out the cause of this issue, and this is a bug. This
can be reproduced, for example, if the input function of the data type
calls elog() during applying on the environment where OIDs of the data
type on publisher and subscriber are different. The cause of this
issue is that we call format_type_be() with remotetypoid. If the OIDs
of data type on publisher and subscriber are different we search it
from syscache by the OID that doesn't exist on subscriber.

On detail of your patch, I don't think this direction is good. Since
the subscriber already has a LogicalRepTyp cache entry for the type we
can report the error message using the data type name. So I think this
issue can be fixed by using the remote type name got from the cache.

Also I'm confused about the message of errcontext; currently we store
the local data type OID corresponding to the remote data type name
into the cache, and then we search the local data type name by the
local data type OID stored in the cache. So it means the both the
local data type OID and the remote data type OID always imply the same
data type. We use the both data type OIDs for log message in
slot_store_error_callback, but I think what the function want to do is
to show the different type names if the table definitions on both
server are different (e.g. sending jsonb column data to text column
data). I think we should use the type of the local relation attribute
rather than remote's one.

Attached draft patch fixed this issue, at least on my environment.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
fix_slot_store_error_callback.patch application/octet-stream 5.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Moser 2017-11-16 08:32:09 Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
Previous Message Pavel Stehule 2017-11-16 07:01:11 wrong formatting psql expanded mode short columns