Re: SUBSCRIPTION command hangs up, segfault occurs in the server process and to cancel the execution.

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: nuko yokohama <nuko(dot)yokohama(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SUBSCRIPTION command hangs up, segfault occurs in the server process and to cancel the execution.
Date: 2017-02-21 18:19:24
Message-ID: CAHGQGwH=DaJXXd7gQkFA9XTzdY9jLhs3xt2-aYf=jHMhFZsL7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 22, 2017 at 1:54 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Tue, Feb 21, 2017 at 7:11 AM, nuko yokohama <nuko(dot)yokohama(at)gmail(dot)com> wrote:
>> Hi.
>> While verifying the logical replication of PostgreSQL 10-devel, I found the
>> following problem.

Thanks for the problem report! This problem was also reported before.
https://www.postgresql.org/message-id/A1E9CB90-1FAC-4CAD-8DBA-9AA62A6E97C5@postgrespro.ru

>> * When you run the SUBSCRIPTION command against a table in the same
>> PostgreSQL server, hang up.
>> * Canceling the hung SUBSCRIPTION command with CTRL + C causes a server
>> process occurs Segfault, and the PostgreSQL server to restart.
>>
>> --------
>> [Steps to Reproduce]
>>
>> $ createdb db1 -U postgres
>> $ createdb db2 -U postgres
>> $ psql -U postgres db1 -c "CREATE TABLE test (id int primary key, data
>> text)"
>> $ psql -U postgres db2 -c "CREATE TABLE test (id int primary key, data
>> text)"
>> $ psql -U postgres db1 -c "CREATE PUBLICATION pub_db1_test FOR TABLE test"
>> $ psql -U postgres db2 -c "CREATE SUBSCRIPTION sub_db2_test CONNECTION
>> 'dbname=db1 port=5432 user=postgres' PUBLICATION pub_db1_test"
>>
>> The SUBSCRIPTION command does not end, it hangs up.
>> At this time, the following logs are output in the server log.
>>
>> 2017-02-21 06:58:05.082 JST [22060] LOG: logical decoding found initial
>> starting point at 0/1C06660
>> 2017-02-21 06:58:05.082 JST [22060] DETAIL: 1 transaction needs to finish.
>>
>> Suspending psql (running SUBSCRIPTION) with CTRL + C causes a Segfault in
>> the server process.
>> At this time, the following message is output to the server log.
>>
>> 2017-02-21 07:01:00.246 JST [22059] ERROR: canceling statement due to user
>> request
>> 2017-02-21 07:01:00.246 JST [22059] STATEMENT: CREATE SUBSCRIPTION
>> sub_db2_test CONNECTION 'dbname=db1 port=5432 user=postgres' PUBLICATION
>> pub_db1_test
>> 2017-02-21 07:01:01.006 JST [21445] LOG: server process (PID 22060) was
>> terminated by signal 11: Segmentation fault
>> 2017-02-21 07:01:01.007 JST [21445] LOG: terminating any other active
>> server processes
>>
>> --------
>> [Environment]
>>
>> postgres=# SELECT version();
>> version
>> ------------------------------------------------------------------------------------------------------------
>> PostgreSQL 10devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2
>> 20140120 (Red Hat 4.8.2-16), 64-bit
>> (1 row)
>>
>> postgres=# SHOW wal_level;
>> wal_level
>> -----------
>> logical
>> (1 row)
>>
>> postgres=# SHOW max_wal_senders;;
>> max_wal_senders
>> -----------------
>> 10
>> (1 row)
>>
>> postgres=# SHOW max_replication_slots;
>> max_replication_slots
>> -----------------------
>> 10
>> (1 row)
>>
>> postgres=# TABLE pg_hba_file_rules ;
>> line_number | type | database | user_name | address |
>> netmask | auth_method | options | error
>> -------------+-------+---------------+------------+-----------+-----------------------------------------+-------------+---------+-------
>> 2 | local | {all} | {all} | |
>> | trust | |
>> 4 | host | {all} | {all} | 127.0.0.1 |
>> 255.255.255.255 | trust | |
>> 6 | host | {all} | {all} | ::1 |
>> ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
>> 9 | local | {replication} | {postgres} | |
>> | trust | |
>> (4 rows)
>>
>>
>
> As log message says, the reason why CREATE SUBSCRIPTION hangs up is
> that it's waiting for another transaction finish.
>
> Regarding SEGV, it occurs by wal sender process creating replication
> slot. When SEV occurs the backtrace is as follows.
>
> (gdb) bt
> #0 0x00000000006e3d47 in resetStringInfo (str=0xea1c60) at stringinfo.c:96
> #1 0x00000000007e4738 in ProcessRepliesIfAny () at walsender.c:1347
> #2 0x00000000007e42c5 in WalSndWaitForWal (loc=23400264) at walsender.c:1142
> #3 0x00000000007e38f6 in logical_read_xlog_page (state=0x2732008,
> targetPagePtr=23396352, reqLen=3912, targetRecPtr=23400240,
> cur_page=0x2733d58 "\225\320\005", pageTLI=0x27328b4) at
> walsender.c:717
> #4 0x00000000005458c8 in ReadPageInternal (state=0x2732008,
> pageptr=23396352, reqLen=3912) at xlogreader.c:556
> #5 0x0000000000545068 in XLogReadRecord (state=0x2732008,
> RecPtr=23400240, errormsg=0x7fff7cf34cf8) at xlogreader.c:255
> #6 0x00000000007d09db in DecodingContextFindStartpoint
> (ctx=0x2731d48) at logical.c:432
> #7 0x00000000007e3a87 in CreateReplicationSlot (cmd=0x26edfa0) at
> walsender.c:796
> #8 0x00000000007e45ae in exec_replication_command
> (cmd_string=0x268b5f8 "CREATE_REPLICATION_SLOT \"sub_db2_test\"
> LOGICAL pgoutput") at walsender.c:1272
> #9 0x0000000000846783 in PostgresMain (argc=1, argv=0x2697f38,
> dbname=0x2697e68 "db1", username=0x2697e40 "masahiko") at
> postgres.c:4064
> #10 0x00000000007b34e0 in BackendRun (port=0x268f4b0) at postmaster.c:4310
> #11 0x00000000007b2c60 in BackendStartup (port=0x268f4b0) at postmaster.c:3982
> #12 0x00000000007af2e0 in ServerLoop () at postmaster.c:1722
> #13 0x00000000007ae9a4 in PostmasterMain (argc=5, argv=0x266c020) at
> postmaster.c:1330
> #14 0x00000000006f487a in main (argc=5, argv=0x266c020) at main.c:228
>
> I guess that the cause of SEGV is that reply_message is reset without
> initialize by initStringInfo. The reply_message and other buffers are
> initialized at beggning of WalSndLoop but resetStringInfo can be
> called by WalSndWaitForWal->ProcessRepliesIfAny without calling
> WalSndLoop when creating replication slot. To fix it, we can
> initialize reply_message in exec_replication_command like follows.

Yes. I pushed the patch which makes walsender always initialize
all three buffers.

Regards,

--
Fujii Masao

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2017-02-21 18:44:27 Re: Resolved typo in a comment
Previous Message Fujii Masao 2017-02-21 18:16:28 Re: Walsender crash