BUG #15778: Replication slot peak query cause memory spike at the server when big transaction are running

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: nitesh(at)datacoral(dot)co
Subject: BUG #15778: Replication slot peak query cause memory spike at the server when big transaction are running
Date: 2019-04-24 19:42:51
Message-ID: 15778-0d88c55df57eb082@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15778
Logged by: Nitesh Yadav
Email address: nitesh(at)datacoral(dot)co
PostgreSQL version: 9.6.2
Operating system: x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 2
Description:

Hi,

Postgres Server setup:
Postgres server is running as AWS rds instance.
Server Version is PostgreSQL 9.6.2 on x86_64-pc-linux-gnu, compiled by gcc
(GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
With the following parameters group rds.logical_replication is set to
1.Which internally set the following flags: wal_level, max_wal_senders,
max_replication_slots, max_connections.
We are using test_decoding module for retrieving/read the WAL data through
the logical decoding mechanism.

Application setup:
Periodically we run the peek command to retrieve the data from the slot:
eg SELECT * FROM pg_logical_slot_peek_changes('pgldpublic_cdc_slot', NULL,
NULL, 'include-timestamp', 'on') LIMIT 200000 OFFSET 0;
From the above query result, we use location of last transaction to remove
the data from the slot:
eg SELECT location, xid FROM
pg_logical_slot_get_changes('pgldpublic_cdc_slot', 'B92/C7394678',
NULL,'include-timestamp', 'on') LIMIT 1;
We run Step 1 & 2 in the loop for reading data in the chunk of 200K
records at a time in a given process.

------------------------------------------------------------------------------
Behavior reported (Bug):
------------------------------------------------------------------------------

Client Side timeline:
------------------------------------------------------------------------------
We have process with PID: 10032, which started at 2019-04-24T06:47:01.877Z
At 2019-04-24T06:47:01.893Z: It invoked the following peek query
SELECT * FROM pg_logical_slot_peek_changes('silo2_cdc_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
Till 2019-04-24T07:01:41.852ZL: There was no response from the server and
then we kill the process.

Server side time:
------------------------------------------------------------------------------
At the server we have large transaction going-on, from 2019-04-24T06:00:00Z.

At 2019-04-24T06:45:38Z, the slot size was around 2GB, at the moment the
query was taking around 20 sec to response.
At 2019-04-24T06:47:00Z, the memory usage of pg server by the PID = 10032
started growing drastically
At 2019-04-24T06:54:00Z, the memory usage of pg server by the PID = 10032
came upto 20 GB
At 2019-04-24T06:58:00Z, the memory usage of pg server by the PID = 10032
came upto 41 GB
At 2019-04-24T07:05:40Z, the memory usage of pg server by the PID = 10032
stays around 40.98 GB

Following are the anonymized pg server logs
------------------------------------------------------------------------------
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(22837):abc(at)xyz:[6989]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:DETAIL:
streaming transactions committing after 1003/499E5518, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:STATEMENT:
SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:STATEMENT:
SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:46:00
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
duration: 22294.217 ms execute <unnamed>: SELECT * FROM
pg_logical_slot_peek_changes('pg_slot', NULL, NULL, 'include-timestamp',
'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:DETAIL:
streaming transactions committing after 1003/499E5518, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:STATEMENT:
SELECT xid FROM pg_logical_slot_get_changes('pg_slot', '1003/9B28D200',
NULL, 'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:STATEMENT:
SELECT xid FROM pg_logical_slot_get_changes('pg_slot', '1003/9B28D200',
NULL, 'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc(at)xyz:[6988]:LOG:
duration: 23156.543 ms statement: SELECT xid FROM
pg_logical_slot_get_changes('pg_slot', '1003/9B28D200', NULL,
'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(46673):abc(at)xyz:[10034]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:DETAIL:
streaming transactions committing after 1003/9B28D200, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:STATEMENT:
SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc(at)xyz:[10032]:STATEMENT:
SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:50:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(60004):abc(at)xyz:[18122]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:55:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(29409):abc(at)xyz:[31637]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:00:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(43301):abc(at)xyz:[5247]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:02:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(46673):abc(at)xyz:[10034]:LOG:
could not receive data from client: Connection reset by peer
2019-04-24 07:05:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(59862):abc(at)xyz:[10813]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:10:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(15221):abc(at)xyz:[16183]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)

We have the following questions:
------------------------------------------------------------------------------
1) What causes the memory usage spike at the server? Is it the peek query or
write of big transaction running on the server ?
2) Why the peek query took so much time at 2019-04-24T06:47:00Z, whereas it
was finished within 20s, when run last time at 2019-04-24T06:45: 38.516Z?
3) How can we restrict/limit the memory usage of peek queries from the
server side? Please suggest if there are configuration we can use?

Regards,
Nitesh

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2019-04-24 20:15:12 Re: amcheck assert failure
Previous Message Peter Geoghegan 2019-04-24 18:59:46 Re: amcheck assert failure