Re: WIP: Avoid creation of the free space map for small tables

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Avoid creation of the free space map for small tables
Date: 2019-02-02 12:03:13
Message-ID: CAA4eK1LYUtHkVknVaKQDF4OjP_LzrbLfTyAG-=5PC_0xn8JiDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 2, 2019 at 7:30 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> > <john(dot)naylor(at)2ndquadrant(dot)com> wrote:
>
> In the past few days, we have done a further analysis of each problem
> and tried to reproduce it. We are successful in generating some form
> of reproducer for 3 out of 4 problems in the same way as it was failed
> in the buildfarm. For the fourth symptom, we have tried a lot (even
> Andrew Dunstan has helped us to run the regression tests with the
> faulty commit on Jacana for many hours, but it didn't got reproduced)
> but not able to regenerate a failure in a similar way. However, I
> have a theory as mentioned below why the particular test could fail
> and the fix for the same is done in the patch. I am planning to push
> the latest version of the patch [1] which has fixes for all the
> symptoms.
>

Today, I have spent some more time to generate a test which can
reproduce the failure though it is with the help of breakpoints. See
below:

> >
> > 4. Failure on jacana:
> > --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out
> > 2018-09-26
> > 17:53:33 -0400
> > +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out
> > 2019-01-27 23:14:35
> > -0500
> > @@ -252,332 +252,7 @@
> > ('(0,100)(0,infinity)'),
> > ('(-infinity,0)(0,infinity)'),
> > ('(-infinity,-infinity)(infinity,infinity)');
> > -SET enable_seqscan = false;
> > -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
> > ..
> > ..
> > TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
> > "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c",
> > Line:
> > 1118)
> > ..
> > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG: server process
> > (PID 14388) exited with exit code 3
> > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL: Failed process
> > was running: INSERT INTO box_temp
> > VALUES (NULL),
> >
> > I think the reason for this failure is same as previous (as mentioned
> > in point-3), but this can happen in a different way. Say, we have
> > searched the local map and then try to extend a relation 'X' and in
> > the meantime, another backend has extended such that it creates FSM.
> > Now, we will reuse that page and won't clear local map. Now, say we
> > try to insert in relation 'Y' which doesn't have FSM. It will try to
> > set the local map and will find that it already exists, so will fail.
> > Now, the question is how it can happen in this box.sql test. I guess
> > that is happening for some system table which is being populated by
> > Create Index statement executed just before the failing Insert.
> >

Based on the above theory, the test is as below:

Session-1
---------------
postgres=# create table test_1(c1 int, c2 char(1500));
CREATE TABLE
postgres=# create table test_2(c1 int, c2 char(1500));
CREATE TABLE
postgres=# insert into test_1 values(generate_series(1,20),'aaaa');
INSERT 0 20
postgres=# insert into test_2 values(1,'aaaa');
INSERT 0 1

Session-2
----------------
postgres=# analyze test_1;
ANALYZE
postgres=# analyze test_2;
ANALYZE
postgres=# select oid, relname, relpages from pg_class where relname
like 'test%';
oid | relname | relpages
-------+----------------+----------
41835 | test_1 | 4
41838 | test_2 | 1

Till here we can see that test_1 has 4 pages and test2_1 has 1 page.
At this stage, even one more record insertion in test_1 will create a
new page. So, now we have to hit the scenario with the help of
debugger. For session-1, attach debugger and put breakpoint in
RelationGetBufferForTuple().

Session-1
----------------
postgres=# insert into test_1 values(21,'aaaa');

It will hit the breakpoint in RelationGetBufferForTuple(). Now, add
one more breakpoint on line 542 in hio.c, aka below line:
RelationGetBufferForTuple
{
..
else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))

Press continue and stop the debugger at line 542. Attach the debugger
for session-2 and add a breakpoint on line 580 in hio.c, aka below
line:
RelationGetBufferForTuple
{
..
buffer = ReadBufferBI(relation, P_NEW, bistate);

Session-2
---------------
postgres=# insert into test_1 values(22,'aaaa');

It will hit the breakpoint in RelationGetBufferForTuple(). Now
proceed with debugging on session-1 by one step. This is to ensure
that session-1 doesn't get a conditional lock. Now, continue the
debugger in session-2 and after that run vacuum test_1 from session-2,
this will ensure that FSM is created for relation test_1. So the
state of session-2 will be as below:

Session-2
----------------
postgres=# insert into test_1 values(22,'aaaa');
INSERT 0 1
postgres=# vacuum test_1;
VACUUM
postgres=# select oid, relname, relpages from pg_class where relname
like 'test%';
oid | relname | relpages
-------+----------------+----------
41835 | test_1 | 5
41838 | test_2 | 1

Continue the debugger in session-1. Now insert one row in test_2 from
session-1 and kaboom:

Session-1
---------------
postgres=# insert into test_2 values(2,'aaaa');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Server logs is as below:
TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
"..\postgresql\src\backend\storage\freespace\freespace.c",
Line: 1118)
2019-02-02 16:48:06.216 IST [4044] LOG: server process (PID 3540)
exited with exit code 3
2019-02-02 16:48:06.216 IST [4044] DETAIL: Failed process was
running: insert into test_2 values(2,'aaaa');

This looks exactly the same as the failure in Jacana.

The patch fixes this case.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message rajan 2019-02-02 12:27:50 Able to do ALTER DEFAULT PRIVILEGES from a user who is not the owner
Previous Message Graham Leggett 2019-02-02 11:34:56 Re: DNS SRV support for LDAP authentication