回复:Bug about drop index concurrently

From: 李杰(慎追) <adger(dot)lj(at)alibaba-inc(dot)com>
To: "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: 回复:Bug about drop index concurrently
Date: 2019-10-21 02:36:04
Message-ID: a767219f-2806-4719-9428-3e13a6c909b9.adger.lj@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the quick reply.
And sorry I haven’t got back to you sooner .

I have seen this backtrace in the core file, and I also looked at the bug in the standby because there is no lock in the drop index concurrently.

However, when our business will perform a large number of queries in the standby, this problem will occur more frequently. So we are trying to solve this problem, and the solution we are currently dealing with is to ban it.

Of course, we considered applying the method of waiting to detect the query lock on the master to the standby, but worried about affecting the standby application log delay, so we gave up that.

If you have a better solution in the future, please push it to the new version, or email it, thank you very much.

regards.

adger.

------------------------------------------------------------------
发件人:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
发送时间:2019年10月19日(星期六) 02:00
收件人:李杰(慎追) <adger(dot)lj(at)alibaba-inc(dot)com>
抄 送:pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
主 题:Re: Bug about drop index concurrently

Hi,

I can trivially reproduce this - it's enough to create a master-standby
setup, and then do this on the master

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT i, i FROM generate_series(1,10000) s(i);

and run pgbench with this script

DROP INDEX CONCURRENTLY IF EXISTS t_a_idx;
CREATE INDEX t_a_idx ON t(a);

while on the standby there's another pgbench running this script

EXPLAIN ANALYZE SELECT * FROM t WHERE a = 10000;

and it fails pretty fast for me. With an extra assert(false) added to
src/backend/access/common/relation.c I get a backtrace like this:

Program terminated with signal SIGABRT, Aborted.
#0 0x00007c32e458fe35 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-22.fc30.x86_64
(gdb) bt
#0 0x00007c32e458fe35 in raise () from /lib64/libc.so.6
#1 0x00007c32e457a895 in abort () from /lib64/libc.so.6
#2 0x0000000000a58579 in ExceptionalCondition (conditionName=0xacd9bc "!(0)", errorType=0xacd95b "FailedAssertion", fileName=0xacd950 "relation.c", lineNumber=64) at assert.c:54
#3 0x000000000048d1bd in relation_open (relationId=38216, lockmode=1) at relation.c:64
#4 0x00000000005082e4 in index_open (relationId=38216, lockmode=1) at indexam.c:130
#5 0x000000000080ac3f in get_relation_info (root=0x21698b8, relationObjectId=16385, inhparent=false, rel=0x220ce60) at plancat.c:196
#6 0x00000000008118c6 in build_simple_rel (root=0x21698b8, relid=1, parent=0x0) at relnode.c:292
#7 0x00000000007d485d in add_base_rels_to_query (root=0x21698b8, jtnode=0x2169478) at initsplan.c:114
#8 0x00000000007d48a3 in add_base_rels_to_query (root=0x21698b8, jtnode=0x21693e0) at initsplan.c:122
#9 0x00000000007d8fad in query_planner (root=0x21698b8, qp_callback=0x7ded97 <standard_qp_callback>, qp_extra=0x7fffa4834f10) at planmain.c:168
#10 0x00000000007dc316 in grouping_planner (root=0x21698b8, inheritance_update=false, tuple_fraction=0) at planner.c:2048
#11 0x00000000007da7ca in subquery_planner (glob=0x220d078, parse=0x2168f78, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1012
#12 0x00000000007d942c in standard_planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:406
#13 0x00000000007d91e8 in planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:275
#14 0x00000000008e1b0d in pg_plan_query (querytree=0x2168f78, cursorOptions=256, boundParams=0x0) at postgres.c:878
#15 0x0000000000658683 in ExplainOneQuery (query=0x2168f78, cursorOptions=256, into=0x0, es=0x220cd90, queryString=0x21407b8 "explain analyze select * from t where a = 10000;", params=0x0, queryEnv=0x0) at explain.c:367
#16 0x0000000000658386 in ExplainQuery (pstate=0x220cc28, stmt=0x2141728, queryString=0x21407b8 "explain analyze select * from t where a = 10000;", params=0x0, queryEnv=0x0, dest=0x220cb90) at explain.c:255
#17 0x00000000008ea218 in standard_ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 10000;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90,
completionTag=0x7fffa48355c0 "") at utility.c:675
#18 0x00000000008e9a45 in ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 10000;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90,
completionTag=0x7fffa48355c0 "") at utility.c:360
#19 0x00000000008e8a0c in PortalRunUtility (portal=0x219c278, pstmt=0x21425c0, isTopLevel=true, setHoldSnapshot=true, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at pquery.c:1175
#20 0x00000000008e871a in FillPortalStore (portal=0x219c278, isTopLevel=true) at pquery.c:1035
#21 0x00000000008e8075 in PortalRun (portal=0x219c278, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x21efb90, altdest=0x21efb90, completionTag=0x7fffa48357b0 "") at pquery.c:765
#22 0x00000000008e207c in exec_simple_query (query_string=0x21407b8 "explain analyze select * from t where a = 10000;") at postgres.c:1215
#23 0x00000000008e636e in PostgresMain (argc=1, argv=0x216c600, dbname=0x216c4e0 "test", username=0x213c3f8 "user") at postgres.c:4236
#24 0x000000000083c71e in BackendRun (port=0x2165850) at postmaster.c:4437
#25 0x000000000083beef in BackendStartup (port=0x2165850) at postmaster.c:4128
#26 0x0000000000838313 in ServerLoop () at postmaster.c:1704
#27 0x0000000000837bbf in PostmasterMain (argc=3, argv=0x213a360) at postmaster.c:1377
#28 0x0000000000759643 in main (argc=3, argv=0x213a360) at main.c:228

So my guess is the root cause is pretty simple - we close/unlock the
indexes after completing the query, but then EXPLAIN tries to open it
again when producing the explain plan.

I don't have a very good idea how to fix this, as explain has no idea
which indexes will be used by the query, etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-10-21 02:40:20 Re: libpq: Fix wrong connection status on invalid "connect_timeout"
Previous Message rob stone 2019-10-21 01:18:23 Re: jsonb_set() strictness considered harmful to data