Bug about drop index concurrently

From: 李杰(慎追) <adger(dot)lj(at)alibaba-inc(dot)com>
To: "pgsql-hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Bug about drop index concurrently
Date: 2019-10-16 08:34:31
Message-ID: cd1b1f74-b3ba-4df7-9355-9df2057b7724.adger.lj@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi hackers,

In recently, I discovered a postgres bug, and I hope I can ask you for the best solution.
The problem is as follows:

postgres=# explain analyze select * from xxx where a=500;
ERROR: could not open relation with OID 25989
The structure of my table is as follows:
postgres=# \d xxx
Table "public.xxx"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | text | | |

postgres=# select count(*) from xxx;
count
--------
800000
(1 row)

postgres=# select * from xxx limit 3;

a | b
---+----------------------------------
1 | 203c51477570aa517cfa317155dcc52c
2 | b58da31baa5c78fee4642fd398bd5909
3 | c7c475bf0a3ca2fc2afc4812a4d44c58

I opened the log file and saw that the index of table xxx was deleted,

postgres=# drop index CONCURRENTLY idx_xxx ;
DROP INDEX

In order to reproduce this bug, I created and deleted the index again and again on the master.
What is hard to understand is that this bug cannot be repeated 100%.
I wrote a script that loops over the master and runs the following two sentences.

postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
postgres=# create index idx_xxx on xxx (a);
postgres=# drop index CONCURRENTLY idx_xxx ;
...
...
...
At the same time, I started two clients in the standby,
respectively execute the following sql on the table xxx:

postgres=# explain analyze select * from xxx where a=500;
postgres=# \watch 0.1

After a few minutes, the bug will appear.

I finally confirmed my guess, I used an index scan in the standby query,
but deleted the index on the master at the same time.
Curious, I went to read the source code of Postgres. I found that
regular DROP INDEX commands imposes a AccessExclusiveLock on the table,
while drop index concurrently commands only used ShareUpdateExclusiveLock.

As we all know, only AccessExclusiveLock and AccessShareLock ,a select's lock ,
are mutually exclusive, and AccessShareLock can't block ShareUpdateExclusiveLock.
This is very weird and not desirable.

This is of course, developers must have thought of this, so we can see in the source
code, before the drop index concurrently, will wait for all transactions using this
index to wait for detection.

But this only exists on the master, my query is executed on the standby.
I use the pg_waldump tool to parse the wal file, and analyze the stantup process,
I found that there is no similar operation on the standby, so it will appear that
when I execute the query on the standby, the index will be deleted by others.

I think this is a bug that will affect the user's experience. we need to fix it.
I have imagined that the logic that detects the query transaction and
waits for it to end is implemented on the standby,but this may increase the
log application delay and the delay is exacerbated that cause the master and backup.
This is not desirable if the query concurrency is large.

All in all, I expect that you can provide a solution that can use drop index concurrently
without affecting the master-slave delay.

Sincerely look forward to your reply and thanks.

adger

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-10-16 08:40:53 Re: Understanding TupleQueue impact and overheads?
Previous Message Amit Kapila 2019-10-16 06:48:41 Re: maintenance_work_mem used by Vacuum