Deadlock between VACUUM and ALTER TABLE commands

From: Kevin Burke <burke(at)shyp(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Deadlock between VACUUM and ALTER TABLE commands
Date: 2016-04-06 22:22:56
Message-ID: CAEYV4pZzdb6B4gsxaAnAYO35ozkSA2=Le5bR-sF_dEi68a=X7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,
Occasionally our test suite gets into a deadlocked state. I was curious why
this happens, and what our options are for dealing with it.

We run ~2900 tests on an Ubuntu machine in sequence against a Postgres
9.4.6 database. There are about 60 tables; each test runs ~3 queries, and
the largest test probably inserts ~1000 rows. To ensure each test starts
with a clean slate, we run the following query in between each test run:

ALTER TABLE table1 DISABLE TRIGGER ALL; DELETE FROM table1; ALTER TABLE
table1 ENABLE TRIGGER ALL;
ALTER TABLE table2 DISABLE TRIGGER ALL; DELETE FROM table2; ALTER TABLE
table2 ENABLE TRIGGER ALL;
...
ALTER TABLE table60 DISABLE TRIGGER ALL; DELETE FROM table60; ALTER TABLE
table60 ENABLE TRIGGER ALL;
ALTER SEQUENCE <sequence> RESTART WITH 1;

We also have a check at the end of every test run that the test has
returned all database connections to the pool, so I don't believe there are
any other database queries in flight at this time.

*Why mess around with DISABLE/ENABLE TRIGGER instead of TRUNCATE?* We
observed that TRUNCATE took about 200ms, but this method takes about 13ms.
Over a our test suite TRUNCATE is going to more or less double the length
of the suite.

We could resolve some kind of foreign key dependency tree, issue the DELETE
FROM's in the right order, and skip DISABLE TRIGGER etc., but haven't
invested the time to do this or figured out how to maintain it/update it as
we add new tables.

We could also run each test in a transaction and issue a ROLLBACK at the
end, unfortunately our ORM
<https://kev.inburke.com/kevin/dont-use-sails-or-waterline/> does not
support transactions.

We observed that when we pushed tests to a third-party CI service, maybe 1
in 100 test runs fails with a mysterious timeout (set to 18 seconds). It
doesn't seem to correlate with any one test.

I finally figured out how to enable the right logging and export our
database logs from the remote machine. We saw the following results:
https://gist.github.com/kevinburkeshyp/4f5eef58411b1c3c1160a6f6008135b8.
I've also pasted relevant parts of the Postgres configuration at this URL.

Specifically we kick off the ALTER TABLE ... DELETE, a second later (our
deadlock_timeout) a log message appears that the ALTER TABLE is waiting for
an AccessExclusiveLock held by process 16936... which is later revealed to
be an autovacuum process of the "events" table.

(The events table also gets cleared between every test. It has 5 columns
that store less-than-60-byte integers, and one created_at timestamp. There
are indexes on the primary key and on LOWER(object_id), another column. We
need to put indexes on the LOWER() fields because of our ORM
<https://kev.inburke.com/kevin/dont-use-sails-or-waterline/>.)

Finally, the test timeout causes the application layer to start closing
connections, at which point the ALTER TABLE gains the exclusive lock.

Here are my questions:

- An AccessExclusiveLock seems to indicate that the vacuum is running a
VACUUM FULL, if I am reading the documentation correctly. Yet the
documentation for autovacuum states that "The autovacuum daemon attempts to
work this way, and in fact will never issue VACUUM FULL." The statements
from the logs and from the documentation seem to be in conflict; am I
missing something?

- Why does vacuum take so long/hang? I would expect it to be a fast
operation.

- Should I just turn off autovacuum altogether? If not, how should I tune
it? Our test run is 10k queries, 6 minutes, each of the ~3k tests runs that
DELETE FROM query listed above, and maybe an average of 2 other queries.

- What other information can I/should I log here?

- Why don't we run into this more often?

--
kevin

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2016-04-06 23:01:04 Re: pg_upgrade error regarding hstore operator
Previous Message Adrian Klaver 2016-04-06 22:04:47 Re: PostgreSQL 9.5.1 -> PG_REWIND NOT FOUND