Fwd: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: "Albin, Lloyd P" <lalbin(at)scharp(dot)org>
Subject: Fwd: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack
Date: 2018-07-19 23:17:44
Message-ID: 9145334b-9583-124c-012e-7d58039ad417@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

I'd like to bump this old bug that Lloyd filed for more discussion. It
seems serious enough to me that we should at least talk about it.

Anyone with simply the login privilege and the ability to run SQL can
instantly block all new incoming connections to a DB including new
superuser connections.

session 1:
select pg_sleep(9999999999) from pg_stat_activity;

session 2:
vacuum full pg_authid; -or- truncate table pg_authid;

(there are likely other SQL you could run in session 2 as well.)

-------- Forwarded Message --------
Subject: BUG #15182: Canceling authentication due to timeout aka Denial
of Service Attack
Date: Mon, 30 Apr 2018 20:41:11 +0000
From: PG Bug reporting form <noreply(at)postgresql(dot)org>
Reply-To: lalbin(at)scharp(dot)org, pgsql-bugs(at)lists(dot)postgresql(dot)org
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
CC: lalbin(at)scharp(dot)org

The following bug has been logged on the website:

Bug reference: 15182
Logged by: Lloyd Albin
Email address: lalbin(at)scharp(dot)org
PostgreSQL version: 10.3
Operating system: OpenSUSE
Description:
Over the last several weeks our developers caused a Denial of Service Attack
against ourselves by accident. When looking at the log files, I noticed that
we had authentication timeouts during these time periods. In researching the
problem I found this is due to locks being held on shared system catalog
items, aka system catalog items that are shared between all databases on the
same cluster/server. This can be caused by beginning a long running
transaction that queries pg_stat_activity, pg_roles, pg_database, etc and
then another connection that runs either a REINDEX DATABASE, REINDEX SYSTEM,
or VACUUM FULL. This issue is of particular importance to database resellers
who use the same cluster/server for multiple clients, as two clients can
cause this issue to happen inadvertently or a single client can either cause
it to happen maliciously or inadvertently. Note: The large cloud providers
give each of their clients their own cluster/server so this will not affect
across cloud clients but can affect an individual client. The problem is
that traditional hosting companies will have all clients from one or more
web servers share the same PostgreSQL cluster/server. This means that one or
two clients could inadvertently stop all the other clients from being able
to connect to their databases until the first client does either a COMMIT or
ROLLBACK of their transaction which they could hold open for hours, which is
what happened to us internally.

In Connection 1 we need to BEGIN a transaction and then query a shared
system item; pg_authid, pg_database, etc; or a view that depends on a shared
system item; pg_stat_activity, pg_roles, etc. Our developers were accessing
pg_roles.

Connection 1 (Any database, Any User)
BEGIN;
SELECT * FROM pg_stat_activity;

Connection 2 (Any database will do as long as you are the database owner)
REINDEX DATABASE postgres;

Connection 3 (Any Database, Any User)
psql -h sqltest-alt -d sandbox

All future Connection 3's will hang for however long the transaction in
Connection 1 runs. In our case this was hours and denied everybody else the
ability to log into the server until Connection 1 was committed. psql will
just hang for hours, even overnight in my testing, but our apps would get
the "Canceling authentication due to timeout" after 1 minute.

Connection 2 can also do any of these commands to also cause the same
issue:
REINDEX SYSTEM postgres;
VACUUM FULL pg_authid;
vacuumdb -f -h sqltest-alt -d lloyd -U lalbin

Even worse is that the VACUUM FULL pg_authid; can be started by an
unprivileged user and it will wait for the AccessShareLock by connection 1
to be released before returning the error that you don't have permission to
perform this action, so even an unprivileged user can cause this to happen.
The privilege check needs to happen before the waiting for the
AccessExclusiveLock happens.

This bug report has been simplified and shorted drastically. To read the
full information about this issue please see my blog post:
http://lloyd.thealbins.com/Canceling%20authentication%20due%20to%20timeout

Lloyd Albin
Database Administrator
Statistical Center for HIV/AIDS Research and Prevention (SCHARP)
Fred Hutchinson Cancer Research Center

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2018-07-19 23:26:49 Re: BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows
Previous Message Peter Geoghegan 2018-07-19 16:44:32 Re: BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-19 23:18:32 Re: missing toast table for pg_policy
Previous Message Andres Freund 2018-07-19 22:59:33 Re: missing toast table for pg_policy