Re: [PATCH] Exponential backoff for auth_delay

From: Michael Banck <mbanck(at)gmx(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Abhijit Menon-Sen <ams(at)toroid(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, 成之焕 <zhcheng(at)ceresdata(dot)com>
Subject: Re: [PATCH] Exponential backoff for auth_delay
Date: 2024-03-04 19:42:55
Message-ID: 65e62440.170a0220.f6a7c.6f7f@mx.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Feb 21, 2024 at 10:26:11PM +0100, Tomas Vondra wrote:
> Thanks for the patch. I took a closer look at v3, so let me share some
> review comments. Please push back if you happen to disagree with some of
> it, some of this is going to be more a matter of personal preference.

Thanks! As my patch was based on a previous patch, some of decisions
were carry-overs I am not overly attached to.

> 1) I think it's a bit weird to have two options specifying amount of
> time, but one is in seconds and one in milliseconds. Won't that be
> unnecessarily confusing? Could we do both in milliseconds?

Alright, I changed that.

See below for a discussion about the GUCs in general.

> 2) The C code defines the GUC as auth_delay.exponential_backoff, but the
> SGML docs say <varname>auth_delay.exp_backoff</varname>.

Right, an oversight from the last version where the GUC name got changed
but I forgot to change the documentation, fixed.

> 3) Do we actually need the exponential_backoff GUC? Wouldn't it be
> sufficient to have the maximum value, and if it's -1 treat that as no
> backoff?

That is a good question, I guess that makes sense.

The next question then is: what should the default for (now)
auth_delay.max_milliseconds be in this case, -1? Or do we say that as
the default for auth_delay.milliseconds is 0 anyway, why would somebody
not want exponential backoff when they switch it on and keep it at the
current 10s/10000ms)?

I have not changed that for now, pending further input.

> 4) I think the SGML docs should more clearly explain that the delay is
> initialized to auth_delay.milliseconds, and reset to this value after
> successful authentication. The wording kinda implies that, but it's not
> quite clear I think.

Ok, I added some text to that end. I also added a not that
auth_delay.max_milliseconds will mean that the delay doubling never
stops.

> 4) I've been really confused by the change from
>
> if (status != STATUS_OK)
> to
> if (status == STATUS_ERROR)
>
> in auth_delay_checks, until I realized those two codes are not the only
> ones, and we intentionally ignore STATUS_EOF. I think it'd be good to
> mention that in a comment, it's not quite obvious (I only realized it
> because the e-mail mentioned it).

Yeah I agree, I tried to explain that now.

> 5) I kinda like the custom that functions in a contrib module start with
> a clear common prefix, say auth_delay_ in this case. Matter of personal
> preference, ofc.

Ok, I changed the functions to have an auth_delay_ prefix throughout..

> 6) I'm a bit skeptical about some acr_array details. Firstly, why would
> 50 entries be enough? Seems like a pretty low and arbitrary number ...
> Also, what happens if someone attempts to authenticate, fails to do so,
> and then disconnects and never tries again? Or just changes IP? Seems
> like the entry will remain in the array forever, no?

Yeah, that is how v3 of this patch worked. I have changed that now, see
below.

> Seems like a way to cause a "limited" DoS - do auth failure from 50
> different hosts, to fill the table, and now everyone has to wait the
> maximum amount of time (if they fail to authenticate).

Right, though the problem would only exist on authentication failures,
so it is really rather limited.

> I think it'd be good to consider:
>
> - Making the acr_array a hash table, and larger than 50 entries (I
> wonder if that should be dynamic / customizable by GUC?).

I would say a GUC should be overkill for this as this would mostly be an
implementation detail.

More generally, I think now that entries are expired (see below), this
should be less of a concern, so I have not changed this to a hash table
for now but doubled MAX_CONN_RECORDS to 100 entries.

> - Make sure the entries are eventually expired, based on time (for
> example after 2*max_delay?).

I went with 5*max_milliseconds - the AuthConnRecord struct now has a
last_failed_auth timestamp member; if we increase the delay for a host,
we check if any other host expired in the meantime and remove it if so.

> - It would be a good idea to log something when we get into the "full
> table" and start delaying everyone by max_delay_seconds. (How would
> anyone know that's what's happening, seems rather confusing.)

Right, I added a log line for that. However, I made it LOG instead of
WARNING as I don't think the client would ever see it, would he?

Attached is v4 with the above changes.

Cheers,

Michael

Attachment Content-Type Size
v4-0001-Add-optional-exponential-backoff-to-auth_delay-co.patch text/x-diff 10.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-03-04 19:47:09 Re: Add system identifier to backup manifest
Previous Message Robert Haas 2024-03-04 18:37:39 Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY