Re: [PATCH] Exponential backoff for auth_delay

From: Michael Banck <mbanck(at)gmx(dot)net>
To: Abhijit Menon-Sen <ams(at)toroid(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, 成之焕 <zhcheng(at)ceresdata(dot)com>
Subject: Re: [PATCH] Exponential backoff for auth_delay
Date: 2024-01-19 14:08:36
Message-ID: 65aa8265.170a0220.1ecef.91df@mx.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

many thanks for the review!

I went through your comments (a lot of which pertained to the original
larger patch I took code from), attached is a reworked version 2.

Other changes:

1. Ignore STATUS_EOF, this led to auth_delay being applied twice (maybe
due to the gss/kerberos auth psql is trying by default? Is that legit
and should this change be reverted?) - i.e. handle STATUS_OK and
STATUS_ERROR explicitly.

2. Guard ShmemInitStruct() with LWLockAcquire(AddinShmemInitLock,
LW_EXCLUSIVE) / LWLockRelease(AddinShmemInitLock), as is done in
pg_prewarm and pg_stat_statements as well.

3. Added an additional paragraph discussing the value of
auth_delay.milliseconds when auth_delay.exponential_backoff is on, see
below.

I wonder whether maybe auth_delay.max_seconds should either be renamed
to auth_delay.exponential_backoff_max_seconds (but then it is rather
long) in order to make it clearer it only applies in that context or
alternatively just apply to auth_delay.milliseconds as well (though that
would be somewhat weird).

Further comments to your comments:

On Tue, Jan 16, 2024 at 12:00:49PM +0530, Abhijit Menon-Sen wrote:
> At 2024-01-04 08:30:36 +0100, mbanck(at)gmx(dot)net wrote:
> >
> > +typedef struct AuthConnRecord
> > +{
> > + char remote_host[NI_MAXHOST];
> > + bool used;
> > + double sleep_time; /* in milliseconds */
> > +} AuthConnRecord;
>
> Do we really need a "used" field here? You could avoid it by setting
> remote_host[0] = '\0' in cleanup_conn_record.

Ok, removed.

> > static void
> > auth_delay_checks(Port *port, int status)
> > {
> > + double delay;
>
> I would just initialise this to auth_delay_milliseconds here, instead of
> the harder-to-notice "else" below.

Done.

> > @@ -43,8 +69,150 @@ auth_delay_checks(Port *port, int status)
> > */
> > if (status != STATUS_OK)
> > {
> > - pg_usleep(1000L * auth_delay_milliseconds);
> > + if (auth_delay_exp_backoff)
> > + {
> > + /*
> > + * Exponential backoff per remote host.
> > + */
> > + delay = record_failed_conn_auth(port);
> > + if (auth_delay_max_seconds > 0)
> > + delay = Min(delay, 1000L * auth_delay_max_seconds);
> > + }
>
> I would make this comment more specific, maybe something like "Delay by
> 2^n seconds after each authentication failure from a particular host,
> where n is the number of consecutive authentication failures".

Done.

> It's slightly discordant to see the new delay being returned by a
> function named "record_<something>" (rather than "calculate_delay" or
> similar). Maybe a name like "backoff_after_failed_auth" would be better?
> Or "increase_delay_on_auth_fail"?

I called it increase_delay_after_failed_conn_auth() now.

> > +static double
> > +record_failed_conn_auth(Port *port)
> > +{
> > + AuthConnRecord *acr = NULL;
> > + int j = -1;
> > +
> > + acr = find_conn_record(port->remote_host, &j);
> > +
> > + if (!acr)
> > + {
> > + if (j == -1)
> > +
> > + /*
> > + * No free space, MAX_CONN_RECORDS reached. Wait as long as the
> > + * largest delay for any remote host.
> > + */
> > + return find_conn_max_delay();
>
> In this extraordinary situation (where there are lots of hosts with lots
> of authentication failures), why not delay by auth_delay_max_seconds
> straightaway, especially when the default is only 10s? I don't see much
> point in coordinating the delay between fifty known hosts and an unknown
> number of others.

I was a bit worried about legitimate users suffering here if (for some
reason) a lot of different hosts try to guess passwords, but only once
or twice or something. But I have changed it now as you suggested as
that makes it simpler and I guess the problem I mentioned above is
rather contrived.

> > + elog(DEBUG1, "new connection: %s, index: %d", acr->remote_host, j);
>
> I think this should be removed, but if you want to leave it in, the
> message should be more specific about what this is actually about, and
> where the message is from, so as to not confuse debug-log readers.

I left it in but mentioned auth_delay in it now. I wonder though whether
this might be a useful message to have at some more standard level like
INFO?

> (The same applies to the other debug messages.)

Those are all gone.

> > +static AuthConnRecord *
> > +find_conn_record(char *remote_host, int *free_index)
> > +{
> > + int i;
> > +
> > + *free_index = -1;
> > + for (i = 0; i < MAX_CONN_RECORDS; i++)
> > + {
> > + if (!acr_array[i].used)
> > + {
> > + if (*free_index == -1)
> > + /* record unused element */
> > + *free_index = i;
> > + continue;
> > + }
> > + if (strcmp(acr_array[i].remote_host, remote_host) == 0)
> > + return &acr_array[i];
> > + }
> > +
> > + return NULL;
> > +}
>
> It's not a big deal, but to be honest, I would much prefer to (get rid
> of used, as suggested earlier, and) have separate find_acr_for_host()
> and find_free_acr() functions.

Done.

> > +static void
> > +record_conn_failure(AuthConnRecord *acr)
> > +{
> > + if (acr->sleep_time == 0)
> > + acr->sleep_time = (double) auth_delay_milliseconds;
> > + else
> > + acr->sleep_time *= 2;
> > +}
>
> I would just roll this function into record_failed_conn_auth (or
> whatever it's named),

Done.

> In terms of the logic, it would have been slightly clearer to store the
> number of failures and calculate the delay, but it's easier to double
> the sleep time that way you've written it. I think it's fine.

I kept it as-is for now.

> It's worth noting that there is no time-based reset of the delay with
> this feature, which I think is something that people might expect to go
> hand-in-hand with exponential backoff. I think that's probably OK too.

You mean something like "after 5 minutes, reset the delay to 0 again"? I
agree that this would probably be useful, but would also make the change
more complex.

> > +static void
> > +auth_delay_shmem_startup(void)
> > +{
> > + Size required;
> > + bool found;
> > +
> > + if (shmem_startup_next)
> > + shmem_startup_next();
> > +
> > + required = sizeof(AuthConnRecord) * MAX_CONN_RECORDS;
> > + acr_array = ShmemInitStruct("Array of AuthConnRecord", required, &found);
> > + if (found)
> > + /* this should not happen ? */
> > + elog(DEBUG1, "variable acr_array already exists");
> > + /* all fileds are set to 0 */
> > + memset(acr_array, 0, required);
> > }
>
> I think you can remove the elog and just do the memset if (!found). Also
> if you're changing it anyway, I'd suggest something like "total_size"
> instead of "required".

Done.

> > + DefineCustomBoolVariable("auth_delay.exp_backoff",
> > + "Exponential backoff for failed connections, per remote host",
> > + NULL,
> > + &auth_delay_exp_backoff,
> > + false,
> > + PGC_SIGHUP,
> > + 0,
> > + NULL,
> > + NULL,
> > + NULL);
>
> Maybe "Double the delay after each authentication failure from a
> particular host". (Note: authentication failed, not connection.)

Done.

> I would also mildly prefer to spell out "exponential_backoff" (but leave
> auth_delay_exp_backoff as-is).

Done.

> > + DefineCustomIntVariable("auth_delay.max_seconds",
> > + "Maximum seconds to wait when login fails during exponential backoff",
> > + NULL,
> > + &auth_delay_max_seconds,
> > + 10,
> > + 0, INT_MAX,
> > + PGC_SIGHUP,
> > + GUC_UNIT_S,
> > + NULL, NULL, NULL);
> > +
>
> Maybe just "Maximum delay when exponential backoff is enabled".

Done.

> (Parameter indentation doesn't match the earlier block.)

I noticed that as well, but pgindent keeps changing it back to this, not
sure why...

> I'm not able to make up my mind if I think 10s is a good default or not.
> In practice, it means that after the first three consecutive failures,
> we'll delay by 10s for every subsequent failure. That sounds OK. But is
> is much more useful than, for example, delaying the first three failures
> by auth_delay_milliseconds and then jumping straight to max_seconds?

What I had in mind is that admins would lower auth_delay.milliseconds to
something like 100 or 125 when exponential_backoff is on, so that the
first few (possibley honest) auth failures do not get an annoying 1
seconds penalty, but later ones then wil. In that case, 10 seconds is
probably ok cause you'd need more than a handful of auth failures.

I added a paragraph to the documentation to this end.

> I can't really imagine wanting to increase max_seconds to, say, 128 and
> keep a bunch of backends sleeping while someone's trying to brute-force
> a password. And with a reasonably short max_seconds, I'm not sure if
> having the backoff be _exponential_ is particularly important.
>
> Or maybe because this is a contrib module, we don't have to think about
> it to that extent?

Well, not sure. I think something like 10 seconds should be fine for
most brute-force attacks in practise, and it is configurable (and turned
off by default).

> > diff --git a/doc/src/sgml/auth-delay.sgml b/doc/src/sgml/auth-delay.sgml
> > index 0571f2a99d..2ca9528011 100644
> > --- a/doc/src/sgml/auth-delay.sgml
> > +++ b/doc/src/sgml/auth-delay.sgml
> > @@ -16,6 +16,17 @@
> > connection slots.
> > </para>
> >
> > + <para>
> > + It is optionally possible to let <filename>auth_delay</filename> wait longer
> > + for each successive authentication failure from a particular remote host, if
> > + the configuration parameter <varname>auth_delay.exp_backoff</varname> is
> > + active. Once an authentication succeeded from a remote host, the
> > + authentication delay is reset to the value of
> > + <varname>auth_delay.milliseconds</varname> for this host. The parameter
> > + <varname>auth_delay.max_seconds</varname> sets an upper bound for the delay
> > + in this case.
> > + </para>
>
> How about something like this…
>
> If you enable exponential_backoff, auth_delay will double the delay
> after each consecutive authentication failure from a particular
> host, up to the given max_seconds (default: 10s). If the host
> authenticates successfully, the delay is reset.

Done, mostly.

> > + <varlistentry>
> > + <term>
> > + <varname>auth_delay.max_seconds</varname> (<type>integer</type>)
> > + <indexterm>
> > + <primary><varname>auth_delay.max_seconds</varname> configuration parameter</primary>
> > + </indexterm>
> > + </term>
> > + <listitem>
> > + <para>
> > + How many seconds to wait at most if exponential backoff is active.
> > + Setting this parameter to 0 disables it. The default is 10 seconds.
> > + </para>
> > + </listitem>
> > + </varlistentry>
>
> I suggest "The maximum delay, in seconds, when exponential backoff is
> enabled."

Done.

Michael

Attachment Content-Type Size
v2-0001-Add-optional-exponential-backoff-to-auth_delay-co.patch text/x-diff 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2024-01-19 14:16:24 Re: [PATCH] Exponential backoff for auth_delay
Previous Message Peter Eisentraut 2024-01-19 14:03:37 Re: generate syscache info automatically