Re: [PATCH] Exponential backoff for auth_delay

From: Abhijit Menon-Sen <ams(at)toroid(dot)org>
To: Michael Banck <mbanck(at)gmx(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, 成之焕 <zhcheng(at)ceresdata(dot)com>
Subject: Re: [PATCH] Exponential backoff for auth_delay
Date: 2024-01-16 06:30:49
Message-ID: ZaYimdRlGS5tdzjy@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At 2024-01-04 08:30:36 +0100, mbanck(at)gmx(dot)net wrote:
>
> +typedef struct AuthConnRecord
> +{
> + char remote_host[NI_MAXHOST];
> + bool used;
> + double sleep_time; /* in milliseconds */
> +} AuthConnRecord;

Do we really need a "used" field here? You could avoid it by setting
remote_host[0] = '\0' in cleanup_conn_record.

> static void
> auth_delay_checks(Port *port, int status)
> {
> + double delay;

I would just initialise this to auth_delay_milliseconds here, instead of
the harder-to-notice "else" below.

> @@ -43,8 +69,150 @@ auth_delay_checks(Port *port, int status)
> */
> if (status != STATUS_OK)
> {
> - pg_usleep(1000L * auth_delay_milliseconds);
> + if (auth_delay_exp_backoff)
> + {
> + /*
> + * Exponential backoff per remote host.
> + */
> + delay = record_failed_conn_auth(port);
> + if (auth_delay_max_seconds > 0)
> + delay = Min(delay, 1000L * auth_delay_max_seconds);
> + }

I would make this comment more specific, maybe something like "Delay by
2^n seconds after each authentication failure from a particular host,
where n is the number of consecutive authentication failures".

It's slightly discordant to see the new delay being returned by a
function named "record_<something>" (rather than "calculate_delay" or
similar). Maybe a name like "backoff_after_failed_auth" would be better?
Or "increase_delay_on_auth_fail"?

> +static double
> +record_failed_conn_auth(Port *port)
> +{
> + AuthConnRecord *acr = NULL;
> + int j = -1;
> +
> + acr = find_conn_record(port->remote_host, &j);
> +
> + if (!acr)
> + {
> + if (j == -1)
> +
> + /*
> + * No free space, MAX_CONN_RECORDS reached. Wait as long as the
> + * largest delay for any remote host.
> + */
> + return find_conn_max_delay();

In this extraordinary situation (where there are lots of hosts with lots
of authentication failures), why not delay by auth_delay_max_seconds
straightaway, especially when the default is only 10s? I don't see much
point in coordinating the delay between fifty known hosts and an unknown
number of others.

> + elog(DEBUG1, "new connection: %s, index: %d", acr->remote_host, j);

I think this should be removed, but if you want to leave it in, the
message should be more specific about what this is actually about, and
where the message is from, so as to not confuse debug-log readers.

(The same applies to the other debug messages.)

> +static AuthConnRecord *
> +find_conn_record(char *remote_host, int *free_index)
> +{
> + int i;
> +
> + *free_index = -1;
> + for (i = 0; i < MAX_CONN_RECORDS; i++)
> + {
> + if (!acr_array[i].used)
> + {
> + if (*free_index == -1)
> + /* record unused element */
> + *free_index = i;
> + continue;
> + }
> + if (strcmp(acr_array[i].remote_host, remote_host) == 0)
> + return &acr_array[i];
> + }
> +
> + return NULL;
> +}

It's not a big deal, but to be honest, I would much prefer to (get rid
of used, as suggested earlier, and) have separate find_acr_for_host()
and find_free_acr() functions.

> +static void
> +record_conn_failure(AuthConnRecord *acr)
> +{
> + if (acr->sleep_time == 0)
> + acr->sleep_time = (double) auth_delay_milliseconds;
> + else
> + acr->sleep_time *= 2;
> +}

I would just roll this function into record_failed_conn_auth (or
whatever it's named), but if you want to keep it a separate function, it
should at least have a name that's sufficiently different from
record_failed_conn_auth.

In terms of the logic, it would have been slightly clearer to store the
number of failures and calculate the delay, but it's easier to double
the sleep time that way you've written it. I think it's fine.

It's worth noting that there is no time-based reset of the delay with
this feature, which I think is something that people might expect to go
hand-in-hand with exponential backoff. I think that's probably OK too.

> +static void
> +auth_delay_shmem_startup(void)
> +{
> + Size required;
> + bool found;
> +
> + if (shmem_startup_next)
> + shmem_startup_next();
> +
> + required = sizeof(AuthConnRecord) * MAX_CONN_RECORDS;
> + acr_array = ShmemInitStruct("Array of AuthConnRecord", required, &found);
> + if (found)
> + /* this should not happen ? */
> + elog(DEBUG1, "variable acr_array already exists");
> + /* all fileds are set to 0 */
> + memset(acr_array, 0, required);
> }

I think you can remove the elog and just do the memset if (!found). Also
if you're changing it anyway, I'd suggest something like "total_size"
instead of "required".

> + DefineCustomBoolVariable("auth_delay.exp_backoff",
> + "Exponential backoff for failed connections, per remote host",
> + NULL,
> + &auth_delay_exp_backoff,
> + false,
> + PGC_SIGHUP,
> + 0,
> + NULL,
> + NULL,
> + NULL);

Maybe "Double the delay after each authentication failure from a
particular host". (Note: authentication failed, not connection.)

I would also mildly prefer to spell out "exponential_backoff" (but leave
auth_delay_exp_backoff as-is).

> + DefineCustomIntVariable("auth_delay.max_seconds",
> + "Maximum seconds to wait when login fails during exponential backoff",
> + NULL,
> + &auth_delay_max_seconds,
> + 10,
> + 0, INT_MAX,
> + PGC_SIGHUP,
> + GUC_UNIT_S,
> + NULL, NULL, NULL);
> +

Maybe just "Maximum delay when exponential backoff is enabled".

(Parameter indentation doesn't match the earlier block.)

I'm not able to make up my mind if I think 10s is a good default or not.
In practice, it means that after the first three consecutive failures,
we'll delay by 10s for every subsequent failure. That sounds OK. But is
is much more useful than, for example, delaying the first three failures
by auth_delay_milliseconds and then jumping straight to max_seconds?

I can't really imagine wanting to increase max_seconds to, say, 128 and
keep a bunch of backends sleeping while someone's trying to brute-force
a password. And with a reasonably short max_seconds, I'm not sure if
having the backoff be _exponential_ is particularly important.

Or maybe because this is a contrib module, we don't have to think about
it to that extent?

> diff --git a/doc/src/sgml/auth-delay.sgml b/doc/src/sgml/auth-delay.sgml
> index 0571f2a99d..2ca9528011 100644
> --- a/doc/src/sgml/auth-delay.sgml
> +++ b/doc/src/sgml/auth-delay.sgml
> @@ -16,6 +16,17 @@
> connection slots.
> </para>
>
> + <para>
> + It is optionally possible to let <filename>auth_delay</filename> wait longer
> + for each successive authentication failure from a particular remote host, if
> + the configuration parameter <varname>auth_delay.exp_backoff</varname> is
> + active. Once an authentication succeeded from a remote host, the
> + authentication delay is reset to the value of
> + <varname>auth_delay.milliseconds</varname> for this host. The parameter
> + <varname>auth_delay.max_seconds</varname> sets an upper bound for the delay
> + in this case.
> + </para>

How about something like this…

If you enable exponential_backoff, auth_delay will double the delay
after each consecutive authentication failure from a particular
host, up to the given max_seconds (default: 10s). If the host
authenticates successfully, the delay is reset.

> + <varlistentry>
> + <term>
> + <varname>auth_delay.max_seconds</varname> (<type>integer</type>)
> + <indexterm>
> + <primary><varname>auth_delay.max_seconds</varname> configuration parameter</primary>
> + </indexterm>
> + </term>
> + <listitem>
> + <para>
> + How many seconds to wait at most if exponential backoff is active.
> + Setting this parameter to 0 disables it. The default is 10 seconds.
> + </para>
> + </listitem>
> + </varlistentry>

I suggest "The maximum delay, in seconds, when exponential backoff is
enabled."

-- Abhijit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-01-16 06:43:04 RE: A failure in t/038_save_logical_slots_shutdown.pl
Previous Message Shubham Khanna 2024-01-16 06:28:10 Re: speed up a logical replica setup