Re: POC: Extension for adding distributed tracing - pg_tracing

From: Aleksander Alekseev <aleksander(at)timescale(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing
Date: 2023-11-17 14:34:37
Message-ID: CAJ7c6TMX+=YLBHOknpe7-08FQ1rExhDj1haEFvr8LNxmA1tUcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the updated patch!

> Some small changes, mostly around making tests less flaky:
> - Removed the top_span and nested_level in the span output, those were
> mostly used for debugging
> - More tests around Parse span in nested queries
> - Remove explicit queryId in the tests

```
+-- Worker can take some additional time to end and report their spans
+SELECT pg_sleep(0.2);
+ pg_sleep
+----------
+
+(1 row)
```

Pretty sure this will fail on buildfarm from time to time. Maybe this
is a case when it is worth using TAP tests? Especially considering the
support of pg_tracing.notify_channel.

```
+ Only superusers can change this setting.
+ This parameter can only be set at server start.
```

I find this confusing. If the parameter can only be set at server
start doesn't it mean that whoever has access to postgresql.conf can
change it?

```
+ <varname>pg_tracing.track_utility</varname> controls whether spans
+ should be generated for utility statements. Utility commands are
+ all those other than <command>SELECT</command>,
<command>INSERT</command>,
+ <command>UPDATE</command>, <command>DELETE</command>, and
<command>MERGE</command>.
+ The default value is <literal>on</literal>.
```

Perhaps TABLE command should be added to the list, since it's
basically a syntax sugar for SELECT:

```
=# create table t(x int);
=# insert into t values (1), (2), (3);
=# table t;
x
---
1
2
3
(3 rows)
```

This part worries me a bit:

```
+ <listitem>
+ <para>
+ <varname>pg_tracing.drop_on_full_buffer</varname> controls whether
+ span buffer should be emptied when <varname>pg_tracing.max_span</varname>
+ spans is reached. If <literal>off</literal>, new spans are dropped.
+ The default value is <literal>on</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
```

Obviously pg_tracing.drop_on_full_buffer = on will not work properly
with pg_tracing.notify_threshold because there is a race condition
possible. By the time one receives and processes the notification the
buffer can be already cleaned. This makes me think that 1)
pg_tracing.drop_on_full_buffer = on could be not the best default
value 2) As a user I would like a support of ring buffer mode. Even if
implementing this mode is out of scope of this particular patch, I
would argue that instead of drop_on_full_buffer boolean parameter we
need something like:

pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
perhaps_other_future_modes

```
+ <para>
+ The module requires additional shared memory proportional to
+ <varname>pg_tracing.max_span</varname>. Note that this
+ memory is consumed whenever the module is loaded, even if
+ no spans are generated.
+ </para>
```

Do you think it possible to document how much shared memory does the
extension consume depending on the parameters? Alternatively maybe
it's worth providing a SQL function that returns this value? As a user
I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
are consumed.

```
+ The start time of a span has two parts: <literal>span_start</literal> and
+ <literal>span_start_ns</literal>. <literal>span_start</literal> is
a Timestamp
+ with time zone providing up a microsecond precision.
<literal>span_start_ns</literal>
+ provides the nanosecond part of the span start.
<literal>span_start + span_start_ns</literal>
+ can be used to get the start with nanosecond precision.
```

I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
redundant and only complicates the interface. This is way below the
noise floor formed from network delays, process switching by the OS,
noisy neighbours under the same supervisor, time slowdown by ntpd,
etc. I don't recall ever needing spans with better than 1 us
precision.

On top of that if you have `span_start_ns` you should probably rename
`duration` to `duration_ns` for consistency. I believe it would be
more convenient to have `duration` in microseconds. Nothing prevents
us from using fractions of microseconds (e.g. 12.345 us) if necessary.

```
+=# select trace_id, parent_id, span_id, span_start, duration, span_type, ...
+ trace_id | parent_id | span_id |
...
+----------------------+----------------------+----------------------+----
+ 479062278978111363 | 479062278978111363 | 479062278978111363 ...
+ 479062278978111363 | 479062278978111363 | 479062278978111363 ...
+ 479062278978111363 | 479062278978111363 | -7732425595282414585 ...
```

Negative span ids, for sure, are possible, but I don't recall seeing
many negative ids in my life. I suggest not to surprise the user
unless this is really necessary.

--
Best regards,
Aleksander Alekseev

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-11-17 14:36:25 long-standing data loss bug in initial sync of logical replication
Previous Message Alvaro Herrera 2023-11-17 13:58:32 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock