Re: POC: Extension for adding distributed tracing - pg_tracing

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing
Date: 2024-01-06 15:30:03
Message-ID: CALDaNm13MCWCJCOy6J9-1ojYo3vWGt+14y05h7qKadvk_YAdiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 7 Dec 2023 at 20:06, Anthonin Bonnefoy
<anthonin(dot)bonnefoy(at)datadoghq(dot)com> wrote:
>
> Hi,
>
> Thanks for the review!
>
> > ```
> > +-- Worker can take some additional time to end and report their spans
> > +SELECT pg_sleep(0.2);
> > + pg_sleep
> > +----------
> > +
> > +(1 row)
> > ```
> >
> > Pretty sure this will fail on buildfarm from time to time. Maybe this
> > is a case when it is worth using TAP tests? Especially considering the
> > support of pg_tracing.notify_channel.
>
> The sleep was unnecessary as the leader waits for all workers to finish
> and report. All workers should have added their spans in the shared buffer
> before the leader and I removed the sleep from the test. I haven't seen a recent
> failure on this test since then.
>
> I did add a TAP test for the notify part since it seems to be the only way
> to test it.
>
> > ```
> > + Only superusers can change this setting.
> > + This parameter can only be set at server start.
> > ```
> >
> > I find this confusing. If the parameter can only be set at server
> > start doesn't it mean that whoever has access to postgresql.conf can
> > change it?
>
> The 'only superusers...' part was incorrect. This setting was needed to compute
> the size of the shmem request so modifying it would require a server start.
> Though for this parameter, I've realised that I could use the global's
> max_parallel_workers
> instead, removing the need to have a max_traced_parallel_workers parameter.
>
> > Perhaps TABLE command should be added to the list, since it's
> > basically a syntax sugar for SELECT:
>
> Added.
>
> > Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> > with pg_tracing.notify_threshold because there is a race condition
> > possible. By the time one receives and processes the notification the
> > buffer can be already cleaned.
>
> That's definitely a possibility. Lowering the notification threshold can
> lower the chance of the buffer being cleared when handling the notification
> but the race can still happen.
>
> For my use cases, spans should be sent as soon as possible and old spans
> are less valuable than new spans but other usage would want the opposite,
> thus the parameters to control the behavior.
>
> > This makes me think that 1) pg_tracing.drop_on_full_buffer = on could
> > be not the best default value
>
> I don't have a strong opinion on the default behaviour, I will switch to
> keep buffer on full.
>
> > 2) As a user I would like a support of ring buffer mode. Even if
> > implementing this mode is out of scope of this particular patch, I
> > would argue that instead of drop_on_full_buffer boolean parameter we
> > need something like:
> >
> > pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> > perhaps_other_future_modes
>
> Having a buffer ring was also my first approach but it adds some complexity
> and I've switched to the full drop + truncate query file for the first
> iteration.
> Ring buffer would definitely be a better experience, I've switched the
> parameter to an enum with only keep_on_full and drop_on_full for the time being.
>
> > Do you think it possible to document how much shared memory does the
> > extension consume depending on the parameters? Alternatively maybe
> > it's worth providing a SQL function that returns this value? As a user
> > I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> > are consumed.
>
> Good point. I've updated the doc to give some approximations with a query
> to detail the memory usage.
>
> > I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> > redundant and only complicates the interface. This is way below the
> > noise floor formed from network delays, process switching by the OS,
> > noisy neighbours under the same supervisor, time slowdown by ntpd,
> > etc. I don't recall ever needing spans with better than 1 us
> > precision.
> >
> > On top of that if you have `span_start_ns` you should probably rename
> > `duration` to `duration_ns` for consistency. I believe it would be
> > more convenient to have `duration` in microseconds. Nothing prevents
> > us from using fractions of microseconds (e.g. 12.345 us) if necessary.
>
> I was definitely on the fence on this. I've initially tried to match the
> precision provided by query instrumentation but that definitely introduced
> a lot of complexity. Also, while trace standard specify that start and end
> should be in nanoseconds, it turns out that the nanosecond precision is
> generally ignored.
>
> I've switched to a start+end in TimestampTz, dropping the ns precision.
> Tests had to be rewritten to have a stable order with spans having that can
> have the same start/end.
>
> I've also removed some spans: ExecutorStart, ExecutorEnd, PostParse.
> Those spans generally had a sub microsecond duration and didn't bring much
> value except for debugging pg_tracing itself. To avoid the confusion of
> having spans with a 0us duration, removing them altogether seemed better.
> ExecutorFinish also has a sub microsecond duration unless it has a nested
> query (triggers) so I'm only emitting them in those cases.
>
> > Negative span ids, for sure, are possible, but I don't recall seeing
> > many negative ids in my life. I suggest not to surprise the user
> > unless this is really necessary.
>
> That one is gonna be difficult. Traceid and spanid are 8 bytes so I'm exporting
> them using bigint. PostgreSQL doesn't support unsigned bigint thus the negative
> span ids. An alternative would be to use a bytea but bigint are easier to read
> and interact with.
>
> The latest patch contains changes around the nanosecond precision drop, test
> updates to make them more stable and simplification around span management.

One of the test has failed in CFBOT [1] with:
-select count(span_id) from pg_tracing_consume_spans group by span_id;
- count
--------
- 1
- 1
- 1
- 1
- 1
-(5 rows)
-
--- Cleanup
-SET pg_tracing.sample_rate = 0.0;
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
+connection to server was lost

More details are available at [2]

[1] - https://cirrus-ci.com/task/5876369167482880
[2] - https://api.cirrus-ci.com/v1/artifact/task/5876369167482880/testrun/build/testrun/pg_tracing/regress/regression.diffs

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-01-06 15:34:54 Re: pg_stat_statements and "IN" conditions
Previous Message vignesh C 2024-01-06 15:20:20 Re: [PATCH] pgbench log file headers