Re: POC: Extension for adding distributed tracing - pg_tracing

From: Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com>
To: Nikita Malakhov <hukutoc(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing
Date: 2023-09-06 13:43:36
Message-ID: CAO6_XqpbF76DXHU+iu2cU3QWv7EWQjFLyQyPnN6LT=FT6jpCtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

I've been working on performance improvements and features.
Performance improvements:
- I was calling select_rtable_names_for_explain on every
ExplainTargetRel which was rather costly and unnecessary. It's now
only done once.
- File writes are now batched and only a single file write is done at
the end of tracing.

With those improvements, pgbench now yields the following results:
0% tracing:
pgbench -T60 "options=-cpg_tracing.sample_rate=0"
number of transactions actually processed: 57293
tps = 954.915403 (without initial connection time)

100% tracing:
pgbench -T60 "options=-cpg_tracing.sample_rate=1"
number of transactions actually processed: 45963
tps = 766.068564 (without initial connection time)

I’ve also added tracing of parallel workers. If a query needs parallel
workers, the trace context is pushed in a specific shared buffer so
workers can pull trace context and generate spans with the correct
traceId and parentId. We are now able to see the start, run and finish
of workers (see image).
I'm currently focusing on adding more tests, comments, fixing TODOs
and known bugs.

Regards,
Anthonin

On Wed, Aug 30, 2023 at 11:15 AM Anthonin Bonnefoy
<anthonin(dot)bonnefoy(at)datadoghq(dot)com> wrote:
>
> And the latest version of the patch that I've forgotten in the previous message.
>
> On Wed, Aug 30, 2023 at 11:09 AM Anthonin Bonnefoy
> <anthonin(dot)bonnefoy(at)datadoghq(dot)com> wrote:
> >
> > Hi!
> >
> > > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > > make a GUC turned off by default to store spans into the table instead of buffer.
> >
> > My main issue with having the extension flush spans in a table is that
> > it will only work on primaries.
> > Replicas won't be able to do this as they can't write data on tables
> > and having this flush function is likely going to introduce more
> > confusion if it only works on primaries.
> >
> > From my point of view, processing the spans should be done by an
> > external application.Similar to my initial example which forward spans
> > to a trace collector
> > (https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50#file-main-go),
> > this application could instead store spans in a dedicated table. This
> > way, the creation of the table is outside pg_tracing's scope and the
> > span_store application will be able to store spans on replicas and
> > primaries.
> > How frequently this application should pull data to avoid full buffer
> > and dropping spans is a tricky part. We have stats so we can monitor
> > if drops are happening and adjust the spans buffer size or increase
> > the application's pull frequency. Another possibility (though I'm not
> > familiar with that) could be to use notifications. The application
> > will listen to a channel and pg_tracing will notify when a
> > configurable threshold is reached (i.e.: if the buffer reaches 50%,
> > send a notification).
> >
> > > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > > loops like
> > > for i in 1..2 loop
> > > StartTime := clock_timestamp();
> > > insert into t2 values (i, a_data);
> > > EndTime := clock_timestamp();
> > > Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> > > end loop;
> >
> > Was this run through psql? In this case, you would be using simple
> > protocol which always drops the portal at the end of the statement and
> > is not triggering the issue.
> >
> > With extended protocol and a multi statement transactions, I have the
> > following backtrace:
> > * frame #0: 0x0000000104581318
> > postgres`ExecutorEnd(queryDesc=0x00000001420083b0) at execMain.c:471:6
> > frame #1: 0x00000001044e1644
> > postgres`PortalCleanup(portal=0x0000000152031d00) at
> > portalcmds.c:299:4
> > frame #2: 0x0000000104b0e77c
> > postgres`PortalDrop(portal=0x0000000152031d00, isTopCommit=false) at
> > portalmem.c:503:3
> > frame #3: 0x0000000104b0e3e8 postgres`CreatePortal(name="",
> > allowDup=true, dupSilent=true) at portalmem.c:194:3
> > frame #4: 0x000000010487a308
> > postgres`exec_bind_message(input_message=0x000000016bb4b398) at
> > postgres.c:1745:12
> > frame #5: 0x000000010487846c
> > postgres`PostgresMain(dbname="postgres", username="postgres") at
> > postgres.c:4685:5
> > frame #6: 0x0000000104773144
> > postgres`BackendRun(port=0x0000000141704080) at postmaster.c:4433:2
> > frame #7: 0x000000010477044c
> > postgres`BackendStartup(port=0x0000000141704080) at
> > postmaster.c:4161:3
> > frame #8: 0x000000010476d6fc postgres`ServerLoop at postmaster.c:1778:6
> > frame #9: 0x000000010476c260 postgres`PostmasterMain(argc=3,
> > argv=0x0000600001cf72c0) at postmaster.c:1462:11
> > frame #10: 0x0000000104625ca8 postgres`main(argc=3,
> > argv=0x0000600001cf72c0) at main.c:198:3
> > frame #11: 0x00000001a61dbf28 dyld`start + 2236
> >
> > At this point, the new statement is already parsed and it's only
> > during the bind that the previous' statement portal is dropped and
> > ExecutorEnd is called. Thus, we have overlapping statements which are
> > tricky to handle.
> >
> > On my side, I've made the following changes:
> > Skip nodes that are not executed
> > Fix query_id propagation, it should now be associated with all the
> > spans of a query when available
> > I've forgotten to add the spans buffer in the shmem hook so it would
> > crash when pg_tracing.max_span was too high. Now it's fixed and we can
> > request more than 2000 spans. Currently, the size of Span is 320 bytes
> > so 50000 will take ~15MB of memory.
> > I've added the subplan/init plan spans and started experimenting
> > deparsing the plan to add more details on the nodes. Instead of
> > 'NestedLoop', we will have something 'NestedLoop|Join Filter : (oid =
> > relnamespace)'. One consequence is that it will create a high number
> > of different operations which is something we want to avoid with trace
> > collectors. I'm probably gonna move this outside the operation name or
> > find a way to make this more stable.
> >
> > I've started initial benchmarks and profiling. On a default
> > installation and running a default `pgbench -T15`, I have the
> > following:
> > Without pg_tracing: tps = 953.118759 (without initial connection time)
> > With pg_tracing: tps = 652.953858 (without initial connection time)
> > Most of the time is spent process_query_desc. I haven't tried to
> > optimize performances yet so there's probably a lot of room for
> > improvement.
> >
> >
> > On Mon, Aug 14, 2023 at 11:17 AM Nikita Malakhov <hukutoc(at)gmail(dot)com> wrote:
> > >
> > > Hi!
> > >
> > > Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
> > > API for pulling data, so from this point of view it is much more convenient to keep spans in separate
> > > table. But in this case we come to another issue - how to flush this data into the table? Automatic
> > > flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
> > > make a GUC turned off by default to store spans into the table instead of buffer.
> > >
> > > >3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended protocol introduces
> > > >differences that break some assumptions I did. For example, with multi statement transaction like
> > > >BEGIN;
> > > >SELECT 1;
> > > >SELECT 2;
> > > >The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping the post parse
> > > >hook if we still have an ongoing tracing.
> > >
> > > I've checked this behavior before and haven't noticed the case you mentioned, but for
> > > loops like
> > > for i in 1..2 loop
> > > StartTime := clock_timestamp();
> > > insert into t2 values (i, a_data);
> > > EndTime := clock_timestamp();
> > > Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
> > > end loop;
> > >
> > > I've got the following call sequence:
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <1>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <1>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <2>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <2>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <StartTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <StartTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <EndTime := clock_timestamp()>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
> > > psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
> > >
> > > --
> > > Regards,
> > > Nikita Malakhov
> > > Postgres Professional
> > > The Russian Postgres Company
> > > https://postgrespro.ru/

Attachment Content-Type Size
pg-tracing-v5.patch application/octet-stream 149.7 KB
trace_worker.png image/png 147.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-09-06 14:00:20 Re: Output affected rows in EXPLAIN
Previous Message Hayato Kuroda (Fujitsu) 2023-09-06 13:36:37 RE: [PoC] pg_upgrade: allow to upgrade publisher node