User-perspective knowledge about wait events

From: Schneider <schneider(at)ardentperf(dot)com>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, pgsql-www(at)postgresql(dot)org
Subject: User-perspective knowledge about wait events
Date: 2017-09-26 18:26:39
Message-ID: CA+fnDAZ_QLb2YdY3wcuH3EY9u392e+gvHrFr7fm8EQz6A3u9cQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

On Mon, Aug 14, 2017 at 11:58 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> As $subject has been touched on two threads recently
> (https://www.postgresql.org/message-id/CAB7nPqTbHLcHFn6m11tfpwAdgz8BmnBza2jjN9AK=SdX_kBJZQ@mail.gmail.com
> and https://www.postgresql.org/message-id/20170808213537.wkmmagf2a6i3hjyi@alvherre.pgsql),
> the list of wait event and their category are becoming harder to
> maintain because of their number and the fact that:
> 1) It is easy to add a wait event and...
> 2) It is easy to not update its documentation.
> 3) the documentation tables get easily broken.

The documentation provides nice "developer-perspective" descriptions
of the wait events. It's a great start. If you have some familiarity
with the code, or if you're willing to dig into it, then the
descriptions get you going.

However I think that it would be immensely helpful to start gathering
knowledge somewhere on wait events from the user perspective. Listing
a few of the wait events which users will probably see most often
along with practical suggestions of what users could further examine
or even what they could change to make improvements on their database
systems.

When Postgres 10 goes GA and users start adopting it, we will start
getting a real-world sense of which wait events are most commonly
observed. But I don't think it's too early to stub out a wiki page
for gathering user-level community knowledge on this. As collective
knowledge about wait event meaning and action matures, the best advice
might even be considered for eventual incorporation back into the
official documentation.

I do realize there are some big pitfalls in the area of documenting
performance optimization. Workloads vary significantly and it's all to
easy for sound bites to devolve into "golden rules" or "best
practices" and then become mantras and silver bullets.

Nonetheless I believe it is possible to make some basic, scientific
and well-qualified observations. Not only is this immensely useful but
in fact it's critical to users' ability to have a good experience when
they dramatically ramp up their usage of PostgreSQL.

One particular section of Oracle's perf tuning doc actually isn't a
bad example in giving a few qualified and actionable user-perspective
wait event descriptions:
http://docs.oracle.com/database/122/TGDBA/instance-tuning-using-performance-views.htm#TGDBA94472

I'm just making thus up off the top of my head, so I'm sure it will be
bad... but in the interest of starting a conversation here's a example
"stub/starter":

^-^-^-^-^-^-^-^-^-^-^-^
https://wiki.postgresql.org/wiki/Wait_Events

Lock:tuple - this means that the SQL is waiting because some
application has previously locked the tuple (row) with a different
database connection and has not yet committed or rolled back that
transaction. if you frequently see this wait event then you probably
need to look for application-level optimizations to decrease
insert/update/delete/lock contention on the same rows.

IO:DataFileRead - this means that the SQL is waiting for data to be
physically read into PostgreSQL's internal memory cache. if you
frequently see this wait event then you might want to examine the SQL
statements which cause the most I/O and look for optimizations. Can
you accomplish the same result while examining less data? If not, is
there an index that might make the processing more efficient? Can the
SQL be restructured to generate a more optimal plan? If no
application-level optimizations are possible, then can the I/O
subsystem be optimized somehow?

IO:XactSync and IO:WALWrite - [what's the exact difference between
these?] this probably means that the SQL is waiting for WAL data to
either be asynchronously written to disk, or for the final fsync to
complete before returning from a transactions's COMMIT call. if you
frequently see this wait event then you might want to review the
content and frequency of updates and inserts generated by your
application. If your application cannot be refactored to remove
unneeded DML then you might example the WAL I/O path for opportunities
to optimize or increase capacity.

Notes about scaling out/up hardware:

IO-related waits can often be somewhat remediated by scaling out or
scaling up your hardware. However it's generally worthwhile to first
review your application for optimization opportunities. A small
application-level change can make orders-of-magnitude greater
improvements over hardware scaling!

Note that locking and concurrency related waits can often become even
worse with scaling out/up. (Waiting for something on another machine
can be slower than waiting for something on another processor.) Also,
scaling the wrong hardware component can exacerbate IO-related waits
if you don't scale in a way that directly addresses the actual
bottleneck.

^-^-^-^-^-^-^-^-^-^-^-^

What do others think about this? Is the wiki.postgresql.org the right
place for something like this, or should it just start out as blog
posts or tweets or people's personal websites until the ideas are a
little more mature?

-Jeremy

P.S. wasn't sure if this discussion was most appropriate for hackers,
users, or www list as it seems to overlap with all of them :)

--
http://about.me/jeremy_schneider

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-26 18:37:46 Re: BUG #14825: enum type: unsafe use?
Previous Message Julien Rouhaud 2017-09-26 17:59:25 Re: [Proposal] Make the optimiser aware of partitions ordering

Browse pgsql-www by date

  From Date Subject
Next Message Michael Paquier 2017-09-26 23:28:27 Re: User-perspective knowledge about wait events
Previous Message David G. Johnston 2017-09-22 14:36:46 Re: RC1 Release Announcement Page Content Oddity