Re: Slowness of extended protocol

From: Shay Rojansky <roji(at)roji(dot)org>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <stark(at)mit(dot)edu>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Slowness of extended protocol
Date: 2016-08-13 17:33:04
Message-ID: CADT4RqA7Np5ZwcDKNzDjxN9PJYpdiPaXaSy_K=W9+1HLHgKREQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Vladimir wrote:

Shay>I don't know much about the Java world, but both pgbouncer and pgpool
> (the major pools?)
>
> In Java world, https://github.com/brettwooldridge/HikariCP is a very good
> connection pool.
> Neither pgbouncer nor pgpool is required.
> The architecture is: application <=> HikariCP <=> pgjdbc <=> PostgreSQL
>
> The idea is HikariCP pools pgjdbc connections, and server-prepared
> statements are per-pgjdbc connection, so there is no reason to "discard
> all" or "deallocate all" or whatever.
>
> Shay>send DISCARD ALL by default. That is a fact, and it has nothing to do
> with any bugs or issues pgbouncer may have.
>
> That is configurable. If pgbouncer/pgpool defaults to "wrong
> configuration", why should we (driver developers, backend developers) try
> to accommodate that?
>

In a nutshell, that sentence represents much of the problem I have with
this conversation: you simply consider that anything that's different from
your approach is simply "wrong".

Shay>f you do a savepoint on every single command, that surely would impact
> performance even without extra roundtrips...?
>
> My voltmeter says me that the overhead is just 3us for a simple "SELECT"
> statement (see https://github.com/pgjdbc/pgjdbc/pull/477#
> issuecomment-239579547 ).
> I think it is acceptable to have it enabled by default, however it would
> be nice if the database did not require a savepoint dance to heal its "not
> implemented" "cache query cannot change result type" error.
>

That's interesting (I don't mean that ironically). Aside from simple
client-observed speed which you're measuring, are you sure there aren't
"hidden" costs on the PostgreSQL side for generating so many implicit
savepoints? I don't know anything about PostgreSQL transaction/savepoint
internals, but adding savepoints to each and every statement in a
transaction seems like it would have some impact. It would be great to have
the confirmation of someone who really knows the internals.

> Shay>I'm not aware of other drivers that implicitly prepare statements,
> Shay >and definitely of no drivers that internally create savepoints and
> Shay> roll the back without explicit user APIs
>
> Glad to hear that.
> Are you aware of other drivers that translate "insert into table(a,b,c)
> values(?,?,?)" =into=> "insert into table(a,b,c) values(?,?,?),
> (?,?,?),...,(?,?,?)" statement on the fly?
>
> That gives 2-3x performance boost (that includes client result processing
> as well!) for batched inserts since "bind/exec" message processing is not
> that fast. That is why I say that investing into "bind/exec performance"
> makes more sense than investing into "improved execution of non-prepared
> statements".
>

I'm not aware of any, but I also don't consider that part of the driver's
job. There's nothing your driver is doing that the application developer
can't do themselves - so your driver isn't faster than other drivers. It's
faster only when used by lazy programmers.

What you're doing is optimizing developer code, with the assumption that
developers can't be trusted to code efficiently - they're going to write
bad SQL and forget to prepare their statements. That's your approach, and
that's fine. The other approach, which is also fine, is that each software
component has its own role, and that clearly-defined boundaries of
responsibility should exist - that writing SQL is the developer's job, and
that delivering it to the database is the driver's job. To me this
separation of layers/roles is key part of software engineering: it results
in simpler, leaner components which interact in predictable ways, and when
there's a problem it's easier to know exactly where it originated. To be
honest, the mere idea of having an SQL parser inside my driver makes me
shiver.

The HikariCP page you sent (thanks, didn't know it) contains a great
Dijkstra quote that summarizes what I think about this - "Simplicity is
prerequisite for reliability.". IMHO you're going the opposite way by
implicitly rewriting SQL, preparing statements and creating savepoints.

But at the end of the day it's two different philosophies. All I ask is
that you respect the one that isn't yours, which means accepting the
optimization of non-prepared statements. There's no mutual exclusion.

> 2) I don't say "it is the only true way". I would change my mind if
> someone would suggest better solution. Everybody makes mistakes, and I have
> no problem with admitting that "I made a mistake" and moving forward.
> They say: "Don't cling to a mistake just because you spent a lot of time
> making it"
>

:)

So your way isn't the only true way, but others are just clinging to their
mistakes... :)

> However, no-one did suggest a case when application issues lots of unique
> SQL statements.
>

We did, you just dismissed or ignored them.

3) For "performance related" issues, a business case and a voltmeter is
> required to prove there's an issue.
>
>
> Shay>But the second query, which should be invalidated, has already been
> Shay>sent to the server (because of batching), and boom
>

>
When doing batched SQL, some of the queries might fail with "duplicate
> key", or "constraint violation". There's already API that covers those kind
> of cases and reports which statements did succeed (if any).
> In the case as you described (one query in a batch somehow invalidates the
> subsequent one) it would just resort to generic error handling.
>

You claimed that this condition could be handled transparently between the
driver and the backend, without disrupting the application with an error.
You proposed some new invalidation protocol between the client and the
server, but for batches this fails and the application does get an error.
That's all I wanted to say - that transparent handling of this error is
difficult, probably impossible without serious changes to the PostgreSQL
protocol...

This is very different from failure because of "duplicate key" or
"constraint violation" - your error is a result of something the driver did
implicitly, after 5 executions, without the user's request or knowledge.

From my side, I think this conversation has definitely reached a dead end
(although I can continue responding if you think it would be of any value).
There are two software design philosophies here, and the discussion isn't
really even about databases or drivers. For the record, I do recognize that
your approach has some merit (even if in the grand scheme of things I think
it isn't the right way to design software components).

Also for the record, I do intend to implement implicit query preparation,
but as an opt-in mechanism. I'm going to so not because I think developers
shouldn't be trusted to code efficiently, but because there are situations
outside of the developers control which could greatly benefit from it (i.e.
ORMs, large existing codebase that can't easily be changed to use prepare).
I'm going to recommend developers keep this feature off, unless they have
no other choice.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2016-08-13 18:05:17 Undiagnosed bug in Bloom index
Previous Message Tom Lane 2016-08-13 16:50:06 Re: No longer possible to query catalogs for index capabilities?