Re: PATCH: pgbench - merging transaction logs

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: pgbench - merging transaction logs
Date: 2015-03-21 01:39:49
Message-ID: 550CCBE5.8060009@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 20.3.2015 13:43, Fabien COELHO wrote:
>
> Hello Robert,
>
>>> The fprintf we are talking about occurs at most once per pgbench
>>> transaction, possibly much less when aggregation is activated,
>>> and this transaction involves networks exchanges and possibly
>>> disk writes on the server.
>>
>> random() was occurring four times per transaction rather than
>> once, but OTOH I think fprintf() is probably a much heavier-weight
>> operation.
>
> Yes, sure.
>
> My point is that if there are many threads and tremendous TPS, the
> *detailed* per-transaction log (aka simple log) is probably a bad
> choice anyway, and the aggregated version is the way to go.

I disagree with this reasoning. Can you provide numbers supporting it?

I do agree that fprintf is not cheap, actually when profiling pgbench
it's often the #1 item, but the impact on the measurements is actually
quite small. For example with a small database (scale 10) and read-only
30-second runs (single client), I get this:

no logging: 18672 18792 18667 18518 18613 18547
with logging: 18170 18093 18162 18273 18307 18234

So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
And with more expensive transactions (larger scale, writes, ...) the
difference will be much smaller.

It's true that this might produce large logs, especially when the runs
are long, but that has nothing to do with fprintf. And can be easily
fixed by either using a dedicated client machine, or only sample the
transaction log.

Introducing actual synchronization between the threads (by locking
inside fprintf) is however a completely different thing.

> Note that even without mutex fprintf may be considered a "heavy
> function" which is going to slow down the transaction rate
> significantly. That could be tested as well.
>
> It is possible to reduce the lock time by preparing the string
> (which would mean introducing buffers) and just do a "fputs" under
> mutex. That would not reduce the print time anyway, and that may add
> malloc/free operations, though.

I seriously doubt fprintf does the string formatting while holding lock
on the file. So by doing this you only simulate what fprintf() does
(assuming it's thread-safe on your platform) and gain nothing.

>
>> The way to know if there's a real problem here is to test it, but
>> I'd be pretty surprised if there isn't.
>
> Indeed, I think I can contrive a simple example where it is,
> basically a more or less empty or read only transaction (eg SELECT
> 1).

That would be nice, because my quick testing suggests it's not the case.

> My opinion is that there is a tradeoff between code simplicity and
> later maintenance vs feature benefit.
>
> If threads are assumed and fprintf is used, the feature is much
> simpler to implement, and the maintenance is lighter.

I think the "if threads are assumed" part makes this dead in water
unless someone wants to spend time on getting rid of the thread
emulation. Removing the code is quite simple, researching whether we can
do that will be difficult IMHO - I have no idea which of the supported
platorms require the emulation etc. And I envision endless discussions
about this.

> The alternative implementation means reparsing the generated files
> over and over for merging their contents.

I agree that the current implementation is not particularly pretty, and
I plan to get rid of the copy&paste parts etc.

> Also, I do not think that the detailed log provides much benefit
> with very fast transactions, where probably the aggregate is a much
> better choice anyway. If the user persists, she may generate a
> per-thread log and merge it later, in which case a merge script is
> needed, but I do not think that would be a bad thing.

I disagree with this - I use transaction logs (either complete or
sampled) quite often. I also explained why I think a separate merge
script is awkward to use.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2015-03-21 01:47:18 Re: proposal: doc: simplify examples of dynamic SQL
Previous Message Kouhei Kaigai 2015-03-21 01:21:29 Re: GSoC - Idea Discussion