Re: Speed dblink using alternate libpq tuple storage

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: markokr(at)gmail(dot)com, greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org, mmoncure(at)gmail(dot)com, shigeru(dot)hanada(at)gmail(dot)com
Subject: Re: Speed dblink using alternate libpq tuple storage
Date: 2012-03-23 09:08:04
Message-ID: 20120323.180804.06014971.horiguchi.kyotaro@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for picking up.

> is considering three cases: it got a 2-byte integer (and can continue on),
> or there aren't yet 2 more bytes available in the buffer, in which case it
> should return EOF without doing anything, or pqGetInt detected a hard
> error and updated the connection error state accordingly, in which case
> again there is nothing to do except return EOF. In the patched code we
> have:
...
> which handles neither the second nor third case correctly: it thinks that
> "data not here yet" is a hard error, and then makes sure it is an error by
> destroying the parsing state :-(.

Marko and I think that, in protocol 3, all bytes of the incoming
message should have been surely loaded when entering
getAnotherTuple(). The following part In pqParseInput3() does
this.

| if (avail < msgLength)
| {
| /*
| * Before returning, enlarge the input buffer if needed to hold
| * the whole message.
| (snipped)..
| */
| if (pqCheckInBufferSpace(conn->inCursor + (size_t) msgLength,
| conn))
| {
| /*
| * XXX add some better recovery code...
| (snipped)..
| */
| handleSyncLoss(conn, id, msgLength);
| }
| return;

So, if cursor state is broken just after exiting
getAnotherTuple(), it had already been broken BEFORE entering
getAnotherTuple() according to current disign. That is the
'protocol error' means. pqGetInt there should not detect any
errors except for broken message.

> error, that possibly-useful error message is overwritten with an entirely
> useless "protocol error" text.

Plus, current pqGetInt seems to set its own error message only
for the wrong parameter 'bytes'.

On the other hand, in protocol 2 (to be removed ?) the error
handling mechanism get touched, because full-load of the message
is not guraranteed.

> I don't think the error return cases for the row processor have been
> thought out too well either. The row processor is not in charge of what
> happens to the PGresult,

Default row processor stuffs PGresult with tuples, another (say
that of dblink) leave it empty. Row processor manages PGresult by
the extent of their own.

> and it certainly has no business telling libpq to just "exit
> immediately from the topmost libpq function". If we do that
> we'll probably lose sync with the data stream and be unable to
> recover use of the connection at all.

I don't think PGresult has any charge of error handling system in
current implement. The phrase 'exit immediately from the topmost
libpq function' should not be able to be seen in the patch.

The exit routes from row processor are following,

- Do longjmp (or PG_PG_TRY-CATCH mechanism) out of the row
processor.

- Row processor returns 0 when entered from PQisBusy(),
immediately exit from PQisBusy().

Curosor consistency will be kept in both case. The cursor already
be on the next to the last byte of the current message.

> Also, do we need to consider any error cases for the row
> processor other than out-of-memory? If so it might be a good
> idea for it to have some ability to store a custom error
> message into the PGconn, which it cannot do given the current
> function API.

It seems not have so strong necessity concerning dblink or
PQgetRow comparing to expected additional complexity around. So
this patch does not include it.

> In the same vein, I am fairly uncomfortable with the blithe assertion that
> a row processor can safely longjmp out of libpq. This was never foreseen
> in the original library coding and there are any number of places that
> that might break, now or in the future. Do we really need to allow it?

To protect row processor from longjmp'ing out, I enclosed the
functions potentially throw exception by PG_TRY-CATCH clause in
the early verson. This was totally safe but the penalty was not
negligible because the TRY-CATCH was passed for every row.

> If we do, it would be a good idea to decorate the libpq
> functions that are now expected to possibly longjmp with
> comments saying so. Tracing all the potential call paths that
> might be aborted by a longjmp is an essential activity anyway.

Concerning now but the future, I can show you the trail of
confirmation process.

- There is no difference between with and without the patch at
the level of getAnotherTuple() from the view of consistency.

- Assuming pqParseInput3 detects the next message has not come
after getAnotherTuple returned. It exits immediately on reading
the length of the next message. This is the same condition to
longjumping.

> if (pqGetInt(&msgLength, 4, conn))
> return;

- parseInput passes it through and immediately exits in
consistent state.

- The caller of PQgetResult, PQisBusy, PQskipResult, PQnotifies,
PQputCopyData, pqHandleSendFailure gain the control finally. I
am convinced that the async status at the time must be
PGASYNC_BUSY and the conn cursor in consistent state.

So the ancestor of row processor is encouraged to call
PQfinish, PQgetResult, PQskipResult after getting longjmped in
the document. These functions should resolve the intermediate
status described above created by longjmp by restarting parsing
the stream afterwards

And about the future, altough it is a matter of cource that every
touch on the code will cause every destruction, longjmp stepping
over libpq internal code seems something complicated which is
enough to cause trouble. I will marking as 'This function is
skipped over by the longjmp invoked in the descendents.' (Better
expressions are welcome..)

> Another design deficiency is PQskipResult(). This is badly
> designed for async operation because once you call it, it will
> absolutely not give back control until it's read the whole
> query result. (It'd be better to have it set some kind of flag
> that makes future library calls discard data until the query
> result end is reached.)

If this function is called just after getting longjmp from row
processor, the async status of the connection at the time must be
PGASYNC_BUSY. So I think this function should always returns even
if the longjmp takes place at the last row in a result. There
must be following 'C' message if not broken.

> Something else that's not very well-designed about it is that
> it might throw away more than just incoming tuples. As an
> example, suppose that the incoming data at the time you call it
> consists of half a dozen rows followed by an ErrorResponse.
> The ErrorResponse will be eaten and discarded, leaving the
> application no clue why its transaction has been aborted, or
> perhaps even the entire session cancelled.

If the caller needs to see the ErrorResposes following, I think
calling PQgetResult seems enough.

> What we probably want here is just to transiently install a
> row processor that
> discards all incoming data, but the application is still
> expected to eventually fetch a PGresult that will tell it
> whether the server side of the query completed or not.

The all-dicarding row processor should not be visible to the
user. The users could design the row processor to behave so if
they want. This is mere a shortcut but it seems difficult to do
so without.

> In the dblink patch, ... it's not clear what is the benefit of
> using malloc rather than palloc to manage the intermediate
> buffers in storeHandler --- what that seems to mostly
> accomplish is increase the risk of permanent memory leaks.

Hmm. I thought that palloc is heavier than malloc, and the
allocated blocks looked well controlled so that there won't be
likely to leak. I will change to use palloc's counting the
discussion about block granurality below.

> I don't see much value in per-column buffers either; it'd
> likely be cheaper to just palloc one workspace that's big
> enough for all the column strings together.

I had hated to count the total length prior to copying the
contents in the early version. In the latest the total
requirement of the memory is easily obatined. So it seems good to
alloc together. I'll do so.

> And speaking of leaks, doesn't storeHandler leak the
> constructed tuple on each call, not to mention whatever might
> be leaked by the called datatype input functions?

I copied the process in the original materializeResult into
storeHandler. tuplestore_donestoring was removed because it is
obsoleted. So I think it is no problem about it. Am I wrong?

> It also seems to me that the dblink patch breaks the case
> formerly handled in materializeResult() of a PGRES_COMMAND_OK
> rather than PGRES_TUPLES_OK result. The COMMAND case is
> supposed to be converted to a single-column text result, and I
> sure don't see where that's happening now.

I'm sorry. I will restore that route.

> BTW, am I right in thinking that some of the hunks in this
> patch are meant to fix a pre-existing bug of failing to report
> the correct connection name? If so, it'd likely be better to
> split those out and submit as a separate patch, instead of
> conflating them with a feature addition.

Ok. I will split it.

> Lastly, as to the pqgetrow patch, the lack of any demonstrated
> test case for these functions makes me uncomfortable as to
> whether they are well designed. Again, I'm unconvinced that
> the error handling is good or that they work sanely in async
> mode. I'm inclined to just drop these for this go-round, and
> to stick to providing the features that we can test via the
> dblink patch.

Testing pqGetRow via dblink?

Do you mean 'drop these' as pqGetRow? So, this part might be
droppable apart from the rest.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-03-23 09:51:10 Re: Reporting WAL file containing checkpoint's REDO record in pg_controldata's result
Previous Message Magnus Hagander 2012-03-23 08:56:18 Re: Reporting WAL file containing checkpoint's REDO record in pg_controldata's result