Hello. This message is a proposal of a pair of patches that enables the memory allocator for PGresult in libpq to be replaced. The comment at the the begging of pqexpbuffer.c says that libpq should not rely on palloc(). Besides, Tom Lane said that palloc should not be visible outside the backend(*1) and I agree with it. *1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php On the other hand, in dblink, dblink-plus (our product!), and maybe FDW's connect to other PostgreSQL servers are seem to copy the result of the query contained in PGresult into tuple store. I guess that this is in order to avoid memory leakage on termination in halfway. But it is rather expensive to copy whole PGresult, and the significance grows as the data received gets larger. Furthermore, it requires about twice as much memory as the net size of the data. And it is fruitless to copy'n modify libpq or reinvent it from scratch. So we shall be happy to be able to use palloc's in libpq at least for PGresult for such case in spite of the policy. For these reasons, I propose to make allocators for PGresult replaceable. The modifications are made up into two patches. 1. dupEvents() and pqAddTuple() get new memory block by malloc currently, but the aquired memory block is linked into PGresult finally. So I think it is preferable to use pqResultAlloc() or its descendents in consistensy with the nature of the place to link. But there is not PQresultRealloc() and it will be costly, so pqAddTuple() is not modified in this patch. 2. Define three function pointers PQpgresult_(malloc|realloc|free) and replace the calls to malloc/realloc/free in the four functions below with these pointers. PQmakeEmptyPGresult() pqResultAlloc() PQclear() pqAddTuple() This patches make the tools run in backend process and use libpq possible to handle PGresult as it is with no copy, no more memory. (Of cource, someone wants to use his/her custom allocator for PGresult on standalone tools could do that using this feature.) Three files are attached to this message. First, the patch with respect to "1" above. Second, the patch with respect to "2" above. Third, a very simple sample program. I have built and briefly tested on CentOS6, with the sample program mentioned above and valgrind, but not on Windows. How do you think about this? Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote: > The comment at the the begging of pqexpbuffer.c says that libpq > should not rely on palloc(). Besides, Tom Lane said that palloc > should not be visible outside the backend(*1) and I agree with > it. > > *1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php > > On the other hand, in dblink, dblink-plus (our product!), and > maybe FDW's connect to other PostgreSQL servers are seem to copy > the result of the query contained in PGresult into tuple store. I > guess that this is in order to avoid memory leakage on > termination in halfway. > > But it is rather expensive to copy whole PGresult, and the > significance grows as the data received gets larger. Furthermore, > it requires about twice as much memory as the net size of the > data. And it is fruitless to copy'n modify libpq or reinvent it > from scratch. So we shall be happy to be able to use palloc's in > libpq at least for PGresult for such case in spite of the policy. > > > For these reasons, I propose to make allocators for PGresult > replaceable. You could use the resource owner mechanism to track them. Register a callback function with RegisterResourceReleaseCallback(). Whenever a PGresult is returned from libpq, add it to e.g a linked list, kept in TopMemoryContext, and also store a reference to CurrentResourceOwner in the list element. In the callback function, scan through the list and free all the PGresults associated with the resource owner that's being released. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes: > On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote: >> The comment at the the begging of pqexpbuffer.c says that libpq >> should not rely on palloc(). Besides, Tom Lane said that palloc >> should not be visible outside the backend(*1) and I agree with >> it. >> >> *1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php >> >> On the other hand, in dblink, dblink-plus (our product!), and >> maybe FDW's connect to other PostgreSQL servers are seem to copy >> the result of the query contained in PGresult into tuple store. I >> guess that this is in order to avoid memory leakage on >> termination in halfway. >> >> But it is rather expensive to copy whole PGresult, and the >> significance grows as the data received gets larger. Furthermore, >> it requires about twice as much memory as the net size of the >> data. And it is fruitless to copy'n modify libpq or reinvent it >> from scratch. So we shall be happy to be able to use palloc's in >> libpq at least for PGresult for such case in spite of the policy. >> >> For these reasons, I propose to make allocators for PGresult >> replaceable. > You could use the resource owner mechanism to track them. Heikki's idea is probably superior so far as PG backend usage is concerned in isolation, but I wonder if there are scenarios where a client application would like to be able to manage libpq's allocations. If so, Kyotaro-san's approach would solve more problems than just dblink's. However, the bigger picture here is that I think Kyotaro-san's desire to not have dblink return a tuplestore may be misplaced. Tuplestores can spill to disk, while PGresults don't; so the larger the result, the more important it is to push it into a tuplestore and PQclear it as soon as possible. Despite that worry, it'd likely be a good idea to adopt one or the other of these solutions anyway, because I think there are corner cases where dblink.c can leak a PGresult --- for instance, what if dblink_res_error fails due to out-of-memory before reaching PQclear? And we could get rid of the awkward and none-too-cheap PG_TRY blocks that it uses to try to defend against such leaks in other places. So I'm in favor of making a change along that line, although I'd want to see more evidence before considering changing dblink to not return tuplestores. regards, tom lane
* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote: > Heikki's idea is probably superior so far as PG backend usage is > concerned in isolation, but I wonder if there are scenarios where a > client application would like to be able to manage libpq's allocations. The answer to that is certainly 'yes'. It was one of the first things that I complained about when moving from Oracle to PG. With OCI, you can bulk load results directly into application-allocated memory areas. Haven't been following the dblink discussion, so not going to comment about that piece. Thanks, Stephen
Stephen Frost <sfrost(at)snowman(dot)net> writes: > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote: >> Heikki's idea is probably superior so far as PG backend usage is >> concerned in isolation, but I wonder if there are scenarios where a >> client application would like to be able to manage libpq's allocations. > The answer to that is certainly 'yes'. It was one of the first things > that I complained about when moving from Oracle to PG. With OCI, you > can bulk load results directly into application-allocated memory areas. Well, loading data in a form whereby the application can access it without going through the PGresult accessor functions would be an entirely different (and vastly larger) project. I'm not sure I want to open that can of worms --- it seems like you could write a huge amount of code trying to provide every format someone might want, and still find that there were impedance mismatches for many applications. AIUI Kyotaro-san is just suggesting that the app should be able to provide a substitute malloc function for use in allocating PGresult space (and not, I think, anything else that libpq allocates internally). Basically this would allow PGresults to be cleaned up with methods other than calling PQclear on each one. It wouldn't affect how you'd interact with one while you had it. That seems like pretty much exactly what we want for preventing memory leaks in the backend; but is it going to be useful for other apps? regards, tom lane
On Sat, Nov 12, 2011 at 12:48 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote: > AIUI Kyotaro-san is just suggesting that the app should be able to > provide a substitute malloc function for use in allocating PGresult > space (and not, I think, anything else that libpq allocates internally). > Basically this would allow PGresults to be cleaned up with methods other > than calling PQclear on each one. It wouldn't affect how you'd interact > with one while you had it. That seems like pretty much exactly what we > want for preventing memory leaks in the backend; but is it going to be > useful for other apps? I think it will. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> writes: > Hello. This message is a proposal of a pair of patches that > enables the memory allocator for PGresult in libpq to be > replaced. Since there seems to be rough consensus that something like this would be a good idea, I looked more closely at the details of the patch. I think the design could use some adjustment. To start with, the patch proposes exposing some global variables that affect the behavior of libpq process-wide. This seems like a pretty bad design, because a single process could contain multiple usages of libpq with different requirements. As an example, if dblink.c were to set these variables inside a backend process, it would break usage of libpq from PL/Perl via DBI. Global variables also tend to be a bad idea whenever you think about multi-threaded applications --- they require locking facilities, which are not in this patch. I think it'd be better to consider the PGresult alloc/free functions to be a property of a PGconn, which you'd set with a function call along the lines of PQsetResultAllocator(conn, alloc_func, realloc_func, free_func) after having successfully opened a connection. Then we just have some more PGconn fields (and I guess PGresult will need a copy of the free_func pointer) and no new global variables. I am also feeling dubious about whether it's a good idea to expect the functions to have exactly the signature of malloc/free. They are essentially callbacks, and in most places where a library provides for callbacks, it's customary to include a "void *" passthrough argument in case the callback needs some context information. I am not sure that dblink.c would need such a thing, but if we're trying to design a general-purpose feature, then we probably should have it. The cost would be having shim functions inside libpq for the default case, but it doesn't seem likely that they'd cost enough to notice. The patch lacks any user documentation, which it surely must have if we are claiming this is a user-visible feature. And I think it could use some attention to updating code comments, notably the large block about PGresult space management near the top of fe-exec.c. Usually, when writing a feature of this sort, it's a good idea to implement a prototype use-case to make sure you've not overlooked anything. So I'd feel happier about the patch if it came along with a patch to make dblink.c use it to prevent memory leaks. regards, tom lane
Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes: > On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote: >> For these reasons, I propose to make allocators for PGresult >> replaceable. > You could use the resource owner mechanism to track them. BTW, I just thought of a potentially fatal objection to making PGresult allocation depend on palloc: libpq is absolutely not prepared to handle losing control on out-of-memory. While I'm not certain that its behavior with malloc is entirely desirable either (it tends to loop in hopes of getting the memory next time), we cannot just plop in palloc in place of malloc and imagine that we're not breaking it. This makes me think that Heikki's approach is by far the more tenable one, so far as dblink is concerned. Perhaps the substitute-malloc idea is still useful for some other application, but I'm inclined to put that idea on the back burner until we have a concrete use case for it. regards, tom lane
On 12/11/2011 07:36, Robert Haas wrote: > On Sat, Nov 12, 2011 at 12:48 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote: >> AIUI Kyotaro-san is just suggesting that the app should be able to >> provide a substitute malloc function for use in allocating PGresult >> space (and not, I think, anything else that libpq allocates internally). >> Basically this would allow PGresults to be cleaned up with methods other >> than calling PQclear on each one. It wouldn't affect how you'd interact >> with one while you had it. That seems like pretty much exactly what we >> want for preventing memory leaks in the backend; but is it going to be >> useful for other apps? > > I think it will. Maybe I've just talking nonsense, I just have little experience hacking the pgsql and pdo-pgsql exstensions, but to me it would seem something that could easily avoid an extra duplication of the data returned by pqgetvalue. To me it seems a pretty nice win. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote: > Well, loading data in a form whereby the application can access it > without going through the PGresult accessor functions would be an > entirely different (and vastly larger) project. Looking through the thread, I agree that it's a different thing than what's being discussed here. > I'm not sure I want > to open that can of worms --- it seems like you could write a huge > amount of code trying to provide every format someone might want, > and still find that there were impedance mismatches for many > applications. The OCI approach is actually very similar to how we handle our catalogs internally.. Imagine you define a C struct which matched your table structure, then you allocate 5000 (or however) of those, give the base pointer to the 'getResult' call and a integer array of offsets into that structure for each of the columns. There might have been a few other minor things (like some notion of how to handle NULLs), but it was pretty straight-forward from the C perspective, imv. Trying to provide alternative formats (I'm guessing you were referring to something like XML..? Or some complex structure?) would certainly be a whole different ballgame. Thanks, Stephen > AIUI Kyotaro-san is just suggesting that the app should be able to > provide a substitute malloc function for use in allocating PGresult > space (and not, I think, anything else that libpq allocates internally). > Basically this would allow PGresults to be cleaned up with methods other > than calling PQclear on each one. It wouldn't affect how you'd interact > with one while you had it. That seems like pretty much exactly what we > want for preventing memory leaks in the backend; but is it going to be > useful for other apps? > > regards, tom lane
Hello, At Fri, 11 Nov 2011 11:29:30 +0200, Heikki Linnakangas wrote > You could use the resource owner mechanism to track > them. Register a callback function with > RegisterResourceReleaseCallback(). Thank you for letting me know about it. I have dug up a message in pg-hackers refering to the mechanism on discussion about postgresql-fdw. I'll put further thought into dblink-plus taking it into account. By the way, thinking about memory management for the result in libpq is considerable as another issue. At Sat, 12 Nov 2011 12:29:50 -0500, Tom Lane wrote > To start with, the patch proposes exposing some global > variables that affect the behavior of libpq process-wide. This > seems like a pretty bad design, because a single process could > contain multiple usages of libpq You're right to say the design is bad. I've designed it to have minimal impact on libpq by limiting usage and imposing any reponsibility on the users, that is the developers of the modules using it. If there are any other applications that want to use their own allocators, there are some points to be considered. I think it is preferable consiering multi-threading to make libpq write PGresult into memory blocks passed from the application like OCI does, instead of letting libpq itself make request for them. This approach hands the responsibility of memory management to the user and gives them the capability to avoid memory exhaustion by their own measures. On the other hand, this way could produce the situation that libpq cannot write all of the data to receive from the server onto handed memory block. So, the API must be able to return the partial data to the caller. More advancing, if libpq could store the result directly into user-allocated memory space using tuplestore-like interface, it is better on performance if the final storage is a tuplestore itself. I will be happy with the part-by-part passing of result. So I will think about this as the next issue. > So I'd feel happier about the patch if it came along with a > patch to make dblink.c use it to prevent memory leaks. I take it is about my original patch. Mmm, I heard that dblink copies received data in PGResult to tuple store not only because of the memory leak, but less memory usage (after the copy is finished). I think I could show you the patch ignoring the latter, but it might take some time for me to start from understand dblink and tuplestore closely... If I find RegisterResourceReleaseCallback short for our requirement, I will show it. If not, I withdraw this patch for ongoing CF and propose another patch based on the discussion above at another time. Please let me have a little more time. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, me> I'll put further thought into dblink-plus taking it into me> account. .. me> Please let me have a little more time. I've inquired the developer of dblink-plus about RegisterResourceReleaseCallback(). He said that the function is in bad compatibility with current implementation. In addition to this, storing into tuplestore directly seems to me a good idea than palloc'ed PGresult. So I tried to make libpq/PGresult be able to handle alternative tuple store by hinting to PGconn, and modify dblink to use the mechanism as the first sample code. I will show it as a series of patches in next message. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, This is the next version of Allow substitute allocators
for PGresult.
Totally chaning the concept from the previous one, this patch
allows libpq to handle alternative tuple store for received
tuples.
Design guidelines are shown below.
- No need to modify existing client code of libpq.
- Existing libpq client runs with roughly same performance, and
dblink with modification runs faster to some extent and
requires less memory.
I have measured roughly of run time and memory requirement for
three configurations on CentOS6 on Vbox with 2GB mem 4 cores
running on Win7-Corei7, transferring (30 bytes * 2 cols) *
2000000 tuples (120MB net) within this virutal machine. The
results are below.
xfer time Peak RSS
Original : 6.02s 850MB
libpq patch + Original dblink : 6.11s 850MB
full patch : 4.44s 643MB
xfer time here is the mean of five 'real time's measured by
running sql script like this after warmup run.
=== test.sql
select dblink_connect('c', 'host=localhost port=5432 dbname=test');
select * from dblink('c', 'select a,c from foo limit 2000000') as (a text, b bytea) limit 1;
select dblink_disconnect('c');
===
$ for i in $(seq 1 10); do time psql test -f t.sql; done
===
Peak RSS is measured by picking up heap Rss in /proc/[pid]/smaps.
It seems somewhat slow using patched libpq and original dblink,
but it seems within error range too. If this amount of slowdown
is not permissible, it might be improved by restoring the static
call route before for extra redundancy of the code.
On the other hand, full patch version seems obviously fast and
requires less memory. Isn't it nice?
This patch consists of two sub patches.
The first is a patch for libpq to allow rewiring tuple storage
mechanism. But default behavior is not changed. Existing libpq
client should run with it.
The second is modify dblink to storing received tuples into
tuplestore directly using the mechanism above.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Ouch! I'm sorry for making a reverse patch for the first modification. This is an amendment of the message below. The body text is copied into this message. http://archives.postgresql.org/message-id/20111201(dot)192419(dot)103527179(dot)horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp ======= Hello, This is the next version of Allow substitute allocators for PGresult. Totally chaning the concept from the previous one, this patch allows libpq to handle alternative tuple store for received tuples. Design guidelines are shown below. - No need to modify existing client code of libpq. - Existing libpq client runs with roughly same performance, and dblink with modification runs faster to some extent and requires less memory. I have measured roughly of run time and memory requirement for three configurations on CentOS6 on Vbox with 2GB mem 4 cores running on Win7-Corei7, transferring (30 bytes * 2 cols) * 2000000 tuples (120MB net) within this virutal machine. The results are below. xfer time Peak RSS Original : 6.02s 850MB libpq patch + Original dblink : 6.11s 850MB full patch : 4.44s 643MB xfer time here is the mean of five 'real time's measured by running sql script like this after warmup run. === test.sql select dblink_connect('c', 'host=localhost port=5432 dbname=test'); select * from dblink('c', 'select a,c from foo limit 2000000') as (a text, b bytea) limit 1; select dblink_disconnect('c'); === $ for i in $(seq 1 10); do time psql test -f t.sql; done === Peak RSS is measured by picking up heap Rss in /proc/[pid]/smaps. It seems somewhat slow using patched libpq and original dblink, but it seems within error range too. If this amount of slowdown is not permissible, it might be improved by restoring the static call route before for extra redundancy of the code. On the other hand, full patch version seems obviously fast and requires less memory. Isn't it nice? This patch consists of two sub patches. The first is a patch for libpq to allow rewiring tuple storage mechanism. But default behavior is not changed. Existing libpq client should run with it. The second is modify dblink to storing received tuples into tuplestore directly using the mechanism above. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, The documentation had slipped my mind. This is the patch to add the documentation of PGresult custom storage. It shows in section '31.19. Alternative result storage'. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 12/01/2011 05:48 AM, Kyotaro HORIGUCHI wrote: > xfer time Peak RSS > Original : 6.02s 850MB > libpq patch + Original dblink : 6.11s 850MB > full patch : 4.44s 643MB > These look like interesting results. Currently Tom is listed as the reviewer on this patch, based on comments made before the CF really started. And the patch has been incorrectly been sitting in "Waiting for author" for the last week; oops. I'm not sure what to do with this one now except raise a general call to see if anyone wants to take a look at it, now that it seems to be in good enough shape to deliver measurable results. -- Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Greg Smith <greg(at)2ndQuadrant(dot)com> writes: > On 12/01/2011 05:48 AM, Kyotaro HORIGUCHI wrote: >> xfer time Peak RSS >> Original : 6.02s 850MB >> libpq patch + Original dblink : 6.11s 850MB >> full patch : 4.44s 643MB > These look like interesting results. Currently Tom is listed as the > reviewer on this patch, based on comments made before the CF really > started. And the patch has been incorrectly been sitting in "Waiting > for author" for the last week; oops. I'm not sure what to do with this > one now except raise a general call to see if anyone wants to take a > look at it, now that it seems to be in good enough shape to deliver > measurable results. I did list myself as reviewer some time ago, but if anyone else wants to take it I won't be offended ;-) regards, tom lane
On Thu, Dec 8, 2011 at 5:41 AM, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote: > This is the patch to add the documentation of PGresult custom > storage. It shows in section '31.19. Alternative result > storage'. It would be good to consolidate this into the main patch. I find the names of the functions added here to be quite confusing and would suggest renaming them. I expected PQgetAsCstring to do something similar to PQgetvalue, but the code is completely different, and even after reading the documentation I still don't understand what that function is supposed to be used for. Why "as cstring"? What would the other option be? I also don't think the "add tuple" terminology is particularly good. It's not obvious from the name that what you're doing is overriding the way memory is allocated and results are stored. Also, what about the problem Tom mentioned here? http://archives.postgresql.org/message-id/1042(dot)1321123761(at)sss(dot)pgh(dot)pa(dot)us -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, thank you for taking the time for comment.
At Wed, 21 Dec 2011 11:09:59 -0500, Robert Haas <robertmhaas(at)gmail(dot)com> wrote
> I find the names of the functions added here to be quite
> confusing and would suggest renaming them. I expected
> PQgetAsCstring to do something similar to PQgetvalue, but the
> code is completely different,
To be honest, I've also felt that kind of perplexity. If the
problem is simply of the naming, I can propose the another name
"PQreadAttValue"... This is not so good too...
But...
> and even after reading the documentation I still don't
> understand what that function is supposed to be used for. Why
> "as cstring"? What would the other option be?
Is it a problem of the poor description? Or about the raison
d'être of the function?
The immediate cause of the existence of the function is that
getAnotherTuple internally stores the field values of the tuples
sent from the server, in the form of PGresAttValue, and I have
found only one route to store a tuple into TupleStore is
BuildeTupleFromCStrings() to tupelstore_puttuple() which is
dblink does in materializeResult(), and of cource C-string is the
most natural format in C program, and I have hesitated to modify
execTuples.c, and I wanted to hide the details of PGresAttValue.
Assuming that the values are passed as PGresAttValue* is given
(for the reasons of performance and the extent of the
modification), the "adding tuple" functions should get the value
from the struct. This can be done in two ways from the view of
authority (`encapsulation', in other words) and convenience, one
is with the knowledge of the structure, and the other is without
it. PQgetAsCstring is the latter approach. (And it is
inconsistent with the fact that the definition of PGresAttValue
is moved into lipq-fe.h from libpq-int.h. The details of the
structure should be hidden like PGresult in this approach).
But it is not obvious that the choice is better than the another
one. If we consider that PGresAttValue is too simple and stable
to hide the details, PQgetAsCString will be taken off and the
problem will go out. PGresAttValue needs documentation in this
case.
I prefer to handle PGresAttValue directly if no problem.
> I also don't think the "add tuple" terminology is particularly good.
> It's not obvious from the name that what you're doing is overriding
> the way memory is allocated and results are stored.
This phrase is taken from pqAddTuple() in fe-exec.c at first and
have not been changed after the function is integrated with other
functions.
I propose "tuple storage handler" for the alternative.
- typedef void *(*addTupleFunction)(...);
+ typedef void *(*tupleStorageHandler)(...);
- typedef enum { ADDTUP_*, } AddTupFunc;
+ typedef enum { TSHANDLER_*, } TSHandlerCommand;
- void *PQgetAddTupleParam(...);
+ void *PQgetTupleStrageHandlerContext(...);
- void PQregisterTupleAdder(...);
+ void PQregisterTupleStoreHandler(...);
- addTupleFunction PGresult.addTupleFunc;
+ tupleStorageHandler PGresult.tupleStorageHandlerFunc;
- void *PGresult.addTuleFuncParam;
+ void *PGresult.tupleStorageHandlerContext;
- char *PGresult.addTuleFuncErrMes;
+ void *PGresult.tupelStrageHandlerErrMes;
> Also, what about the problem Tom mentioned here?
>
> http://archives.postgresql.org/message-id/1042(dot)1321123761(at)sss(dot)pgh(dot)pa(dot)us
The plan that simply replace malloc's with something like
palloc's is abandoned for the narrow scope.
dblink-plus copies whole PGresult into TupleStore in order to
avoid making orphaned memory on SIGINT. The resource owner
mechanism is principally applicable to that but practically hard
for the reason that current implementation without radically
modification couldn't accept it.. In addition to that, dblink
also does same thing for maybe the same reason with dblink-plus
and another reason as far as I heard.
Whatever the reason is, both dblink and dblink-plus do the same
thing that could lower the performance than expected.
If TupleStore(TupleDesc) is preferable to PGresult for in-backend
use and oridinary(client-use) libpq users can handle only
PGresult, the mechanism like this patch would be reuired to
maintain the compatibility, I think. To the contrary, if there is
no significant reason to use TupleStore in backend use - it
implies that existing mechanisms like resource owner can save the
backend inexpensively from possible inconvenience caused by using
PGresult storage in backends - PGresult should be used as it is.
I think TupleStore prepared to be used in backend is preferable
for the usage and don't want to get data making detour via
PGresult.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
One patch that fell off the truck during a turn in the November CommitFest was "Allow substitute allocators for PGresult". Re-reading the whole thing again, it actually turned into a rather different submission in the middle, and I know I didn't follow that shift correctly then. I'm replying to its thread but have changed the subject to reflect that change. From a procedural point of view, I don't feel right kicking this back to its author on a Friday night when the deadline for resubmitting it would be Sunday. Instead I've refreshed the patch myself and am adding it to the January CommitFest. The new patch is a single file; it's easy enough to split out the dblink changes if someone wants to work with the pieces separately. After my meta-review I think we should get another reviewer familiar with using dblink to look at this next. This is fundamentally a performance patch now. Some results and benchmarking code were submitted along with it; the other issues are moot if those aren't reproducible. The secondary goal for a new review here is to provide another opinion on the naming issues and abstraction concerns raised so far. To clear out the original line of thinking, this is not a replacement low-level storage allocator anymore. The idea of using such a mechanism to help catch memory leaks has also been dropped. Instead this adds and documents a new path for libpq callers to more directly receive tuples, for both improved speed and lower memory usage. dblink has been modified to use this new mechanism. Benchmarking by the author suggests no significant change in libpq speed when only that change was made, while the modified dblink using the new mechanism was significantly faster. It jumped from 332K tuples/sec to 450K, a 35% gain, and had a lower memory footprint too. Test methodology and those results are at http://archives.postgresql.org/pgsql-hackers/2011-12/msg00008.php Robert Haas did a quick code review of this already, it along with author response mixed in are at http://archives.postgresql.org/pgsql-hackers/2011-12/msg01149.php I see two areas of contention there: -There are several naming bits no one is happy with yet. Robert didn't like some of them, but neither did Kyotaro. I don't have an opinion myself. Is it the case that some changes to the existing code's terminology are what's actually needed to make this all better? Or is this just fundamentally warty and there's nothing to be done about it. Dunno. -There is an abstraction wrapper vs. coding convenience trade-off centering around PGresAttValue. It sounded to me like it raised always fun questions like "where's the right place for the line between lipq-fe.h and libpq-int.h to be?" dblink is pretty popular, and this is a big performance win for it. If naming and API boundary issues are the worst problems here, this sounds like something well worth pursuing as part of 9.2's still advancing performance theme. -- Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Hello, This is revised and rebased version of the patch. a. Old term `Add Tuple Function' is changed to 'Store Handler'. The reason why not `storage' is simply length of the symbols. b. I couldn't find the place to settle PGgetAsCString() in. It is removed and storeHandler()@dblink.c touches PGresAttValue directly in this new patch. Definition of PGresAttValue stays in lipq-fe.h and provided with comment. c. Refine error handling of dblink.c. I think it preserves the previous behavior for column number mismatch and type conversion exception. d. Document is revised. > It jumped from 332K tuples/sec to 450K, a 35% gain, and had a > lower memory footprint too. Test methodology and those results > are at > http://archives.postgresql.org/pgsql-hackers/2011-12/msg00008.php It is a disappointment that I found that the gain had become lower than that according to the re-measuring. For CentOS6.2 and other conditions are the same to the previous testing, the overall performance became hihger and the loss of libpq patch was 1.8% and the gain of full patch had been fallen to 5.6%. But the reduction of the memory usage was not changed. Original : 3.96s 100.0% w/libpq patch : 4.03s 101.8% w/libpq+dblink patch : 3.74s 94.4% The attachments are listed below. libpq_altstore_20120117.patch - Allow alternative storage for libpql. dblink_perf_20120117.patch - Modify dblink to use alternative storage mechanism. libpq_altstore_doc_20120117.patch - Document for libpq_altstore. Shows in "31.19. Alternatie result storage" regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 17, 2012 at 05:53:33PM +0900, Kyotaro HORIGUCHI wrote: > Hello, This is revised and rebased version of the patch. > > a. Old term `Add Tuple Function' is changed to 'Store > Handler'. The reason why not `storage' is simply length of the > symbols. > > b. I couldn't find the place to settle PGgetAsCString() in. It is > removed and storeHandler()@dblink.c touches PGresAttValue > directly in this new patch. Definition of PGresAttValue stays > in lipq-fe.h and provided with comment. > > c. Refine error handling of dblink.c. I think it preserves the > previous behavior for column number mismatch and type > conversion exception. > > d. Document is revised. First, my priority is one-the-fly result processing, not the allocation optimizing. And this patch seems to make it possible, I can process results row-by-row, without the need to buffer all of them in PQresult. Which is great! But the current API seems clumsy, I guess its because the patch grew from trying to replace the low-level allocator. I would like to propose better one-shot API with: void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns); where the PGresAttValue * is allocated once, inside PQresult. And the pointers inside point directly to network buffer. Ofcourse this requires replacing the current per-column malloc+copy pattern with per-row parse+handle pattern, but I think resulting API will be better: 1) Pass-through processing do not need to care about unnecessary per-row allocations. 2) Handlers that want to copy of the row (like regular libpq), can optimize allocations by having "global" view of the row. (Eg. One allocation for row header + data). This also optimizes call patterns - first libpq parses packet, then row handler processes row, no unnecessary back-and-forth. Summary - current API has various assumptions how the row is processed, let's remove those. -- marko
> > > > c. Refine error handling of dblink.c. I think it preserves the > > previous behavior for column number mismatch and type > > conversion exception. Hello, I don't know if this cover following issue. I just mention it for the case you didn't notice it and would like to handle this rather cosmetic issue as well. http://archives.postgresql.org/pgsql-bugs/2011-08/msg00113.php best regards, Marc Mamin
On Sat, Jan 21, 2012 at 1:52 PM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote: >> > >> > c. Refine error handling of dblink.c. I think it preserves the >> > previous behavior for column number mismatch and type >> > conversion exception. > > Hello, > > I don't know if this cover following issue. > I just mention it for the case you didn't notice it and would like to > handle this rather cosmetic issue as well. > > http://archives.postgresql.org/pgsql-bugs/2011-08/msg00113.php It is not relevant to this thread, but seems good idea to implement indeed. It should be simple matter of creating handler that uses dblink_res_error() to report the notice. Perhaps you could create and submit the patch by yourself? For reference, here it the full flow in PL/Proxy: 1) PQsetNoticeReceiver: https://github.com/markokr/plproxy-dev/blob/master/src/execute.c#L422 2) handle_notice: https://github.com/markokr/plproxy-dev/blob/master/src/execute.c#L370 3) plproxy_remote_error: https://github.com/markokr/plproxy-dev/blob/master/src/main.c#L82 -- marko
Thank you for the comment, > First, my priority is one-the-fly result processing, > not the allocation optimizing. And this patch seems to make > it possible, I can process results row-by-row, without the > need to buffer all of them in PQresult. Which is great! > > But the current API seems clumsy, I guess its because the > patch grew from trying to replace the low-level allocator. Exactly. > I would like to propose better one-shot API with: > > void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns); > > where the PGresAttValue * is allocated once, inside PQresult. > And the pointers inside point directly to network buffer. Good catch, thank you. The patch is dragging too much from the old implementation. It is no need to copy the data inside getAnotherTuple to do it, as you say. > Ofcourse this requires replacing the current per-column malloc+copy > pattern with per-row parse+handle pattern, but I think resulting > API will be better: > > 1) Pass-through processing do not need to care about unnecessary > per-row allocations. > > 2) Handlers that want to copy of the row (like regular libpq), > can optimize allocations by having "global" view of the row. > (Eg. One allocation for row header + data). > > This also optimizes call patterns - first libpq parses packet, > then row handler processes row, no unnecessary back-and-forth. > > > Summary - current API has various assumptions how the row is > processed, let's remove those. Thank you, I rewrite the patch to make it realize. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, This is a new version of the patch formerly known as 'alternative storage for libpq'. - Changed the concept to 'Alternative Row Processor' from 'Storage handler'. Symbol names are also changed. - Callback function is modified following to the comment. - From the restriction of time, I did minimum check for this patch. The purpose of this patch is to show the new implement. - Proformance is not measured for this patch for the same reason. I will do that on next monday. - The meaning of PGresAttValue is changed. The field 'value' now contains a value withOUT terminating zero. This change seems to have no effect on any other portion within the whole source tree of postgresql from what I've seen. > > I would like to propose better one-shot API with: > > > > void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns); ... > > 1) Pass-through processing do not need to care about unnecessary > > per-row allocations. > > > > 2) Handlers that want to copy of the row (like regular libpq), > > can optimize allocations by having "global" view of the row. > > (Eg. One allocation for row header + data). I expect the new implementation is far more better than the orignal. regargs, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote: > Hello, This is a new version of the patch formerly known as > 'alternative storage for libpq'. I took a quick look at the patch and the docs. Looks good and agree with rationale and implementation. I see you covered the pqsetvalue case which is nice. I expect libpq C api clients coded for performance will immediately gravitate to this api. > - The meaning of PGresAttValue is changed. The field 'value' now > contains a value withOUT terminating zero. This change seems to > have no effect on any other portion within the whole source > tree of postgresql from what I've seen. This is a minor point of concern. This function was exposed to support libpqtypes (which your stuff compliments very nicely by the way) and I quickly confirmed removal of the null terminator didn't cause any problems there. I doubt anyone else is inspecting the structure directly (also searched the archives and didn't find anything). This needs to be advertised very loudly in the docs -- I understand why this was done but it's a pretty big change in the way the api works. merlin
On Fri, Jan 27, 2012 at 05:57:01PM +0900, Kyotaro HORIGUCHI wrote: > Hello, This is a new version of the patch formerly known as > 'alternative storage for libpq'. > > - Changed the concept to 'Alternative Row Processor' from > 'Storage handler'. Symbol names are also changed. > > - Callback function is modified following to the comment. > > - From the restriction of time, I did minimum check for this > patch. The purpose of this patch is to show the new implement. > > - Proformance is not measured for this patch for the same > reason. I will do that on next monday. > > - The meaning of PGresAttValue is changed. The field 'value' now > contains a value withOUT terminating zero. This change seems to > have no effect on any other portion within the whole source > tree of postgresql from what I've seen. I think we have general structure in place. Good. Minor notes: = rowhandler api = * It returns bool, so void* is wrong. Instead libpq style is to use int, with 1=OK, 0=Failure. Seems that was also old pqAddTuple() convention. * Drop PQgetRowProcessorParam(), instead give param as argument. * PQsetRowProcessorErrMes() should strdup() the message. That gets rid of allocator requirements in API. This also makes safe to pass static strings there. If strdup() fails, fall back to generic no-mem message. * Create new struct to replace PGresAttValue for rowhandler usage. RowHandler API is pretty unique and self-contained. It should have it's own struct. Main reason is that it allows to properly document it. Otherwise the minor details get lost as they are different from libpq-internal usage. Also this allows two structs to be improved separately. (PGresRawValue?) * Stop storing null_value into ->value. It's libpq internal detail. Instead the ->value should always point into buffer where the value info is located, even for NULL. This makes safe to simply subtract pointers to get row size estimate. Seems pqAddTuple() already does null_value logic, so no need to do it in rowhandler api. = libpq = Currently its confusing whether rowProcessor can be NULL, and what should be done if so. I think its better to fix usage so that it is always set. * PQregisterRowProcessor() should use default func if func==NULL. and set default handler if so. * Never set rowProcessor directly, always via PQregisterRowProcessor() * Drop all if(rowProcessor) checks. = dblink = * There are malloc failure checks missing in initStoreInfo() & storeHandler(). -- marko PS. You did not hear it from me, but most raw values are actually nul-terminated in protocol. Think big-endian. And those which are not, you can make so, as the data is not touched anymore. You cannot do it for last value, as next byte may not be allocated. But you could memmove() it lower address so you can null-terminate. I'm not suggesting it for official patch, but it would be fun to know if such hack is benchmarkable, and benchmarkable on realistic load.
On Fri, Jan 27, 2012 at 09:35:04AM -0600, Merlin Moncure wrote: > On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI > > - The meaning of PGresAttValue is changed. The field 'value' now > > contains a value withOUT terminating zero. This change seems to > > have no effect on any other portion within the whole source > > tree of postgresql from what I've seen. > > This is a minor point of concern. This function was exposed to > support libpqtypes (which your stuff compliments very nicely by the > way) and I quickly confirmed removal of the null terminator didn't > cause any problems there. I doubt anyone else is inspecting the > structure directly (also searched the archives and didn't find > anything). > > This needs to be advertised very loudly in the docs -- I understand > why this was done but it's a pretty big change in the way the api > works. Note that the non-NUL-terminated PGresAttValue is only used for row handler. So no existing usage is affected. But I agree using same struct in different situations is confusing, thus the request for separate struct for row handler usage. -- marko
Thank you for comments, this is revised version of the patch.
The gain of performance is more than expected. Measure script now
does query via dblink ten times for stability of measuring, so
the figures become about ten times longer than the previous ones.
sec % to Original
Original : 31.5 100.0%
RowProcessor patch : 31.3 99.4%
dblink patch : 24.6 78.1%
RowProcessor patch alone makes no loss or very-little gain, and
full patch gives us 22% gain for the benchmark(*1).
The modifications are listed below.
- No more use of PGresAttValue for this mechanism, and added
PGrowValue instead. PGresAttValue has been put back to
libpq-int.h
- pqAddTuple() is restored as original and new function
paAddRow() to use as RowProcessor. (Previous pqAddTuple
implement had been buggily mixed the two usage of
PGresAttValue)
- PQgetRowProcessorParam has been dropped. Contextual parameter
is passed as one of the parameters of RowProcessor().
- RowProcessor() returns int (as bool, is that libpq convension?)
instead of void *. (Actually, void * had already become useless
as of previous patch)
- PQsetRowProcessorErrMes() is changed to do strdup internally.
- The callers of RowProcessor() no more set null_field to
PGrowValue.value. Plus, the PGrowValue[] which RowProcessor()
receives has nfields + 1 elements to be able to make rough
estimate by cols->value[nfields].value - cols->value[0].value -
something. The somthing here is 4 * nfields for protocol3 and
4 * (non-null fields) for protocol2. I fear that this applies
only for textual transfer usage...
- PQregisterRowProcessor() sets the default handler when given
NULL. (pg_conn|pg_result).rowProcessor cannot be NULL for its
lifetime.
- initStoreInfo() and storeHandler() has been provided with
malloc error handling.
And more..
- getAnotherTuple()@fe-protocol2.c is not tested utterly.
- The uniformity of the size of columns in the test data prevents
realloc from execution in dblink... More test should be done.
regards,
=====
(*1) The benchmark is done as follows,
==test.sql
select dblink_connect('c', 'host=localhost dbname=test');
select * from dblink('c', 'select a,c from foo limit 2000000') as (a text b bytea) limit 1;
...(repeat 9 times more)
select dblink_disconnect('c');
==
$ for i in $(seq 1 10); do time psql test -f t.sql; done
The environment is
CentOS 6.2 on VirtualBox on Core i7 965 3.2GHz
# of processor 1
Allocated mem 2GB
Test DB schema is
Column | Type | Modifiers
--------+-------+-----------
a | text |
b | text |
c | bytea |
Indexes:
"foo_a_bt" btree (a)
"foo_c_bt" btree (c)
test=# select count(*),
min(length(a)) as a_min, max(length(a)) as a_max,
min(length(c)) as c_min, max(length(c)) as c_max from foo;
count | a_min | a_max | c_min | c_max
---------+-------+-------+-------+-------
2000000 | 29 | 29 | 29 | 29
(1 row)
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Jan 30, 2012 at 06:06:57PM +0900, Kyotaro HORIGUCHI wrote: > The gain of performance is more than expected. Measure script now > does query via dblink ten times for stability of measuring, so > the figures become about ten times longer than the previous ones. > > sec % to Original > Original : 31.5 100.0% > RowProcessor patch : 31.3 99.4% > dblink patch : 24.6 78.1% > > RowProcessor patch alone makes no loss or very-little gain, and > full patch gives us 22% gain for the benchmark(*1). Excellent! > - The callers of RowProcessor() no more set null_field to > PGrowValue.value. Plus, the PGrowValue[] which RowProcessor() > receives has nfields + 1 elements to be able to make rough > estimate by cols->value[nfields].value - cols->value[0].value - > something. The somthing here is 4 * nfields for protocol3 and > 4 * (non-null fields) for protocol2. I fear that this applies > only for textual transfer usage... Excact estimate is not important here. And (nfields + 1) elem feels bit too much magic, considering that most users probably do not need it. Without it, the logic would be: total = last.value - first.value + ((last.len > 0) ? last.len : 0) which isn't too complex. So I think we can remove it. = Problems = * Remove the dubious memcpy() in pqAddRow() * I think the dynamic arrays in getAnotherTuple() are not portable enough, please do proper allocation for array. I guess in PQsetResultAttrs()? = Minor notes = These can be argued either way, if you don't like some suggestion, you can drop it. * Move PQregisterRowProcessor() into fe-exec.c, then we can make pqAddRow static. * Should PQclear() free RowProcessor error msg? It seems it should not get outside from getAnotherTuple(), but thats not certain. Perhaps it would be clearer to free it here too. * Remove the part of comment in getAnotherTuple(): * Buffer content may be shifted on reloading additional * data. So we must set all pointers on every scan. It's confusing why it needs to clarify that, as there is nobody expecting it. * PGrowValue documentation should mention that ->value pointer is always valid. * dblink: Perhaps some of those mallocs() could be replaced with pallocs() or even StringInfo, which already does the realloc dance? I'm not familiar with dblink, and various struct lifetimes there so I don't know it that actually makes sense or not. It seems this patch is getting ReadyForCommitter soon... -- marko