| From: | Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io> | 
|---|---|
| To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com> | 
| Subject: | Re: Use generation context to speed up tuplesorts | 
| Date: | 2022-01-07 11:03:55 | 
| Message-ID: | 3082578.5fSG56mABF@aivenronan | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> I've attached some benchmark results that I took recently.  The
> spreadsheet contains results from 3 versions. master, master + 0001 -
> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
> more conservative about the chunk sizes it allocates and also tries to
> allocate the tuple array according to the number of tuples we expect
> to be able to sort in a single batch for when the sort is not
> estimated to fit inside work_mem.
(Sorry for trying to merge back the discussion on the two sides of the thread)
In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I 
expressed the idea of being able to tune glibc's malloc behaviour. 
I implemented that (patch 0001) to provide a new hook which is called on 
backend startup, and anytime we set work_mem. This hook is # defined depending 
on the malloc implementation: currently a default, no-op implementation is 
provided as well as a glibc's malloc implementation.
The glibc's malloc implementation relies on a new GUC, 
glibc_malloc_max_trim_threshold. When set to it's default value of -1, we 
don't tune malloc at all, exactly as in HEAD. If a different value is provided, 
we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value, 
capped by work_mem / 2 and work_mem respectively. 
The net result is that we can then allow to keep more unused memory at the top 
of the heap, and to use mmap less frequently, if the DBA chooses too. A 
possible other use case would be to on the contrary, limit the allocated 
memory in idle backends to a minimum. 
The reasoning behind this is that glibc's malloc default way of handling those 
two thresholds is to adapt to the size of the last freed mmaped block. 
I've run the same "up to 32 columns" benchmark as you did, with this new patch 
applied on top of both HEAD and your v2 patchset incorporating planner 
estimates for the block sizez. Those are called "aset" and "generation" in the 
attached spreadsheet. For each, I've run it with 
glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case 
I've measured two things:
 - query latency, as reported by pgbench
 - total memory allocated by malloc at backend ext after running each query 
three times. This represents the "idle" memory consumption, and thus what we 
waste in malloc inside of releasing back to the system. This measurement has 
been performed using the very small module presented in patch 0002. Please 
note that I in no way propose that we include this module, it was just a 
convenient way for me to measure memory footprint.
My conclusion is that the impressive gains you see from using the generation 
context with bigger blocks mostly comes from the fact that we allocate bigger 
blocks, and that this moves the mmap thresholds accordingly. I wonder how much 
of a difference it would make on other malloc implementation: I'm afraid the 
optimisation presented here would in fact be specific to glibc's malloc, since 
we have almost the same gains with both allocators when tuning malloc to keep 
more memory. I still think both approaches are useful, and would be necessary. 
Since this affects all memory allocations, I need to come up with other 
meaningful scenarios to benchmarks.
-- 
Ronan Dunklau
| Attachment | Content-Type | Size | 
|---|---|---|
| bench_trims.ods | application/vnd.oasis.opendocument.spreadsheet | 151.1 KB | 
| v1-0001-Add-the-possibility-of-tuning-malloc-options.patch | text/x-patch | 9.4 KB | 
| v1-0002-Add-malloc_stats-extension-to-see-the-allocated-m.patch | text/x-patch | 2.0 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2022-01-07 11:08:57 | Re: \dP and \dX use ::regclass without "pg_catalog." | 
| Previous Message | Amit Kapila | 2022-01-07 10:26:14 | Re: Logical replication timeout problem |