|From:||Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>|
|Cc:||Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tv(at)fuzzy(dot)cz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>|
|Subject:||Re: Use generation context to speed up tuplesorts|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit :
> On 1/7/22 12:03, Ronan Dunklau wrote:
> > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> >> I've attached some benchmark results that I took recently. The
> >> spreadsheet contains results from 3 versions. master, master + 0001 -
> >> 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit
> >> more conservative about the chunk sizes it allocates and also tries to
> >> allocate the tuple array according to the number of tuples we expect
> >> to be able to sort in a single batch for when the sort is not
> >> estimated to fit inside work_mem.
> > (Sorry for trying to merge back the discussion on the two sides of the
> > thread)
> > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan,
> > I expressed the idea of being able to tune glibc's malloc behaviour.
> > I implemented that (patch 0001) to provide a new hook which is called on
> > backend startup, and anytime we set work_mem. This hook is # defined
> > depending on the malloc implementation: currently a default, no-op
> > implementation is provided as well as a glibc's malloc implementation.
> Not sure I'd call this a hook - that usually means a way to plug-in
> custom code through a callback, and this is simply ifdefing a block of
> code to pick the right implementation. Which may be a good way to do
> that, just let's not call that a hook.
> There's a commented-out MallocTuneHook() call, probably not needed.
Ok, I'll clean that up if we decide to proceed with this.
> I wonder if #ifdefing is sufficient solution, because it happens at
> compile time, so what if someone overrides the allocator in LD_PRELOAD?
> That was a fairly common way to use a custom allocator in an existing
> application. But I don't know how many people do that with Postgres (I'm
> not aware of anyone doing that) or if we support that (it'd probably
> apply to other stuff too, not just malloc). So maybe it's OK, and I
> can't think of a better way anyway.
I couldn't think of a better way either, maybe there is something to be done
with trying to dlsym something specific to glibc's malloc implementation ?
> > The glibc's malloc implementation relies on a new GUC,
> > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
> > don't tune malloc at all, exactly as in HEAD. If a different value is
> > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD
> > to this value, capped by work_mem / 2 and work_mem respectively.
> > The net result is that we can then allow to keep more unused memory at the
> > top of the heap, and to use mmap less frequently, if the DBA chooses too.
> > A possible other use case would be to on the contrary, limit the
> > allocated memory in idle backends to a minimum.
> > The reasoning behind this is that glibc's malloc default way of handling
> > those two thresholds is to adapt to the size of the last freed mmaped
> > block.
> > I've run the same "up to 32 columns" benchmark as you did, with this new
> > patch applied on top of both HEAD and your v2 patchset incorporating
> > planner estimates for the block sizez. Those are called "aset" and
> > "generation" in the attached spreadsheet. For each, I've run it with
> > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
> > I've measured two things:
> > - query latency, as reported by pgbench
> > - total memory allocated by malloc at backend ext after running each
> > query
> > three times. This represents the "idle" memory consumption, and thus what
> > we waste in malloc inside of releasing back to the system. This
> > measurement has been performed using the very small module presented in
> > patch 0002. Please note that I in no way propose that we include this
> > module, it was just a convenient way for me to measure memory footprint.
> > My conclusion is that the impressive gains you see from using the
> > generation context with bigger blocks mostly comes from the fact that we
> > allocate bigger blocks, and that this moves the mmap thresholds
> > accordingly. I wonder how much of a difference it would make on other
> > malloc implementation: I'm afraid the optimisation presented here would
> > in fact be specific to glibc's malloc, since we have almost the same
> > gains with both allocators when tuning malloc to keep more memory. I
> > still think both approaches are useful, and would be necessary.
> Interesting measurements. It's intriguing that for generation contexts,
> the default "-1" often outperforms "1MB" (but not the other options),
> while for aset it's pretty much "the higher value the better".
For generation context with "big block sizes" this result is expected, as the
malloc dynamic tuning will adapt to the big block size. This can also be seen
on the "idle memory" measurement: the memory consumption is identical to the
64MB value when using -1, since that's what we converge to. This makes it
possible to configure postgres to be more conservative with memory: for
example, if we have long lived backend where we sometime temporarily set
work_mem to a high value, we may end up with a large memory foot print. The
implementation I provide also requests a malloc trim when we lower the
threshold, making it possible to release memory that would have otherwise been
kept around forever.
For aset, the memory allocation pattern is a bit more complicated, and we
don't end up with such a high value for mmap_threshold.
Also, one thing that I haven't explained yet is the weird outlier when there
is only one column.
> > Since this affects all memory allocations, I need to come up with other
> > meaningful scenarios to benchmarks.
> OK. Are you thinking about a different microbenchmark, or something
> closer to real workload?
Both. As for microbenchmarking, I'd like to test the following scenarios:
- set returning functions allocating a lot of memory
- maintenance operations: REINDEX TABLE and the like, where we may end up
with a large amount of memory used.
- operations involving large hash tables
For real workloads, if you have something specific in mind let me know.
One thing I didn't mention is that I set max_parallel_workers_per_gather to 0
in all tests.
|Next Message||Fujii Masao||2022-01-07 12:54:31||Re: RFC: Logging plan of the running query|
|Previous Message||Ali Koca||2022-01-07 12:48:24||Python Plain Text Sender|