Re: shared memory based stat collector (was: Sharing record typmods between backends)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: shared memory based stat collector (was: Sharing record typmods between backends)
Date: 2017-08-14 15:46:10
Message-ID: CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> You both are obviously right. Another point of potential concern could
>> be that we'd pretyt much fully rely on dsm/dht's being available, for
>> the server to function correctly. Are we ok with that? Right now
>> postgres still works perfectly well, leaving parallelism aside, with
>> dynamic_shared_memory_type = none.
>
> In principle we could keep the existing mechanism as a fallback.
> Whether that's worth the trouble is debatable. The current code
> in initdb believes that every platform has some type of DSM support
> (see choose_dsm_implementation). Nobody's complained about that,
> and it certainly works on every buildfarm animal. So for all we know,
> dynamic_shared_memory_type = none is broken already.

Actually, now that you mention it, I think it *is* broken already, or
more to the point, if you configure that value, the autovacuum
launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
added. When I just tested it, the AV launcher somehow ended up
waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
down. That's actually not really entirely the fault of
dynamic_shared_memory_type = none, though, because the code in
autovacuum.c does this:

AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
/* make sure it doesn't go away even if we do */
dsa_pin(AutoVacuumDSA);
dsa_pin_mapping(AutoVacuumDSA);

Now, that's actually really broken because if dsa_create() throws an
error of any kind, you're going to have already assigned the value to
AutoVacuumDSA, but you will not have pinned the DSA or the DSA
mapping. There's evidently some additional bug here because I'd sorta
expect this code to just go into an infinite loop in this case,
failing over and over trying to reattach the segment, but evidently
something even worse happening - perhaps the ERROR isn't releasing
AutovacuumLock. Anyway, this code has multiple error handling defects
and IMHO it's pretty questionable whether DSA should have been used
here at all. Allowing the number of autovacuum work items to grow
without bound would not be a good design - and if we've got a bound
somewhere, why not just put this in the main shared memory segment?

Leaving all that aside, when the DSM facility was introduced in 9.4, I
was afraid that it would break things for people and that they'd
demand a way to turn it off, and I included "none" as an option for
that purpose. Well, it did break a few things initially but those
seem to have mostly been fixed during the 9.4 cycle itself. I'm not
aware of any subsequent problem reports caused by having DSM enabled
(pointers welcome!) and given that we now have parallel query relying
on DSM, people are much less likely to want to just give up on having
DSM available. If somebody has a system where no other form of shared
memory, works dynamic_shared_memory_type = mmap is still a thing, so
the use case for "none" seems very thin indeed. I'd vote for just
ripping it out in v11.

>> a) It'd be quite useful to avoid needing a whole cluster's stats in
>> memory. Even if $subject would save memory, I'm hesitant committing
>> to something requiring all stats to be in memory forever. As a first
>> step it seems reasonable to e.g. not require state for all databases
>> to be in memory. The first time per-database stats are required, it
>> could be "paged in". Could even be more aggressive and do that on a
>> per-table level and just have smaller placeholder entries for
>> non-accessed tables, but that seems more work.
>
> Huh? As long as we don't lock the shared memory into RAM, which on most
> platforms we couldn't do without being root anyway, ISTM the kernel should
> do just fine at paging out little-used areas of the stats. Let's not
> reinvent that wheel until there's demonstrable proof of need.

I think some systems aren't able to page out data stored in shared
memory segments, and it would probably be pretty awful for performance
if they did (there's a reason large sorts need to switch to using temp
files ... and the same rationale seems to apply here). That having
been said, I don't see a need to change everything at once. If we
moved the stats collector data from frequently-rewritten files to
shared memory, we'd already be saving a lot of I/O and possibly memory
utilization. If somebody then wanted to refine that further by
spilling rarely used data out to disk and reloading it on demand, that
could be done as a follow-on patch. But I think that would only be
needed for people with *really* large numbers of tables, and it would
only help them if most of those tables were barely ever touched.

>> b) I think our tendency to dump all stats whenever we crash isn't really
>> tenable, given how autovacuum etc are tied to them.
>
> Eh, maybe. I don't think crashes are really so common on production
> systems. As developers we have an inflated view of their frequency ;-)

Without taking a position on the point under debate, AFAIK it wouldn't
be technically complex either under our current architecture or the
proposed new one to dump out a new permanent stats file every 10
minutes or so. So if there is an issue here I think it might not be
very hard to fix, whatever else we do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-14 15:50:18 Re: Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)
Previous Message Peter Eisentraut 2017-08-14 15:40:12 Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)