Re: shared memory based stat collector (was: Sharing record typmods between backends)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: shared memory based stat collector (was: Sharing record typmods between backends)
Date: 2017-08-14 16:00:25
Message-ID: 20170814160025.mam2ocgdwcls32yn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-08-14 11:46:10 -0400, Robert Haas wrote:
> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> >> You both are obviously right. Another point of potential concern could
> >> be that we'd pretyt much fully rely on dsm/dht's being available, for
> >> the server to function correctly. Are we ok with that? Right now
> >> postgres still works perfectly well, leaving parallelism aside, with
> >> dynamic_shared_memory_type = none.
> >
> > In principle we could keep the existing mechanism as a fallback.
> > Whether that's worth the trouble is debatable. The current code
> > in initdb believes that every platform has some type of DSM support
> > (see choose_dsm_implementation). Nobody's complained about that,
> > and it certainly works on every buildfarm animal. So for all we know,
> > dynamic_shared_memory_type = none is broken already.
>
> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added. When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down. That's actually not really entirely the fault of
> dynamic_shared_memory_type = none, though, because the code in
> autovacuum.c does this:
>
> AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
> /* make sure it doesn't go away even if we do */
> dsa_pin(AutoVacuumDSA);
> dsa_pin_mapping(AutoVacuumDSA);
>
> Now, that's actually really broken because if dsa_create() throws an
> error of any kind, you're going to have already assigned the value to
> AutoVacuumDSA, but you will not have pinned the DSA or the DSA
> mapping. There's evidently some additional bug here because I'd sorta
> expect this code to just go into an infinite loop in this case,
> failing over and over trying to reattach the segment, but evidently
> something even worse happening - perhaps the ERROR isn't releasing
> AutovacuumLock. Anyway, this code has multiple error handling defects
> and IMHO it's pretty questionable whether DSA should have been used
> here at all. Allowing the number of autovacuum work items to grow
> without bound would not be a good design - and if we've got a bound
> somewhere, why not just put this in the main shared memory segment?

CCing Alvaro. This seems like it should be an open item.

> Leaving all that aside, when the DSM facility was introduced in 9.4, I
> was afraid that it would break things for people and that they'd
> demand a way to turn it off, and I included "none" as an option for
> that purpose. Well, it did break a few things initially but those
> seem to have mostly been fixed during the 9.4 cycle itself. I'm not
> aware of any subsequent problem reports caused by having DSM enabled
> (pointers welcome!) and given that we now have parallel query relying
> on DSM, people are much less likely to want to just give up on having
> DSM available. If somebody has a system where no other form of shared
> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> the use case for "none" seems very thin indeed. I'd vote for just
> ripping it out in v11.

It's possibly still useful for debugging - we'll continue not to be able
to rely on allocations to succeed...

> >> a) It'd be quite useful to avoid needing a whole cluster's stats in
> >> memory. Even if $subject would save memory, I'm hesitant committing
> >> to something requiring all stats to be in memory forever. As a first
> >> step it seems reasonable to e.g. not require state for all databases
> >> to be in memory. The first time per-database stats are required, it
> >> could be "paged in". Could even be more aggressive and do that on a
> >> per-table level and just have smaller placeholder entries for
> >> non-accessed tables, but that seems more work.
> >
> > Huh? As long as we don't lock the shared memory into RAM, which on most
> > platforms we couldn't do without being root anyway, ISTM the kernel should
> > do just fine at paging out little-used areas of the stats. Let's not
> > reinvent that wheel until there's demonstrable proof of need.
>
> I think some systems aren't able to page out data stored in shared
> memory segments, and it would probably be pretty awful for performance
> if they did (there's a reason large sorts need to switch to using temp
> files ... and the same rationale seems to apply here).

Right.

> That having been said, I don't see a need to change everything at
> once. If we moved the stats collector data from frequently-rewritten
> files to shared memory, we'd already be saving a lot of I/O and
> possibly memory utilization.

Right #7

> >> b) I think our tendency to dump all stats whenever we crash isn't really
> >> tenable, given how autovacuum etc are tied to them.
> >
> > Eh, maybe. I don't think crashes are really so common on production
> > systems. As developers we have an inflated view of their frequency ;-)

FWIW, in my experience crashes are far from rare. And most shutdowns /
restarts are immediate ones in my experience.

> Without taking a position on the point under debate, AFAIK it wouldn't
> be technically complex either under our current architecture or the
> proposed new one to dump out a new permanent stats file every 10
> minutes or so. So if there is an issue here I think it might not be
> very hard to fix, whatever else we do.

There's some issues around that, primarily because how dropped tables
are handled. It's not entirely trivial to avoid having old entries
around after a dropped tables, which then even can lead to conflicts if
oids are reused.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-08-14 16:09:05 Re: shared memory based stat collector (was: Sharing record typmods between backends)
Previous Message Robert Haas 2017-08-14 15:50:18 Re: Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)