Re: dynamic shared memory

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dynamic shared memory
Date: 2013-09-24 16:19:51
Message-ID: CA+TgmoYzo_3m8WQej3MAk7bKD3tw1GHNbFWga576aEGitwWTUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Hm, I guess you dont't want to add it to global/ or so because of the
>> > mmap implementation where you presumably scan the directory?
>>
>> Yes, and also because I thought this way would make it easier to teach
>> things like pg_basebackup (or anybody's home-brew scripts) to just
>> skip that directory completely. Actually, I was wondering if we ought
>> to have a directory under pgdata whose explicit charter it was to
>> contain files that shouldn't be copied as part of a base backup.
>> pg_do_not_back_this_up.
>
> Wondered exactly about that as soon as you've mentioned
> pg_basebackup. pg_local/?

That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.

Are there any other likely candidates for inclusion in that directory
other than this stuff?

>> >> + /* Create or truncate the file. */
>> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>> >
>> > Doesn't this need a | PG_BINARY?
>>
>> It's a text file. Do we need PG_BINARY anyway?
>
> I'd say yes. Non binary mode stuff on windows does stuff like
> transforming LF <=> CRLF on reading/writing, which makes sizes not match
> up and similar ugliness.
> Imo there's little reason to use non-binary mode for anything written
> for postgres' own consumption.

Well, I'm happy to do whatever the consensus is. AFAICT you and Noah
are both for it and Amit's position is that it doesn't matter either
way, so I'll go ahead and change that unless further discussion sheds
a different light on things.

>> > Why are you using open() and not
>> > BasicOpenFile or even OpenTransientFile?
>>
>> Because those don't work in the postmaster.
>
> Oh, that's news to me. Seems strange, especially for BasicOpenFile.

Per its header comment, InitFileAccess is not called in the
postmaster, so there's no VFD cache. Thus, any attempt by
BasicOpenFile to call ReleaseLruFile would be pointless at best.

>> dsm_handle is an alias for uint32. Is that always exactly an unsigned
>> int or can it sometimes be an unsigned long? I thought the latter, so
>> couldn't figure out how to write this portably without casting to a
>> type that explicitly matched the format string.
>
> That should always be an unsigned int on platforms we support. Note that
> we've printed TransactionIds that way (i.e. %u) for a long time and they
> are a uint32 as well.

Fixed.

>> > Not sure whether it's sensible to only LOG in these cases. After all
>> > there's something unexpected happening. The robustness argument doesn't
>> > count since we're already shutting down.
>
>> I see no point in throwing an error. The fact that we're having
>> trouble cleaning up one dynamic shared memory segment doesn't mean we
>> shouldn't try to clean up others, or that any remaining postmaster
>> shutdown hooks shouldn't be executed.
>
> Well, it means we'll do a regular shutdown instead of PANICing
> and *not* writing a checkpoint.
> If something has corrupted our state to the point we cannot unregister
> shared memory we registered, something has gone terribly wrong. Quite
> possibly we've scribbled over our control structures or such. In that
> case it's not proper to do a normal shutdown, we're quite possibly
> writing bad data.

I have to admit I didn't consider the possibility of an
otherwise-clean shutdown that hit only this problem. I'm not sure how
seriously to take that case. I guess we could emit warnings for
individual failures and then throw an error at the end if there were >
0, but that seems a little ugly. Or we could go whole hog and treat
any failure as a critical error. Anyone else have an opinion on what
to do here?

>> > Imo that's a PANIC or at the very least a FATAL.
>>
>> Sure, that's a tempting option, but it doesn't seem to serve any very
>> necessary point. There's no data corruption problem if we proceed
>> here. Most likely either (1) there's a bug in the code, which
>> panicking won't fix or (2) the DBA hand-edited the state file, in
>> which case maybe he shouldn't have done that, but if he thinks the
>> best way to recover from that is a cluster-wide restart, he can do
>> that himself.
>
> "There's no data corruption problem if we proceed" - but there likely
> has been one leading to the current state.

I doubt it. It's more likely that the file permissions got changed or
something.

>> > Do we rely on being run in an environment with proper setup for lwlock
>> > cleanup? I can imagine shared libraries doing this pretty early on...
>>
>> Yes, we rely on that. I don't really see that as a problem. You'd
>> better connect to the main shared memory segment before starting to
>> create your own.
>
> I am not talking about lwlocks itself being setup but an environment
> that has resource owners defined and catches errors. I am specifically
> asking because you're a) ereport()ing without releasing an LWLock b)
> unconditionally relying on the fact that there's a current resource
> owner.
> In shared_preload_libraries neither is the case afair?
>
> Now, you could very well argue that you don't need to use dsm for
> shared_preload_libraries but there are enough libraries that you can use
> per session or globally. Requiring them to use both implementation or
> register stuff later seems like it would complicate things.

Well, the cleanup logic gets a lot more complicated without a resource
owner. The first few drafts of this logic didn't involve any resource
owner integration and things got quite a bit simpler and nicer when I
added that. For example, in dsm_create(), we do
dsm_create_descriptor() and then loop until we find an unused segment
identifier. If dsm_impl_op() throws an ERROR, which it definitely
can, then the segment produced by dsm_create_descriptor() is still
lying around in dsm_segment_list, and without the resource owner
machinery, that's a permanent leak. Certainly, it's fixable. You can
put PG_TRY() blocks in everywhere and solve the problem that way. But
I'm not very keen on going that route; it looks like it will be
painful and messy.

I also do not think that allocating dynamic shared memory segments in
shared_preload_libraries is actually sensible. You're in the
postmaster at that point, and the main shared memory segment is not
set up. If you were to map a shared memory segment at that point, the
mapping would get inherited in EXEC_BACKEND environments but not
otherwise, so we'd need more infrastructure to handle that. And, of
course, we couldn't use LWLocks for synchronization. I think that we
couldn't use spinlocks either, even if it were otherwise acceptable,
since with --disable-spinlocks those are going to turn into semaphores
that I don't think are available at this point either.

I don't really feel like solving all of those problems and, TBH, I
don't see why it's particularly important. If somebody wants a
loadable module that can be loaded either from
shared_preload_libraries or on the fly, and they use dynamic shared
memory in the latter case, then they can use it in the former case as
well. If they've already got logic to create the DSM when it's first
needed, it doesn't cost extra to do it that way in both cases.

>> They'll continue to see the portion they have mapped, but must do
>> dsm_remap() if they want to see the whole thing.
>
> But resizing can shrink, can it not? And we do an ftruncate() in at
> least the posix shmem case. Which means the other backend will get a
> SIGSEGV accessing that memory IIRC.

Yep. Shrinking the shared memory segment will require special
caution. Caveat emptor.

>> > Shouldn't we error out if !dsm_impl_can_resize()?
>>
>> The implementation-specific code throws an error if it can't support
>> resize. Even if we put a secondary check here, I wouldn't want
>> dsm_impl_op to behave in an undefined manner when asked to resize
>> under an implementation that can't. And there doesn't seem to be much
>> point in having two checks.
>
> Well, You have the check in dsm_remap() which seems strange to me.

Oh, fair point. Removed.

>> Now on the flip side we might not be aborting; maybe we're committing.
>> But we don't want to turn a commit into an abort just for this. If
>> resowner.c detects a buffer pin leak or a tuple descriptor leak, those
>> are "just" warning as well. They're serious warnings, of course, and
>> if they happen it means there's a bug in the code that needs to be
>> fixed. But the severity of an ereport() isn't based just on how
>> alarming the situation is; it's based on what you want to happen when
>> that situation comes up. And we've decided (correctly, I think) that
>> resource leaks are not grounds for aborting a transaction that
>> otherwise would have committed.
>
> We're not talking about a missed munmap() but about one that failed. If
> we unpin the leaked pins and notice that we haven't actually pinned it
> anymore we do error (well, Assert) out. Same for TupleDescs.
>
> If there were valid scenarios in which you could get into that
> situation, maybe. But which would that be? ISTM we can only get there if
> our internal state is messed up.

I don't know. I think that's part of why it's hard to decide what we
want to happen. But personally I think it's paranoid to say, well,
something happened that we weren't expecting, so that must mean
something totally horrible has happened and we'd better die in a fire.
I mean, the fact that the checks you are talking about are assertions
means that they are scenarios we expect never to happen, and therefore
we don't even check for them in a production build. I don't think you
can use that as a precedent to show that any failure here is an
automatic PANIC.

>> > Why isn't the port number part of the posix shmem identifiers? Sure, we
>> > retry, but using a logic similar to sysv_shmem.c seems like a good idea.
>>
>> According to the man page for shm_open on Solaris, "For maximum
>> portability, name should include no more than 14 characters, but this
>> limit is not enforced."
>
> What about "/pgsql.%u" or something similar? That should still fit.

Well, if you want both the port and the identifier in there, that
doesn't get you there.

>> >> +static int
>> >> +errcode_for_dynamic_shared_memory()
>> >> +{
>> >> + if (errno == EFBIG || errno == ENOMEM)
>> >> + return errcode(ERRCODE_OUT_OF_MEMORY);
>> >> + else
>> >> + return errcode_for_file_access();
>> >> +}
>> >
>> > Is EFBIG guaranteed to be defined?
>>
>> I dunno. We could put an #ifdef around that part. Should we do that
>> now or wait and see if it actually breaks anywhere?
>
> A bit of googling around seems to indicate it's likely to be
> available. Even on windows according to MSDN.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-09-24 16:30:47 Re: SSL renegotiation
Previous Message Stephen Frost 2013-09-24 16:00:35 Re: record identical operator