Re: logical decoding : exceeded maxAllocatedDescs for .spill files

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Date: 2019-11-19 11:28:18
Message-ID: CAJ3gD9ckx6ch34tGk6UWWqhWb6bnd9+8Ax8=wnoZgJ_J_YNUgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> >
> > On Mon, 18 Nov 2019 at 17:20, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > I see that you have made changes in ReorderBufferRestoreChanges to use
> > > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a
> > > reason for the same? In my test environment, with the test provided
> > > by you, I got the error (reported in this thread) via
> > > ReorderBufferSerializeTXN.
> >
> > You didn't get this error with the patch applied, did you ?
> >
>
> No, I got this before applying the patch. However, after applying the
> patch, I got below error in the same test:
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR: could not read from reorderbuffer spill file: Invalid argument
>
> It seems to me that FileRead API used in the patch can return value <
> 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error
> on a windows machine and in the server log the message was "LOG:
> unrecognized win32 error code: 38" which indicates "Reached the end of
> the file."

On Windows, it is documented that ReadFile() (which is called by
pg_pread) will return false on EOF but only when the file is open for
asynchronous reads/writes. But here we are just dealing with usual
synchronous reads. So pg_pread() code should indeed return 0 on EOF on
Windows. Not yet able to figure out how FileRead() managed to return
this error on Windows. But from your symptoms, it does look like
pg_pread()=>ReadFile() returned false (despite doing asynchronous
reads), and so _dosmaperr() gets called, and then it does not find the
eof error in doserrors[], so the "unrecognized win32 error code"
message is printed. May have to dig up more on this.

>
> > If you were debugging this without the patch applied, I suspect that
> > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
> > generating this error is because the max limit must be already crossed
> > because of earlier calls to ReorderBufferRestoreChanges().
> >
> > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
> > sufficient because the code in that function has made sure the fd gets
> > closed there itself.
> >
>
> Okay, then we might not need it there, but we should at least add a
> comment in ReorderBufferRestoreChanges to explain why we have used a
> different function to operate on the file at that place.

Yeah, that might make sense.

>
> >
> > For the API's that use VFDs (like PathNameOpenFile), the files opened
> > are always recorded in the VfdCache array. So it is not required to do
> > the cleanup at (sub)transaction end, because the kernel fds get closed
> > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
> > limit. So if a transaction aborts, the fds might remain open, but
> > those will get cleaned up whenever we require more fds, through
> > ReleaseLruFiles(). Whereas, for files opened through
> > OpenTransientFile(), VfdCache is not involved, so this needs
> > transaction end cleanup.
> >
>
> Have you tried by injecting some error? After getting the error
> mentioned above in email, when I retried the same query, I got the
> below message.
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR: could not remove file
> "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during
> removal of pg_replslot/regression_slot/xid*: Permission denied
>
> And, then I tried to drop the replication slot and I got below error.
> postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> ERROR: could not rename file "pg_replslot/regression_slot" to
> "pg_replslot/regression_slot.tmp": Permission denied
>
> It might be something related to Windows

Oh ok, I missed the fact that on Windows we can't delete the files
that are already open, unlike Linux/Unix.
I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use
OpenTemporaryFile(). I wonder though if this same issue might come up
for the other use-case of PathNameOpenFile() :
logical_rewrite_log_mapping().

> but you can once try by
> injecting some error after reading a few files in the code path and
> see the behavior.
Yeah, will check the behaviour, although on Linux, I think I won't get
this error. But yes, like I mentioned above, I think we might have to
arrange for something.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-11-19 11:49:02 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Kyotaro Horiguchi 2019-11-19 10:48:10 Re: Protect syscache from bloating with negative cache entries