|From:||David Fetter <david(at)fetter(dot)org>|
|To:||PostgreSQL Announce <pgsql-announce(at)postgresql(dot)org>|
|Subject:||== PostgreSQL Weekly News - August 9, 2020 ==|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
== PostgreSQL Weekly News - August 9, 2020 ==
PGDay Ukraine has been cancelled due to COVID-19.
Person of the week: https://postgresql.life/post/tatsuo_ishii/
== PostgreSQL Jobs for August ==
== PostgreSQL Local ==
pgDay Israel 2020 will take place on September 10, 2020 in Tel Aviv.
== PostgreSQL in the News ==
Planet PostgreSQL: http://planet.postgresql.org/
PostgreSQL Weekly News is brought to you this week by David Fetter
Submit news and announcements by Sunday at 3:00pm PST8PDT to david(at)fetter(dot)org(dot)
== Applied Patches ==
Tom Lane pushed:
- Fix minor issues in psql's new \dAc and related commands. The type-name
pattern in \dAc and \dAf was matched only to the actual pg_type.typname
string, which is fairly user-unfriendly in cases where that is not what's
shown to the user by format_type (compare "_int4" and "integer"). Make this
code match what \dT does, i.e. match the pattern against either typname or
format_type() output. Also fix its broken handling of schema-name
restrictions. (IOW, make these processSQLNamePattern calls match \dT's.)
While here, adjust whitespace to make the query a little prettier in -E
output, too. Also improve some inaccuracies and shaky grammar in the related
documentation. Noted while working on a patch for intarray's opclasses; I
wondered why I couldn't get a match to "integer*" for the input type name.
- Fix behavior of ecpg's "EXEC SQL elif name". This ought to work much like C's
"#elif defined(name)"; but the code implemented it in a way equivalent to
endif followed by ifdef, so that it didn't matter whether any previous branch
of the IF construct had succeeded. Fix that; add some test cases covering
elif and nested IFs; and improve the documentation, which also seemed a bit
confused. AFAICS the code has been like this since the feature was added in
1999 (commit b57b0e044). So while it's surely wrong, there might be code out
there relying on the current behavior. Hence, don't back-patch into stable
branches. It seems all right to fix it in v13 though. Per report from
Ashutosh Sharma. Reviewed by Ashutosh Sharma and Michael Meskes. Discussion:
- Doc: fix obsolete info about allowed range of TZ offsets in timetz. We've
allowed UTC offsets up to +/- 15:59 since commit cd0ff9c0f, but that commit
forgot to fix the documentation about timetz. Per bug #16571 from osdba.
- Remove unnecessary "DISTINCT" in psql's queries for \dAc and \dAf. A moment's
examination of these queries is sufficient to see that they do not produce
duplicate rows, unless perhaps there's catalog corruption. Using DISTINCT
anyway is inefficient and confusing; moreover it sets a poor example for
anyone who refers to psql -E output to see how to query the catalogs.
- Increase hard-wired timeout values in ecpg regression tests. A couple of test
cases had connect_timeout=14, a value that seems to have been plucked from a
hat. While it's more than sufficient for normal cases, slow/overloaded
buildfarm machines can get a timeout failure here, as per recent report from
"sungazer". Increase to 180 seconds, which is in line with our typical
timeouts elsewhere in the regression tests. Back-patch to 9.6; the code looks
different in 9.5, and this doesn't seem to be quite worth the effort to adapt
to that. Report:
- Fix matching of sub-partitions when a partitioned plan is stale. Since we no
longer require AccessExclusiveLock to add a partition, the executor may see
that a partitioned table has more partitions than the planner saw.
ExecCreatePartitionPruneState's code for matching up the partition lists in
such cases was faulty, and would misbehave if the planner had successfully
pruned any partitions from the query. (Thus, trouble would occur only if a
partition addition happens concurrently with a query that uses both static and
dynamic partition pruning.) This led to an Assert failure in debug builds,
and probably to crashes or query misbehavior in production builds. To repair
the bug, just explicitly skip zeroes in the plan's relid_map list. I also
made some cosmetic changes to make the code more readable (IMO anyway). Also,
convert the cross-checking Assert to a regular test-and-elog, since it's now
apparent that this logic is more fragile than one would like. Currently,
there's no way to repeatably exercise this code, except with manual use of a
debugger to stop the backend between planning and execution. Hence, no test
case in this patch. We oughta do something about that testability gap, but
that's for another day. Amit Langote and Tom Lane, per report from Justin
Pryzby. Oversight in commit 898e5e329; backpatch to v12 where that appeared.
- Support testing of cases where table schemas change after planning. We have
various cases where we allow DDL on tables to be performed with less than full
AccessExclusiveLock. This requires concurrent queries to be able to cope with
the DDL change mid-flight, but up to now we had no repeatable way to test such
cases. To improve that, invent a test module that allows halting a backend
after planning and then resuming execution once we've done desired actions in
another session. (The same approach could be used to inject delays in other
places, if there's a suitable hook available.) This commit includes a single
test case, which is meant to exercise the previously-untestable
ExecCreatePartitionPruneState code repaired by commit 7a980dfc6. We'd
probably not bother with this if that were the only foreseen benefit, but I
expect additional test cases will use this infrastructure in the future. Test
module by Andy Fan, partition-addition test case by me. Discussion:
- Remove <@ from contrib/intarray's GiST operator classes. Since commit
efc77cf5f, an indexed query using <@ has required a full-index scan, so that
it actually performs worse than a plain seqscan would do. As I noted at the
time, we'd be better off to not treat <@ as being indexable by such indexes at
all; and that's what this patch does. It would have been difficult to remove
these opclass members without dropping the whole opclass before commit
9f9682783 fixed GiST opclass member dependency rules, but now it's quite
simple, so let's do it. I left the existing support code in place for the
time being, with comments noting it's now unreachable. At some point, perhaps
we should remove that code in favor of throwing an error telling people to
upgrade the extension version. Discussion:
- Remove useless Assert. Testing that an unsigned variable is >= 0 is pretty
pointless, as noted by Coverity and numerous buildfarm members. In passing,
add comment about new uses of "volatile" --- Coverity doesn't much like that
either, but it seems probably necessary.
- Check for fseeko() failure in pg_dump's _tarAddFile(). Coverity pointed out,
not unreasonably, that we checked fseeko's result at every other call site but
these. Failure to seek in the temp file (note this is NOT pg_dump's output
file) seems quite unlikely, and even if it did happen the file length
cross-check further down would probably detect the problem. Still, that's a
poor excuse for not checking the result of a system call.
Thomas Munro pushed:
- Correct comment in simplehash.h. Post-commit review for commit 84c0e4b9.
Author: David Rowley <dgrowleyml(at)gmail(dot)com> Discussion:
- Fix rare failure in LDAP tests. Instead of writing a query to psql's stdin,
use -c. This avoids a failure where psql exits before we write, seen a few
times on the build farm. Thanks to Tom Lane for the suggestion. Back-patch
to 11, where the LDAP tests arrived. Reviewed-by: Noah Misch
Michaël Paquier pushed:
- Add %P to log_line_prefix for parallel group leader. This is useful for
monitoring purposes with log parsing. Similarly to pg_stat_activity, the
leader's PID is shown only for active parallel workers, minimizing the log
footprint for the leaders as the equivalent shared memory field is set as long
as a backend is alive. Author: Justin Pryzby Reviewed-by: Álvaro Herrera,
Michael Paquier, Julien Rouhaud, Tom Lane Discussion:
- Make new SSL TAP test for channel_binding more robust. The test would fail in
an environment including a certificate file in ~/.postgresql/. bdd6e9b fixed
a similar failure, and d6e612f introduced the same problem again with a new
test. Author: Kyotaro Horiguchi Discussion:
Peter Geoghegan pushed:
- Add nbtree page deletion assertion. Add a documenting assertion that's similar
to the nearby assertion added by commit cd8c73a3. This conveys that the
entire call to _bt_pagedel() does no work if it isn't possible to get a
descent stack for the initial scanblkno page.
- Fix replica backward scan race condition. It was possible for the logic used
by backward scans (which must reason about concurrent page splits/deletions in
its own peculiar way) to become confused when running on a replica.
Concurrent replay of a WAL record that describes the second phase of page
deletion could cause _bt_walk_left() to get confused.
btree_xlog_unlink_page() simply failed to adhere to the same locking protocol
that we use on the primary, which is obviously wrong once you consider these
two disparate functions together. This bug is present in all stable branches.
More concretely, the problem was that nothing stopped _bt_walk_left() from
observing inconsistencies between the deletion's target page and its original
sibling pages when running on a replica. This is true even though the second
phase of page deletion is supposed to work as a single atomic action. Queries
running on replicas raised "could not find left sibling of block %u in index
%s" can't-happen errors when they went back to their scan's "original" page
and observed that the page has not been marked deleted (even though it really
was concurrently deleted). There is no evidence that this actually happened
in the real world. The issue came to light during unrelated feature
development work. Note that _bt_walk_left() is the only code that cares about
the difference between a half-dead page and a fully deleted page that isn't
also exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code). It seems very likely that backward scans are the only thing that could
become confused by the inconsistency. Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected. To fix, teach
btree_xlog_unlink_page() to lock the left sibling, target, and right sibling
pages in that order before releasing any locks (just like
_bt_unlink_halfdead_page()). This is the simplest possible approach. There
doesn't seem to be any opportunity to be more clever about lock acquisition in
the REDO routine, and it hardly seems worth the trouble in any case. This fix
might enable contrib/amcheck verification of leaf page sibling links with only
an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was
rejected back in January because it clashed with btree_xlog_unlink_page()'s
lax approach to locking pages. It now seems likely that the real problem was
with btree_xlog_unlink_page(), not the patch. This is a low severity, low
likelihood bug, so no backpatch. Author: Michail Nikolaev Diagnosed-By:
Michail Nikolaev Discussion:
- amcheck: Sanitize metapage's allequalimage field. This will be helpful if it
ever proves necessary to revoke an opclass's support for deduplication.
Backpatch: 13-, where nbtree deduplication was introduced.
- Remove obsolete amcheck comment. Oversight in commit d114cc53871.
- Rename nbtree split REDO routine variables. Make the nbtree page split REDO
routine variable names consistent with _bt_split() (which handles the original
execution of page splits). These names make the code easier to follow by
making the distinction between the original page and the left half of the
split clear. (The left half of the split page is a temp page that REDO
creates to replace the origpage contents.) Also reduce the elevel used when
adding a new high key to the temp page from PANIC to ERROR to be consistent.
We already only raise an ERROR when data item PageAddItem() temp page calls
- Make nbtree split REDO locking match original execution. Make the nbtree page
split REDO routine consistent with original execution in its approach to
acquiring and releasing buffer locks (at least for pages on the tree level of
the page being split). This brings btree_xlog_split() in line with
btree_xlog_unlink_page(), which was taught to couple buffer locks by commit
9a9db08a. Note that the precise order in which we both acquire and release
sibling buffer locks in btree_xlog_split() now matches original execution
exactly (the precise order in which the locks are released probably doesn't
matter much, but we might as well be consistent about it). The rule for
nbtree REDO routines from here on is that same-level locks should be acquired
in an order that's consistent with original execution. It's not practical to
have a similar rule for cross-level page locks, since for the most part
original execution holds those locks for a period that spans multiple atomic
actions/WAL records. It's also not necessary, because clearly the cross-level
lock coupling is only truly needed during original execution because of the
presence of concurrent inserters. This is not a bug fix (unlike the similar
aforementioned commit, commit 9a9db08a). The immediate reason to tighten
things up in this area is to enable an upcoming enhancement to contrib/amcheck
that allows it to verify that sibling links are in agreement with only an
AccessShareLock (this check produced false positives when run on a replica
server on account of the inconsistency fixed by this commit). But that's not
the only reason to be stricter here. It is generally useful to make locking
on replicas be as close to what happens during original execution as
practically possible. It makes it less likely that hard to catch bugs will
slip in in the future. The previous state of affairs seems to be a holdover
from before the introduction of Hot Standby, when buffer lock acquisitions
during recovery were totally unnecessary. See also: commit 3bbf668d, which
tightened things up in this area a few years after the introduction of Hot
- Teach amcheck to verify sibling links in all cases. Teach contrib/amcheck's
bt_index_check() function to check agreement between siblings links. The left
sibling's right link should point to a right sibling page whose left link
points back to the same original left sibling. This extends a check that
bt_index_parent_check() always performed to bt_index_check(). This is the
first time amcheck has been taught to perform buffer lock coupling, which we
have explicitly avoided up until now. The sibling link check tends to catch a
lot of real world index corruption with little overhead, so it seems worth
accepting the complexity. Note that the new lock coupling logic would not
work correctly on replica servers without the changes made by commits 0a7d771f
and 9a9db08a (there could be false positives without those changes). Author:
Andrey Borodin, Peter Geoghegan Discussion:
Alexander Korotkov pushed:
- Remove btree page items after page unlink. Currently, page unlink leaves
remaining items "as is", but replay of corresponding WAL-record re-initializes
page leaving it with no items. For the sake of consistency, this commit makes
primary delete all the items during page unlink as well. Thanks to this
change, we now don't mask contents of deleted btree page for WAL consistency
Author: Alexander Korotkov Reviewed-by: Peter Geoghegan
Bruce Momjian pushed:
- doc: clarify "state" table reference in tutorial. Reported-by: Vyacheslav
Robert Haas pushed:
- Register llvm_shutdown using on_proc_exit, not before_shmem_exit. This seems
more correct, because other before_shmem_exit calls may expect the
infrastructure that is needed to run queries and access the database to be
working, and also because this cleanup has nothing to do with shared memory.
There are no known user-visible consequences to this, though, apart from what
was previous fixed by commit 303640199d0436c5e7acdf50b837a027b5726594 and
back-patched as commit bcbc27251d35336a6442761f59638138a772b839 and commit
f7013683d9bb663a6a917421b1374306a32f165b, so for now, no back-patch. Bharath
David Rowley pushed:
- Fix bogus EXPLAIN output for Hash Aggregate. 9bdb300de modified the EXPLAIN
output for Hash Aggregate to show details from parallel workers. However, it
neglected to consider that a given parallel worker may not have assisted with
the given Hash Aggregate. This can occur when workers fail to start or during
Parallel Append with enable_partitionwise_join enabled when only a single
worker is working on a non-parallel aware sub-plan. It could also happen if a
worker simply wasn't fast enough to get any work done before other processes
went and finished all the work. The bogus output came from the fact that
ExplainOpenWorker() skipped showing any details for non-initialized workers
but show_hashagg_info() did show details from the worker. This meant that the
worker properties that were shown were not properly attributed to the worker
that they belong to. In passing, we also now don't show Hash Aggregate
properties for the leader process when it did not contribute any work to the
Hash Aggregate. This can occur either during Parallel Append when only a
parallel worker worked on a given sub plan or with
parallel_leader_participation set to off. This aims to make the behavior of
Hash Aggregate's EXPLAIN output more similar to Sort's. Reported-by: Justin
Pryzby Discussion: https://postgr.es/m/20200805012105.GZ28072%40telsasoft.com
Backpatch-through: 13, where the original breakage was introduced
Etsuro Fujita pushed:
- Fix yet another issue with step generation in partition pruning. Commit
13838740f fixed some issues with step generation in partition pruning, but
there was yet another one: get_steps_using_prefix() assumes that clauses in
the passed-in prefix list are sorted in ascending order of their partition key
numbers, but the caller failed to ensure this for range partitioning, which
led to an assertion failure in debug builds. Adjust the caller function to
arrange the clauses in the prefix list in the required order for range
partitioning. Back-patch to v11, like the previous commit. Patch by me,
reviewed by Amit Langote. Discussion:
Álvaro Herrera pushed:
- Remove PROC_IN_ANALYZE and derived flags. These flags are unused and always
have been. Discussion:
- walsnd: Don't set waiting_for_ping_response spuriously. Ashutosh Bapat noticed
that when logical walsender needs to wait for WAL, and it realizes that it
must send a keepalive message to walreceiver to update the sent-LSN, which
*does not* request a reply from walreceiver, it wrongly sets the flag that
it's going to wait for that reply. That means that any future would-be sender
of feedback messages ends up not sending a feedback message, because they all
believe that a reply is expected. With built-in logical replication there's
not much harm in this, because WalReceiverMain will send a ping-back every
wal_receiver_timeout/2 anyway; but with other logical replication systems
(e.g. pglogical) it can cause significant pain. This problem was introduced
in commit 41d5f8ad734, where the request-reply flag was changed from true to
false to WalSndKeepalive, without at the same time removing the line that sets
waiting_for_ping_response. Just removing that line would be a sufficient fix,
but it seems better to shift the responsibility of setting the flag to
WalSndKeepalive itself instead of requiring caller to do it; this is clearly
less error-prone. Author: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Reported-by: Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com> Backpatch: 9.5
and up Discussion: https://postgr.es/m/20200806225558.GA22401@alvherre.pgsql
Amit Kapila pushed:
- Implement streaming mode in ReorderBuffer. Instead of serializing the
transaction to disk after reaching the logical_decoding_work_mem limit in
memory, we consume the changes we have in memory and invoke stream API methods
added by commit 45fdc9738b. However, sometimes if we have incomplete toast or
speculative insert we spill to the disk because we can't generate the complete
tuple and stream. And, as soon as we get the complete tuple we stream the
transaction including the serialized changes. We can do this incremental
processing thanks to having assignments (associating subxact with toplevel
xacts) in WAL right away, and thanks to logging the invalidation messages at
each command end. These features are added by commits 0bead9af48 and
c55040ccd0 respectively. Now that we can stream in-progress transactions, the
concurrent aborts may cause failures when the output plugin consults catalogs
(both system and user-defined). We handle such failures by returning
ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the
backend or WALSender decoding a specific uncommitted transaction. The decoding
logic on the receipt of such a sqlerrcode aborts the decoding of the current
transaction and continue with the decoding of other transactions. We have
ReorderBufferTXN pointer in each ReorderBufferChange by which we know which
xact it belongs to. The output plugin can use this to decide which changes to
discard in case of stream_abort_cb (e.g. when a subxact gets discarded). We
also provide a new option via SQL APIs to fetch the changes being streamed.
Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke Reviewed-by:
Amit Kapila, Kuntal Ghosh, Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh
Thalor and Ajin Cherian Discussion:
- Fix the logical streaming test. Commit 7259736a6e added the capability to
stream changes in ReorderBuffer which has some tests to test the streaming
mode. It is quite possible that while this test is running a parallel
transaction could be logged by autovacuum. Such a transaction won't perform
any insert/update/delete to non-catalog tables so will be shown as an empty
transaction. Fix it by skipping the empty transactions during this test. Per
report by buildfarm.
Peter Eisentraut pushed:
- Add some const decorations.
== Pending Patches ==
Anastasia Lubennikova sent in another revision of a patch to set VM all_visible
and all_frozen bits when COPY FREEZE inserts tuples into empty page.
Justin Pryzby sent in another revision of a patch to create some secondary index
access optimizations, and avoid bitmap index scans inconsistent with partition
Justin Pryzby sent in a patch to elog.c to remove a special case which avoided
%*s format strings, which should no longer be needed since it was a performance
hack for specific platform snprintf, which are no longer used.
Andrew Dunstan sent in two more revisions of a patch to support libnss as a TLS
Bharath Rupireddy sent in another revision of a patch to parallelize COPY.
Amit Langote sent in another revision of a patch to use heap slot for
Peter Eisentraut sent in three revisions of a patch to replace the remaining
StrNCpy() calls with strlcpy() calls.
Takashi Menjo sent in another revision of a patch to make it possible to use
PMDK for WAL.
Kyotaro HORIGUCHI sent in a patch to allow using a directory name for the
ssl_crl_file GUC and the sslcrl connection option. X509_STORE_load_locations
accepts a directory, which leads to on-demand loading method with which method
only relevant CRLs are loaded.
Peter Eisentraut sent in a patch to change return type of EXTRACT to numeric.
Bertrand Drouvot sent in a patch to add information to rm_redo_error_callback().
David Cramer sent in a patch to throw an error and rollback on a failed
transaction instead of silently rolling back.
Andrey Borodin sent in another revision of a patch to add sort support for point
gist_point_sortsupport, and implement GiST index builds using same.
Álvaro Herrera sent in a patch to let ALTER TABLE exec routines deal with the
relation, and implement ALTER TABLE ... DETACH CONCURRENTLY.
Alexander Korotkov sent in another revision of a patch to remove btree page
items after page unlink, which removal makes it possible to stop masking the
contents of deleted btree pages for WAL consistency checking.
Bharath Rupireddy sent in another revision of a patch to implement background
worker shared memory access for EXEC_BACKEND cases.
David Rowley sent in another revision of a patch to allow estimate_num_groups()
to pass back further details about the estimation, allow users of simplehash.h
to perform direct deletions, and add a Result Cache executor node.
Mahendra Singh sent in another revision of a patch to add offset with block
number in vacuum errcontext.
Amit Langote, Tom Lane, and Andy Fan traded patches to fix an issue that
manifested as FailedAssertion("pd_idx == pinfo->nparts", File:
"execPartition.c", Line: 1689).
Amit Langote sent in two more revisions of a patch to revise how FDWs obtain
result relation information, avoid making "root" ResultRelInfo for insert
queries, revise the child-to-root tuple conversion map management, and
Initialize result relation information lazily.
Thomas Munro sent in another revision of a patch to ask the checkpointer to
flush SLRU files.
Jehan-Guillaume de Rorthais sent in another revision of a patch to implement
DEMOTE, a command that turns a writer node into a standby.
Dmitry Dolgov sent in another revision of a patch to implement generic type
David Rowley sent in a patch to avoid displaying hashagg properties for workers
that don't assist.
Asim Praveen sent in two more revisions of a patch to use a stringInfo instead
of a char for replace_string in pg_regress.
Ashutosh Sharma sent in two more revisions of a patch to add contrib/pg_surgery
to perform surgery on a damaged heap table.
Álvaro Herrera sent in a patch to flag CREATE INDEX CONCURRENTLY to avoid
Michaël Paquier sent in a patch to implement range checks of pg_test_fsync
--secs-per-test and pg_test_timing --duration.
Bertrand Drouvot sent in another revision of a patch to display individual
queries in pg_stat_activity.
Ryo Matsumura sent in another revision of a patch to ensure that "make
installcheck" works with PGXS.
Michaël Paquier sent in another revision of a patch to use multi-inserts for
pg_depend, and switch to multi-insert dependencies for many code paths.
Amit Langote sent in a patch to fix some inefficiencies in the patch to make
inserts to tables with foreign partitions faster.
Tomáš Vondra sent in another revision of a patch to implement BRIN multi-range
Robert Haas sent in another revision of a patch to fix a bug that manifested as
parallel worker hanging while handling errors.
Bharath Rupireddy sent in another revision of a patch to modify the
Pavel Borisov sent in a patch to implement covering GiST indexes.
Justin Pryzby sent in another revision of a patch to implement CREATE INDEX
CONCURRENTLY on partitioned tables.
Asim Praveen sent in another revision of a patch to start the WAL receiver
before the startup process replays existing WAL.
Michaël Paquier sent in a patch to support REINDEX in psql's tab completion.
Thomas Munro sent in a patch to optimize latches to send fewer signals, and
remove the self-pipe in WAIT_USE_EPOLL and WAIT_USE_KQUEUE builds.
Jeff Janes sent in a patch to implement tab completion for IMPORT FOREIGN SCHEMA
|Next Message||Jonathan S. Katz||2020-08-13 12:33:21||PostgreSQL 12.4, 11.9, 10.14, 9.6.19, 9.5.23, and 13 Beta 3 Released!|
|Previous Message||David Fetter||2020-08-02 19:50:50||== PostgreSQL Weekly News - August 2, 2020 ==|