Re: pgsql: Add parallel-aware hash joins.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgsql: Add parallel-aware hash joins.
Date: 2018-01-04 20:06:47
Message-ID: 20180104200647.wna6yoin7nav5qqo@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 2018-01-04 11:20:33 -0800, Andres Freund wrote:
> On 2018-01-04 12:11:37 -0500, Tom Lane wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > > On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> Also, what the devil is happening on skink?
> >
> > > So, skink is apparently dying during shutdown of a user-connected
> > > backend, and specifically the one that executed the 'tablespace' test.
> >
> > Well, yeah, valgrind is burping: the postmaster log is full of
> >
> > ==10544== VALGRINDERROR-BEGIN
> > ==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
> > ==10544== at 0x7011490: epoll_pwait (epoll_pwait.c:42)
> > ==10544== by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048)
> > ==10544== by 0x4BF40B: WaitEventSetWait (latch.c:1000)
> > ==10544== by 0x3C0B3B: secure_read (be-secure.c:166)
> > ==10544== by 0x3CCD9E: pq_recvbuf (pqcomm.c:963)
> > ==10544== by 0x3CDA07: pq_getbyte (pqcomm.c:1006)
> > ==10544== by 0x4E2A2D: SocketBackend (postgres.c:339)
> > ==10544== by 0x4E444E: ReadCommand (postgres.c:512)
> > ==10544== by 0x4E7588: PostgresMain (postgres.c:4085)
> > ==10544== by 0x4641D0: BackendRun (postmaster.c:4412)
> > ==10544== by 0x467308: BackendStartup (postmaster.c:4084)
> > ==10544== by 0x4675F7: ServerLoop (postmaster.c:1757)
> > ==10544== by 0x4689D4: PostmasterMain (postmaster.c:1365)
> > ==10544== Address 0x0 is not stack'd, malloc'd or (recently) free'd
> > ==10544==
> > ==10544== VALGRINDERROR-END
> >
> > But (a) this is happening in multiple branches, and (b) we've not
> > changed anything near that code in awhile. I think something broke
> > in valgrind itself.
>
> Some packages on skink have been upgraded. It appears that there either
> was a libc or valgrind change that made valgrind not recognize that a
> pointer of 0 might not point anywhere :(

It looks like recent glibc implements epoll_wait() as epoll_pwait() with
a NULL sigmask argument, and valgrind isn't happy with it.

> Let me check whether valgrind accept multiple suppression files, in
> which case I could add a suppression for this error to all
> branches.

Done that. I've added a valgrind-global.supp that has:

{
null-sigmask-to-pwait-means-no-sigmask
Memcheck:Param
epoll_pwait(sigmask)
fun:epoll_pwait
fun:WaitEventSetWaitBlock
}

> Will also check whether I can reproduce locally.

Locally the above suppression allowed me to continue a bit further. But
I promptly hit other new blurbs:

assert build:
2018-01-04 12:03:45.927 PST [5716][3/189] STATEMENT: select format('%1$s %4$s', 1, 2, 3);
==5718== Invalid write of size 8
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718== by 0x5F8E26: ProcessUtility (utility.c:357)
==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718== by 0x5F74DE: PortalRun (pquery.c:799)
==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718== by 0x5F5640: PostgresMain (postgres.c:4143)
==5718== Address 0x20 is not stack'd, malloc'd or (recently) free'd
==5718==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:ExecInterpExpr
fun:ExecInterpExprStillValid
fun:ExecEvalExpr
fun:exec_eval_simple_expr
fun:exec_eval_expr
fun:exec_stmt_raise
fun:exec_stmt
fun:exec_stmts
fun:exec_stmt_block
fun:plpgsql_exec_function
fun:plpgsql_inline_handler
fun:OidFunctionCall1Coll
fun:ExecuteDoStmt
fun:standard_ProcessUtility
fun:ProcessUtility
fun:PortalRunUtility
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
}
==5718==
==5718== Process terminating with default action of signal 11 (SIGSEGV)
==5718== Access not within mapped region at address 0x20
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718== by 0x5F8E26: ProcessUtility (utility.c:357)
==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718== by 0x5F74DE: PortalRun (pquery.c:799)
==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718== by 0x5F5640: PostgresMain (postgres.c:4143)
==5718== If you believe this happened as a result of a stack
==5718== overflow in your program's main thread (unlikely but
==5718== possible), you can try to increase the size of the
==5718== main thread stack using the --main-stacksize= flag.
==5718== The main thread stack size used in this run was 8388608.

This looks like it might be related to recent plpgsql changes. Looks
like some pointer in:
/* call input function (similar to InputFunctionCall) */
if (!op->d.iocoerce.finfo_in->fn_strict || str != NULL)
is NULL.

==4286== Conditional jump or move depends on uninitialised value(s)
==4286== at 0x6BA8A17: __wcsnlen_avx2 (strlen-avx2.S:261)
==4286== by 0x6AF2FF1: wcsrtombs (wcsrtombs.c:104)
==4286== by 0x6A88A40: wcstombs (wcstombs.c:34)
==4286== by 0x514772: lowerstr_with_len.part.1 (ts_locale.c:284)
==4286== by 0x523A63: readstoplist (ts_utils.c:119)
==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286== by 0x514E02: LexizeExec (ts_parse.c:203)
==4286== by 0x5153ED: parsetext (ts_parse.c:405)
==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286== by 0x456069: evaluate_expr (clauses.c:4873)
==4286== by 0x4588B5: evaluate_function (clauses.c:4424)
==4286== by 0x4588B5: simplify_function (clauses.c:4064)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286== by 0x458E43: simplify_function (clauses.c:4055)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286== Uninitialised value was created by a heap allocation
==4286== at 0x4C2BB76: malloc (vg_replace_malloc.c:299)
==4286== by 0x6A9667: AllocSetAlloc (aset.c:945)
==4286== by 0x6B5AC7: palloc (mcxt.c:848)
==4286== by 0x5238EA: get_tsearch_config_filename (ts_utils.c:55)
==4286== by 0x5239AF: readstoplist (ts_utils.c:75)
==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286== by 0x514E02: LexizeExec (ts_parse.c:203)
==4286== by 0x5153ED: parsetext (ts_parse.c:405)
==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286== by 0x456069: evaluate_expr (clauses.c:4873)
==4286== by 0x4588B5: evaluate_function (clauses.c:4424)
==4286== by 0x4588B5: simplify_function (clauses.c:4064)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286== by 0x458E43: simplify_function (clauses.c:4055)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286==
{
<insert_a_suppression_name_here>
Memcheck:Cond
fun:__wcsnlen_avx2
fun:wcsrtombs
fun:wcstombs
fun:lowerstr_with_len.part.1
fun:readstoplist
fun:dsnowball_init
fun:OidFunctionCall1Coll
fun:lookup_ts_dictionary_cache
fun:LexizeExec
fun:parsetext
fun:to_tsvector_byid
fun:ExecInterpExpr
fun:ExecEvalExprSwitchContext
fun:evaluate_expr
fun:evaluate_function
fun:simplify_function
fun:eval_const_expressions_mutator
fun:expression_tree_mutator
fun:simplify_function
fun:eval_const_expressions_mutator
fun:expression_tree_mutator
fun:eval_const_expressions_mutator
}

I don't recall any recent changes in the area, so this might be a glibc
internal issue.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2018-01-04 20:16:15 Re: pgsql: Add parallel-aware hash joins.
Previous Message Tom Lane 2018-01-04 19:59:13 pgsql: Fix incorrect computations of length of null bitmap in pageinspe

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-01-04 20:16:15 Re: pgsql: Add parallel-aware hash joins.
Previous Message PG Bug reporting form 2018-01-04 20:06:33 BUG #14999: pg_rewind corrupts control file global/pg_control