open/lseek overhead with many partitions (including with "faster partitioning pruning")

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: open/lseek overhead with many partitions (including with "faster partitioning pruning")
Date: 2018-04-03 19:46:13
Message-ID: 20180403194613.GY28454@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
backing every table being queried, even for those partitions eventually
"excluded".

One of our customers (with the largest number of child tables, currently of
relkind='r') takes a long time to plan query, and I wondered whether we'd
benefit from ongoing work, if I should plan to convert to relkind='p', and if I
could start testing that for our use cases.

I found by stracing that the backend is open()ing + lseek(SEEK_END) each file
in each relation being UNIONed in this view, including relations which were
excluded by constraints. Their tables are partitioned by day (some are still
40+GB), so there's only 10-20 tables being included in the above query, total
size perhaps ~50GB and under 50 backing files, but accessing a LOT more files
than needed:

cold backend:
[pryzbyj(at)database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
26593 open("pg_tblspc/16400
6894 open("pg_tblspc/16401
482 open("base/16402

warm backend:
[pryzbyj(at)database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
6545 open("pg_tblspc/16401
1338 open("pg_tblspc/16400
254 open("base/16402

I was curious if that was improved in pg11dev, and if it applied to relkind='r'
or only to relkind='p'. I applied v47 patches for "path toward faster
partition pruning", and was able to quickly create synthetic "huge table" using
truncate (and verified that smgr logic actually opens all 1e5 files).

$ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY range(i)"
$ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t$a PARTITION OF t FOR VALUES FROM ($a)TO($((1+a)))"; done
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 1G $a; done # make existing, empty files 1GB..
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in `seq 1 99`; do truncate -s 1G $a.$b; done; done # make new 1GB tables with .n extension

postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;
Limit (cost=0.00..0.01 rows=1 width=8)
-> Aggregate (cost=0.00..0.01 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=0 width=0)
One-Time Filter: false
Time: 63268.474 ms (01:03.268)

=> zfs/fuse is certainly exacerbating the issue here due to context switches; but..

[pryzbyj(at)database postgresql]$ time cut -d\( -f1 /tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head
100899 open
99900 lseek
98918 close
118 brk

postgres is still open()+lseek() on 100k files, most of which are for relations
which were clearly excluded:

(gdb) bt
#0 0x00000032b2adb580 in __open_nocancel () from /lib64/libc.so.6
#1 0x0000000000711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:965
#2 0x00000000007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:1394
#3 0x0000000000733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, segno=10, oflags=0) at md.c:1783
#4 0x0000000000734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at md.c:918
#5 0x00000000006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, attr_widths=0x1e2cd24, pages=<value optimized out>, tuples=0x1e2cb98, allvisfrac=<value optimized out>) at plancat.c:946
#6 0x00000000006b3a47 in get_relation_info (root=0x1bd9030, relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at plancat.c:144
#7 0x00000000006b78fa in build_simple_rel (root=0x1bd9030, relid=126, parent=0x1bda578) at relnode.c:185
#8 0x00000000006b7990 in build_simple_rel (root=0x1bd9030, relid=1, parent=0x0) at relnode.c:251
#9 0x0000000000691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=<value optimized out>) at initsplan.c:121
#10 0x00000000006924af in query_planner (root=<value optimized out>, tlist=0x1d34fd8, qp_callback=0x693940 <standard_qp_callback>, qp_extra=0x7ffe0a077290) at planmain.c:138
#11 0x000000000069777e in grouping_planner (root=<value optimized out>, inheritance_update=false, tuple_fraction=<value optimized out>) at planner.c:1892
#12 0x0000000000698ef7 in subquery_planner (glob=<value optimized out>, parse=<value optimized out>, parent_root=<value optimized out>, hasRecursion=<value optimized out>, tuple_fraction=0) at planner.c:966
#13 0x0000000000699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, boundParams=<value optimized out>) at planner.c:405
#14 0x0000000000739d2a in pg_plan_query (querytree=<value optimized out>, cursorOptions=<value optimized out>, boundParams=<value optimized out>) at postgres.c:808
#15 0x00000000005a0bd6 in ExplainOneQuery (query=0x1bd9778, cursorOptions=<value optimized out>, into=0x0, es=0x1bd8f58, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0)
at explain.c:365
#16 0x00000000005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254
#17 0x00000000007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1bd8e90,
completionTag=0x7ffe0a077720 "") at utility.c:670
#18 0x000000000073c789 in PortalRunUtility (portal=0x1b46780, pstmt=0x1ae2050, isTopLevel=<value optimized out>, setHoldSnapshot=<value optimized out>, dest=0x1bd8e90, completionTag=<value optimized out>) at pquery.c:1178
#19 0x000000000073d9aa in FillPortalStore (portal=0x1b46780, isTopLevel=true) at pquery.c:1038
#20 0x000000000073de80 in PortalRun (portal=0x1b46780, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1ae2618, altdest=0x1ae2618, completionTag=0x7ffe0a077920 "") at pquery.c:768
#21 0x000000000073a39d in exec_simple_query (query_string=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;") at postgres.c:1121
#22 0x000000000073b341 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1b0ac80 "postgres", username=<value optimized out>) at postgres.c:4149
#23 0x00000000006cd4f3 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4409
#24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4081
#25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1754
#26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1362
#27 0x00000000006477d0 in main (argc=7, argv=0x1adba20) at main.c:228

I haven't been following the development patch, so don't know if that's
expected to be improved, or a possible future improvement, or ...

Also, I wondered if it'd be preferable for mdnblocks/estimate_rel_size to
l/stat() rather than open+lseek, which can require opening large number of
rarely-queried files and mess up fd.c LRU cache.

Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-04-03 19:50:55 Re: WIP: BRIN multi-range indexes
Previous Message Alvaro Herrera 2018-04-03 19:11:35 Re: Foreign keys and partitioned tables