Quick Links

WIP: Collecting statistics on CSV file data

From:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Subject:	WIP: Collecting statistics on CSV file data
Date:	2011-09-12 10:40:11
Message-ID:	4E6DE18B.4010505@lab.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi there,

To enable file_fdw to estimate costs of scanning a CSV file more
accurately, I would like to propose a new FDW callback routine,
AnalyzeForeignTable, which allows to ANALYZE command to collect
statistics on a foreign table, and a corresponding file_fdw function,
fileAnalyzeForeignTable. Attached is my WIP patch.

Here's a summary of the implementation:

void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int
elevel);
This is a new FDW callback routine to collect statistics on a foreign
table and store the results in the pg_class and pg_statistic system
catalogs. This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

static void fileAnalyzeForeignTable(Relation relation, VacuumStmt
*vacstmt, int elevel);
This new file_fdw function collects and stores the same statistics on
CSV file data as collected on a local table except for index related
statistics by executing the sequential scan on the CSV file and
acquiring sample rows using Vitter's algorithm. (It is time-consuming
for a large file.)

estimate_costs() (more precisely, clauselist_selectivity() in
estimate_costs()) estimates baserel->rows using the statistics stored in
the pg_statistic system catalog. If there are no statistics,
estimate_costs() estimates it using the default statistics as in
PostgreSQL 9.1.

I am able to demonstrate the effectiveness of this patch. The following
run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB
of RAM. Configuration settings are default except for work_mem = 256MB.
We can see from this result that the optimiser selects a good plan when
the foreign tables have been analyzed.

I appreciate your comments and suggestions.

[sample csv file data]
postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1,
5000000) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv,
DELIMITER ',');
COPY 5000000
postgres=# COPY (SELECT (random()*10000)::int, repeat('b', 100) FROM
generate_series(1, 5000000)) TO '/home/pgsql/sample_csv_data2.csv'
(FORMAT csv, DELIMITER ',');
COPY 5000000

[Unpatched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# SELECT count(*) FROM tab1;
count
---------
5000000
(1 row)

postgres=# SELECT count(*) FROM tab2;
count
---------
5000000
(1 row)

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------

---
Aggregate (cost=128859182.29..128859182.30 rows=1 width=0) (actual
time=27321.304..27321.304 rows=1 loops=1)
-> Merge Join (cost=5787102.68..111283426.33 rows=7030302383
width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1)
Merge Cond: (tab1.aid = tab2.aid)
-> Sort (cost=1857986.37..1858198.83 rows=84983 width=4)
(actual time=5964.282..5965.958 rows=10000 loops=1)
Sort Key: tab1.aid
Sort Method: quicksort Memory: 853kB
-> Foreign Scan on tab1 (cost=0.00..1851028.44
rows=84983 width=4) (actual time=0.071..5962.382 rows=10000 loops=1)
Filter: ((aid >= 0) AND (aid <= 10000))
Foreign File: /home/pgsql/sample_csv_data1.csv
Foreign File Size: 543888896
-> Materialize (cost=3929116.30..4011842.29 rows=16545197
width=4) (actual time=16216.953..19550.846 rows=5000000 loops=1)
-> Sort (cost=3929116.30..3970479.30 rows=16545197
width=4) (actual time=16216.947..18418.684 rows=5000000 loops=1)
Sort Key: tab2.aid
Sort Method: external merge Disk: 68424kB
-> Foreign Scan on tab2 (cost=0.00..1719149.70
rows=16545197 width=4) (actual time=0.081..6059.630 rows=5000000 loops=1)
Foreign File: /home/pgsql/sample_csv_data2.csv
Foreign File Size: 529446313
Total runtime: 27350.673 ms
(18 rows)

[Patched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE VERBOSE tab1;
INFO: analyzing "public.tab1"
INFO: "tab1": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# ANALYZE VERBOSE tab2;
INFO: analyzing "public.tab2"
INFO: "tab2": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1282725.25..1282725.26 rows=1 width=0) (actual
time=15114.325..15114.325 rows=1 loops=1)
-> Hash Join (cost=591508.50..1271157.90 rows=4626940 width=0)
(actual time=5964.449..14526.822 rows=4999745 loops=1)
Hash Cond: (tab2.aid = tab1.aid)
-> Foreign Scan on tab2 (cost=0.00..564630.00 rows=5000000
width=4) (actual time=0.070..6253.257 rows=5000000 loops=1)
Foreign File: /home/pgsql/sample_csv_data2.csv
Foreign File Size: 529446313
-> Hash (cost=591393.00..591393.00 rows=9240 width=4) (actual
time=5964.346..5964.346 rows=10000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 352kB
-> Foreign Scan on tab1 (cost=0.00..591393.00 rows=9240
width=4) (actual time=0.066..5962.222 rows=10000 loops=1)
Filter: ((aid >= 0) AND (aid <= 10000))
Foreign File: /home/pgsql/sample_csv_data1.csv
Foreign File Size: 543888896
Total runtime: 15114.480 ms
(13 rows)

Best regards,
Etsuro Fujita

Attachment	Content-Type	Size
postgresql-analyze-v1.patch	text/plain	31.3 KB

Responses

Re: WIP: Collecting statistics on CSV file data at 2011-09-20 09:49:53 from Shigeru Hanada
Re: WIP: Collecting statistics on CSV file data at 2011-09-20 14:12:48 from Marti Raudsepp

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2011-09-12 12:39:42	Re: psql additions
Previous Message	Stephen Frost	2011-09-12 09:55:42	Re: superusers are members of all roles?