Skip site navigation (1) Skip section navigation (2)

WIP: Collecting statistics on CSV file data

From: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: WIP: Collecting statistics on CSV file data
Date: 2011-09-12 10:40:11
Message-ID: 4E6DE18B.4010505@lab.ntt.co.jp (view raw or flat)
Thread:
Lists: pgsql-hackers
Hi there,

To enable file_fdw to estimate costs of scanning a CSV file more
accurately, I would like to propose a new FDW callback routine,
AnalyzeForeignTable, which allows to ANALYZE command to collect
statistics on a foreign table, and a corresponding file_fdw function,
fileAnalyzeForeignTable. Attached is my WIP patch.

Here's a summary of the implementation:

void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int
elevel);
This is a new FDW callback routine to collect statistics on a foreign
table and store the results in the pg_class and pg_statistic system
catalogs. This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

static void fileAnalyzeForeignTable(Relation relation, VacuumStmt
*vacstmt, int elevel);
This new file_fdw function collects and stores the same statistics on
CSV file data as collected on a local table except for index related
statistics by executing the sequential scan on the CSV file and
acquiring sample rows using Vitter's algorithm. (It is time-consuming
for a large file.)

estimate_costs() (more precisely, clauselist_selectivity() in
estimate_costs()) estimates baserel->rows using the statistics stored in
the pg_statistic system catalog. If there are no statistics,
estimate_costs() estimates it using the default statistics as in
PostgreSQL 9.1.

I am able to demonstrate the effectiveness of this patch. The following
run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB
of RAM. Configuration settings are default except for work_mem = 256MB.
We can see from this result that the optimiser selects a good plan when
the foreign tables have been analyzed.

I appreciate your comments and suggestions.

[sample csv file data]
postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1,
5000000) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv,
DELIMITER ',');
COPY 5000000
postgres=# COPY (SELECT (random()*10000)::int, repeat('b', 100) FROM
generate_series(1, 5000000)) TO '/home/pgsql/sample_csv_data2.csv'
(FORMAT csv, DELIMITER ',');
COPY 5000000

[Unpatched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# SELECT count(*) FROM tab1;
  count
---------
 5000000
(1 row)

postgres=# SELECT count(*) FROM tab2;
  count
---------
 5000000
(1 row)

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
                                                                   QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------

---
 Aggregate  (cost=128859182.29..128859182.30 rows=1 width=0) (actual
time=27321.304..27321.304 rows=1 loops=1)
   ->  Merge Join  (cost=5787102.68..111283426.33 rows=7030302383
width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1)
         Merge Cond: (tab1.aid = tab2.aid)
         ->  Sort  (cost=1857986.37..1858198.83 rows=84983 width=4)
(actual time=5964.282..5965.958 rows=10000 loops=1)
               Sort Key: tab1.aid
               Sort Method: quicksort  Memory: 853kB
               ->  Foreign Scan on tab1  (cost=0.00..1851028.44
rows=84983 width=4) (actual time=0.071..5962.382 rows=10000 loops=1)
                     Filter: ((aid >= 0) AND (aid <= 10000))
                     Foreign File: /home/pgsql/sample_csv_data1.csv
                     Foreign File Size: 543888896
         ->  Materialize  (cost=3929116.30..4011842.29 rows=16545197
width=4) (actual time=16216.953..19550.846 rows=5000000 loops=1)
               ->  Sort  (cost=3929116.30..3970479.30 rows=16545197
width=4) (actual time=16216.947..18418.684 rows=5000000 loops=1)
                     Sort Key: tab2.aid
                     Sort Method: external merge  Disk: 68424kB
                     ->  Foreign Scan on tab2  (cost=0.00..1719149.70
rows=16545197 width=4) (actual time=0.081..6059.630 rows=5000000 loops=1)
                           Foreign File: /home/pgsql/sample_csv_data2.csv
                           Foreign File Size: 529446313
 Total runtime: 27350.673 ms
(18 rows)

[Patched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE VERBOSE tab1;
INFO:  analyzing "public.tab1"
INFO:  "tab1": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# ANALYZE VERBOSE tab2;
INFO:  analyzing "public.tab2"
INFO:  "tab2": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1282725.25..1282725.26 rows=1 width=0) (actual
time=15114.325..15114.325 rows=1 loops=1)
   ->  Hash Join  (cost=591508.50..1271157.90 rows=4626940 width=0)
(actual time=5964.449..14526.822 rows=4999745 loops=1)
         Hash Cond: (tab2.aid = tab1.aid)
         ->  Foreign Scan on tab2  (cost=0.00..564630.00 rows=5000000
width=4) (actual time=0.070..6253.257 rows=5000000 loops=1)
               Foreign File: /home/pgsql/sample_csv_data2.csv
               Foreign File Size: 529446313
         ->  Hash  (cost=591393.00..591393.00 rows=9240 width=4) (actual
time=5964.346..5964.346 rows=10000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 352kB
               ->  Foreign Scan on tab1  (cost=0.00..591393.00 rows=9240
width=4) (actual time=0.066..5962.222 rows=10000 loops=1)
                     Filter: ((aid >= 0) AND (aid <= 10000))
                     Foreign File: /home/pgsql/sample_csv_data1.csv
                     Foreign File Size: 543888896
 Total runtime: 15114.480 ms
(13 rows)

Best regards,
Etsuro Fujita

Attachment: postgresql-analyze-v1.patch
Description: text/plain (31.3 KB)

Responses

pgsql-hackers by date

Next:From: Peter EisentrautDate: 2011-09-12 12:39:42
Subject: Re: psql additions
Previous:From: Stephen FrostDate: 2011-09-12 09:55:42
Subject: Re: superusers are members of all roles?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group