Restoring large tables with COPY

From: Marko Kreen <marko(at)l-t(dot)ee>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Restoring large tables with COPY
Date: 2001-12-11 15:10:05
Message-ID: 20011211151005.GA28141@l-t.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces. Yes, dump as INSERTs works but is slow.

"Cant" as in "it does not work with the default setup I have
running on devel machine" - 128M mem, 128M swap, basically
default postgresql.conf:

1) Too few WAL files.
- well, increase the wal_files (eg to 32),

2) Machine runs out of swap, PostgreSQL seems to keep whole TX
in memory.
- So I must put 1G of swap? But what if I have 1G of rows?

Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump, which is
non-obvious and may be a bad surprise when in real trouble.

This is annoying, especially as dump as COPY's should be
preferred as it is faster and smaller. Ofcourse the
dump-as-INSERTs has also positive side - eg. ALTER TABLE DROP
COLUMN with sed...

Patch below implements '-m NUM' switch to pg_dump, which splits
each COPY command to chunks, each maximum NUM rows.

Comments? What am I missing?

--
marko

Index: doc/src/sgml/ref/pg_dump.sgml
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.41
diff -u -c -r1.41 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml 8 Dec 2001 03:24:37 -0000 1.41
--- doc/src/sgml/ref/pg_dump.sgml 11 Dec 2001 03:58:30 -0000
***************
*** 35,40 ****
--- 35,41 ----
<arg>-f <replaceable>file</replaceable></arg>
<arg>-F <replaceable>format</replaceable></arg>
<arg>-i</arg>
+ <arg>-m <replaceable>num_rows</replaceable></arg>
<group> <arg>-n</arg> <arg>-N</arg> </group>
<arg>-o</arg>
<arg>-O</arg>
***************
*** 301,306 ****
--- 302,321 ----
if you need to override the version check (and if
<command>pg_dump</command> then fails, don't
say you weren't warned).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>-m <replaceable class="parameter">num_rows</replaceable></term>
+ <term>--maxrows=<replaceable class="parameter">num_rows</replaceable></term>
+ <listitem>
+ <para>
+ Set maximum number of rows to put into one COPY statement.
+ This starts new COPY command after every
+ <replaceable class="parameter">num_rows</replaceable>.
+ This is useful on large tables to avoid restoring whole table in
+ one transaction which may consume lot of resources.
</para>
</listitem>
</varlistentry>
Index: src/bin/pg_dump/pg_dump.c
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.236
diff -u -c -r1.236 pg_dump.c
*** src/bin/pg_dump/pg_dump.c 28 Oct 2001 06:25:58 -0000 1.236
--- src/bin/pg_dump/pg_dump.c 11 Dec 2001 04:48:42 -0000
***************
*** 116,121 ****
--- 116,123 ----
bool dataOnly;
bool aclsSkip;

+ int g_max_copy_rows = 0;
+
char g_opaque_type[10]; /* name for the opaque type */

/* placeholders for the delimiters for comments */
***************
*** 151,156 ****
--- 153,159 ----
" -h, --host=HOSTNAME database server host name\n"
" -i, --ignore-version proceed even when server version mismatches\n"
" pg_dump version\n"
+ " ­m, --maxrows=NUM max rows in one COPY command\n"
" -n, --no-quotes suppress most quotes around identifiers\n"
" -N, --quotes enable most quotes around identifiers\n"
" -o, --oids include oids in dump\n"
***************
*** 187,192 ****
--- 190,196 ----
" pg_dump version\n"
" -n suppress most quotes around identifiers\n"
" -N enable most quotes around identifiers\n"
+ " ­m NUM max rows in one COPY command\n"
" -o include oids in dump\n"
" -O do not output \\connect commands in plain\n"
" text format\n"
***************
*** 244,249 ****
--- 248,255 ----
int ret;
bool copydone;
char copybuf[COPYBUFSIZ];
+ int cur_row;
+ int linestart;

if (g_verbose)
write_msg(NULL, "dumping out the contents of table %s\n", classname);
***************
*** 297,302 ****
--- 303,310 ----
else
{
copydone = false;
+ linestart = 1;
+ cur_row = 0;

while (!copydone)
{
***************
*** 310,316 ****
--- 318,338 ----
}
else
{
+ /*
+ * Avoid too large transactions by breaking them up.
+ */
+ if (g_max_copy_rows > 0 && linestart
+ && cur_row >= g_max_copy_rows)
+ {
+ cur_row = 0;
+ archputs("\\.\n", fout);
+ archprintf(fout, "COPY %s %sFROM stdin;\n",
+ fmtId(classname, force_quotes),
+ (oids && hasoids) ? "WITH OIDS " : "");
+ }
+
archputs(copybuf, fout);
+
switch (ret)
{
case EOF:
***************
*** 318,325 ****
--- 340,350 ----
/* FALLTHROUGH */
case 0:
archputc('\n', fout);
+ cur_row++;
+ linestart = 1;
break;
case 1:
+ linestart = 0;
break;
}
}
***************
*** 696,701 ****
--- 721,727 ----
{"compress", required_argument, NULL, 'Z'},
{"help", no_argument, NULL, '?'},
{"version", no_argument, NULL, 'V'},
+ {"maxrows", required_argument, NULL, 'm'},

/*
* the following options don't have an equivalent short option
***************
*** 748,756 ****
}

#ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1)
#endif

{
--- 774,782 ----
}

#ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1)
#endif

{
***************
*** 798,803 ****
--- 824,833 ----

case 'i': /* ignore database version mismatch */
ignore_version = true;
+ break;
+
+ case 'm':
+ g_max_copy_rows = atoi(optarg);
break;

case 'n': /* Do not force double-quotes on

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2001-12-11 15:34:44 Re: pg_dump: Sorted output, referential integrity
Previous Message Thomas Lockhart 2001-12-11 15:05:12 Re: Explicit configuration file