| From: | Ron Johnson <ronljohnsonjr(at)gmail(dot)com> |
|---|---|
| To: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
| Subject: | Re: scaling up from t1n to 60 million records |
| Date: | 2026-05-19 20:18:04 |
| Message-ID: | CANzqJaBAf+Kf80dDXv-VvKpq2r1Gnwoxq3pQP1Fi+y+PJCbKDQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Indices are your friend! (Except when loading data.)
Add them on any relevant column.
https://www.postgresql.org/docs/16/sql-createindex.html
The "(expression)" clause might be useful in your situation, since it can
exclude some words from an index, exclude empty cells, index upper-case
versions of the word, etc.
On Tue, May 19, 2026 at 2:53 PM Martin Mueller <
martinmueller(at)northwestern(dot)edu> wrote:
> Here is a more detailed version.
>
> I work on the curation of a corpus of some 65,000 Early Modern texts with
> 1.5 billion words. They exist as TEI-XML files and each word is wrapped
> in a <w> element. Here are the first and last two words in the corpus
>
> <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>
>
>
> <w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>
>
> ...
>
> <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>
>
> <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>
>
>
> The corpus has many corrupt spellings and errors in the linguistic
> annotation. Most of them are low-frequency phenomena and occur in no more
> than 64 documents. In nearly all cases you have enough evidence to correct
> a word or its annotation if you can see the word in the middle of a text
> string that includes
>
>
> 1. the spelling
> 2. the lemma
> 3. the part of speech tag
> 4. a standard spelling (e.g. 'loue' for 'love')
> 5. up to seven previous words
> 6. up to seven next words
> 7. the spelling and POS tag of the previous
> 8. the lemma and POS tag of the next word
> 9. the Xpath of the current work
>
>
> My goal is to involve users of the corpus in identifying and correcting
> corrupt readings. I call this a "philological shopping cart" since the
> offering of a correction can be thought of as a sale. Instead of buying
> something, with the machine registering the who, what, when, and where of
> the purchase, I offer an emendation, with the machine registering the who,
> what, when, and where of my emendation.
>
> My hunch is that it would not be particularly difficult to build such a
> philological shopping cart and that in terms of scale it would not be a big
> thing.
>
> I am trying to mirror that "shopping cart" on my Mac. There are about 60
> million word occurrences that occur in no more than 64 texts. The basic
> table has the columns described above, and half a dozen other columns for
> data entry and various counts. There are some helper tables. The most
> important of them is a simple case-insensitive list of spellings with their
> document frequencies. This is very useful for finding suspect spellings
> with queries like "show me all spellings in a low frequency range that
> contain 'tb' and look for words where replacing 'tb' with 'th' will find a
> word with a higher document frequency. That picks up spellings like 'tbe',
> 'tbat', 'autboritie', etc.
>
> I've worked with KWIC tables of this kind for several years. I have Aqua
> Data Studio as a front end for Postgres, currently version 17, running on a
> five-year old Mac with an Intel processor and 32 GB of memory. I know a
> lot less about the innards of a SQL database than I should.
>
> My largest kwic table has about 15 million rows with dozen columns for
> each row. Except for the left and right context, the columns consists of
> single words or numbers. The left and right context columns rarely add up
> to more than 35 characters each.. I have used plain indexes for some
> columns, with commands like "Create index on kwics16(keyword)", where
> 'kwics16' is the table name. My typical routine takes a single-user
> interactive form: ask a query, wait for the results (typically seconds,
> sometimes a minute or more), and do something with the results. I know
> next to nothing about the size of the database or tables, and it's not
> something Ihave needed to worry about. There are occasional memory bottle
> necks, because Aqua Data Studio isn't particularly good at release memory
> once it's no longer used. Closing and reopening the client fixes that.
>
> It takes an hour or so to upload a table of this kind into the database.
> Several tables of that size exist on my database and don't cause any
> trouble. I don't know at what point I would be running into constraints of
> an aging Mac with 32 GB of memory and a 2 TB hard drive.
>
> I could comfortably live with what I'm doing now, dividing the data into
> three or four frequency ranges.
>
> Given this information, should I try and create a single table or am I
> likely to run into serious constraints if I move beyond my current maximum
> table size of 15 million records.
>
> Perhaps there is no clear answer, and I should just experiment. But if
> any reader with more knowledge of Postgres thinks that in my environment I
> would be skating on thin ice if I move beyond current limits, I'd be
> grateful to be told so.
>
>
>
>
>
>
>
>
>
>
> Martin Mueller
> Professor emeritus of English and Classics
> Northwestern University
> *From: *Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> *Date: *Tuesday, May 19, 2026 at 09:45
> *To: *Martin Mueller <martinmueller(at)northwestern(dot)edu>;
> pgsql-general(at)postgresql(dot)org <pgsql-general(at)postgresql(dot)org>
> *Subject: *Re: scaling up from t1n to 60 million records
>
> On 5/19/26 7:27 AM, Martin Mueller wrote:
> > I use Postgres with a GUI frontend (Aquafold) as a very large
> > spreadsheet on steroids that analyzes rare or defective spellings in a
> > corpus of 65,000 texts and1.5 billion words. I typically extract data
> > from the corpus with python scripts, turn them into tables and load them
> > into the database.
> >
> >
> > On my Mac with 32 GB of memory performance is OK with queries that
> > typically within seconds extract data rows from tables with up to ten
> > million rows. If the result set is large, I suspect that most of time
> > machine's time is spent displaying result sets. I have used indexing
> > sparingly. While it helps, the time savings often don't matter much.
>
> This is going to need more information:
>
> 1) Postgres version.
>
> 2) The table schema including indexes.
>
> 3) An example of the query.
>
> 4) Where you are measuring the time.
>
> 5) The client you are displaying the results in.
>
> >
> >
> > I am thinking about scaling up to table with about 60 million rows. Are
> > there things to do or watch out for? Or should I proceed on the
> > assumption that that 60 million records are within scope and that the
> > added timecost is roughly linear?
> >
> > Martin Mueller
> >
> > Professor emeritus of English and Classics
> >
> > Northwestern University
> >
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>
--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ertan Küçükoglu | 2026-05-20 07:17:57 | Large backup size of pg_dump |
| Previous Message | Martin Mueller | 2026-05-19 18:52:57 | Re: scaling up from t1n to 60 million records |