Quick Links

Performance problem with semi-large tables

From:	"Ken Egervari" <ken(at)upfactor(dot)com>
To:	<pgsql-performance(at)postgresql(dot)org>
Subject:	Performance problem with semi-large tables
Date:	2005-01-27 02:17:23
Message-ID:	001601c50416$57090e10$cd422418@a96dfxb4kjzogw
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi everyone.

I'm new to this forum and was wondering if anyone would be kind enough to help me out with a pretty severe performance issue. I believe the problem to be rather generic, so I'll put it in generic terms. Since I'm at home and not a work (but this is really bugging me), I can't post any specifics. However, I think my explaination will suffice.

I have a 2 tables that are are getting large and will only get larger with time (expoentially as more users sign on to the system). Right the now, a table called 'shipment' contains about 16,000 rows and 'shipment_status' contains about 32,500 rows. These aren't massive rows (I keep reading about tables with millions), but they will definately get into 6 digits by next year and query performance is quite poor.

Now, from what I can understand about tuning, you want to specify good filters, provide good indexes on the driving filter as well as any referencial keys that are used while joining. This has helped me solve performance problems many times in the past (for example, changing a query speed from 2 seconds to 21 milliseconds).

However, I am now tuning queries that operate on these two tables and the filters aren't very good (the best is a filter ratio of 0.125) and the number of rows returned is very large (not taking into consideration limits).

For example, consider something like this query that takes ~1 second to finish:

select s.*, ss.*
from shipment s, shipment_status ss, release_code r
where s.current_status_id = ss.id
and ss.release_code_id = r.id
and r.filtered_column = '5'
order by ss.date desc
limit 100;

Release code is just a very small table of 8 rows by looking at the production data, hence the 0.125 filter ratio. However, the data distribution is not normal since the filtered column actually pulls out about 54% of the rows in shipment_status when it joins. Postgres seems to be doing a sequencial scan to pull out all of these rows. Next, it joins approx 17550 rows to shipment. Since this query has a limit, it only returns the first 100, which seems like a waste.

Now, for this query, I know I can filter out the date instead to speed it up. For example, I can probably search for all the shipments in the last 3 days instead of limiting it to 100. But since this isn't a real production query, I only wanted to show it as an example since many times I cannot do a filter by the date (and the sort may be date or something else irrelavant).

I'm just stressed out how I can make queries like this more efficient since all I see is a bunch of hash joins and sequencial scans taking all kinds of time.

I guess here are my 2 questions:

1. Should I just change beg to change the requirements so that I can make more specific queries and more screens to access those?
2. Can you recommend ways so that postgres acts on big tables more efficiently? I'm not really interested in this specific case (I just made it up). I'm more interested in general solutions to this general problem of big table sizes with bad filters and where join orders don't seem to help much.

Thank you very much for your help.

Best Regards,
Ken Egervari

Responses

Re: Performance problem with semi-large tables at 2005-01-29 21:51:10 from Josh Berkus
Re: Performance problem with semi-large tables at 2005-01-29 22:08:58 from PFC

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2005-01-27 02:31:20	Re: Should the optimizer see this?
Previous Message	Ron Mayer	2005-01-27 01:27:59	Should the optimizer see this?