Re: Postgres refusing to use >1 core

From: Aren Cambre <aren(at)arencambre(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Postgres refusing to use >1 core
Date: 2011-05-12 03:17:00
Message-ID: BANLkTim9Gg15bQsaNOpoUKZjmyH-+1vNYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>
> I suspect your app is doing lots of tiny single-row queries instead of
> efficiently batching things. It'll be wasting huge amounts of time
> waiting for results. Even if every query is individually incredibly
> fast, with the number of them you seem to be doing you'll lose a LOT of
> time if you loop over lots of little SELECTs.
>

So here's what's going on.

I have a table of about 12,000,000 traffic tickets written by the Texas
Department of Public Safety (TxDPS). Each ticket has a route name and a
reference marker. On Interstate highways, reference marker = mile post. On
all other roads, from US highways down to Farm to Market roads, the
reference marker is based on a grid superimposed over the state. Basically
that reference marker increments as the road crosses a grid line, so unless
the road is perfectly N-S or E-W, these reference markers are more than a
mile apart.

I have a separate table with data from the Texas Department of
Transportation (TxDOT). It is a database of almost all the state's reference
markers, along with latitude and longitude for each.

I am trying to geolocate each ticket by correlating the ticket's
route/reference marker to the same in the TxDOT database. And it's not
straightforward for a few reasons:

*1. TxDPS and TxDOT formats are different.*

TxDPS uses 1-5 to denote route type. 1 = Interstate. 2 = US or any state
highway except Farm to Market. 3 = Farm to Market, 4 = county road, 5 =
local road. So if the route name is 0071 and route type is 2, it could mean
US 71 or TX 71, both of which really exist in Texas but are on different
parts of the state.

I haven't proven it yet, but it is possible that no two routes of the same
number are in the same county. You wouldn't find both TX 71 and US 71 in the
same county.

For now, I am looking up the TxDOT database based on route type, name, and
county, and I may need to repeat the lookup until I get a match.

In the above example, if the ticket is written for route_name = 0071,
route_type = 2, and county = 206, then I need to do searches against the
TxDOT database for:

1. rte_nm = 'US71' AND county_num='206'
2. rte_nm = 'SH71' AND county_num='206'
3. rte_nm = 'UA71' AND county_num='206'
4. rte_nm = 'UP71' AND county_num='206'
5. ...

*2. Not TxDPS reference markers correspond to TxDOT reference markers.*

Now, if I've matched a route, I have to find the reference marker.

The TxDOT database is pretty good but not 100% complete, so some TxDPS
tickets' reference markers may not exist in the TxDOT table. Plus, it's
possible that some TxDPS tickets have the wrong marker.

To compensate, I am looking for the closest reference marker along the route
that is not more than 50 marker units away, either direction. I've again
implemented that with multiple queries, where I don't stop until I find a
match. Suppose I am searching for reference marker 256 on TX 71. The queries
will be like this:

1. rte_nm = 'SH71' AND rm = '256' (base marker)
2. rte_nm = 'SH71' AND rm = '257' (+1)
3. rte_nm = 'SH71' AND rm = '255' (-1)
4. rte_nm = 'SH71' AND rm = '258' (+2)
5. rte_nm = 'SH71' AND rm = '254' (-2)
6. ...
7. rte_nm = 'SH71' AND rm = '306' (+50)
8. rte_nm = 'SH71' AND rm = '206' (-50)

Assuming a matching route name was found in the prior step, the app will
have 1 to 101 of these queries for each ticket.

Assuming steps 1 and 2 above worked out, now I have a reference marker. So I
write to a third table that has four columns:

1. *HA_Arrest_Key* (varchar(18) that refers back to the TxDPS tickets
table
2. *gid* (integer that refers to the unique identity of the reference
marker in the TxDOT table)
3. *distance* (integer that is the distance, in reference markers,
between that noted in the TxDPS ticket and the nearest marker found in the
TxDOT table)
4. *hasLatLong* (Boolean that is true if TxDPS also recorded latitude and
longitude for the ticket, presumably from an in-car device. These don't
appear to be that accurate, plus a substantial portion of tickets have no
lat/long.)

Right now, I am doing a separate INSERT for each of the 12,000,000 rows
inserted into this table.

I guess the app is chatty like you suggest? HOWEVER, if I am reading system
activity correctly, the master thread that is going through the 12,000,000
tickets appears to have its own Postgres process, and based on how quickly
RAM usage initially shoots up the first ~60 seconds or so the app runs, it
may be reading all these rows into memory. But I am consulting with Npgsql
developers separately to make sure I am really understanding correctly. They
suspect that the PLINQ stuff (basically "multithreading in a can") may not
be dispatching threads as expected because it may be misreading things.

By using a producer/consumer model like that you can ensure that thread
> 1 is always talking to the database, keeping Pg busy, and thread 2 is
> always working the CPUs.

Thanks for the example and illustration.

... or you can have each worker fetch its own chunks of rows (getting
> rid of the producer/consumer split) using its own connection and just
> have lots more workers to handle all the wasted idle time. A
> producer/consumer approach will probably be faster, though.
>

That's what PLINQ is *supposed* to do. In theory. :-) Working with Npgsql
folks to see if something is tripping it up.

Aren

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Aren Cambre 2011-05-12 03:18:12 Re: Postgres refusing to use >1 core
Previous Message Josh Berkus 2011-05-12 01:14:22 Re: Postgres refusing to use >1 core