Overview:
I have three tables 1) subscribers, bios, and shirtsizes and i need to find the subscribers without a bio or shirtsizes
the tables are laid out such as
subscribers
| season_id | user_id |
bio
| bio_id | user_id |
shirt sizes
| bio_id | shirtsize |
And I need to find all users who do not have a bio or shirtsize, (if no bio; then no shirtsize via relation) for any given season.
I originally wrote a query like:
SELECT *
FROM subscribers s
LEFT JOIN bio b ON b.user_id = subscribers.user_id
LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
but it is taking 10 seconds to complete now.
I am wondering how I can restructure the query (or possibly the problem) so that it will preform reasonably.
Here is the mysql explain: (ogu = subscribers, b = bio, tn = shirtshize)
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+-------------+--------+-------------+
| 1 | SIMPLE | ogu | ref | PRIMARY | PRIMARY | 4 | const | 133 | Using where |
| 1 | SIMPLE | b | index | NULL | PRIMARY | 8 | NULL | 187644 | Using index |
| 1 | SIMPLE | tn | ref | nid | nid | 4 | waka2.b.nid | 1 | Using where |
The above is pretty sanitized, here's the realz info:
mysql> DESCRIBE subscribers
+-----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| subscribers | int(11) | NO | PRI | | |
| uid | int(11) | NO | PRI | | |
mysql> DESCRIBE bio;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id | int(10) unsigned | NO | PRI | 0 | |
| uid | int(10) unsigned | NO | PRI | 0 | |
mysql> DESCRIBE shirtsize;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id | int(10) unsigned | NO | PRI | 0 | |
| shirtsize | int(10) unsigned | NO | PRI | 0 | |
and the real query looks like:
SELECT ogu.nid, ogu.is_active, ogu.uid, b.nid AS bio_node, tn.nid AS size
FROM og_uid ogu
LEFT JOIN bio b ON b.uid = ogu.uid
LEFT JOIN term_node tn ON tn.nid = b.nid
WHERE ogu.nid = 185033 AND ogu.is_admin = 0
AND (b.nid IS NULL OR tn.tid IS NULL)
nid is season_id or bio_id (with a type); term_node is going to be the shirtsize
-
The query should be OK. I would run it through a query analyzer and refine the indexes on the tables.
jskulski : I guess this is what I get for trusting contributed modules to have correct index. Thanks! -
If you define what you are looking for exactly rather than SELECT * it might speed it up a bit... also OR is not the fastest query to be doing, if you can re-write it without the OR it will be faster.
Also... you could try unions instead of left joins maybe?
SELECT s.user_id FROM subscribers s LEFT JOIN bio b ON b.user_id = s.user_id LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
would be something like:
(SELECT s.user_id FROM subscribers s WHERE s.season_id = 185181) UNION (SELECT b.user_id, b.bio_id FROM bio b WHERE bio.bio_id IS NULL) UNION (SELECT shirtsizes.bio_id FROM shirtsizes WHERE shirtsizes.size is NULL)
(to be honest that doesn't look right to me... but then I never use
joins orjoin syntax or unions...)I would do:
SELECT * FROM subscribers s, bio b, shirtsizes sh WHERE s.season_id = 185181 AND shirtsize.bio_id = bio.bio_id AND b.user_id = s.user_id AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);HLGEM : "I never use joins or unions" Of course you do use joins. FROM subscribers s, bio b, shirtsizes sh WHERE s.season_id = 185181 AND shirtsize.bio_id = bio.bio_id AND b.user_id = s.user_id is a set of joins. It is is using the outdated style of join but it is still a join.SeanJA : I guess it would have been more correct to write 'I avoid using the join syntax and I never use unions' -
Joins are one of the most expensive operations that you can perform on an SQL query. While it should be able to automatically optimize your query somewhat, maybe try restructuring it. First of all, I would instead of SELECT *, be sure to specify which columns you need from which relations. This will speed things up quite a bit.
If you only need the user ID for example:
SELECT s.user_id FROM subscribers s LEFT JOIN bio b ON b.user_id = subscribers.user_id LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);That will allow the SQL database to restructure your query a little more efficiently on its own.
Jonathan Leffler : Regular joins are moderately expensive; outer joins are more expensive; the typical correlated sub-query is diabolical.jskulski : I posted the real query and am only grabbing a few columns, but thanks. -
Is
bio_idthe primary key of bios? Is it really possible for there to be a bios row withb.user_id=subscribers.user_idbut withb.bio_idNULL?Are there shirtsize rows with
shirtsize.bio_idNULL? Do those rows ever have shirtsize.size not NULL? -
Obviously I haven't checked this but it seems to be that what you want is to select any subscriber where there there isn't a matching bio or the join between bios and shirtsizes fails. I would consider using NOT EXISTS for this condition. You'll probably want indexes on bio.user_id and shirtsizes.bio_id.
select * from subscribers where s.season_id = 185181 and not exists (select * from bio join shirtsizes on bio.bio_id = shirtsizes.bio_id where bio.user_id = subscribers.user_id)EDIT:
Based on your update, you may want to create separate keys on each column instead of/in addition to having compound primary keys. It's possible that the joins aren't able to take optimal advantage of the compound primary indexes and an index on the join columns themselves may speed things up.
Dems : Quite a few people have investigated this and it turns out that SELECT * is normally slightly faster than SELECT 1 / SELECT NULL / etc in an EXISTS query. But yes, using EXISTS can often be faster than join, but not always, it's a weapon to try out in each scenario...tvanfosson : Point taken. I've updated the answer. -
Would it be any quicker to do a difference between the list of subscribers for the relevant season and the list of subscribers for the season with bios and shirt sizes?
SELECT * FROM Subscribers WHERE season_id = 185181 AND user_id NOT IN (SELECT DISTINCT s.user_id FROM subscribers s JOIN bios b ON s.user_id = b.user_id JOIN shirtsizes z ON b.bio_id = z.bio_id WHERE s.season_id = 185181 )This avoids outer joins, which are not as fast as inner joins, and may therefore be quicker. On the other hand, it might be creating two large lists with very few differences between them. It is not clear whether the DISTINCT in the sub-query would improve or harm performance. It implies a sort operation (expensive) but paves the way for a merge-join if the MySQL optimizer supports such things.
There might be other notations available - MINUS or DIFFERENCE, for example.
Bill Karwin : There is no reason to use DISTINCT in the subquery. -
select * from subscribers where user_id not in ( select user_id from bio where bio_id not in ( select bio_id from shirt_sizes ) ) and season_id=185181 -
I presume that your "big table" is subscribers, and that season_id is probably neither selective nor indexed (indexing it is rather meaningless if it's not selective, anyway), which means that you'll have to fully scan subscribers, anyway. Parting, I would join (with an inner join) the two other tables - note that if there is no bio_id in shirt_size it's exactly the same for your query as if there were no bio. First bit:
select uid from bio inner join shirtsizes on shirtsizes.bio_id = bio.bio_idAt which point you want to check that shirtsizes is indexed on bio_id. Now you can left outer join this query to subscribers:
select * from subscribers s left outer join (select uid from bio inner join shirtsizes on shirtsizes.bio_id = bio.bio_id) x on x.uid = s.uid where s.season_id = 185181 and x.uid is nullwhich is likely to run reasonably fast if neither bio nor shirtsizes are gigantic ...
-
Your query, as it is written now, evaluates all
bio's andterm_node's if they exist, and then filters them out.But what you want is just find
og_uid's that don't haveterm_node's (not having abioalso implies not having aterm_node)So you just want to stop evaluating
bio's andterm_node's as soon as you find the first existingterm_node:SELECT * FROM ( SELECT ogu.nid, ogu.is_active, ogu.uid, ( SELECT 1 FROM bio b, term_node tn WHERE b.uid = ogu.uid AND tn.nid = b.nid LIMIT 1 ) AS ex FROM og_uid ogu WHERE ogu.nid = 185033 AND ogu.is_admin = 0 ) ogu1 WHERE ex IS NULLThis will evaluate at most one
bioand at most oneterm_nodefor eachog_uid, instead of evaluating all existing thousands and the filtering them out.Should work much faster.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.