MySQL - Query Analysis

After you have implemented your system, a year or so down the live you generally hear users stating that reports are taking longer and the system is a little sluggish, in this section we will take a look at how to address these issues which generally all relate to concurrency, more users using the system

The query cache is not being utilized properly
The query contains subqueries or unoptimized subqueries
The table contains large amounts of unnecessary data
The table is fragmented
The schema was not designed for the queries being run
The queries being run do not take into consideration the schema design
Tables have no indexes that are appropriate for the query

Generally system are first setup at very fast pace, numbers of code developers, SQL developers, system admin's are all too preoccupied by there own environment, it's not until the system has gone live and multiple users start using the system that concurrency problems appear, there are tools out that can test concurrency on your system before go live, but with targets and deadlines to meet, these are not used to there full potential.

The other issue that most developers forget is the amount of data that is generated, at the time of development this could have been underestimated, and large data with the inefficient number of indexes can slow down a system dramatically.

I have a couple of section that also has detail on tuning and system performance

Explain

Explain is one of those tools that helps a SQL developer see inside the optimizer, is shows you what method the optimize used to determine how it retrieve the data you requested, this tool is also available to other databases and is the tool widely used to tuning SQL statements. Explain will give you the following detail in MySQL

How many tables are involved
How the tables are joined
How the data is looked up
If there are subqueries or unions
If distinct is used
If a where clause is used
If a temporary table is used
Possible index usage
Actual index usage
Length of the indexes used
Approximate number of records returned
If sorting requires an extra pass through the data

To use Explain, you simply just type explain at the beginning of you SQL statement

explain

mysql> explain select * from users\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: users
type: index
possible_keys: NULL
key: users_idx
key_len: 1534
ref: NULL
rows: 10249
Extra: Using index
1 row in set (0.00 sec)

When you run the explain you will get back a number of values

id	is a sequential identifier, which is different for each row, each row represents a physical table, subquery, temporary table or derived table.
select_type	is the type of select represented by the row
table	shows the table alias that the row refers to, if there is no alias then the name of the table is displayed, most often it will be simple, others are simple union union_result primary
type	data access strategy is displayed in this field, all the possible values are below, it starts with the slowest going to the fastest all - full scan of the entire table index - full scan of the entire index range - partial scan of the index index_subquery - subquery using a nonunique index unique_subquery - subquery using a unique index index_merge - more than one of the indexes are used to perform multiple scans ref_or_null - more than one record may be looked up for each set of results from previous in the explain plan fulltext - use MySQL's fulltext search ref - more than one record may be looked up for each record being joined, or for each set of from previous in the explain plan eq_ref - fewer than two records are looked up for each record being joined from previous in the explain plan Const - fewer than two records are looked up from a nonsystem table System - fewer than two records are looked up from a system table NULL - the data is not looked up using a table
possible_keys	this displays what the optimizer considers using to satisfy data filters that is the where clause and the join conditions, if this is null then there were no indexes that could have been used
key	this shows which index the query optimizer used to satisfy the query, sometimes you may get several indexes being used. It is possible that the optimizer may decide that doing a full table scan is quicker, hence this field may be blank and indexes highlighted in the possible_keys field.
key_len	shows the length of the key that was used (in bytes), queries that use indexes can be further optimized by making the length of the index smaller.
ref	this shows what is compared to the index, for a range of values or a full table scan ref would be null.
rows	this gives an approximate number of records examined for this row, the number is based on metadata and the metadata may or may not be accurate depending on the storage engine.
extra	this last field is a catch-all field that shows good, neutral and bad information about a query plan no tables used - no table, temporary table, view or derived table will be used impossible where noticed after reading const tables - there is no satisfactory value const row not found - there is no satisfactory value using where - there is a filter for comparison or joining using intersection - examines indexes in parallel in an index_merge data access strategy, then performs an intersection of the result sets using union - examines indexes in parallel in an index_merge data access strategy, then performs a union of the result sets using sort_union - examines indexes in parallel in an index_merge data access strategy by fetching all record ID's sorting them , then performs a union of the result sets using index - only data from the index is needed, there is no need to retrieve a data record using index for group-by - only data from the index is needed to satisfy a group by or distinct using index condition - access an index value, testing the part of the filter that involves the index using MRR - uses the multi read range optimization using join buffer - table records are put into a join buffer, then the buffer is used for joining distinct - stops looking after the first matched record for this row not exists - used in outer joins where one lookup is sufficient for each record being joined ranged checked for each record - no index could be found but there might be a good one after some other rows (tables) have values select tables optimized away - metadata or an index can be used, so no tables are necessary; one record is returned using where with pushed condition - the cluster "pushes" the condition from the SQL nodes down to the data nodes using temporary - needs to use a temporary table for intermediate values using filtersort - needs to pass through the result set an extra time for sorting

Explain handles subqueries differently than it handles queries, explain will highlight this differences, the biggest difference is the select_type values that are used to describe subqueries

primary	outermost query when using subquery
derived	select subquery in from clause
subquery	first select in a subquery
dependent subquery	first select in a dependent subquery
uncacheable subquery	subquery result cannot be cached, must be evaluated for every record
dependent union	second or later select statements in a union and is used in a dependent subquery
uncacheable union	second or later select statement in a union and is used in a dependent subquery, cannot be cached and must be evaluated for every record

Explain can be extended to provide two sets of additional information, the first is an additional field called filtered which shows an approximate percentage of how many rows examined will be returned after the table conditions have been applied. The second set of information is obtained by running show warnings after the explain extended was run, the show warning will have a message field which display the actual SQL run by the optimizer.

Optimizing Queries

Now that you know how to look at your queries, it time to optimize them, when running explain the two fields of importance are the type and the extra fields, what you are trying to archive is get the fastest type strategy you possibly can, and how do you do this by creating indexes and trying to improve the SQL join statements. So why don't we just create indexes for everything we need, indexes can be a performance boost by they can also slow a system down, reads will certainly improve but writes will slow down as it has to write to the table and the index, also the optimizer is clever enough to know that sometimes a full table scan is better than going to the index, remember when access data from the indexes it takes two hoops to get to the data if the data is not in the indexes, the first hoop in the lookup is in the index to obtain the row reference, then next hoop is the table access using the information obtained from the index, the percentage difference with data not in the index is about 20-30 percent before the optimizer will chose to use the index rather than a full table scan, this can be overridden by using optimizer hints which will be discussed next.

Optimizer hints can include

specifying the join order of the tables with straight_join
specifying indexes to ignore using ignore index or ignore key
giving extra weight to indexes with use index or use key
specifying index to use with force index or force key
changing the value of optimizer_prune_level, value of 1 limits the number of query plans examined based on the number of rows returned, a value of 0 does not limit the number of query plans examined
changing the value of optimizer_search_depth, this controls how many data access plans the optimizer considers, a lower number means the optimizer spends less time meaning a suboptimal query plan will be produced, a higher number means that more data access plans are examined but this takes more time, default is 62
changing the value of optimizer_use_mrr, the default is force which means that it will use multi-Read-Range access method when possible.
setting a low value for max_seeks_for_key, this value is the maximum number of seeks the query optimizer assumes an index search will have. he default is large to allow the optimizer use index cardinality statistics to estimate the number of seeks.

You can use optimizer hints but there downfall is that they are not maintained, what I mean is that indexes and schema changes make the hint invalid, remember to document your hints and change them if required to do so.

Lets discuss temporary tables, when using explain and you see in the extra field using temporary it means that a temporary table was used, now depending on the size of the temporary table this could have been created in memory or on disk. If possible you try and not use temporary unless they are small, there are several ways to optimize you SQL code not to use temporary tables

try and get rid of any order by or group by, this may be done by splitting the query into two queries, it may be possible to combine the queries by using union so that intermediate results do not need to be stored in a temporary table.
again distinct may cause a creation of a temporary table, use the method above to over come this problem
if the sql_calc_found_rows keyword is used the number of rows is stored in a temporary table, it better to store the count in a table and periodically read this table for the results
the sql_small_result keyword is used in select statements, this tells the optimizer that the result set is small and thus to use a temporary table.
when using order by or group by try changing the ordering or as stated above try to get rid of it

Summary

Whole books have been written regarding tuning SQL, below is a small list to summarize what to look for when tuning SQL

add indexes
changing the size and type of data fields
adding new data fields
moving indexed fields out of functions
limiting the use of temporary tables
batching expensive and/or frequent queries
periodically calculating frequent queries