Query Analysis and Index Tuning

After you have implemented your system, a year or so down the live you generally hear users stating that reports are taking longer and the system is a little sluggish, in this section we will take a look at how to address these issues which generally all relate to concurrency, more users using the system

Generally system are first setup at very fast pace, numbers of code developers, SQL developers, system admin's are all too preoccupied by there own environment, it's not until the system has gone live and multiple users start using the system that concurrency problems appear, there are tools out that can test concurrency on your system before go live, but with targets and deadlines to meet, these are not used to there full potential.

The other issue that most developers forget is the amount of data that is generated, at the time of development this could have been underestimated, and large data with the inefficient number of indexes can slow down a system dramatically.

I have a couple of section that also has detail on tuning and system performance

Explain

Explain is one of those tools that helps a SQL developer see inside the optimizer, is shows you what method the optimize used to determine how it retrieve the data you requested, this tool is also available to other databases and is the tool widely used to tuning SQL statements. Explain will give you the following detail in MySQL

To use Explain, you simply just type explain at the beginning of you SQL statement

explain mysql> explain select * from users\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: users
type: index
possible_keys: NULL
key: users_idx
key_len: 1534
ref: NULL
rows: 10249
Extra: Using index
1 row in set (0.00 sec)

When you run the explain you will get back a number of values

id is a sequential identifier, which is different for each row, each row represents a physical table, subquery, temporary table or derived table.
select_type is the type of select represented by the row
table

shows the table alias that the row refers to, if there is no alias then the name of the table is displayed, most often it will be simple, others are

  • simple
  • union
  • union_result
  • primary
type

data access strategy is displayed in this field, all the possible values are below, it starts with the slowest going to the fastest

  • all - full scan of the entire table
  • index - full scan of the entire index
  • range - partial scan of the index
  • index_subquery - subquery using a nonunique index
  • unique_subquery - subquery using a unique index
  • index_merge - more than one of the indexes are used to perform multiple scans
  • ref_or_null - more than one record may be looked up for each set of results from previous in the explain plan
  • fulltext - use MySQL's fulltext search
  • ref - more than one record may be looked up for each record being joined, or for each set of from previous in the explain plan
  • eq_ref - fewer than two records are looked up for each record being joined from previous in the explain plan
  • Const - fewer than two records are looked up from a nonsystem table
  • System - fewer than two records are looked up from a system table
  • NULL - the data is not looked up using a table
possible_keys this displays what the optimizer considers using to satisfy data filters that is the where clause and the join conditions, if this is null then there were no indexes that could have been used
key this shows which index the query optimizer used to satisfy the query, sometimes you may get several indexes being used. It is possible that the optimizer may decide that doing a full table scan is quicker, hence this field may be blank and indexes highlighted in the possible_keys field.
key_len shows the length of the key that was used (in bytes), queries that use indexes can be further optimized by making the length of the index smaller.
ref this shows what is compared to the index, for a range of values or a full table scan ref would be null.
rows this gives an approximate number of records examined for this row, the number is based on metadata and the metadata may or may not be accurate depending on the storage engine.
extra

this last field is a catch-all field that shows good, neutral and bad information about a query plan

  • no tables used - no table, temporary table, view or derived table will be used
  • impossible where noticed after reading const tables - there is no satisfactory value
  • const row not found - there is no satisfactory value
  • using where - there is a filter for comparison or joining
  • using intersection - examines indexes in parallel in an index_merge data access strategy, then performs an intersection of the result sets
  • using union - examines indexes in parallel in an index_merge data access strategy, then performs a union of the result sets
  • using sort_union - examines indexes in parallel in an index_merge data access strategy by fetching all record ID's sorting them , then performs a union of the result sets
  • using index - only data from the index is needed, there is no need to retrieve a data record
  • using index for group-by - only data from the index is needed to satisfy a group by or distinct
  • using index condition - access an index value, testing the part of the filter that involves the index
  • using MRR - uses the multi read range optimization
  • using join buffer - table records are put into a join buffer, then the buffer is used for joining
  • distinct - stops looking after the first matched record for this row
  • not exists - used in outer joins where one lookup is sufficient for each record being joined
  • ranged checked for each record - no index could be found but there might be a good one after some other rows (tables) have values
  • select tables optimized away - metadata or an index can be used, so no tables are necessary; one record is returned
  • using where with pushed condition - the cluster "pushes" the condition from the SQL nodes down to the data nodes
  • using temporary - needs to use a temporary table for intermediate values
  • using filtersort - needs to pass through the result set an extra time for sorting

Explain handles subqueries differently than it handles queries, explain will highlight this differences, the biggest difference is the select_type values that are used to describe subqueries

primary outermost query when using subquery
derived select subquery in from clause
subquery first select in a subquery
dependent subquery first select in a dependent subquery
uncacheable subquery subquery result cannot be cached, must be evaluated for every record
dependent union second or later select statements in a union and is used in a dependent subquery
uncacheable union second or later select statement in a union and is used in a dependent subquery, cannot be cached and must be evaluated for every record

Explain can be extended to provide two sets of additional information, the first is an additional field called filtered which shows an approximate percentage of how many rows examined will be returned after the table conditions have been applied. The second set of information is obtained by running show warnings after the explain extended was run, the show warning will have a message field which display the actual SQL run by the optimizer.

Optimizing Queries

Now that you know how to look at your queries, it time to optimize them, when running explain the two fields of importance are the type and the extra fields, what you are trying to archive is get the fastest type strategy you possibly can, and how do you do this by creating indexes and trying to improve the SQL join statements. So why don't we just create indexes for everything we need, indexes can be a performance boost by they can also slow a system down, reads will certainly improve but writes will slow down as it has to write to the table and the index, also the optimizer is clever enough to know that sometimes a full table scan is better than going to the index, remember when access data from the indexes it takes two hoops to get to the data if the data is not in the indexes, the first hoop in the lookup is in the index to obtain the row reference, then next hoop is the table access using the information obtained from the index, the percentage difference with data not in the index is about 20-30 percent before the optimizer will chose to use the index rather than a full table scan, this can be overridden by using optimizer hints which will be discussed next.

Optimizer hints can include

You can use optimizer hints but there downfall is that they are not maintained, what I mean is that indexes and schema changes make the hint invalid, remember to document your hints and change them if required to do so.

Lets discuss temporary tables, when using explain and you see in the extra field using temporary it means that a temporary table was used, now depending on the size of the temporary table this could have been created in memory or on disk. If possible you try and not use temporary unless they are small, there are several ways to optimize you SQL code not to use temporary tables

Summary

Whole books have been written regarding tuning SQL, below is a small list to summarize what to look for when tuning SQL