Query Analysis and Index Tuning
After you have implemented your system, a year or so down the live you generally hear users stating that reports are taking longer and the system is a little sluggish, in this section we will take a look at how to address these issues which generally all relate to concurrency, more users using the system
Generally system are first setup at very fast pace, numbers of code developers, SQL developers, system admin's are all too preoccupied by there own environment, it's not until the system has gone live and multiple users start using the system that concurrency problems appear, there are tools out that can test concurrency on your system before go live, but with targets and deadlines to meet, these are not used to there full potential.
The other issue that most developers forget is the amount of data that is generated, at the time of development this could have been underestimated, and large data with the inefficient number of indexes can slow down a system dramatically.
I have a couple of section that also has detail on tuning and system performance
Explain is one of those tools that helps a SQL developer see inside the optimizer, is shows you what method the optimize used to determine how it retrieve the data you requested, this tool is also available to other databases and is the tool widely used to tuning SQL statements. Explain will give you the following detail in MySQL
To use Explain, you simply just type explain at the beginning of you SQL statement
explain | mysql> explain select * from users\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: users type: index possible_keys: NULL key: users_idx key_len: 1534 ref: NULL rows: 10249 Extra: Using index 1 row in set (0.00 sec) |
When you run the explain you will get back a number of values
id | is a sequential identifier, which is different for each row, each row represents a physical table, subquery, temporary table or derived table. |
select_type | is the type of select represented by the row |
table | shows the table alias that the row refers to, if there is no alias then the name of the table is displayed, most often it will be simple, others are
|
type | data access strategy is displayed in this field, all the possible values are below, it starts with the slowest going to the fastest
|
possible_keys | this displays what the optimizer considers using to satisfy data filters that is the where clause and the join conditions, if this is null then there were no indexes that could have been used |
key | this shows which index the query optimizer used to satisfy the query, sometimes you may get several indexes being used. It is possible that the optimizer may decide that doing a full table scan is quicker, hence this field may be blank and indexes highlighted in the possible_keys field. |
key_len | shows the length of the key that was used (in bytes), queries that use indexes can be further optimized by making the length of the index smaller. |
ref | this shows what is compared to the index, for a range of values or a full table scan ref would be null. |
rows | this gives an approximate number of records examined for this row, the number is based on metadata and the metadata may or may not be accurate depending on the storage engine. |
extra | this last field is a catch-all field that shows good, neutral and bad information about a query plan
|
Explain handles subqueries differently than it handles queries, explain will highlight this differences, the biggest difference is the select_type values that are used to describe subqueries
primary | outermost query when using subquery |
derived | select subquery in from clause |
subquery | first select in a subquery |
dependent subquery | first select in a dependent subquery |
uncacheable subquery | subquery result cannot be cached, must be evaluated for every record |
dependent union | second or later select statements in a union and is used in a dependent subquery |
uncacheable union | second or later select statement in a union and is used in a dependent subquery, cannot be cached and must be evaluated for every record |
Explain can be extended to provide two sets of additional information, the first is an additional field called filtered which shows an approximate percentage of how many rows examined will be returned after the table conditions have been applied. The second set of information is obtained by running show warnings after the explain extended was run, the show warning will have a message field which display the actual SQL run by the optimizer.
Now that you know how to look at your queries, it time to optimize them, when running explain the two fields of importance are the type and the extra fields, what you are trying to archive is get the fastest type strategy you possibly can, and how do you do this by creating indexes and trying to improve the SQL join statements. So why don't we just create indexes for everything we need, indexes can be a performance boost by they can also slow a system down, reads will certainly improve but writes will slow down as it has to write to the table and the index, also the optimizer is clever enough to know that sometimes a full table scan is better than going to the index, remember when access data from the indexes it takes two hoops to get to the data if the data is not in the indexes, the first hoop in the lookup is in the index to obtain the row reference, then next hoop is the table access using the information obtained from the index, the percentage difference with data not in the index is about 20-30 percent before the optimizer will chose to use the index rather than a full table scan, this can be overridden by using optimizer hints which will be discussed next.
Optimizer hints can include
You can use optimizer hints but there downfall is that they are not maintained, what I mean is that indexes and schema changes make the hint invalid, remember to document your hints and change them if required to do so.
Lets discuss temporary tables, when using explain and you see in the extra field using temporary it means that a temporary table was used, now depending on the size of the temporary table this could have been created in memory or on disk. If possible you try and not use temporary unless they are small, there are several ways to optimize you SQL code not to use temporary tables
Whole books have been written regarding tuning SQL, below is a small list to summarize what to look for when tuning SQL