Friday, December 23, 2016

Improved Query Performance by Using Derived Table instead of IN-subquery

MySQL 5.6 introduced semi-join transformation for IN-subqueries. This opened up for several new ways to execute such subqueries; one was no longer tied to executing the subquery once for every row of the outer query. This dramatically improved the performance of many such queries. However, semi-join transformation does not apply to all types of queries. For example, if the subquery has a GROUP BY clause, semi-join does not apply. (For a full list of which types of queries can not use semi-join, see the Reference Manual.)
Fortunately, MySQL 5.6 also introduced subquery materialization that can be used when semi-join does not apply; as long as the subquery is not dependent on the outer query. (There should be no references to tables of the outer query within the subquery). In an earlier blog post, I showed how the query execution time for DBT-3 Query 18 improved from over a month to a few seconds due to subquery materialization. In this blog post, I will show how we can further improve the performance of Query 18 by rewriting it.
Let's first take a look at Query 18 in its original form:
 SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM(l_quantity)  
 FROM customer, orders, lineitem  
 WHERE o_orderkey IN (  
    SELECT l_orderkey  
    FROM lineitem  
    GROUP BY l_orderkey  
    HAVING SUM(l_quantity) > 313  
  )  
  AND c_custkey = o_custkey  
  AND o_orderkey = l_orderkey  
 GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice  
 ORDER BY o_totalprice DESC, o_orderdate  
 LIMIT 100;  
This query is called Large Volume Customer Query since it finds large orders (of more than 313 items) and returns the top 100 orders; ordered by total price and date. In MySQL 5.5 and earlier, MySQL would execute the GROUP BY subquery once for every row of the outer query. In MySQL 5.6, subquery materialization made it possible to execute this subquery only once. This can be seen from the EXPLAIN output for this query (some of the columns have been removed to save space):
id select_type table type key rows Extra
1 PRIMARY orders ALL NULL 1500000 Using where; Using temporary; Using filesort
1 PRIMARY customer eq_ref PRIMARY 1 NULL
1 PRIMARY lineitem ref i_l_orderkey_quantity 4 Using index
2 SUBQUERY lineitem index i_l_orderkey_quantity 6001215 Using index

That select_type is SUBQUERY indicates that the subquery will be executed only once. (Otherwise, the select_type would be DEPENDENT SUBQUERY.) Here is the Visual EXPLAIN diagram for the same query plan:
A basic component of both semi-join execution strategies and subquery materialization is duplicate removal. Unlike a JOIN operation where the result will contain all combinations of rows that match the join criteria, an IN-expression should only produce one row regardless of the number of matches in the subquery. For semi-join there are multiple strategies for removing duplicates, and for subquery materialization, duplicates are removed from the temporary table.
Looking at QUERY 18, we realize that the subquery will never give any duplicates. The only column selected is the column that we are grouping on, l_orderkey, so all rows will be distinct. This means that we can replace the IN-expression with a join, and get this equivalent query:
SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM(lineitem.l_quantity)
FROM customer, orders, lineitem,
     ( SELECT l_orderkey
       FROM lineitem
       GROUP BY l_orderkey
       HAVING SUM(l_quantity) > 313
     ) l2
WHERE o_orderkey = l2.l_orderkey
  AND c_custkey = o_custkey
  AND o_orderkey = lineitem.l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice DESC, o_orderdate
LIMIT 100;

The subquery is now a derived table; a subquery in the FROM clause. Take a look at the query plan for the rewritten query:
Notice that the subquery is materialized as before, but this time the plan is to start with scanning the temporary table instead of the orders table as for the original query. This means that instead of accessing all orders, MySQL will now only access the biggest orders (with more than 313 items). Given that most orders are smaller than that, this should give improved performance. Indeed, with MySQL 5.7.14 and using a scale factor 1 database, the query execution time is slashed by one third by this rewrite:
In general, rewriting the query from an IN-expression to a JOIN is beneficial because the join optimizer may then be used to find the optimal order for accessing the tables. This is the idea behind the semi-join transformations that are done automatically since MySQL 5.6, and this is why our manual transformation gave a better query plan. Hopefully, some time in the future the MySQL Query Optimizer will be able to do such transformations automatically. Until then, watch out for opportunities to manually rewrite IN-subqueries where semi-join does not apply!

Friday, October 21, 2016

Presentations at OpenWorld and PerconaLive

During the last month, I presented on MySQL both at Oracle OpenWorld and Percona Live in Amsterdam.  The slides from the presentations have been uploaded to the conference web sites,  and you also find the slides at Slideshare:

Saturday, September 17, 2016

MySQL Optimizer Sessions at Oracle OpenWorld

Oracle OpenWorld starts soon and there will be a few sessions on the MySQL Optimizer.   On Monday, I will have my tutorial on how to analyze and tune MySQL queries.  Later in the week Manyi Lu, the MySQL optimizer team manager, will present what was new in the MySQL optimizer 5.7, and also give a sneak peek into the MySQL 8.0 release.  Both of us will together have a presentation  on Common Table Expressions; a new SQL feature in MySQL 8.0.  The details are as follows: 

Monday, Sep 19, 1:45 p.m. 
How to Analyze and Tune MySQL Queries for Better Performance [TUT3750]
Oystein Grovlen, Senior Principal Software Engineer, Oracle

Wednesday, Sep 21, 3:00 p.m.
MySQL Optimizer: What’s New in 5.7 and Sneak Peek at 8.0 [CON6112]
Manyi Lu, Director Software Engineering, Oracle

Thursday, Sep 22, 12:00 p.m.
MySQL 8.0: Common Table Expressions [CON7928]
Manyi Lu, Director Software Engineering, Oracle
Oystein Grovlen, Senior Principal Software Engineer, Oracle


All MySQL tutorials and conference sessions are this year at at The Park Central hotel.  For more information on MySQL sessions, see Focus on MySQL @ OpenWorld.  OpenWorld participants can build their own OpenWorld agenda and reserve seats for sessions by going to the OpenWorld home page and select Tools and Resources → My Schedule.

We will also like to invite people to the MySQL community reception at Jillian's on Tuesday, September 20 @ 7 pm.  This reception is not limited to OpenWorld registrants.  So if you are in the SF area, please, feel free to come and meet MySQL developers and users!

Friday, March 4, 2016

Oracle Virtual Technology Summit

In the coming weeks, Oracle Technology Network welcomes you to join the Virtual Technology Summit which is a half-day online conference with multiple tracks.  This time there is a MySQL track, and I will do a presentation on Analyze & Tune MySQL Queries for Better Performance.  The MySQL track also contains a presentation on MySQL 5.7, and a presentation on how to use Oracle Enterprise Manager to manage MySQL databases.

There will be three events with the same program, but at different times to best suit different parts of the world.   The first opportunity is the Americas event on March 8.  Later, there will be events for Asia/Pacific (March 15) and Europe/Middle East/Africa (April 5).  

To register, go to the pages for the Americas, Asia/Pacific, and Europe/Middle East/Africa events, respectively.  It will also be possible to listen to the webcast after the events, but then you will miss the opportunity to ask questions as we go along.

 

Wednesday, January 27, 2016

How to Get Optimizer Trace for a Query

While EXPLAIN shows the selected query plan for a query, optimizer trace will show you WHY the particular plan was selected. From the trace you will be able to see what alternative plans was considered, the estimated costs of different plans, and what decisions was made during query optimization.

To turn on recording of optimizer trace for the current session:
SET optimizer_trace='enabled=on';

When optimizer trace is enabled, the information schema table optimizer_trace will contain the trace for the latest query that was explained or executed. I usually dump the trace to a file using this SQL statement:
SELECT trace FROM information_schema.optimizer_trace 
    INTO OUTFILE '<filename>' LINES TERMINATED BY '';

One important thing to note is that there is a configurable maximum size for the memory buffer used to record the trace. The default is pretty low, and you will often have to increase the size to capture the entire trace. The missing_bytes_beyond_max_mem_size column of the optimizer_trace table shows how many bytes are missing from the trace. If this column is non-zero, you should increase the setting of the variable optimizer_trace_max_mem_size:
SET optimizer_trace_max_mem_size=1000000;

The above statement increases the trace buffer to about 1 MB. Unless your query joins very many tables, this should be sufficient.

To get more information about optimizer trace, check:

Optimizer tracing: Query Execution Plan descriptions beyond EXPLAIN
Optimizer tracing: how to configure it
MySQL Internals Manual on Optimizer Tracing

Tuesday, January 26, 2016

Improved Performance of Queries with Derived Tables

Last month/year I published a blog post on mysqlserverteam.com with an example of how MySQL 5.7 gives you improved performance for queries with derived tables.

The query example was part of my tutorial “How to Analyze and Tune MySQL Queries for Better Performance” at Oracle OpenWorld 2015. Slides for the entire presentation can be found here.