Skip to content

Commit 3956a81

Browse files
author
Ole John Aske
committed
Bug#33317872 Incorrect Index selected leading to slower execution of queries
Root cause for this bug is two-fold: 1): Optimizer use a mix of handler::read_cost() and Cost_model::page_read_cost() to estimate the cost for different access methods. As these two methods has not been aligned to return comparable costs, we can't really compare the different proposed access plans at all. 2) ha_ndbcluster does not implement read_time() (Which is used by ::read_cost()). Thus the default implementation was used, which did not reflect that ndbcluster has a huge cost difference when fetching <n>-rows using a fully specified unique key, vs. fetching the same <n>-rows with a range scan on an ordered index (Needing to scan all th e128 fragments in this case). Thus the patch is two-fold as well: 1) The usage of Cost_optimizer::page_read_cost() directly from the optimizer was replaced by the new handler method handler::page_read_cost(). The default implementation of handler::page_read_cost() just calls Cost_optimizer::page_read_cost() as before, such that SE's (InnoDb!) not overriding handler::page_read_cost() will calculate the same cost as before (and still be 'broken') A handler::worst_seek_times() was also introduced, with a default implementation still using the Cost_optimizer::page_read_cost(), effectively keeping the same worst_seeks calculation as before for those not overriding it. Note that the costs calculated with respective page_read_cost(), worst_seek_times() and handler::read_cost() are all compared against each other, so they all need to use the same underlying 'metric' to really be comparable. Which we previouslu didn't do, and neither does in the default implementation - The later is intentional to avoid changing too many InnoDb test/customer cases, possibly introducing regressions as well. 2) ha_ndbcluster implements ::read_time(), page_read_cost() and worst_seek_times() read_time() is implemented such that it differentiate between the 3 basic access cases: - PK-access, This use the existing handler::read_time() calculation as a baseline, where cost=ranges+rows, where the assumption seems to be that each 'range' (= a single row) has the cost of '1', as well as each returned row has the same. It could be discussed whether this formula is correct or not, but we intentionally didn't change it to avoid too many test cases to change as well. - UQ-access, will require a 'double' key lookup, first the UQ, then on the PK. Thus we assume twice the cost for the 'range', which is the request part of the fetch. - Order-index range scan, will need to 'broadcast' the range scan to each fragment, thus the cost scales with number of fragments to scan over (128 fragments in the SR related to bug report!) We estimate the request part of the cost to scale with a factor of 0.5 for each fragment to scan. Note that the above is not intended to be a perfect read_time calculation, there are still pending WLs for that. Main focus has been to not change existing test cases, while still being able to reflect that scanning many fragments has a high cost. Then we also override page_read_cost() and worst_seek_times() to be based on read_cost() instead. Thus making the calculated cost comparable with the cost calculated from optimizer those places where it calls read_cost() directly itself. (Note that a 'page-cost' is not relevant at all for SE's not being disk/page-cache based) The optimizer patches: Reviewed By: Sergei Glukhov <sergey.glukhov@oracle.com> The ha_ndbcluster patches: Reviewed By: Arnab Ray <arnab.r.ray@oracle.com Change-Id: Ie84b649c47eb1016ef27f5c9b027a74d93d1ad80
1 parent 9181d3f commit 3956a81

10 files changed

+324
-14
lines changed

mysql-test/suite/ndb/r/ndb_index_unique.result

+33-3
Original file line numberDiff line numberDiff line change
@@ -907,14 +907,14 @@ from
907907
t2 as t1, t2 as t2, t2 as t3, t2 as t4;
908908
explain
909909
SELECT STRAIGHT_JOIN count(*) FROM
910-
t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
910+
t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
911911
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
912912
1 SIMPLE t1 p0,p1,p2,p3,p4,p5,p6,p7 ALL NULL NULL NULL NULL 10000 100.00 Using where with pushed condition (`test`.`t1`.`a` is not null)
913913
1 SIMPLE t2 p0,p1,p2,p3,p4,p5,p6,p7 ref ix ix 10 const,test.t1.a 1 100.00 Using where with pushed condition isnull(`test`.`t2`.`uq`)
914914
Warnings:
915-
Note 1003 /* select#1 */ select straight_join count(0) AS `count(*)` from `test`.`t1` join `test`.`t2` where ((`test`.`t2`.`a` = `test`.`t1`.`a`) and isnull(`test`.`t2`.`uq`))
915+
Note 1003 /* select#1 */ select straight_join count(0) AS `count(*)` from `test`.`t1` join `test`.`t2` FORCE INDEX FOR JOIN (`ix`) where ((`test`.`t2`.`a` = `test`.`t1`.`a`) and isnull(`test`.`t2`.`uq`))
916916
SELECT STRAIGHT_JOIN count(*) FROM
917-
t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
917+
t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
918918
count(*)
919919
0
920920
drop table t1,t2;
@@ -1099,3 +1099,33 @@ insert into t2 values (4);
10991099
ERROR 23000: Cannot add or update a child row: a foreign key constraint fails (Unknown error code)
11001100
#cleanup
11011101
drop table t2, t1;
1102+
#
1103+
# Bug#33317872 Incorrect Index selected leading to slower execution of queries
1104+
#
1105+
# When choosing between a Multi-range-read on an unique index,
1106+
# and a range scan on an ordered index, where few rows are expected
1107+
# to be returned, prefer the unique index.
1108+
# (An ordered index scan has higher cost, as all fragments are scanned)
1109+
#
1110+
CREATE TABLE t (
1111+
delivery_id bigint unsigned NOT NULL AUTO_INCREMENT,
1112+
msg_id int unsigned NOT NULL,
1113+
auth_login_type char(1) DEFAULT NULL,
1114+
auth_login_id varchar(64),
1115+
PRIMARY KEY (delivery_id),
1116+
UNIQUE KEY msg_id(msg_id,auth_login_type,auth_login_id),
1117+
KEY auth_login_id(auth_login_id)
1118+
) ENGINE=ndbcluster;
1119+
Table Op Msg_type Msg_text
1120+
test.t analyze status OK
1121+
insert into t (msg_id, auth_login_type, auth_login_id) values
1122+
(4866, '5', '81774133'),
1123+
(4869, '5', '81774133');
1124+
explain
1125+
select * from t
1126+
where auth_login_type='5' AND auth_login_id = '81774133' AND msg_id IN (4866,4869);
1127+
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1128+
1 SIMPLE t p0,p1,p2,p3,p4,p5,p6,p7 range msg_id,auth_login_id msg_id 73 NULL 2 100.00 Using where with pushed condition ((`test`.`t`.`auth_login_id` = '81774133') and (`test`.`t`.`auth_login_type` = '5') and (`test`.`t`.`msg_id` in (4866,4869))); Using MRR
1129+
Warnings:
1130+
Note 1003 /* select#1 */ select `test`.`t`.`delivery_id` AS `delivery_id`,`test`.`t`.`msg_id` AS `msg_id`,`test`.`t`.`auth_login_type` AS `auth_login_type`,`test`.`t`.`auth_login_id` AS `auth_login_id` from `test`.`t` where ((`test`.`t`.`auth_login_id` = '81774133') and (`test`.`t`.`auth_login_type` = '5') and (`test`.`t`.`msg_id` in (4866,4869)))
1131+
drop table t;

mysql-test/suite/ndb/r/ndb_statistics1.result

+2-2
Original file line numberDiff line numberDiff line change
@@ -82,8 +82,8 @@ EXPLAIN
8282
SELECT * FROM t10000 AS x JOIN t10000 AS y
8383
ON y.i=x.i AND y.j = x.i;
8484
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
85-
1 SIMPLE x p0 ALL I NULL NULL NULL 10000 100.00 Parent of 2 pushed join@1; Using where with pushed condition (`test`.`x`.`I` is not null)
86-
1 SIMPLE y p0 ref J,I J 5 test.x.I 1 5.00 Child of 'x' in pushed join@1; Using where
85+
1 SIMPLE x p0 ALL I NULL NULL NULL 10000 100.00 Parent of 2 pushed join@1; Using where with pushed condition ((`test`.`x`.`I` is not null) and (`test`.`x`.`I` is not null))
86+
1 SIMPLE y p0 ref J,I I 10 test.x.I,test.x.I 1 100.00 Child of 'x' in pushed join@1
8787
Warnings:
8888
Note 1003 /* select#1 */ select `test`.`x`.`K` AS `K`,`test`.`x`.`I` AS `I`,`test`.`x`.`J` AS `J`,`test`.`y`.`K` AS `K`,`test`.`y`.`I` AS `I`,`test`.`y`.`J` AS `J` from `test`.`t10000` `x` join `test`.`t10000` `y` where ((`test`.`y`.`I` = `test`.`x`.`I`) and (`test`.`y`.`J` = `test`.`x`.`I`))
8989
EXPLAIN

mysql-test/suite/ndb/t/ndb_index_unique.test

+106-2
Original file line numberDiff line numberDiff line change
@@ -553,9 +553,9 @@ from
553553

554554
explain
555555
SELECT STRAIGHT_JOIN count(*) FROM
556-
t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
556+
t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
557557
SELECT STRAIGHT_JOIN count(*) FROM
558-
t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
558+
t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
559559

560560
drop table t1,t2;
561561

@@ -738,4 +738,108 @@ insert into t2 values (4);
738738
--echo #cleanup
739739
drop table t2, t1;
740740

741+
742+
--echo #
743+
--echo # Bug#33317872 Incorrect Index selected leading to slower execution of queries
744+
--echo #
745+
--echo # When choosing between a Multi-range-read on an unique index,
746+
--echo # and a range scan on an ordered index, where few rows are expected
747+
--echo # to be returned, prefer the unique index.
748+
--echo # (An ordered index scan has higher cost, as all fragments are scanned)
749+
--echo #
750+
751+
CREATE TABLE t (
752+
delivery_id bigint unsigned NOT NULL AUTO_INCREMENT,
753+
msg_id int unsigned NOT NULL,
754+
auth_login_type char(1) DEFAULT NULL,
755+
auth_login_id varchar(64),
756+
PRIMARY KEY (delivery_id),
757+
UNIQUE KEY msg_id(msg_id,auth_login_type,auth_login_id),
758+
KEY auth_login_id(auth_login_id)
759+
) ENGINE=ndbcluster;
760+
761+
disable_query_log;
762+
763+
insert into t (msg_id, auth_login_type, auth_login_id) values
764+
(4866, '5', '91774132'), (4869, '5', '91884132');
765+
insert into t (msg_id, auth_login_type, auth_login_id) values
766+
(4866, '5', '91774134'), (4869, '5', '91884134');
767+
768+
insert into t (msg_id, auth_login_type, auth_login_id) values
769+
(4866, '4', '91774133'), (4869, '4', '91884133');
770+
insert into t (msg_id, auth_login_type, auth_login_id) values
771+
(4866, '6', '91774133'), (4869, '6', '91884133');
772+
773+
insert into t (msg_id, auth_login_type, auth_login_id) values
774+
(4865, '5', '91774133'), (4868, '5', '91774133');
775+
insert into t (msg_id, auth_login_type, auth_login_id) values
776+
(4867, '5', '91774133'), (4870, '5', '91774133');
777+
# ^^^ Add some more rows, none will match the query
778+
779+
# Use these as base for filling in more rows
780+
insert into t (msg_id, auth_login_type, auth_login_id) select
781+
msg_id - 4800, auth_login_type, auth_login_id+10 from t;
782+
783+
insert into t (msg_id, auth_login_type, auth_login_id) select
784+
msg_id + 1000, auth_login_type, auth_login_id+20 from t where msg_id < 100;
785+
insert into t (msg_id, auth_login_type, auth_login_id) select
786+
msg_id + 2000, auth_login_type, auth_login_id+30 from t where msg_id < 100;
787+
insert into t (msg_id, auth_login_type, auth_login_id) select
788+
msg_id + 3000, auth_login_type, auth_login_id+40 from t where msg_id < 100;
789+
790+
insert into t (msg_id, auth_login_type, auth_login_id) select
791+
msg_id + 5000, auth_login_type, auth_login_id+50 from t where msg_id < 100;
792+
insert into t (msg_id, auth_login_type, auth_login_id) select
793+
msg_id + 6000, auth_login_type, auth_login_id+60 from t where msg_id < 100;
794+
insert into t (msg_id, auth_login_type, auth_login_id) select
795+
msg_id + 7000, auth_login_type, auth_login_id+70 from t where msg_id < 100;
796+
797+
798+
##########################################
799+
# Statistics is asynchronous: We can not predict when it
800+
# is available in the statistics cache, and thus which
801+
# statistics is being used for the query. Thus we
802+
# need to cheat:
803+
#
804+
# We 'analyze' without the rows to be select yet inserted.
805+
# -> statistics will predictable contain 0 rows to be returned.
806+
# Which is 'rounded' up to 2 rows, which are the lowest number of
807+
# rows allowed to be estimated from non unique indexes.
808+
#
809+
# Note that the estimated 2 rows is actually what _is_ in the table
810+
# as well, when we later insert these rows below!
811+
812+
analyze table t;
813+
enable_query_log;
814+
815+
##########################################
816+
# Then insert the rows we want returned, without being part of the statistics
817+
insert into t (msg_id, auth_login_type, auth_login_id) values
818+
(4866, '5', '81774133'),
819+
(4869, '5', '81774133');
820+
# ^^^^ Rows we want from the query
821+
822+
# optimizer trace, For debugging, or understanding of test case + bug:
823+
#
824+
# If enabled, we will find that optimizer first calculate the range-access
825+
# cost, using handler ::cost methods, and correctly find the msg_id to
826+
# have the lowest cost.
827+
# Possible REF-accesses are then investigated and estimated: However, these
828+
# cost estimates was 'pre-patch' based in an entirely different page/cache
829+
# cost metric, not being compare compatible with the former. A lower cost
830+
# was found for using REF access on auth_login_id index, incorrectly causing
831+
# this index to to be used for the table access.
832+
#
833+
#SET optimizer_trace_max_mem_size=1048576; # 1MB
834+
#SET end_markers_in_json=on;
835+
#SET optimizer_trace="enabled=on,one_line=off";
836+
837+
explain
838+
select * from t
839+
where auth_login_type='5' AND auth_login_id = '81774133' AND msg_id IN (4866,4869);
840+
841+
#SELECT * FROM information_schema.optimizer_trace;
842+
843+
drop table t;
844+
741845
# end of tests

sql/ha_ndbcluster.cc

+110
Original file line numberDiff line numberDiff line change
@@ -8322,6 +8322,116 @@ double ha_ndbcluster::scan_time()
83228322
DBUG_RETURN(res);
83238323
}
83248324

8325+
/**
8326+
read_time() need to differentiate between single row type lookups,
8327+
and accesses where an ordered index need to be scanned.
8328+
The later will need to scan all fragments, which might be
8329+
significantly more expensive - imagine a deployment with hundreds
8330+
of partitions.
8331+
*/
8332+
double ha_ndbcluster::read_time(uint index, uint ranges, ha_rows rows) {
8333+
DBUG_ENTER("ha_ndbcluster::read_time()");
8334+
assert(rows > 0);
8335+
assert(ranges > 0);
8336+
assert(rows >= ranges);
8337+
8338+
const NDB_INDEX_TYPE index_type =
8339+
(index < MAX_KEY) ? get_index_type(index)
8340+
: (index == MAX_KEY) ? PRIMARY_KEY_INDEX // Hidden primary key
8341+
: UNDEFINED_INDEX; // -> worst index
8342+
8343+
// fanout_factor is intended to compensate for the amount
8344+
// of roundtrips between API <-> data node and between data nodes
8345+
// themself by the different index type. As an initial guess
8346+
// we assume a single full roundtrip for each 'range'.
8347+
double fanout_factor;
8348+
8349+
/**
8350+
* Note that for now we use the default handler cost estimate
8351+
* 'rows2double(ranges + rows)' as the baseline - Even if it
8352+
* might have some obvious flaws. For now it is more important
8353+
* to get the relative cost between PK/UQ and order index scan
8354+
* more correct. It is also a matter of not changing too many
8355+
* existing MTR tests. (and customer queries as well!)
8356+
*
8357+
* We also estimate the same cost for a request roundtrip as
8358+
* for returning a row. Thus the baseline cost 'ranges + rows'
8359+
*/
8360+
if (index_type == PRIMARY_KEY_INDEX) {
8361+
assert(index == table->s->primary_key);
8362+
// Need a full roundtrip for each row
8363+
fanout_factor = 1.0 * rows2double(rows);
8364+
} else if (index_type == UNIQUE_INDEX) {
8365+
// Need to lookup first on UQ, then on PK, + lock/unlock
8366+
fanout_factor = 2.0 * rows2double(rows);
8367+
8368+
} else if (rows > ranges ||
8369+
index_type == ORDERED_INDEX ||
8370+
index_type == UNDEFINED_INDEX) {
8371+
// Assume || need a range scan
8372+
8373+
// TODO: - Handler call need a parameter specifying whether
8374+
// key was fully specified or not (-> scan or lookup)
8375+
// - The range scan could be pruned -> lower cost, or
8376+
// - The scan need to be 'ordered' -> higher cost.
8377+
// - Returning multiple rows pr range has a lower
8378+
// pr. row cost?
8379+
const uint fragments_to_scan =
8380+
m_table->getFullyReplicated() ? 1 : m_table->getPartitionCount();
8381+
8382+
// The range scan does one API -> TC request, which scale out the
8383+
// requests to all fragments. Assume a somewhat (*0.5) lower cost
8384+
// for these requests, as they are not full roundtrips back to the API
8385+
fanout_factor = (double)ranges * (1.0 + ((double)fragments_to_scan * 0.5));
8386+
8387+
} else {
8388+
assert(rows == ranges);
8389+
8390+
// Assume a set of PK/UQ single row lookups.
8391+
// We assume the hash key is used for a direct lookup
8392+
if (index_type == PRIMARY_KEY_ORDERED_INDEX) {
8393+
assert(index == table->s->primary_key);
8394+
fanout_factor = (double)ranges * 1.0;
8395+
} else {
8396+
assert(index_type == UNIQUE_ORDERED_INDEX);
8397+
// Unique key access has a higher cost than PK. Need to first
8398+
// lookup in index, then use that to lookup the row + lock & unlock
8399+
fanout_factor = (double)ranges * 2.0; // Assume twice as many roundtrips
8400+
}
8401+
}
8402+
DBUG_RETURN(fanout_factor + rows2double(rows));
8403+
}
8404+
8405+
/**
8406+
* Estimate the cost for reading the specified number of rows,
8407+
* using 'index'. Note that there is no such thing as a 'page'-read
8408+
* in ha_ndbcluster. Unfortunately, the optimizer does some
8409+
* assumptions about an underlying page based storage engine,
8410+
* which explains the name.
8411+
*
8412+
* In the NDB implementation we simply ignore the 'page', and
8413+
* calculate it as any other read_cost()
8414+
*/
8415+
double ha_ndbcluster::page_read_cost(uint index, double rows) {
8416+
DBUG_ENTER("ha_ndbcluster::page_read_cost()");
8417+
DBUG_RETURN(read_cost(index, 1, rows).total_cost());
8418+
}
8419+
8420+
/**
8421+
* Estimate the upper cost for reading rows in a seek-and-read fashion.
8422+
* Calculation is based on the worst index we can find for this table, such
8423+
* that any other better way of reading the rows will be preferred.
8424+
*
8425+
* Note that worst_seek will be compared against page_read_cost().
8426+
* Thus, it need to calculate the cost using comparable 'metrics'.
8427+
*/
8428+
double ha_ndbcluster::worst_seek_times(double reads) {
8429+
// Specifying the 'UNDEFINED_INDEX' is a special case in read_time(),
8430+
// where the cost for the most expensive/worst index will be calculated.
8431+
const uint undefined_index = MAX_KEY+1;
8432+
return page_read_cost(undefined_index, std::max(reads, 1.0));
8433+
}
8434+
83258435
/*
83268436
Convert MySQL table locks into locks supported by Ndb Cluster.
83278437
Note that MySQL Cluster does currently not support distributed

sql/ha_ndbcluster.h

+3
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,9 @@ class ha_ndbcluster: public handler, public Partition_handler
308308
const char* index_type(uint key_number);
309309

310310
double scan_time();
311+
double read_time(uint index, uint ranges, ha_rows rows);
312+
double page_read_cost(uint index, double rows);
313+
double worst_seek_times(double reads);
311314
ha_rows records_in_range(uint inx, key_range *min_key, key_range *max_key);
312315
void start_bulk_insert(ha_rows rows);
313316
int end_bulk_insert();

sql/handler.cc

+19
Original file line numberDiff line numberDiff line change
@@ -5947,6 +5947,25 @@ int ha_binlog_end(THD* thd)
59475947
return 0;
59485948
}
59495949

5950+
double handler::page_read_cost(uint index,
5951+
double reads) {
5952+
return table->cost_model()->page_read_cost(reads);
5953+
5954+
/////////////////
5955+
// Other, non-page-based storage engine, may prefer to
5956+
// override to;
5957+
//return read_cost(index, 1, reads).total_cost();
5958+
5959+
// Longer term: We should avoid mixed usage of read_cost()
5960+
// and page_read_cost() from the optimizer. Use only
5961+
// one of these to get cost estimates comparable between different
5962+
// access methods and call paths.
5963+
}
5964+
5965+
double handler::worst_seek_times(double reads) {
5966+
return table->cost_model()->page_read_cost(reads);
5967+
}
5968+
59505969
/**
59515970
Calculate cost of 'index only' scan for given index and number of records
59525971

sql/handler.h

+45
Original file line numberDiff line numberDiff line change
@@ -2582,6 +2582,51 @@ class handler :public Sql_alloc
25822582

25832583
virtual Cost_estimate read_cost(uint index, double ranges, double rows);
25842584

2585+
/**
2586+
Cost estimate for doing a number of non-sequentially accesses
2587+
against the storage engine. Such accesses can be either number
2588+
of rows to read, or number of disk pages to access.
2589+
Each handler implementation is free to interpret that as best
2590+
suited, depending on what is the dominating cost for that
2591+
storage engine.
2592+
2593+
This method is mainly provided as a temporary workaround for
2594+
bug#33317872, where we fix problems caused by calling
2595+
Cost_model::page_read_cost() directly from the optimizer.
2596+
That should be avoide, as it introduced assumption about all
2597+
storage engines being disk-page based, and having a 'page' cost.
2598+
Furthermore, this page cost was even compared against read_cost(),
2599+
which was computed with an entirely different algorithm, and thus
2600+
could not be compared.
2601+
2602+
The default implementation still call Cost_model::page_read_cost(),
2603+
thus behaving just as before. However, handler implementation may
2604+
override it to call handler::read_cost() instead(), which propably
2605+
will be more correct. (If a page_read_cost should be included
2606+
in the cost estimate, that should preferable be done inside
2607+
each read_cost() implementation)
2608+
2609+
Longer term we should considder to remove all page_read_cost()
2610+
usage from the optimizer itself, making this method obsolete.
2611+
2612+
@param index the index number
2613+
@param reads the number of accesses being made
2614+
2615+
@returns the estimated cost
2616+
*/
2617+
virtual double page_read_cost(uint index, double reads);
2618+
2619+
/**
2620+
Provide an upper cost-limit of doing a specified number of
2621+
seek-and-read key lookups. This need to be comparable and
2622+
calculated with the same 'metric' as page_read_cost.
2623+
2624+
@param reads the number of rows read in the 'worst' case.
2625+
2626+
@returns the estimated cost
2627+
*/
2628+
virtual double worst_seek_times(double reads);
2629+
25852630
/**
25862631
Return an estimate on the amount of memory the storage engine will
25872632
use for caching data in memory. If this is unknown or the storage

0 commit comments

Comments
 (0)