Bug#33317872 Incorrect Index selected leading to slower execution of queries

Ole John Aske · Ole John Aske · commit 3956a810dda0 · 2021-11-19T21:32:23.000+01:00
Root cause for this bug is two-fold:

1): Optimizer use a mix of handler::read_cost()
and Cost_model::page_read_cost() to estimate the cost for different
access methods. As these two methods has not been aligned to return
comparable costs, we can't really compare the different proposed access
plans at all.

2) ha_ndbcluster does not implement read_time() (Which is used
by ::read_cost()). Thus the default implementation was used, which
did not reflect that ndbcluster has a huge cost difference when
fetching &lt;n&gt;-rows using a fully specified unique key, vs. fetching the
same &lt;n&gt;-rows with a range scan on an ordered index (Needing to
scan all th e128 fragments in this case).

Thus the patch is two-fold as well:

1) The usage of Cost_optimizer::page_read_cost() directly from the optimizer
was replaced by the new handler method handler::page_read_cost().
The default implementation of handler::page_read_cost() just calls
Cost_optimizer::page_read_cost() as before, such that SE's (InnoDb!)
not overriding handler::page_read_cost() will calculate the same cost
as before (and still be 'broken')

A handler::worst_seek_times() was also introduced, with a default
implementation still using the Cost_optimizer::page_read_cost(),
effectively keeping the same worst_seeks calculation as before
for those not overriding it.

Note that the costs calculated with respective page_read_cost(),
worst_seek_times() and handler::read_cost() are all compared against
each other, so they all need to use the same underlying 'metric'
to really be comparable. Which we previouslu didn't do, and neither
does in the default implementation - The later is intentional to avoid
changing too many InnoDb test/customer cases, possibly introducing
regressions as well.

2) ha_ndbcluster implements ::read_time(), page_read_cost() and
worst_seek_times()

read_time() is implemented such that it differentiate between the
3 basic access cases:

- PK-access, This use the existing handler::read_time() calculation as
  a baseline, where cost=ranges+rows, where  the assumption seems to
  be that each 'range' (= a single row) has the cost of '1', as well
  as each returned row has the same. It could be discussed whether this
  formula is correct or not, but we intentionally didn't change it
  to avoid too many test cases to change as well.

- UQ-access, will require a 'double' key lookup, first the UQ, then
  on the PK. Thus we assume twice the cost for the 'range', which is
  the request part of the fetch.

- Order-index range scan, will need to 'broadcast' the range scan to
  each fragment, thus the cost scales with number of fragments to scan over
  (128 fragments in the SR related to bug report!) We estimate the
  request part of the cost to scale with a factor of 0.5 for each fragment
  to scan.

Note that the above is not intended to be a perfect read_time calculation,
there are still pending WLs for that. Main focus has been to not change
existing test cases, while still being able to reflect that scanning many
fragments has a high cost.

Then we also override page_read_cost() and worst_seek_times() to be based
on read_cost() instead. Thus making the calculated cost comparable with
the cost calculated from optimizer those places where it calls read_cost()
directly itself. (Note that a 'page-cost' is not relevant at all for
SE's not being disk/page-cache based)

The optimizer patches:
Reviewed By: Sergei Glukhov &lt;sergey.glukhov@oracle.com&gt;

The ha_ndbcluster patches:
Reviewed By: Arnab Ray &lt;arnab.r.ray@oracle.com

Change-Id: Ie84b649c47eb1016ef27f5c9b027a74d93d1ad80
diff --git a/mysql-test/suite/ndb/r/ndb_index_unique.result b/mysql-test/suite/ndb/r/ndb_index_unique.result
@@ -907,14 +907,14 @@ from
 t2 as t1, t2 as t2, t2 as t3, t2 as t4;
 explain
 SELECT STRAIGHT_JOIN count(*) FROM 
-t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
+t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
 id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
 1	SIMPLE	t1	p0,p1,p2,p3,p4,p5,p6,p7	ALL	NULL	NULL	NULL	NULL	10000	100.00	Using where with pushed condition (`test`.`t1`.`a` is not null)
 1	SIMPLE	t2	p0,p1,p2,p3,p4,p5,p6,p7	ref	ix	ix	10	const,test.t1.a	1	100.00	Using where with pushed condition isnull(`test`.`t2`.`uq`)
 Warnings:
-Note	1003	/* select#1 */ select straight_join count(0) AS `count(*)` from `test`.`t1` join `test`.`t2` where ((`test`.`t2`.`a` = `test`.`t1`.`a`) and isnull(`test`.`t2`.`uq`))
+Note	1003	/* select#1 */ select straight_join count(0) AS `count(*)` from `test`.`t1` join `test`.`t2` FORCE INDEX FOR JOIN (`ix`) where ((`test`.`t2`.`a` = `test`.`t1`.`a`) and isnull(`test`.`t2`.`uq`))
 SELECT STRAIGHT_JOIN count(*) FROM 
-t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
+t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
 count(*)
 0
 drop table t1,t2;
@@ -1099,3 +1099,33 @@ insert into t2 values (4);
 ERROR 23000: Cannot add or update a child row: a foreign key constraint fails (Unknown error code)
 #cleanup
 drop table t2, t1;
+#
+# Bug#33317872 Incorrect Index selected leading to slower execution of queries
+#
+#   When choosing between a Multi-range-read on an unique index,
+#   and a range scan on an ordered index, where few rows are expected
+#   to be returned, prefer the unique index.
+#   (An ordered index scan has higher cost, as all fragments are scanned)
+#
+CREATE TABLE t (
+delivery_id bigint unsigned NOT NULL AUTO_INCREMENT,
+msg_id int unsigned NOT NULL,
+auth_login_type char(1) DEFAULT NULL,
+auth_login_id varchar(64),
+PRIMARY KEY (delivery_id),
+UNIQUE KEY msg_id(msg_id,auth_login_type,auth_login_id),
+KEY auth_login_id(auth_login_id)
+) ENGINE=ndbcluster;
+Table	Op	Msg_type	Msg_text
+test.t	analyze	status	OK
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4866, '5', '81774133'),
+(4869, '5', '81774133');
+explain
+select * from t
+where auth_login_type='5' AND auth_login_id = '81774133' AND msg_id IN (4866,4869);
+id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
+1	SIMPLE	t	p0,p1,p2,p3,p4,p5,p6,p7	range	msg_id,auth_login_id	msg_id	73	NULL	2	100.00	Using where with pushed condition ((`test`.`t`.`auth_login_id` = '81774133') and (`test`.`t`.`auth_login_type` = '5') and (`test`.`t`.`msg_id` in (4866,4869))); Using MRR
+Warnings:
+Note	1003	/* select#1 */ select `test`.`t`.`delivery_id` AS `delivery_id`,`test`.`t`.`msg_id` AS `msg_id`,`test`.`t`.`auth_login_type` AS `auth_login_type`,`test`.`t`.`auth_login_id` AS `auth_login_id` from `test`.`t` where ((`test`.`t`.`auth_login_id` = '81774133') and (`test`.`t`.`auth_login_type` = '5') and (`test`.`t`.`msg_id` in (4866,4869)))
+drop table t;
diff --git a/mysql-test/suite/ndb/r/ndb_statistics1.result b/mysql-test/suite/ndb/r/ndb_statistics1.result
@@ -82,8 +82,8 @@ EXPLAIN
 SELECT * FROM t10000 AS x JOIN t10000 AS y
 ON y.i=x.i AND y.j = x.i;
 id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
-1	SIMPLE	x	p0	ALL	I	NULL	NULL	NULL	10000	100.00	Parent of 2 pushed join@1; Using where with pushed condition (`test`.`x`.`I` is not null)
-1	SIMPLE	y	p0	ref	J,I	J	5	test.x.I	1	5.00	Child of 'x' in pushed join@1; Using where
+1	SIMPLE	x	p0	ALL	I	NULL	NULL	NULL	10000	100.00	Parent of 2 pushed join@1; Using where with pushed condition ((`test`.`x`.`I` is not null) and (`test`.`x`.`I` is not null))
+1	SIMPLE	y	p0	ref	J,I	I	10	test.x.I,test.x.I	1	100.00	Child of 'x' in pushed join@1
 Warnings:
 Note	1003	/* select#1 */ select `test`.`x`.`K` AS `K`,`test`.`x`.`I` AS `I`,`test`.`x`.`J` AS `J`,`test`.`y`.`K` AS `K`,`test`.`y`.`I` AS `I`,`test`.`y`.`J` AS `J` from `test`.`t10000` `x` join `test`.`t10000` `y` where ((`test`.`y`.`I` = `test`.`x`.`I`) and (`test`.`y`.`J` = `test`.`x`.`I`))
 EXPLAIN
diff --git a/mysql-test/suite/ndb/t/ndb_index_unique.test b/mysql-test/suite/ndb/t/ndb_index_unique.test
@@ -553,9 +553,9 @@ from
 
 explain
 SELECT STRAIGHT_JOIN count(*) FROM 
-   t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
+   t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
 SELECT STRAIGHT_JOIN count(*) FROM 
-   t1 JOIN t2 ON t2.a=t1.a where t2.uq IS NULL;
+   t1 JOIN t2 FORCE INDEX FOR JOIN (ix) ON t2.a=t1.a where t2.uq IS NULL;
 
 drop table t1,t2;
 
@@ -738,4 +738,108 @@ insert into t2 values (4);
 --echo #cleanup
 drop table t2, t1;
 
+
+--echo #
+--echo # Bug#33317872 Incorrect Index selected leading to slower execution of queries
+--echo #
+--echo #   When choosing between a Multi-range-read on an unique index,
+--echo #   and a range scan on an ordered index, where few rows are expected
+--echo #   to be returned, prefer the unique index.
+--echo #   (An ordered index scan has higher cost, as all fragments are scanned)
+--echo #
+
+CREATE TABLE t (
+  delivery_id bigint unsigned NOT NULL AUTO_INCREMENT,
+  msg_id int unsigned NOT NULL,
+  auth_login_type char(1) DEFAULT NULL,
+  auth_login_id varchar(64),
+  PRIMARY KEY (delivery_id),
+  UNIQUE KEY msg_id(msg_id,auth_login_type,auth_login_id),
+  KEY auth_login_id(auth_login_id)
+) ENGINE=ndbcluster;
+
+disable_query_log;
+
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4866, '5', '91774132'), (4869, '5', '91884132');
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4866, '5', '91774134'), (4869, '5', '91884134');
+
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4866, '4', '91774133'), (4869, '4', '91884133');
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4866, '6', '91774133'), (4869, '6', '91884133');
+
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4865, '5', '91774133'), (4868, '5', '91774133');
+insert into t (msg_id, auth_login_type, auth_login_id) values
+(4867, '5', '91774133'), (4870, '5', '91774133');
+# ^^^ Add some more rows, none will match the query
+
+# Use these as base for filling in more rows
+insert into t (msg_id, auth_login_type, auth_login_id) select
+  msg_id - 4800, auth_login_type, auth_login_id+10 from t;
+
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 1000, auth_login_type, auth_login_id+20 from t where msg_id < 100;
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 2000, auth_login_type, auth_login_id+30 from t where msg_id < 100;
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 3000, auth_login_type, auth_login_id+40 from t where msg_id < 100;
+
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 5000, auth_login_type, auth_login_id+50 from t where msg_id < 100;
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 6000, auth_login_type, auth_login_id+60 from t where msg_id < 100;
+insert into t  (msg_id, auth_login_type, auth_login_id) select
+  msg_id + 7000, auth_login_type, auth_login_id+70 from t where msg_id < 100;
+
+
+##########################################
+# Statistics is asynchronous: We can not predict when it
+# is available in the statistics cache, and thus which
+# statistics is being used for the query. Thus we
+# need to cheat:
+#
+# We 'analyze' without the rows to be select yet inserted.
+# -> statistics will predictable contain 0 rows to be returned.
+# Which is 'rounded' up to 2 rows, which are the lowest number of
+# rows allowed to be estimated from non unique indexes.
+#
+# Note that the estimated 2 rows is actually what _is_ in the table
+# as well, when we later insert these rows below!
+
+analyze table t;
+enable_query_log;
+
+##########################################
+# Then insert the rows we want returned, without being part of the statistics
+insert into t (msg_id, auth_login_type, auth_login_id) values
+  (4866, '5', '81774133'),
+  (4869, '5', '81774133');
+# ^^^^ Rows we want from the query
+
+# optimizer trace, For debugging, or understanding of test case + bug:
+#
+# If enabled, we will find that optimizer first calculate the range-access
+# cost, using handler ::cost methods, and correctly find the msg_id to
+# have the lowest cost.
+# Possible REF-accesses are then investigated and estimated: However, these
+# cost estimates was 'pre-patch' based in an entirely different page/cache
+# cost metric, not being compare compatible with the former. A lower cost
+# was found for using REF access on auth_login_id index, incorrectly causing
+# this index to to be used for the table access.
+#
+#SET optimizer_trace_max_mem_size=1048576; # 1MB
+#SET end_markers_in_json=on;
+#SET optimizer_trace="enabled=on,one_line=off";
+
+explain
+  select * from t
+  where auth_login_type='5' AND auth_login_id = '81774133' AND msg_id IN (4866,4869);
+
+#SELECT * FROM information_schema.optimizer_trace;
+
+drop table t;
+
 # end of tests
diff --git a/sql/ha_ndbcluster.cc b/sql/ha_ndbcluster.cc
@@ -8322,6 +8322,116 @@ double ha_ndbcluster::scan_time()
   DBUG_RETURN(res);
 }
 
+/**
+  read_time() need to differentiate between single row type lookups,
+  and accesses where an ordered index need to be scanned.
+  The later will need to scan all fragments, which might be
+  significantly more expensive - imagine a deployment with hundreds
+  of partitions.
+ */
+double ha_ndbcluster::read_time(uint index, uint ranges, ha_rows rows) {
+  DBUG_ENTER("ha_ndbcluster::read_time()");
+  assert(rows > 0);
+  assert(ranges > 0);
+  assert(rows >= ranges);
+
+  const NDB_INDEX_TYPE index_type =
+      (index < MAX_KEY) ? get_index_type(index)
+    : (index == MAX_KEY) ? PRIMARY_KEY_INDEX   // Hidden primary key
+    : UNDEFINED_INDEX;                         // -> worst index
+
+  // fanout_factor is intended to compensate for the amount
+  // of roundtrips between API <-> data node and between data nodes
+  // themself by the different index type. As an initial guess
+  // we assume a single full roundtrip for each 'range'.
+  double fanout_factor;
+
+  /**
+   * Note that for now we use the default handler cost estimate
+   * 'rows2double(ranges + rows)' as the baseline - Even if it
+   * might have some obvious flaws. For now it is more important
+   * to get the relative cost between PK/UQ and order index scan
+   * more correct. It is also a matter of not changing too many
+   * existing MTR tests. (and customer queries as well!)
+   *
+   * We also estimate the same cost for a request roundtrip as
+   * for returning a row. Thus the baseline cost 'ranges + rows'
+   */
+  if (index_type == PRIMARY_KEY_INDEX) {
+    assert(index == table->s->primary_key);
+    // Need a full roundtrip for each row
+    fanout_factor = 1.0 * rows2double(rows);
+  } else if (index_type == UNIQUE_INDEX) {
+    // Need to lookup first on UQ, then on PK, + lock/unlock
+    fanout_factor = 2.0 * rows2double(rows);
+
+  } else if (rows > ranges ||
+	     index_type == ORDERED_INDEX ||
+	     index_type == UNDEFINED_INDEX) {
+    // Assume || need a range scan
+
+    // TODO: - Handler call need a parameter specifying whether
+    //         key was fully specified or not (-> scan or lookup)
+    //       - The range scan could be pruned -> lower cost, or
+    //       - The scan need to be 'ordered' -> higher cost.
+    //       - Returning multiple rows pr range has a lower
+    //         pr. row cost?
+    const uint fragments_to_scan =
+        m_table->getFullyReplicated() ? 1 : m_table->getPartitionCount();
+
+    // The range scan does one API -> TC request, which scale out the
+    // requests to all fragments. Assume a somewhat (*0.5) lower cost
+    // for these requests, as they are not full roundtrips back to the API
+    fanout_factor = (double)ranges * (1.0 + ((double)fragments_to_scan * 0.5));
+
+  } else {
+    assert(rows == ranges);
+
+    // Assume a set of PK/UQ single row lookups.
+    // We assume the hash key is used for a direct lookup
+    if (index_type == PRIMARY_KEY_ORDERED_INDEX) {
+      assert(index == table->s->primary_key);
+      fanout_factor = (double)ranges * 1.0;
+    } else {
+      assert(index_type == UNIQUE_ORDERED_INDEX);
+      // Unique key access has a higher cost than PK. Need to first
+      // lookup in index, then use that to lookup the row + lock & unlock
+      fanout_factor = (double)ranges * 2.0;  // Assume twice as many roundtrips
+    }
+  }
+  DBUG_RETURN(fanout_factor + rows2double(rows));
+}
+
+/**
+ * Estimate the cost for reading the specified number of rows,
+ * using 'index'. Note that there is no such thing as a 'page'-read
+ * in ha_ndbcluster. Unfortunately, the optimizer does some
+ * assumptions about an underlying page based storage engine,
+ * which explains the name.
+ *
+ * In the NDB implementation we simply ignore the 'page', and
+ * calculate it as any other read_cost()
+ */
+double ha_ndbcluster::page_read_cost(uint index, double rows) {
+  DBUG_ENTER("ha_ndbcluster::page_read_cost()");
+  DBUG_RETURN(read_cost(index, 1, rows).total_cost());
+}
+
+/**
+ * Estimate the upper cost for reading rows in a seek-and-read fashion.
+ * Calculation is based on the worst index we can find for this table, such
+ * that any other better way of reading the rows will be preferred.
+ *
+ * Note that worst_seek will be compared against page_read_cost().
+ * Thus, it need to calculate the cost using comparable 'metrics'.
+ */
+double ha_ndbcluster::worst_seek_times(double reads) {
+  // Specifying the 'UNDEFINED_INDEX' is a special case in read_time(),
+  // where the cost for the most expensive/worst index will be calculated.
+  const uint undefined_index = MAX_KEY+1;
+  return page_read_cost(undefined_index, std::max(reads, 1.0));
+}
+
 /*
   Convert MySQL table locks into locks supported by Ndb Cluster.
   Note that MySQL Cluster does currently not support distributed
diff --git a/sql/ha_ndbcluster.h b/sql/ha_ndbcluster.h
@@ -308,6 +308,9 @@ class ha_ndbcluster: public handler, public Partition_handler
   const char* index_type(uint key_number);
 
   double scan_time();
+  double read_time(uint index, uint ranges, ha_rows rows);
+  double page_read_cost(uint index, double rows);
+  double worst_seek_times(double reads);
   ha_rows records_in_range(uint inx, key_range *min_key, key_range *max_key);
   void start_bulk_insert(ha_rows rows);
   int end_bulk_insert();
diff --git a/sql/handler.cc b/sql/handler.cc
@@ -5947,6 +5947,25 @@ int ha_binlog_end(THD* thd)
   return 0;
 }
 
+double handler::page_read_cost(uint index,
+                               double reads) {
+  return table->cost_model()->page_read_cost(reads);
+
+  /////////////////
+  // Other, non-page-based storage engine, may prefer to
+  // override to;
+  //return read_cost(index, 1, reads).total_cost();
+
+  // Longer term: We should avoid mixed usage of read_cost()
+  // and page_read_cost() from the optimizer. Use only
+  // one of these to get cost estimates comparable between different
+  // access methods and call paths.
+}
+
+double handler::worst_seek_times(double reads) {
+  return table->cost_model()->page_read_cost(reads);
+}
+
 /**
   Calculate cost of 'index only' scan for given index and number of records
 
diff --git a/sql/handler.h b/sql/handler.h
@@ -2582,6 +2582,51 @@ class handler :public Sql_alloc
 
   virtual Cost_estimate read_cost(uint index, double ranges, double rows);
   
+  /**
+    Cost estimate for doing a number of non-sequentially accesses
+    against the storage engine. Such accesses can be either number
+    of rows to read, or number of disk pages to access.
+    Each handler implementation is free to interpret that as best
+    suited, depending on what is the dominating cost for that
+    storage engine.
+
+    This method is mainly provided as a temporary workaround for
+    bug#33317872, where we fix problems caused by calling
+    Cost_model::page_read_cost() directly from the optimizer.
+    That should be avoide, as it introduced assumption about all
+    storage engines being disk-page based, and having a 'page' cost.
+    Furthermore, this page cost was even compared against read_cost(),
+    which was computed with an entirely different algorithm, and thus
+    could not be compared.
+
+    The default implementation still call Cost_model::page_read_cost(),
+    thus behaving just as before. However, handler implementation may
+    override it to call handler::read_cost() instead(), which propably
+    will be more correct. (If a page_read_cost should be included
+    in the cost estimate, that should preferable be done inside
+    each read_cost() implementation)
+
+    Longer term we should considder to remove all page_read_cost()
+    usage from the optimizer itself, making this method obsolete.
+
+    @param index  the index number
+    @param reads  the number of accesses being made
+
+    @returns the estimated cost
+  */
+  virtual double page_read_cost(uint index, double reads);
+
+  /**
+    Provide an upper cost-limit of doing a specified number of
+    seek-and-read key lookups. This need to be comparable and
+    calculated with the same 'metric' as page_read_cost.
+
+    @param reads the number of rows read in the 'worst' case.
+
+    @returns the estimated cost
+  */
+  virtual double worst_seek_times(double reads);
+
   /**
     Return an estimate on the amount of memory the storage engine will
     use for caching data in memory. If this is unknown or the storage
diff --git a/sql/sql_optimizer.cc b/sql/sql_optimizer.cc
diff --git a/sql/sql_planner.cc b/sql/sql_planner.cc
diff --git a/sql/sql_select.cc b/sql/sql_select.cc