Skip to content

Commit 6de594a

Browse files
author
Joao Gramacho
committed
WL#10406: Improve usability when receiver thread is waiting
for disk space Step 1 ====== This patch replaced the requirement to access Master_info format description event from Master_info->data_lock to relay_log->LOCK_log. It also changed the locking order at queue_event() to relay-log->LOCK_log, Master_info->data_lock. It made the SQL thread not rely on the relay log LOCK_log to be stopped anymore. Step 2 ====== Truncating the relay log in correct event boundaries ---------------------------------------------------- This patch introduced the MYSQL_BIN_LOG::truncate_relaylog_file(). This function is called after errors writing events to the relay log, passing the relay log end pos (the end of the last known successfully written event) to minimize the possibility of the applier thread to read a partial (bad) event. Displaying "Waiting for disk space" on status --------------------------------------------- This patch introduced enter_stage_hook to make my_write() able to set the current thread stage as "Waiting for disk space" before calling wait_for_free_space() function and restoring the previous thread stage after the function call. This will make any thread waiting for disk space on my_write() to report this information not only in error logs but also in thread status interfaces (performance schema tables, SHOW SLAVE STATUS, etc.). WL related bug fixes ==================== BUG#26111422 ASSERTION `IS_OPEN()' FAILED AT MYSQL_BIN_LOG::TRUNCATE_RELAYLOG_FILE Problem ------- The replica server is trying to truncate a closed relay log file when an unrecoverable error occurred while rotating the relay log. Analysis -------- In the case of an unrecoverable error when rotating the relay log, the server will take the configured BINLOG_ERROR_ACTION. When BINLOG_ERROR_ACTION=ABORT_SERVER, the server will be shutdown. When BINLOG_ERROR_ACTION=IGNORE_ERROR, the server will close the relay log. The only way of recovering the closed relay log is to restart the whole server. The code at truncate_relaylog_file() is asserting that the relay log was opened when called to prevent trying to truncate a closed relay log. Fix --- Because of the possibility of calling the function after an error rotating the relay log, the truncate_relaylog_file() function should not assert that the relay log is open and also should take no action when the relay log was closed. BUG#26161405 EXECUTING STOP SLAVE WHEN IO_THREAD IS "WAITING FOR DISK SPACE" CAUSES PROBLEMS Problems ------- STOP SLAVE [IO_THREAD] will set mi->abort_slave flag and will wait until the I/O thread to be stopped. When the I/O thread is waiting for disk space, the mi->abort_slave signal will not be checked by the I/O thread until finishing queuing the current event. So, "STOP SLAVE" will be blocked (until STOP SLAVE timeout with an error). Also, any thread waiting for disk space at "my_write" (thread that used the MY_WAIT_IF_FULL flag) could report itself as "Waiting for disk space". Shutting down the server while having an I/O thread waiting for disk space would hang the server without accepting new connections until disk space be freed. Fixes ----- STOP SLAVE [IO_THREAD] throws a warning message into the server error log recommending either to free some disk space or to use 'KILL' to abort I/O thread operation. Only the relay log related operations will change the thread status to "Waiting for disk space". A new flag was used to signal the my_write function to change the thread status. Shutting down the server while having an I/O thread waiting for disk space will make the I/O thread to be killed, truncating the current relay log file if possible. Fixed a doxygen issue at MYSQL_BIN_LOG::truncate_relaylog_file(). Fixed an issue in "performance_schema.threads" that was not showing "Waiting for disk space" at PROCESSLIST_STATE field.
1 parent b01e3d9 commit 6de594a

29 files changed

+1268
-157
lines changed

include/my_sys.h

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ C_MODE_START
112112
#define MY_WAIT_IF_FULL 32 /* Wait and try again if disk full error */
113113
#define MY_IGNORE_BADFD 32 /* my_sync: ignore 'bad descriptor' errors */
114114
#define MY_SYNC_DIR 8192 /* my_create/delete/rename: sync directory */
115-
#define MY_UNUSED 64 /* Unused (was support for RAID) */
115+
#define MY_REPORT_WAITING_IF_FULL 64 /* my_write: set status as waiting */
116116
#define MY_FULL_IO 512 /* For my_read - loop intil I/O is complete */
117117
#define MY_DONT_CHECK_FILESIZE 128 /* Option to init_io_cache() */
118118
#define MY_LINK_WARNING 32 /* my_redel() gives warning if links */
@@ -268,6 +268,18 @@ extern void (*exit_cond_hook)(void *opaque_thd,
268268
const char *src_file,
269269
int src_line);
270270

271+
extern void (*enter_stage_hook)(void *opaque_thd,
272+
const PSI_stage_info *new_stage,
273+
PSI_stage_info *old_stage,
274+
const char *src_function,
275+
const char *src_file,
276+
int src_line);
277+
278+
/*
279+
Hook for setting THD waiting_for_disk_space flag.
280+
*/
281+
extern void (*set_waiting_for_disk_space_hook)(void *opaque_thd,
282+
bool waiting);
271283
/*
272284
Hook for checking if the thread has been killed.
273285
*/
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
CALL mtr.add_suppression('Disk is full writing');
2+
CALL mtr.add_suppression('Retry in 60 secs');
3+
CREATE TABLE t1 (c1 INT, c2 LONGTEXT) ENGINE=MyISAM;
4+
INSERT INTO t1 VALUES (1, REPEAT("a", 8192));
5+
INSERT INTO t1 VALUES (2, REPEAT("b", 8192));
6+
INSERT INTO t1 VALUES (3, REPEAT("c", 8192));
7+
# Adding debug point 'force_wait_for_disk_space' to @@GLOBAL.debug
8+
REPAIR TABLE t1;
9+
include/assert.inc [REPAIR TABLE PROCESSLIST_STATE must not be "Waiting for disk space"]
10+
# Removing debug point 'force_wait_for_disk_space' from @@GLOBAL.debug
11+
Table Op Msg_type Msg_text
12+
test.t1 repair status OK
13+
DROP TABLE t1;
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
CREATE TABLE t1 (c1 INT);
7+
INSERT INTO t1 (c1) VALUES (1);
8+
include/sync_slave_sql_with_master.inc
9+
CALL mtr.add_suppression('coordinator and worker threads are stopped');
10+
SET @@GLOBAL.debug="+d,pause_on_queue_event_after_write_buffer";
11+
[connection master]
12+
INSERT INTO t1 (c1) VALUES (2);
13+
[connection slave]
14+
SET DEBUG_SYNC="now WAIT_FOR receiver_reached_pause_on_queue_event";
15+
include/assert.inc [Consulting SHOW SLAVE STATUS was successful]
16+
include/stop_slave_sql.inc
17+
include/start_slave_sql.inc
18+
SET @@GLOBAL.debug="-d,pause_on_queue_event_after_write_buffer";
19+
SET DEBUG_SYNC="now SIGNAL receiver_continue_queuing_event";
20+
[connection master]
21+
DROP TABLE t1;
22+
include/rpl_end.inc
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
CREATE TABLE t1(a INT);
7+
INSERT INTO t1 VALUES(1);
8+
include/sync_slave_sql_with_master.inc
9+
CALL mtr.add_suppression("Disk is full writing");
10+
CALL mtr.add_suppression("Retry in 60 secs");
11+
CALL mtr.add_suppression("Relay log write failure");
12+
CALL mtr.add_suppression("Recovery from master pos");
13+
CALL mtr.add_suppression("Waiting until I/O thread .* finish writing to disk before stopping");
14+
#
15+
# Step 1 - Monitor and kill I/O thread while waiting for disk space
16+
#
17+
# Adding debug point 'simulate_io_thd_wait_for_disk_space' to @@GLOBAL.debug
18+
[connection master]
19+
INSERT INTO t1 VALUES(2);
20+
[connection slave]
21+
include/wait_for_slave_param.inc [Slave_IO_State]
22+
include/assert.inc [I/O thread PROCESSLIST_STATE should be Waiting for disk space at performance_schema.threads]
23+
include/assert.inc [I/O thread STATE should be Waiting for disk space at information_schema.processlist]
24+
[connection slave1]
25+
STOP SLAVE IO_THREAD;
26+
[connection slave]
27+
SET DEBUG_SYNC="now WAIT_FOR reached_stopping_io_thread";
28+
KILL IO_THREAD;
29+
[connection slave1]
30+
[connection slave]
31+
include/wait_for_slave_io_error.inc [errno=1595]
32+
# Removing debug point 'simulate_io_thd_wait_for_disk_space' from @@GLOBAL.debug
33+
include/assert_grep.inc [Found the disk full error message on the slave]
34+
include/assert_grep.inc [Found the relay log truncate message on the slave]
35+
include/start_slave_io.inc
36+
[connection master]
37+
include/sync_slave_sql_with_master.inc
38+
#
39+
# Step 2 - Restart the slave server while I/O thread is waiting for disk space
40+
#
41+
# Adding debug point 'simulate_io_thd_wait_for_disk_space' to @@GLOBAL.debug
42+
[connection master]
43+
INSERT INTO t1 VALUES(3);
44+
INSERT INTO t1 VALUES(4);
45+
INSERT INTO t1 VALUES(5);
46+
INSERT INTO t1 VALUES(6);
47+
INSERT INTO t1 VALUES(7);
48+
INSERT INTO t1 VALUES(8);
49+
INSERT INTO t1 VALUES(9);
50+
[connection slave]
51+
include/wait_for_slave_param.inc [Slave_IO_State]
52+
include/stop_slave_sql.inc
53+
include/rpl_restart_server.inc [server_number=2 parameters: --relay_log_recovery]
54+
include/start_slave.inc
55+
include/assert_grep.inc [Found no warning messages about I/O thread waiting before stopping]
56+
[connection master]
57+
DROP TABLE t1;
58+
include/rpl_end.inc
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
CREATE TABLE t1 (c1 INT, c2 TEXT);
7+
include/sync_slave_sql_with_master.inc
8+
CALL mtr.add_suppression("Disk is full writing");
9+
CALL mtr.add_suppression("Retry in 60 secs");
10+
CALL mtr.add_suppression("Relay log write failure");
11+
CALL mtr.add_suppression("Could not parse relay log event entry.");
12+
CALL mtr.add_suppression("Turning logging off for the whole duration of the MySQL server process");
13+
CALL mtr.add_suppression("Recovery from master pos");
14+
CALL mtr.add_suppression("Error reading relay log event for channel");
15+
CALL mtr.add_suppression("next log error");
16+
CALL mtr.add_suppression("The slave coordinator and worker threads are stopped");
17+
CALL mtr.add_suppression("Error writing master configuration");
18+
CALL mtr.add_suppression("Failed to flush master info file");
19+
# Adding debug point 'simulate_random_io_thd_wait_for_disk_space' to @@GLOBAL.debug
20+
# Generating events to be replicated to the slave
21+
[connection master]
22+
# Waiting until slave I/O thread synced with master
23+
[connection slave]
24+
# Removing debug point 'simulate_random_io_thd_wait_for_disk_space' from @@GLOBAL.debug
25+
[connection master]
26+
include/sync_slave_sql_with_master.inc
27+
include/diff_tables.inc [master:t1, slave:t1]
28+
[connection master]
29+
DROP TABLE t1;
30+
include/rpl_end.inc
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# ==== Purpose ====
2+
#
3+
# Check that the execution of SHOW SLAVE STATUS command is not blocked when IO
4+
# thread is blocked writing an event to the relay log.
5+
#
6+
# ==== Implementation ====
7+
#
8+
# Simulate a scenario where IO thread is waiting while writing into the relay
9+
# log, holding LOCK_log. Execute SHOW SLAVE STATUS command after IO thread is
10+
# blocked writing an event. The command should not be blocked.
11+
#
12+
# ==== References ====
13+
#
14+
# WL#10406: Improve usability when receiver thread is waiting for disk space
15+
# BUG#21753696: MAKE SHOW SLAVE STATUS NON BLOCKING IF IO THREAD WAITS FOR
16+
# DISK SPACE
17+
# BUG#25609448: STOP SLAVE IS ACQUIRING THE RELAY LOG->LOG_LOCK EVEN TO STOP
18+
# ONLY THE SQL THREAD
19+
#
20+
--source include/have_debug.inc
21+
--source include/have_debug_sync.inc
22+
--source include/master-slave.inc
23+
24+
# Generate events to be replicated to the slave
25+
CREATE TABLE t1 (c1 INT);
26+
INSERT INTO t1 (c1) VALUES (1);
27+
--let $master_file= query_get_value(SHOW MASTER STATUS, File, 1)
28+
--source include/sync_slave_sql_with_master.inc
29+
30+
# Suppression of error messages
31+
CALL mtr.add_suppression('coordinator and worker threads are stopped');
32+
33+
# Set the debug option that will simulate I/O thread blocked writing
34+
SET @@GLOBAL.debug="+d,pause_on_queue_event_after_write_buffer";
35+
36+
# Generate events to be replicated to the slave
37+
--source include/rpl_connection_master.inc
38+
INSERT INTO t1 (c1) VALUES (2);
39+
40+
--source include/rpl_connection_slave.inc
41+
# Wait until IO thread is queuing events from master
42+
SET DEBUG_SYNC="now WAIT_FOR receiver_reached_pause_on_queue_event";
43+
44+
# Notice that this is performed by querying SHOW SLAVE STATUS
45+
--let $master_log_file= query_get_value(SHOW SLAVE STATUS, Master_Log_File, 1)
46+
--let $assert_cond= "$master_log_file" = "$master_file"
47+
--let $assert_text= Consulting SHOW SLAVE STATUS was successful
48+
--source include/assert.inc
49+
50+
# We can even stop SQL thread
51+
--source include/stop_slave_sql.inc
52+
53+
# And we also can start SQL thread again
54+
--source include/start_slave_sql.inc
55+
56+
# Remove debug instrumentation and let the I/O thread to continue
57+
SET @@GLOBAL.debug="-d,pause_on_queue_event_after_write_buffer";
58+
SET DEBUG_SYNC="now SIGNAL receiver_continue_queuing_event";
59+
60+
# Cleanup
61+
--source include/rpl_connection_master.inc
62+
DROP TABLE t1;
63+
--source include/rpl_end.inc
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# ==== Purpose ====
2+
#
3+
# Check that the execution of SHOW SLAVE STATUS command is not blocked when IO
4+
# thread is blocked waiting for disk space.
5+
#
6+
# ==== Implementation ====
7+
#
8+
# There are two steps on this test case:
9+
#
10+
# 1st Step
11+
#
12+
# Simulate a scenario where IO thread is waiting for disk space while writing
13+
# into the relay log. Execute SHOW SLAVE STATUS command after IO thread is
14+
# blocked waiting for space. The command should not be blocked.
15+
#
16+
# 2nd Step
17+
#
18+
# The slave server should respond promptly when asked to shutdown. No warnings
19+
# about the "STOP SLAVE" waiting for disk space should be generated.
20+
#
21+
# ==== References ====
22+
#
23+
# WL#10406: Improve usability when receiver thread is waiting for disk space
24+
# Bug#21753696: MAKE SHOW SLAVE STATUS NON BLOCKING IF IO THREAD WAITS FOR
25+
# DISK SPACE
26+
#
27+
###############################################################################
28+
--source include/have_debug.inc
29+
--source include/master-slave.inc
30+
31+
# Generate events to be replicated to the slave
32+
CREATE TABLE t1(a INT);
33+
INSERT INTO t1 VALUES(1);
34+
--source include/sync_slave_sql_with_master.inc
35+
36+
# Those errors will only happen in the slave
37+
CALL mtr.add_suppression("Disk is full writing");
38+
CALL mtr.add_suppression("Retry in 60 secs");
39+
CALL mtr.add_suppression("Relay log write failure");
40+
CALL mtr.add_suppression("Recovery from master pos");
41+
CALL mtr.add_suppression("Waiting until I/O thread .* finish writing to disk before stopping");
42+
43+
--echo #
44+
--echo # Step 1 - Monitor and kill I/O thread while waiting for disk space
45+
--echo #
46+
47+
--let $io_id=`SELECT PROCESSLIST_ID FROM performance_schema.threads WHERE NAME = 'thread/sql/slave_io'`
48+
49+
# Set the debug option that will simulate disk full
50+
--let $debug_point=simulate_io_thd_wait_for_disk_space
51+
--source include/add_debug_point.inc
52+
53+
# Generate events to be replicated to the slave
54+
--source include/rpl_connection_master.inc
55+
INSERT INTO t1 VALUES(2);
56+
57+
--source include/rpl_connection_slave.inc
58+
# Wait until IO thread is waiting for disk space
59+
# Notice that this is performed by querying SHOW SLAVE STATUS
60+
--let $slave_param= Slave_IO_State
61+
--let $slave_param_value= Waiting for disk space
62+
--source include/wait_for_slave_param.inc
63+
64+
# performance_schema.threads and information_schema.processlist checks
65+
--let $io_state=`SELECT PROCESSLIST_STATE FROM performance_schema.threads WHERE NAME = 'thread/sql/slave_io'`
66+
--let $assert_text= I/O thread PROCESSLIST_STATE should be Waiting for disk space at performance_schema.threads
67+
--let $assert_cond= "$io_state" = "Waiting for disk space"
68+
--source include/assert.inc
69+
70+
--let $io_state=`SELECT STATE FROM information_schema.processlist WHERE ID = $io_id`
71+
--let $assert_text= I/O thread STATE should be Waiting for disk space at information_schema.processlist
72+
--let $assert_cond= "$io_state" = "Waiting for disk space"
73+
--source include/assert.inc
74+
75+
# Get the relay log file name, also using SHOW SLAVE STATUS
76+
--let $relay_log_file= query_get_value(SHOW SLAVE STATUS, Relay_Log_File, 1)
77+
78+
# STOP SLAVE IO_THREAD should be blocked
79+
--source include/rpl_connection_slave1.inc
80+
--send STOP SLAVE IO_THREAD
81+
82+
# But the KILL on it should unblock it
83+
--source include/rpl_connection_slave.inc
84+
SET DEBUG_SYNC="now WAIT_FOR reached_stopping_io_thread";
85+
--replace_result $io_id IO_THREAD
86+
--eval KILL $io_id
87+
88+
--source include/rpl_connection_slave1.inc
89+
--reap
90+
91+
--source include/rpl_connection_slave.inc
92+
--let $slave_io_errno= convert_error(ER_SLAVE_RELAY_LOG_WRITE_FAILURE)
93+
--source include/wait_for_slave_io_error.inc
94+
95+
# Restore the debug options to "simulate" freed space on disk
96+
--source include/remove_debug_point.inc
97+
98+
# There should be a message in the error log of the slave stating
99+
# that it was waiting for space to write on the relay log.
100+
--let $assert_file=$MYSQLTEST_VARDIR/log/mysqld.2.err
101+
# Grep only after the message that the I/O thread has started
102+
--let $assert_only_after= Slave I/O .* connected to master .*replication started in log .* at position
103+
--let $assert_count= 1
104+
--let $assert_select=Disk is full writing .*$relay_log_file.* No space left on device
105+
--let $assert_text= Found the disk full error message on the slave
106+
--source include/assert_grep.inc
107+
108+
# There should be a message in the error log of the slave stating
109+
# that it truncate the relay log file.
110+
--let $assert_file=$MYSQLTEST_VARDIR/log/mysqld.2.err
111+
# Grep only after the message that the I/O thread has started
112+
--let $assert_only_after= Slave I/O .* connected to master .*replication started in log .* at position
113+
--let $assert_count= 1
114+
--let $assert_select= Relaylog file .* size was .* but was truncated at
115+
--let $assert_text= Found the relay log truncate message on the slave
116+
--source include/assert_grep.inc
117+
118+
# Start the I/O thread to let the slave to sync
119+
--source include/start_slave_io.inc
120+
--source include/rpl_connection_master.inc
121+
--source include/sync_slave_sql_with_master.inc
122+
123+
--echo #
124+
--echo # Step 2 - Restart the slave server while I/O thread is waiting for disk space
125+
--echo #
126+
127+
# Set the debug option that will simulate disk full
128+
--let $debug_point=simulate_io_thd_wait_for_disk_space
129+
--source include/add_debug_point.inc
130+
131+
# Generate events to be replicated to the slave
132+
--source include/rpl_connection_master.inc
133+
INSERT INTO t1 VALUES(3);
134+
INSERT INTO t1 VALUES(4);
135+
INSERT INTO t1 VALUES(5);
136+
INSERT INTO t1 VALUES(6);
137+
INSERT INTO t1 VALUES(7);
138+
INSERT INTO t1 VALUES(8);
139+
INSERT INTO t1 VALUES(9);
140+
141+
--source include/rpl_connection_slave.inc
142+
# Wait until IO thread is waiting for disk space
143+
# Notice that this is performed by querying SHOW SLAVE STATUS
144+
--let $slave_param= Slave_IO_State
145+
--let $slave_param_value= Waiting for disk space
146+
--source include/wait_for_slave_param.inc
147+
148+
# Restart the slave with relay log recovery
149+
--source include/stop_slave_sql.inc
150+
--let $rpl_server_number= 2
151+
--let $rpl_server_parameters= --relay_log_recovery
152+
--source include/rpl_restart_server.inc
153+
--source include/start_slave.inc
154+
155+
# There should not be a message in the error log of the slave stating
156+
# that it was waiting for space to stop the I/O thread.
157+
--let $assert_file=$MYSQLTEST_VARDIR/log/mysqld.2.err
158+
# Grep only after the message that the I/O thread has started
159+
--let $assert_only_after= Slave I/O .* connected to master .*replication started in log .* at position
160+
--let $assert_count= 0
161+
--let $assert_select=Waiting until I/O thread for channel
162+
--let $assert_text= Found no warning messages about I/O thread waiting before stopping
163+
--source include/assert_grep.inc
164+
165+
# Cleanup
166+
--source include/rpl_connection_master.inc
167+
DROP TABLE t1;
168+
--source include/rpl_end.inc
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--binlog_error_action=IGNORE_ERROR

0 commit comments

Comments
 (0)