Skip to content

WifiClientSecure hangs on write - socket error #3460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Onigiri-Kun opened this issue Nov 7, 2019 · 13 comments · Fixed by #4424
Closed

WifiClientSecure hangs on write - socket error #3460

Onigiri-Kun opened this issue Nov 7, 2019 · 13 comments · Fixed by #4424
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@Onigiri-Kun
Copy link

Onigiri-Kun commented Nov 7, 2019

Hardware:

Board: ESP32 Dev Module
Core Installation version: ESP32 Arduino 1.0.3 (IDF 3.2) Release
IDE name: Visual Studio with Visual Micro add on
Flash Frequency: 40Mhz - tried a few and problem persists.
PSRAM enabled: no
Upload Speed: 115200
Computer OS: Windows 10

Description:

I have an ESP32 publishing code to a Mosquitto MQTT server, using the PubSubClient library.
This works fine, and publishes many short messages (around 50 float values, to one per topic) at a 10 seconds interval (50 mqtt write every 10 seconds).
I had problems with the board hanging occasionally, and managed to isolate the problem to the WebClientSecure library.

The log files of the server would give "Socket Error on the client".
The ESP32 debug log would give "Writing HTTP request", before hanging.

The point where the code hangs is the following:

int send_ssl_data(sslclient_context *ssl_client, const uint8_t *data, uint16_t len)
{
    log_v("Writing HTTP request...");  //for low level debug
    int ret = -1;

    while ((ret = mbedtls_ssl_write(&ssl_client->ssl_ctx, data, len)) <= 0) {
        if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) {
            return handle_error(ret);
        }
    }

    len = ret;
    //log_v("%d bytes written", len);  //for low level debug
    return ret;
}

The sketch hangs in the while loop, because mbedtls_ssl_write does not return an error code.
I attribute this to some occasional write failure/socket failure, which is not handled properly.

I've tried adding some timeout code here, which does catch the problem. However, I am not able to reset the code correctly and restart successfully transmission of data without a board reset.

My question is: Can you have code to handle a timeout of the mbedtls_ssl_write?
In such a way that, if some socket error occurs, the client can restart transmission quickly?

Debug Messages:

"Writing HTTP request.."
It then hangs
@lbernstone
Copy link
Contributor

Are you checking to make sure the connection is active before you send?

@Onigiri-Kun
Copy link
Author

Sorry for the late reply.
Yes, I do. I have:

if (WiFi.status() != WL_CONNECTED) {
	Serial.println(F("trying to publish, but wifi disconnected"));
	return;
}
if (mqttclient.connected() == false) {
	Serial.println(F("trying to publish, but mqtt disconnected"));
	return;
} 

So I only send if I have wifi and mqtt connection.

After that I have:

yield();
bool result = mqttclient.publish(topic, payload, retained);

I've check logs on the mqtt broker as well, and it also records "Socket Error on the client", so I guess it's a network error that happens during the mqtt write.

@stale
Copy link

stale bot commented Mar 1, 2020

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Mar 1, 2020
@stale
Copy link

stale bot commented Mar 15, 2020

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Mar 15, 2020
@felipeapp
Copy link

felipeapp commented Apr 1, 2020

I have a similar problem here. I am also using PubSubClient library with WiFiClientSecure to publish in a local Mosquitto server. Everything works great, but sometimes when the connection with the server goes down the while in the send_ssl_data function never finishes.

After some debugging and reading the mbedtls_ssl_write documentation, I realized that this function was returning MBEDTLS_ERR_SSL_WANT_WRITE. According to mbedtls_ssl_write documentation, it is an error code and call mbedtls_ssl_write again and again in the while will not solve this apparently. The loop does not finish even when the connection is up again. So, the only alternative is restart the ESP-32.

I am not sure, but the while and the line len = ret seem to be not necessary. Here, I solved the problem by reimplementing the function like that:

int send_ssl_data(sslclient_context *ssl_client, const uint8_t *data, uint16_t len)
{
    log_v("Writing HTTP request with %d bytes...", len); //for low level debug
    int ret = -1;

    if ((ret = mbedtls_ssl_write(&ssl_client->ssl_ctx, data, len)) <= 0)
        return handle_error(ret);

    log_v("Returning with %d bytes written", ret); //for low level debug
    return ret;
}

@Onigiri-Kun
Copy link
Author

Thanks!
As far as I understand, the while does the following:

  1. if there is an error, return it
  2. if the result is MBEDTLS_ERR_SSL_WANT_READ or MBEDTLS_ERR_SSL_WANT_WRITE, it means that the mbedtls_ssl_write needs to be called again, because it did a partial write and didn't complete yet

So I think removing it might be ok most of the times, but cause some problems in some cases.

But in any case, I will follow your suggestion and try it out! Hopefully, it solved the problem for me as well.

@felipeapp
Copy link

felipeapp commented Apr 2, 2020

Thanks!
As far as I understand, the while does the following:

  1. if there is an error, return it
  2. if the result is MBEDTLS_ERR_SSL_WANT_READ or MBEDTLS_ERR_SSL_WANT_WRITE, it means that the mbedtls_ssl_write needs to be called again, because it did a partial write and didn't complete yet

So I think removing it might be ok most of the times, but cause some problems in some cases.

But in any case, I will follow your suggestion and try it out! Hopefully, it solved the problem for me as well.

You are probably right. I think that keep the while and add an attempt counter could be a better solution.

From the mbedtls_ssl_write documentation: "When this function returns MBEDTLS_ERR_SSL_WANT_WRITE/READ, it must be called later with the same arguments, until it returns a value greater that or equal to 0. When the function returns MBEDTLS_ERR_SSL_WANT_WRITE there may be some partial data in the output buffer, however this is not yet sent.".

However, I get MBEDTLS_ERR_SSL_WANT_WRITE over and over what keeps the while in a loop. It could be an issue from mbedtls library and not from WifiClientSecure library.

@Onigiri-Kun
Copy link
Author

yeah, I think you're right. It probably doesn't handles well writing when the connection had problems.
One of the things I tried was to add a timeout to the while. The problem was what to do after that timeout is triggered. I tried disconnecting and reconnecting the wifi, but it got itself into a weird state where it would crash the esp and reboot it. I will explore that further once I have some time...

@igolubic
Copy link
Contributor

This is an issue that's causing a lot of problems on our application, and it was pretty complex to track. We are publishing multiple stuff using PubSubClient (we also tried some alternatives, but they are all using WiFiClientSecure ).

Hardware:
Board: Custom board with ESP32, 16MB module
Core Installation version: ESP32 Arduino 1.0.4 (IDF 3.2) Release
IDE name: Visual Studio, with PlatformIO plugin
Flash Frequency: 40Mhz
PSRAM enabled: no
Upload Speed: 115200
Computer OS: Windows 10

Description
We were publishing some long MQTT messages to our MQTT broker using PubSubClient library. The connection is done using WiFiClientSecure and our certificates.

During operation (with verbose log) and standard publishing, the message "Writing HTTP request" is visible, but sometimes the system just hangs and triggers WDT and ESP restarts.

After decoding Backtrace the following line looked suspicious:

0x40140919: send_ssl_data(sslclient_context*, unsigned char const*, unsigned short) at C:\Users\Ivan\.platformio\packages\framework-arduinoespressif32\libraries\WiFiClientSecure\src/ssl_client.cpp:282

After more digging and logging, we had several conclusions:

  • In most cases, it will work and data will be sent without no issues
  • Sometimes, if there is some unknown connection issue (or similar) function will return MBEDTLS_ERR_SSL_WANT_WRITE error code, and per documentation, it will retry to send data until the positive number or 0 is returned.
  • Error handler will not be triggered because of this line:
    if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) { return handle_error(ret); }
    But, from the other side, it will be stuck in while loop.
  • Sometimes it will go to the while loop and stay there for a few seconds by constantly publishing the same data, and then it will continue to work normally (obviously it takes less time than needed to trigger WDT), but the real issue is when it needs more time to handle data and that triggers WDT to reset ESP32.

Fix by @felipeapp looks promising, but I don't know if this is WiFiClientSecure issue or mbedtls .

I don't know the logic behind:

if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) { return handle_error(ret); }

Maybe somebody can check if it is safe to remove it and implement @felipeapp solution since it looks like it solves this problem from our side also?

@igolubic
Copy link
Contributor

@me-no-dev - can you please reopen this issue?

igolubic added a commit to igolubic/arduino-esp32 that referenced this issue Aug 28, 2020
@Onigiri-Kun
Copy link
Author

@igolubic Thanks. What you wrote matches more or less what I remembered. It is definitely caused by some occasional network issue (expected). Then, MBEDTLS_ERR_SSL_WANT_WRITE gets thrown and not handled properly. As you said, it's not clear if this is WifiClientSecure or mbedtls fault.

I have mostly resolved it, on my system, by using a rasperrypi on the local network. This has an mqtt broker, and node-red flows which receive the messages, and then re-transmit them to my original broker (not on the local network).

As the ESP32-Mqtt communication happens on local network, there are no observed network errors.
And the unreliable network from my local network to the internet broker now happens with QoS 2, on the rasperrypi.
This way, it seems I don't have any interruption for weeks/months.

For sure it's not ideal, as it basically necessitate a Rasperrypi "relay" to introduce QoS 2.

The commit you pushed: Does it seem to work for you? Has it improved things?
If so, I could give it a try to (without my mqtt relay stuff of course).

@igolubic
Copy link
Contributor

@Onigiri-Kun - just created the pull request #4424. It's been working in production for at least 1 month without any issues on the public network and not a single byte was dropped.

This solution is currently used in quite a few projects by custom reference in the platformio.ini file, hopefully, it will pass to production to avoid using custom referencing.

Custom referencing:
platform_packages = framework-arduinoespressif32@https://github.com/igolubic/arduino-esp32.git#43896eb

me-no-dev pushed a commit that referenced this issue Nov 2, 2020
Fixes: #3460

This code has been run in production for 1 month and it looks stable, no data dropped and it definitely fixes the issue described. I think that this can be merged to avoid using custom package referencing in PlatformIO that has been used in quite a few projects for now.

Co-authored-by: Ivan Golubic <ivan@mvt-solutions.com>
@bhupiister
Copy link

bhupiister commented Sep 30, 2022

//This solution will try multiple times before failing. Have tested this thoroughly. Alter the retry variable to suit your need.

int send_ssl_data(sslclient_context *ssl_client, const uint8_t *data, size_t len)
{
    log_v("Writing HTTP request with %d bytes...", len); // for low level debug
    int ret = -1;
    int retry = 0;
    while ((ret = mbedtls_ssl_write(&ssl_client->ssl_ctx, data, len)) <= 0)
    {
        if ((ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE && ret < 0) || retry >= 100)
        {
            log_v("Trial #  %d", retry);     // for low level debug // Show number of attempts taken 
            log_v("Handling error %d", ret); // for low level debug
            return handle_error(ret);
        }
        retry++;
        // wait for space to become available
        vTaskDelay(2);
    }
    log_v("Trial #  %d", retry); // for low level debug // Show number of attempts taken 
    return ret;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants