Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wifi connect failures stuck in long timeout loop, autoreconnect on PMK failure is wrong #7968

Closed
tablatronix opened this issue Mar 16, 2023 · 21 comments
Labels
Area: WiFi Issue related to WiFi

Comments

@tablatronix
Copy link
Contributor

tablatronix commented Mar 16, 2023

Board

esp32*

Device Description

devkit

Hardware Configuration

Version

v2.0.7

IDE Name

deviot

Operating System

osx

Flash frequency

40

PSRAM enabled

yes

Upload speed

921600

Description

On my Unifi APs
If I use wrong password I get this failure, instead of auth failure.
Auto reconnect gets stuck here for a very long time this will retry many many times then abort. (60s)
Similar issue with some auth protections wifi:Association refused temporarily

[MEM] free: 253092 | max: 45044 | frag:  83% 
*wm:[3] . 
[  7038][V][WiFiGeneric.cpp:362] _arduino_event_cb(): STA Disconnected: SSID: leela, BSSID: f4:92:bf:7f:35:4b, Reason: 15
[  7040][D][WiFiGeneric.cpp:931] _eventCallback(): Arduino Event: 5 - STA_DISCONNECTED
[  7047][W][WiFiGeneric.cpp:955] _eventCallback(): Reason: 15 - 4WAY_HANDSHAKE_TIMEOUT
[  7054][D][WiFiGeneric.cpp:979] _eventCallback(): WiFi AutoReconnect Running

tzapu/WiFiManager#1563

Sketch

WiFi.begin();

Debug Message

*wm:[3] . 
[  7038][V][WiFiGeneric.cpp:362] _arduino_event_cb(): STA Disconnected: SSID:xxxxx, BSSID: f4:92:bf:7f:35:4b, Reason: 15
[  7040][D][WiFiGeneric.cpp:931] _eventCallback(): Arduino Event: 5 - STA_DISCONNECTED
[  7047][W][WiFiGeneric.cpp:955] _eventCallback(): Reason: 15 - 4WAY_HANDSHAKE_TIMEOUT
[  7054][D][WiFiGeneric.cpp:979] _eventCallback(): WiFi AutoReconnect Running
[  7063][V][WiFiGeneric.cpp:97] set_esp_interface_ip(): Configuring Station static IP: 0.0.0.0, MASK: 0.0.0.0, GW: 0.0.0.0
[MEM] free: 252800 | max: 45044 | frag:  83% 
*wm:[3] [EVENT] WIFI_REASON:  15


### Other Steps to Reproduce

2.0.6
2.0.7

### I have checked existing issues, online documentation and the Troubleshooting Guide

- [X] I confirm I have checked existing issues, online documentation and Troubleshooting guide.
@tablatronix tablatronix added the Status: Awaiting triage Issue is waiting for triage label Mar 16, 2023
@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 16, 2023

tzapu/WiFiManager#1563 (comment)

I have not had time to packet sniff anything

@Jason2866
Copy link
Collaborator

@tablatronix Can you try this framework (easy with Platformio)? We (Tasmota) have timeout issues in lwip (DHCP) in some rare cases. This build uses a older lwip commit (IDF). Our issues are gone when using.

platform = https://github.com/tasmota/platform-espressif32/releases/download/2023.02.00/platform-espressif32.zip
platform_packages = framework-arduinoespressif32 @ https://github.com/Jason2866/esp32-arduino-lib-builder/releases/download/1254/framework-arduinoespressif32-lwip_timeout-ed6742e7f0.zip 

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 16, 2023

@Jason2866 I will, I saw that when investigating last night but it did not sound exactly similar, Ill try it.

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 17, 2023

I did not see any change with that repo, Do I need to wipe anything to force lwip in pio?
It looks like just an include not requiring an xtensa wipe

*wm:[2] [SYS] WM version:  v2.0.15-rc.1
*wm:[2] [SYS] Arduino version:  2.0.7
*wm:[2] [SYS] ESP SDK version:  4.4.4.20230310

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 17, 2023

unifi ap logs

This could be a host issue, it should not be handling wrong password like this.. wrong pass on my iphone does the same thing.

hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: invalid MIC in msg 2/4 of 4-Way Handshake

Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: : wevent[11872]: wevent.ubnt_custom_event(): EVENT_STA_LEAVE ra0: ac:67:b2:df:85:7c / 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: authentication OK (open system)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c MLME: MLME-AUTHENTICATE.indication(ac:67:b2:df:85:7c, OPEN_SYSTEM)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c MLME: MLME-DELETEKEYS.request(ac:67:b2:df:85:7c)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: authenticated
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.241010] CFG80211_PKT: TX AUTH Frame
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.241062] ra0: AUTH - CFG80211_AuthRespHandler
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.244338] CFG80211_AssocReqHandler(): peer support rate mode = 0x7
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.244492] ap_cmm_peer_assoc_req_action(): peer support rate mode = 0x7
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.11: associated
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: event 1 notification
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: start authentication
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c IEEE 802.1X: unauthorizing port
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: sending 1/4 msg of 4-Way Handshake
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.247405] ===[APAssocRespTxSHandler]: (RE=0, LE=0, ME=0), wlan_idx = 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.247453] ===[APAssocRespTxSHandler]: wlan_idx = 21,AuthAssocNotInProgressFlag = 1 pEntry->bssid f4:92:bf:7f:35:4b
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.249084] ra0: Send EAPOL-Key M1, DA=ac:67:b2:df:85:7c, SA=f4:92:bf:7f:35:4b, len=113
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: kernel: [729515.253984] ra0: Recv EAPOL-Key M2, DA=f4:92:bf:7f:35:4b, SA=ac:67:b2:df:85:7c, len=129
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: : wevent[11872]: wevent.ubnt_custom_event(): EVENT_STA_JOIN ra0: ac:67:b2:df:85:7c / 21
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: received EAPOL-Key frame (2/4 Pairwise)
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: hostapd[11877]: ra0: STA ac:67:b2:df:85:7c WPA: invalid MIC in msg 2/4 of 4-Way Handshake
Mar 16 22:35:43 UAPNanoHDA-MainHouse f492bf7f354a,UAP-nanoHD-6.5.36+14567: stahtd: stahtd[11874]: [STA-TRACKER].stahtd_dump_event(): {"message_type":"STA_ASSOC_TRACKER","mac":"18:74:2e:9c:1e:5e","vap":"ra0","event_type":"fixup","assoc_status":"0","event_id":"4","arp_reply_gw_seen":"yes","dns_resp_seen":"yes","avg_rssi":"-53"}```

@Jason2866
Copy link
Collaborator

That's my version ESP SDK version: 4.4.4.20230310 so the issue is not the same and not related to the lwip changes in IDF.

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 17, 2023

It seems this is a normal PSK PMK failure, But this handling is new, so either these events were missing in previous sdk and handled by auth_fail, not sure how you can infer the difference now..

EIther way a 60s autoreconnect is too damn high.. lol

@mrengineer7777
Copy link
Collaborator

@tablatronix Interesting. Are you saying the Autoreconnect is not working, or it's too aggressive and never fails? I agree that the UniFi AP may be sending back a wrong response on a bad password.

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 17, 2023

It looks like adding autoreconnect to reason 15 was a bad idea, as bad passwords (depending on wpa config) will throw this error. Also autoreconnecting for 60s is way too long but we can pass custom timeout so not the main issue.

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 17, 2023

Conclusion
waitForConnectResult does not check for WL_CONNECTION_LOST and gets stuck for timeout

WL_CONNECTION_LOST is < WL_DISCONNECTED

@tablatronix
Copy link
Contributor Author

oh hey @mrengineer7777 you are the person that refactored this, I wasn't pulling git, thanks this needed a major cleaning up.

A couple things I am looking at atm.

  • Password failure, EAP fail should not retry
  • Not sure we can determine 4-way-handshake timeout and a deauth protection. I am packet sniffing this atm to find a solution.
  • sidenote: I am not entirely sure about always retry on firstconnect, this takes 3 extra seconds to fail

@tablatronix
Copy link
Contributor Author

The good news is that my other deauth issue # is gone now, I can connect in 300ms..

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 18, 2023

My solution atm is to fail on this reason, and it is working ok.
Will have to re-read old issues that demonstrate other causes of this reason and breaking effects

  • exclude WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT from first_connect
  • exclude WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT from autoreconnect
  • explicit WL_CONNECT_FAILED for waitforconnectresult loop
       if(reason == WIFI_REASON_ASSOC_LEAVE) {                                     //Voluntarily disconnected. Don't reconnect!
        }
        if(reason == WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT) {                                     //Voluntarily disconnected. Don't reconnect!
            WiFiSTAClass::_setStatus(WL_CONNECT_FAILED);
        }
        else if(first_connect) {                                                    //Retry once for all failure reasons

and exclude it from autoreconnect

bool WiFiGenericClass::_isReconnectableReason(uint8_t reason) {
    switch(reason) {
        case WIFI_REASON_UNSPECIFIED:
        //Timeouts (retry)
        case WIFI_REASON_AUTH_EXPIRE:
        // case WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT:
        case WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT:

@mrengineer7777
Copy link
Collaborator

@tablatronix There was quite a bit of discussion on this. Please read here: #7210.

It's been a while, but I believe WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT was one of the common failures I would see on initial connection with a good password. Thus it needs to be in the first retry. I could be convinced that it shouldn't be included in additional retries.

Honestly the whole reconnect mechanism feels kludgy and it could use a re-write. However with the PR I submitted it works "well enough" and I really don't have time to write my own WiFi library from scratch.

I developed and tested _isReconnectableReason() under arduino-esp32 v2.0.3 and v2.0.5. It's possible the behavior is different under v2.0.7. I recently read that IDF 4.4.4 changed the LWIP version and may have introduced a DHCP timeout bug.

Probably unrelated:
An API change was also introduced here: https://github.com/espressif/arduino-esp32/pull/7577/files . I don't see how it impacts the disconnect reasons. It could cause disconnects if it introduces the DHCP timeout bug.

@tablatronix
Copy link
Contributor Author

I figured those errors was due to a bug in PMF, similar to the binary change that was needed here.

espressif/esp-idf#8192

I would be curious to see if your spurious 4way timeouts are still occurring

@mrengineer7777
Copy link
Collaborator

@tablatronix That was an interesting read. I'll have to turn off autoreconnect and see what kind of connection failures I am getting. I manage our UniFi AP at work, so I can also try turning on protected management frames (PMF) and see if that is an issue. Unfortunately I'm not setup for packet capture on wireless. Best I can give you is disconnect reason codes from the Arduino core.

According to IDF release notes, PMF fix may have been added in IDF 4.4.2. Arduino-esp32 v2.0.5 includes 4.4.2. It is very possible it was fixed since I did my initial testing on v2.0.3.

Anecdotally I sometimes still see WiFi disconnects shortly after a reboot, but the autoreconnect logic gets the device right back on the network.

I'll do some testing. Ping me in 2 days if I forget.

@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 22, 2023

Yeah those are usually deauth protections and it makes you wait to prevent dos, usually there is a relevant packet descriptor, at least that is my assumption, never found a whitepaper or hostapd docs on that, and its probably not in spec.

Alot of this stuff probably gets copied from cisco

@VojtechBartoska VojtechBartoska added the Area: WiFi Issue related to WiFi label Mar 22, 2023
@tablatronix tablatronix changed the title wifi connect failures stuck in long timeout loop, odd failures wifi connect failures stuck in long timeout loop, autoreconnect on PMK failure is wrong Mar 23, 2023
@tablatronix
Copy link
Contributor Author

tablatronix commented Mar 23, 2023

I updated the title to clarify this is not a problem with trying to reconnect, its a problem with WPA2 connection fails, and they should not be retried, because this is the only event response to abort the connection. Hence we get stuck in a waitforconnectresult timeout.

If there are legitimate auth or connection issues with 4-way timeouts then they should be handled special, waitFor.. timeoutLength should only be there to timeout in cases of missing events

I would revert this fix and readdress the bug it was intended to fix.

In that case perhaps another single reconnect attempt limiter inside of firstconnect

Another thing to test is does the spec or router implementations send another event after this timeout to explicitly say auth failed..? Maybe we just need to wait for that, I will see what i can find. Need a wifi engineer here...

@Jason2866 Jason2866 added Status: Needs investigation We need to do some research before taking next steps on this issue and removed Status: Awaiting triage Issue is waiting for triage labels Apr 19, 2023
@Jason2866
Copy link
Collaborator

@tablatronix Released today a new core build based on latest IDF 4.4. A lot of changes are done since last release. Maybe it fixes the issue.
Only working with Platformio!

platform = https://github.com/tasmota/platform-espressif32/releases/download/2023.06.04/platform-espressif32.zip

@tablatronix
Copy link
Contributor Author

@Jason2866 thanks, will check it out

@Parsaabasi Parsaabasi removed the Status: Needs investigation We need to do some research before taking next steps on this issue label Jan 16, 2025
@Parsaabasi
Copy link

Hello,

Due to the overwhelming volume of issues currently being addressed, we have decided to close the previously received tickets. If you still require assistance or if the issue persists, please don't hesitate to reopen the ticket.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: WiFi Issue related to WiFi
Projects
None yet
Development

No branches or pull requests

5 participants