-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Fix semaphores in IDF & std::string assert #2728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes the problem of giving a mutex from a callback with the latest IDF. Also addresses an occasional assert that happens when the btc_task callback gives the semaphore and causes an assert due to both cores potentially writing m_owner concurrently.
libraries/BLE/src/FreeRTOS.cpp
Outdated
@@ -62,22 +62,21 @@ uint32_t FreeRTOS::getTimeSinceStart() { | |||
uint32_t FreeRTOS::Semaphore::wait(std::string owner) { | |||
log_v(">> wait: Semaphore waiting: %s for %s", toString().c_str(), owner.c_str()); | |||
|
|||
m_owner = owner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the owner before we take the semaphore seems like maybe it can lead to incorrect behaviour. Think about this sequence of events:
- Task A calls
s->wait("task_a")
. m_owner is set to "task_a", and then waiting for the semaphore succeeds because noone was holding it. - Task B calls
s->wait("task_b")
. m_owner is set to "task_b" but then Task B is blocked in xSemaphoreWait() because Task A is still holding the semaphore.
Now we have a situation where Task A is holding the semaphore, but the owner field indicates Task B is holding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, hadn't really thought about it as the way the BLE library currently works (on the client side anyway) there is multiple semaphores used and only one task actually takes each of them. Also, as this was a mutex before, the semaphore is usually taken already before we call the wait function to block the task and wait for the callback to give the semaphore back, at which time it sets m_owner to "<N/A>" anyway.
The reason for this change though is I've encountered many instances of on assert error when giving the semaphore from the callback as both cores sometimes try to write to m_owner at the same time, one in the give() on core 0 and this one after we've unblocked in wait() on core 1.
Anyway, maybe a better solution would be to pin the callback task to core 1, or maybe just remove the setting of m_owner in wait() all together? I'm open to suggestions :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this m_owner
anyway? Logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as far as I can tell it’s only used for logging, but I suppose someone could call the toString() method on the semaphore to check ownership.
I’m tempted just to remove it entirely for my use, but others might find it useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry I should have read the full code - I missed that wait() isn't "take", it's "take then give". Not quite what I expected...
The reason for this change though is I've encountered many instances of on assert error when giving the semaphore from the callback as both cores sometimes try to write to m_owner at the same time, one in the give() on core 0 and this one after we've unblocked in wait() on core 1.
If this is the problem then I think the best solution is to change all the m_owner assignments so they only happen when holding the semaphore, so the semaphore protects against any race condition. I think this means:
- In
::give()
, move setting ofm_owner
up to before the semaphore is given so it's assigned while the caller still holds the semaphore. - In
::wait()
, don't set m_owner at all. As soon as the caller in wait() gets the semaphore it gives it back, and giving a semaphore never blocks, so if you did set m_owner then it gets immediately un-set afterwards which seems like a waste of CPU cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that’s along the line of my thinking as well. Sorry I changed the PR yesterday assuming you wanted me to revert the m_owner move and I didn’t see your reply here first. I’ll make the changes as you’ve mentioned and fix the PR again lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes, tested with my test code that would cause the assert within seconds and has been running over an hour now. Logging works fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! :)
Update esp32-hal-rmt.c
Revert previous revert commit and move setting of m_owner in ::give to before giving the semaphore to prevent race condition possibility.
Fixes the problem of giving a mutex from a callback with the latest IDF. Also addresses an occasional assert that happens when the btc_task callback gives the semaphore and causes an assert due to both cores potentially writing m_owner concurrently.