Possible signal corruption from other code/modules + filter code for RX #213
Labels
topic: code
Related to content of the project itself
type: imperfection
Perceived defect in any part of project
Hello all,
I have been working on trying to stop the BLE on my IoT's from crashing (#73). Originally I focused on catching the _recvIndex value so it didn't go too high (similar to this #202 and possibly #187). However, as I mentioned in #73, I continued to look at specific causes of the runaway _recvIndex and I found that what was happening was that the BLE was somehow receiving packets of data that would break the if statements meant to handle it.
The Filter Fix
Anyway, I set about writing some simple filters to catch these issues (again similar to #202) these were as follows:
In HCI.cpp I changed the while loop in Poll to this:
I read on https://microchipdeveloper.com/wireless:ble-link-layer-packet-types that the advertisement packets should only be up to 37 bytes but I manage to make valid one with my phone that was bigger.
This catches a lot of errors, however, I noticed a similar issue to #202 where the scan would stop taking new devices after a while. I managed to pin that down to GAP.cpp, in particular, the part
if (_discoveredDevices.size() >= GAP_MAX_DISCOVERED_QUEUE_SIZE)
was getting stuck and permanently remaining true (and therefore dropping new devices), so I did this:which is a little crude but seems to work.
Here are some logging extracts of these filters at work:
In this log, you can see instances where the RX data is mashed up together causing the code to think it is getting a really long RX or giving the wrong subevent type. Also, there is a case of the queue getting stuck. But you can also see that the code rectifies itself and is still able to detect new devices afterwards.
Although not the most elegant of solutions, these fixes seem to work and I can leave the BLE running for a long time with lots of other stuff going on, my current project has neopixels, WIFI, Sevros, and motors, and it can still detect/connect to a new device.
However!
So this is where it gets weird, I can't replicate the errors without lots of other code on the go. In fact, I have run the following code with my adjusted HCI.cpp and GAP.cpp with intervals of around 5 - 10 seconds and it still gives clean RX data but will eventually crash. A 1-second interval works fine. So far there hasn't been a single error. It does still crash with these long intervals but this somewhat contradicts the general advice that it is the time between BLE.Poll() that causes the issue (e.g., #130). Also, oddly if you just call HCITransport.available() it never seems to crash. I feel like I am missing something here.
So the only conclusion I can draw is that some other piece of code or something is introducing some sort of noise or clash meaning that the BLE is reading incorrectly or the signals are overlapping. I am aware of #96, but I don't think this is just slow, this data seems corrupted somehow. Am I barking up the wrong tree here, should it try adding other libraries until I get the corruption? Are the filters introducing the errors I am catching? I can propose this as a pull request or remove it if it is too similar to the old stuff.
Matt
The text was updated successfully, but these errors were encountered: