Stress-Testing 100 Bluetooth Beacons (So the Team Can Sleep Well at Night)

How do we make sure our system works with hundreds of devices in the same room? We test it with hundreds of devices in the same room. Over the past year, I have been working with Blecon, a startup based in Cambridge, UK, founded by the team behind the ARM Mbed OS, to make sure their new generation of Bluetooth beacons works reliably, even in the most challenging conditions. Blecon is developing a new generation of Bluetooth beacons that both send and receive data and that can handle large data payloads. The beacons use nearby smartphones, in addition to dedicated gateways, to send and receive data to and from the cloud. The beacons are designed to work in environments with many devices, such as warehouses, hospitals, and smart buildings. And it has to be reliable. Very reliable. So we built a testbed, with more than 100 devices, to make sure that the system works as expected. Here is how we did it. Bluetooth Beacons That Talk Back Blecon has created a new class of Bluetooth beacons that, unlike old school Bluetooth beacons, can both send and receive data, and can handle much larger data payloads. When a Blecon beacon is in the vicinity of a smartphone, that smartphone can forward data to and from the beacon. The data is encrypted and anonymized, so that the smartphone cannot read it. This makes it so that devices can communicate with the cloud, even when there are no dedicated gateways nearby. The smartphone needs to have a Blecon-enabled app. And that’s it. Once the app is installed, everything is handled automatically by the Blecon system. This is how it looks: The animation above shows how the system works: messages are sent to and from the beacons via the people carrying their smartphones. It is also possible to have dedicated gateway devices, called Hubs, in case there are no people around. (Feel free to click and drag around the people and the devices in the animation.) Why Bluetooth? Because Bluetooth is a one of the world’s most widely deployed wireless technology. Every smartphone has Bluetooth. And the number of Bluetooth devices is growing rapidly. And workforces increasingly are equipped with phones, running apps that are easily Blecon-enabled. This opens up a wide range of new applications. But the system needs to be extremely reliable. Scaling is Hard, So We Made it Harder So how do we ensure that a system like this works reliably, even when there are large numbers of devices in the same area? We can certainly use simulation to model the behavior, and this is an important part of the development process. But to understand the behavior of the system in the real world, you need to run the system with real hardware. Scaling is hard. As we know from before, things are very different when you have a only 1-2 devices compared to when you have 100+ devices. Robustness it the primary challenge. The system has to work, even when unexpected things happen. Given enough time, unexpected things will happen. Devices will be added and removed. Devices will run out of battery. Users will behave in unexpected ways. Interference will occur. And so on. Given enough devices, even things that happen very infrequently will happen regularly. To make these infrequent events happen, we need to push ourselves hard. So how do we prepare for it? We set up a testbed, with actual hardware. We Chose $10 USB Sticks Over Something Fancier The hardware is both the easy part and the tricky part. It is easy in the sense of being simple: you just need to buy a bunch of devices and set them up. It is tricky in the sense of being tedious: you need to set up, run, and maintain a large number of devices. Fortunately, there is one off-the-shelf hardware device that is perfect for this: the Nordic Semiconductor nRF52840 Dongle. It is a small USB device, with a powerful ARM Cortex-M4 CPU, Bluetooth 5, and a range of other features. It costs only around $10 per unit and it has a very neat form factor: a thin USB stick. The USB dongles can be easily plugged into USB hubs, to give them power, and to make them physically manegeable. We then placed the hubs on a wall, and gave them some lovely decoration in the form of IKEA shrubbery. One drawback with these simple USB sticks is that they lack debugging ports. They’re basically only equipped with a single serial port UART over the USB connection. The USB connection makes it possible to, in theory, read the debug messages that the USB dongles generate. But because USB controllers typically are not dimensioned for hundreds of devices at the same time, this becomes unreliable. So it is better to use the USB for power and use other means to get debug data from the devices. Could we have chosen a more elaborate setup, with a set of debuggable devboards instead? Sure, that could have given us better insight into the behavior of each individual device. But that’s not what we’re after here: we want to see the large-scale aggregate behavior. The Software That Herds 100+ Electronic Cats The software is the heart of the system. It consists of two parts: The software under test - the beacon firmware, the smartphone apps, the cloud backend. - the beacon firmware, the smartphone apps, the cloud backend. The orchestration framework - the code that manages the testbed, deploys firmware updates, runs tests, collects results. The software under test exists already, but the orchestration framework typically will be custom-made, based on the specific needs of the system under test. The orchestration framework is responsible for: Monitoring the status of all devices. Deploying firmware updates to all devices. Running tests, both continuous and periodic. Collecting and analyzing performance data. Collecting crash reports. The orchestration framework is implemented as a combination of backend scripts, cloud functions, and a web dashboard. The dashboard provides an overview of the status of the testbed, and allows the engineering team to continuously monitor the performance and drill down into specific issues. Now let’s test it. Step 1: Can We Break Just One Device? The first hurdle for the software is the single-device test setup. The single-device test setup is a single device, connected to a computer - in our case a Raspberry Pi. This is used for the first simple, automated tests, that are automatically run on every change to the code. This is to ensure that the basic functionality of the system works, and that no obvious bugs are introduced. But this is also where we need to run one important, more complex, test: can we do Over-the-Air (OTA) updates reliably? OTA updates are so important to the system that we need to see them as a fundamental part. If we cannot update the devices reliably, the system is broken. And this is something that must never reach the field. In fact, we don’t even want this to reach the testbed, because we rely on OTA to update it. To ensure that OTAs always work, for every code change, we do the following: Build a new firmware image, and install on the device. Build two instances the same version of the firmware image, but with incremented version numbers, and upload to the OTA cloud. Trigger an OTA update on the device. Wait for the device to report back that the update was successful. Trigger a second OTA update. Wait for the device to report back that the second update was successful. Why do two OTAs? Because there is a risk that we have introduced a change that will make subsequent OTAs fail. For example, if we have changed the way the firmware version is stored, the second OTA might fail if it cannot determine that the new version is indeed newer than the old version. By running two OTA updates, we can catch such issues early. Step 2: Now Break All 100+ at Once Once the small-scale tests are working reliably, we can move on to the large-scale testbed. The large-scale testbed is intended for two main purposes: To do long-running tests, to see if it continues to work reliably, even after weeks or months of continuous operation. To test new versions of the software, to see if they work reliably at scale and over time. To install a new version of the software on the testbed, we use the same OTA mechanism as in the small-scale tests. This is important, as it ensures that OTAs are tested continuously, and that any issues with OTAs are caught early. OTAs need to be reliable, so as a general rule, it is a good idea for the engineering team to use it as the default way to install new firmware on the devices during development. The testbed system continuously monitors the status of all devices and collects performance data. This happens all the time, even during OTAs. Performance data is always collected and displayed on the web frontend. The same data is also posted to a database that allow us to compare performance over time. But the most important part isn’t the performance data. It is that each device always works. To ensure that every device always works, we display the status of them a screen, visible to everyone. Each device is represented by a smiley face. A happy face if things are going well, a crying face if it starting to have issues, and an angry face if it is not working at all. Thus a quick glance will be enough to get an idea of the status of the system. Breaking it on Purpose (Before Customers Do) The first thing that happens when a system is exposed to large scale testing that that the bugs start to show up. Bugs are inevitable, and the more complex the system, the more bugs there will be. And they won’t show up until you run the system at scale, over time. To be able to catch the bugs when they appear, it is important to have some form of crash reports built into the system. In this project, we used the Memfault crash reporting backend (now owned by Nordic Semiconductor), which turned out to be a great tool. As embedded developers, we usually build some form of custom crash reporting into our systems, but this is cumbersome and takes away time that could be better spent on building the actual product. Memfault provided a great, ready-made, alternative, with a simple SDK that was easy to integrate into the existing codebase. Now that we have a large-scale testbed, and the baseline system works, we can begin to provoke issues: sending too much data, too quickly, reducing the sending rate to a slow trickle, sending data when there are no smartphones around, turning on the entire system at the same time. So we can, on purpose, introduce specific bugs into the system to see that it is caught before even making it into the testbed. For example, we add a bug into the OTA code by making it flip one random bit as it receives the new OTA update. Will the system catch it? Yes, it will, because it won’t even make it beyond the first one-device step. The Testbed That Never Sleeps So now that we have made the system work reliably, should we still have the testbed set up? Yes, because the testbed now serves as a status indicator. If the testbed smileys are happy, the system is working. We can sleep well at night, knowing that the testbed is still awake. If the smileys are sad, something is wrong. And if they’re angry, they’re completely offline. Maybe that recent iOS update caused some unexpected behavior? Or maybe the cloud backend is having issues? The testbed will make sure we are the first to know and the engineering team can jump in and investigate. Conclusion To ensure that a complex IoT system works reliably, even in challenging environments, it is essential to test it with real hardware at scale. By setting up a testbed with hundreds of devices, we can catch issues early, ensure that critical features like OTA updates work reliably, and provide a continuous status indicator for the engineering team. Also check out Blecon co-founder Donatien Garnier talking about the project at the Zephyr Developer Conference 2025: For more insights on IoT scaling challenges, see the previous post on IoT development challenges.

Stress-Testing 100 Bluetooth Beacons (So the Team Can Sleep Well at Night)

Share this article

Related Articles