Multi-Threaded Watchdog Class in C++

When dealing with critical systems such as user interfaces, headless servers or embedded systems, you must define a maximum response time of your software. However, there is no system that works as expected all the time and this is the reason you must use watchdog mechanism to take down and restart your program when this is necessary. In this post I will show how to make multi-threaded watchdog class to track all of your threads. 👊

How Does Watchdog Timer Work

Watchdog is a special timer that counts down from a predefined value. It basically has one purpose; restart your system when it reaches zero. This very simple mechanism, can be a hardware or software function, helps you maintain responsive systems easily.

When to use Watchdog

If you have multiple threads that use mutex locks for synchronization or resource sharing, it is possible to face dead-locks when multiple threads wait for each other at the same time. Furthermore, dead-lock is not the only reason for a program to stuck somewhere. You can also have an unlikely lock situation due to another misbehaving component. Although a well tested system have very little possibility, it is never zero.

How to Use Watchdog

The action may seem destructive but it is certainly not intended to restart your system many times, ideally never! A bug-free program will reset the watchdog timer periodically, preventing it to reach zero. If you set the watchdog timer to 10 seconds, your program should reload the timer more frequent than every 5 seconds. Some systems have much smaller deadlines of only some milliseconds.

Checking Multiple Threads

Purpose of watchdog is to track your program as a whole, not partly. That’s why you must ensure al your threads work as expected before sending the heartbeat signal. Only way of doing this is creating a special class to gather sanity signals from all of your threads. Keep reading for implementation details of it.

Multi-Thread Watchdog Class

We need to talk about some theory before jumping to the implementation.

Working Principle

Every thread must maintain its own response time by calling Kick(...) method of our watchdog collector class. To track these timings, we need a queue like the below illustration.

Once we add all the threads into the queue with respect to the expire time, we are certain that the front of the queue has the item that will expire first. If expire time of the front element comes, we stop reloading the external watchdog timer so it will restart the program or the whole system soon.

A properly functioning thread must kick itself so that we remove it from the front of the queue and add to back with a new due time.

You can see the concept above. But it is also very likely that a thread kicks more frequent than another since there is no minimum limit for this. In this case the kicking thread is not in front of the queue and we need to use a hash map to find it in constant look-up time.

Another case to think about is when a thread is join(...)’ed and does not exist anymore. We have another method to remove it from the queue completely such as Done(...). This is very useful if you have temporary worker threads that always come and go.

But how can we represent a thread? We need a unique id or a handle. I prefer using the thread-id for this purpose since we don’t need to keep a handle around and it is always unique among all the threads.

Implementation

From the implementation point of view, our class must be thread-safe and the provided class can be called by different threads without race conditions or concurrency issues. It uses a mutex to handle concurrent requests from multiple threads. Use the provided code as a guide and modify it according to your needs.

Usage of the Example Implementation

You can demonstrate the behavior with the following code snippet.

Compile it with  g++ main.cpp -std=c++17 -lpthread to see the output below.

If you comment-out line 17, you can observe an expire event.

Conclusion

This is only a reference design to guide you through. Adding more features like thread-specific timings or support for callbacks on expiration can advance your design one more level.