Multi-Threaded Watchdog Class in C++
When dealing with critical systems such as user interfaces, headless servers or embedded systems, you must define a maximum response time of your software. However, there is no system that works as expected all the time and this is the reason you must use watchdog mechanism to take down and restart your program when this is necessary. In this post I will show how to make multi-threaded watchdog class to track all of your threads. 👊
How Does Watchdog Timer Work
Watchdog is a special timer that counts down from a predefined value. It basically has one purpose; restart your system when it reaches zero. This very simple mechanism, can be a hardware or software function, helps you maintain responsive systems easily.
When to use Watchdog
If you have multiple threads that use mutex locks for synchronization or resource sharing, it is possible to face dead-locks when multiple threads wait for each other at the same time. Furthermore, dead-lock is not the only reason for a program to stuck somewhere. You can also have an unlikely lock situation due to another misbehaving component. Although a well tested system have very little possibility, it is never zero.
How to Use Watchdog
The action may seem destructive but it is certainly not intended to restart your system many times, ideally never! A bug-free program will reset the watchdog timer periodically, preventing it to reach zero. If you set the watchdog timer to 10 seconds, your program should reload the timer more frequent than every 5 seconds. Some systems have much smaller deadlines of only some milliseconds.
Checking Multiple Threads
Purpose of watchdog is to track your program as a whole, not partly. That’s why you must ensure al your threads work as expected before sending the heartbeat signal. Only way of doing this is creating a special class to gather sanity signals from all of your threads. Keep reading for implementation details of it.
Multi-Thread Watchdog Class
We need to talk about some theory before jumping to the implementation.
Working Principle
Every thread must maintain its own response time by calling Kick(...) method of our watchdog collector class. To track these timings, we need a queue like the below illustration.
Once we add all the threads into the queue with respect to the expire time, we are certain that the front of the queue has the item that will expire first. If expire time of the front element comes, we stop reloading the external watchdog timer so it will restart the program or the whole system soon.
A properly functioning thread must kick itself so that we remove it from the front of the queue and add to back with a new due time.
You can see the concept above. But it is also very likely that a thread kicks more frequent than another since there is no minimum limit for this. In this case the kicking thread is not in front of the queue and we need to use a hash map to find it in constant look-up time.
Another case to think about is when a thread is join(...)’ed and does not exist anymore. We have another method to remove it from the queue completely such as Done(...). This is very useful if you have temporary worker threads that always come and go.
But how can we represent a thread? We need a unique id or a handle. I prefer using the thread-id for this purpose since we don’t need to keep a handle around and it is always unique among all the threads.
Implementation
From the implementation point of view, our class must be thread-safe and the provided class can be called by different threads without race conditions or concurrency issues. It uses a mutex to handle concurrent requests from multiple threads. Use the provided code as a guide and modify it according to your needs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
/** * @file MultiThreadWatchdog.h * @author Atakan S. * @brief Generalized Watchdog Class for multi-thread programs. * * @copyright Copyright (c) 2020 Atakan SARIOGLU ~ www.atakansarioglu.com * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * DEALINGS IN THE SOFTWARE. */ #ifndef _H_MultiThreadWatchdog_H_ #define _H_MultiThreadWatchdog_H_ #include <iostream> #include <unordered_map> #include <list> #include <mutex> #include <atomic> #include <thread> #include <exception> class MultiThreadWatchdog { public: using Interval_t = std::chrono::duration<std::chrono::seconds>; using ThreadId_t = std::thread::id; using Time_t = std::chrono::time_point<std::chrono::system_clock>; MultiThreadWatchdog(double interval, size_t maxNumClients) : m_maxNumClients(maxNumClients), m_interval(interval) { m_bgThread = std::thread(&MultiThreadWatchdog::bgThread, this); } bool Kick(const ThreadId_t &id = std::this_thread::get_id()) { std::lock_guard<std::mutex> lock(m_mutex); if (m_hashMap.count(id)) { // Erase the current item. m_list.erase(m_hashMap[id]); m_hashMap.erase(id); --m_currentNumClients; } else if (m_currentNumClients == m_maxNumClients) { // Cannot add item. return false; } // Insert new item to front. auto expireTime = std::chrono::system_clock::now(); expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval); m_list.emplace_front(std::make_pair(id, expireTime)); m_hashMap[id] = m_list.begin(); ++m_currentNumClients; // Kicked. std::cout << "Kicked from: " << id << std::endl; return true; } bool Done(const ThreadId_t &id = std::this_thread::get_id()) { std::lock_guard<std::mutex> lock(m_mutex); if (m_hashMap.count(id)) { // Erase current item. m_list.erase(m_hashMap[id]); m_hashMap.erase(id); --m_currentNumClients; // Removed. std::cout << "Thread: " << id << " is done!" << std::endl; return true; } // Not found. return false; } bool IsExpired(ThreadId_t &expiredId) { std::lock_guard<std::mutex> lock(m_mutex); if (m_list.size() && m_list.back().second < std::chrono::system_clock::now()) { expiredId = m_list.back().first; return true; } return false; } auto GetNextExpireTime() { std::lock_guard<std::mutex> lock(m_mutex); auto expireTime = std::chrono::system_clock::now(); expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval); return m_list.size() ? m_list.back().second : expireTime; } ~MultiThreadWatchdog() { m_bgThreadEnable = false; m_bgThread.join(); } private: size_t m_maxNumClients = 0; size_t m_currentNumClients = 0; std::list<std::pair<ThreadId_t, Time_t>> m_list; std::unordered_map<ThreadId_t, decltype(m_list)::iterator> m_hashMap; std::chrono::duration<double> m_interval; std::mutex m_mutex; std::thread m_bgThread; std::atomic<bool> m_bgThreadEnable{true}; void bgThread() { while (m_bgThreadEnable) { ThreadId_t expiredThread{}; if (IsExpired(expiredThread)) { // Trigger watchdog reset. std::cout << "Expired thread: " << expiredThread << "!" << std::endl; // TODO: Replace with respect to ypur system's requirement. std::terminate(); } // Kick the WatchDog. std::cout << "Kicked System!" << std::endl; // Nothing has expired. std::this_thread::sleep_until(GetNextExpireTime()); } } }; #endif |
Usage of the Example Implementation
You can demonstrate the behavior with the following code snippet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
#include <iostream> #include <thread> #include <chrono> #include "MultiThreadWatchdog.h" int main() { // Initialize with 3sec expire time and max 10 threads. MultiThreadWatchdog wdg(3, 10); // Kick 5 times and stop. int i = 5; while(true) { if(i > 0) { wdg.Kick(); --i; } else { wdg.Done(); } using namespace std::chrono_literals; std::this_thread::sleep_for(1s); } // Should never reach here. std::cout << "Bye" << std::endl; return 0; } |
Compile it with g++ main.cpp -std=c++17 -lpthread to see the output below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Kicked from: 139648404215616 Kicked System! Kicked from: 139648404215616 Kicked from: 139648404215616 Kicked System! Kicked from: 139648404215616 Kicked from: 139648404215616 Kicked System! Thread: 139648404215616 is done! Kicked System! Kicked System! Kicked System! Kicked System! ... |
If you comment-out line 17, you can observe an expire event.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Kicked from: 140708338083648 Kicked System! Kicked from: 140708338083648 Kicked from: 140708338083648 Kicked System! Kicked from: 140708338083648 Kicked from: 140708338083648 Kicked System! Expired thread: 140708338083648! terminate called without an active exception Aborted |
Conclusion
This is only a reference design to guide you through. Adding more features like thread-specific timings or support for callbacks on expiration can advance your design one more level.