Multi-Threaded Watchdog Class in C++

By Atakan SARIOGLU in Software Design 04/01/2020

When dealing with critical systems such as user interfaces, headless servers or embedded systems, you must define a maximum response time of your software. However, there is no system that works as expected all the time and this is the reason you must use watchdog mechanism to take down and restart your program when this is necessary. In this post I will show how to make multi-threaded watchdog class to track all of your threads. 👊

1 How Does Watchdog Timer Work
2 Multi-Thread Watchdog Class
3 Conclusion

How Does Watchdog Timer Work

Watchdog is a special timer that counts down from a predefined value. It basically has one purpose; restart your system when it reaches zero. This very simple mechanism, can be a hardware or software function, helps you maintain responsive systems easily.

When to use Watchdog

If you have multiple threads that use mutex locks for synchronization or resource sharing, it is possible to face dead-locks when multiple threads wait for each other at the same time. Furthermore, dead-lock is not the only reason for a program to stuck somewhere. You can also have an unlikely lock situation due to another misbehaving component. Although a well tested system have very little possibility, it is never zero.

How to Use Watchdog

The action may seem destructive but it is certainly not intended to restart your system many times, ideally never! A bug-free program will reset the watchdog timer periodically, preventing it to reach zero. If you set the watchdog timer to 10 seconds, your program should reload the timer more frequent than every 5 seconds. Some systems have much smaller deadlines of only some milliseconds.

Checking Multiple Threads

Purpose of watchdog is to track your program as a whole, not partly. That’s why you must ensure al your threads work as expected before sending the heartbeat signal. Only way of doing this is creating a special class to gather sanity signals from all of your threads. Keep reading for implementation details of it.

Multi-Thread Watchdog Class

We need to talk about some theory before jumping to the implementation.

Working Principle

Every thread must maintain its own response time by calling Kick(...) method of our watchdog collector class. To track these timings, we need a queue like the below illustration.

Once we add all the threads into the queue with respect to the expire time, we are certain that the front of the queue has the item that will expire first. If expire time of the front element comes, we stop reloading the external watchdog timer so it will restart the program or the whole system soon.

A properly functioning thread must kick itself so that we remove it from the front of the queue and add to back with a new due time.

You can see the concept above. But it is also very likely that a thread kicks more frequent than another since there is no minimum limit for this. In this case the kicking thread is not in front of the queue and we need to use a hash map to find it in constant look-up time.

Another case to think about is when a thread is join(...)’ed and does not exist anymore. We have another method to remove it from the queue completely such as Done(...). This is very useful if you have temporary worker threads that always come and go.

But how can we represent a thread? We need a unique id or a handle. I prefer using the thread-id for this purpose since we don’t need to keep a handle around and it is always unique among all the threads.

Implementation

From the implementation point of view, our class must be thread-safe and the provided class can be called by different threads without race conditions or concurrency issues. It uses a mutex to handle concurrent requests from multiple threads. Use the provided code as a guide and modify it according to your needs.

/**
 * @file      MultiThreadWatchdog.h
 * @author    Atakan S.
 * @brief     Generalized Watchdog Class for multi-thread programs.
 *
 * @copyright Copyright (c) 2020 Atakan SARIOGLU ~ www.atakansarioglu.com
 *
 *  Permission is hereby granted, free of charge, to any person obtaining a
 *  copy of this software and associated documentation files (the "Software"),
 *  to deal in the Software without restriction, including without limitation
 *  the rights to use, copy, modify, merge, publish, distribute, sublicense,
 *  and/or sell copies of the Software, and to permit persons to whom the
 *  Software is furnished to do so, subject to the following conditions:
 *
 *  The above copyright notice and this permission notice shall be included in
 *  all copies or substantial portions of the Software.
 *
 *  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 *  IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 *  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 *  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 *  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 *  FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 *  DEALINGS IN THE SOFTWARE.
 */

#ifndef _H_MultiThreadWatchdog_H_
#define _H_MultiThreadWatchdog_H_

#include <iostream>
#include <unordered_map>
#include <list>
#include <mutex>
#include <atomic>
#include <thread>
#include <exception>

class MultiThreadWatchdog
{
public:
    using Interval_t = std::chrono::duration<std::chrono::seconds>;
    using ThreadId_t = std::thread::id;
    using Time_t = std::chrono::time_point<std::chrono::system_clock>;

    MultiThreadWatchdog(double interval, size_t maxNumClients)
        : m_maxNumClients(maxNumClients), m_interval(interval)
    {   
        m_bgThread = std::thread(&MultiThreadWatchdog::bgThread, this);
    }

    bool Kick(const ThreadId_t &id = std::this_thread::get_id())
    {
        std::lock_guard<std::mutex> lock(m_mutex);
        if (m_hashMap.count(id))
        {
            // Erase the current item.
            m_list.erase(m_hashMap[id]);
            m_hashMap.erase(id);
            --m_currentNumClients;
        }
        else if (m_currentNumClients == m_maxNumClients)
        {
            // Cannot add item.
            return false;
        }

        // Insert new item to front.
        auto expireTime = std::chrono::system_clock::now();
        expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval);
        m_list.emplace_front(std::make_pair(id, expireTime));
        m_hashMap[id] = m_list.begin();
        ++m_currentNumClients;

        // Kicked.
        std::cout << "Kicked from: " << id << std::endl;
        return true;
    }

    bool Done(const ThreadId_t &id = std::this_thread::get_id())
    {
        std::lock_guard<std::mutex> lock(m_mutex);
        if (m_hashMap.count(id))
        {
            // Erase current item.
            m_list.erase(m_hashMap[id]);
            m_hashMap.erase(id);
            --m_currentNumClients;

            // Removed.
            std::cout << "Thread: " << id << " is done!" << std::endl;
            return true;
        }

        // Not found.
        return false;
    }

    bool IsExpired(ThreadId_t &expiredId)
    {
        std::lock_guard<std::mutex> lock(m_mutex);
        if (m_list.size() && m_list.back().second < std::chrono::system_clock::now())
        {
            expiredId = m_list.back().first;
            return true;
        }
        return false;
    }

    auto GetNextExpireTime()
    {
        std::lock_guard<std::mutex> lock(m_mutex);
        auto expireTime = std::chrono::system_clock::now();
        expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval);
        return m_list.size() ? m_list.back().second : expireTime;
    }

    ~MultiThreadWatchdog()
    {
        m_bgThreadEnable = false;
        m_bgThread.join();
    }

private:
    size_t m_maxNumClients = 0;
    size_t m_currentNumClients = 0;
    std::list<std::pair<ThreadId_t, Time_t>> m_list;
    std::unordered_map<ThreadId_t, decltype(m_list)::iterator> m_hashMap;
    std::chrono::duration<double> m_interval;
    std::mutex m_mutex;
    std::thread m_bgThread;
    std::atomic<bool> m_bgThreadEnable{true};

    void bgThread()
    {
        while (m_bgThreadEnable)
        {
            ThreadId_t expiredThread{};
            if (IsExpired(expiredThread))
            {
                // Trigger watchdog reset.
                std::cout << "Expired thread: " << expiredThread << "!" << std::endl;
				
				// TODO: Replace with respect to ypur system's requirement.
                std::terminate();
            }

            // Kick the WatchDog.
            std::cout << "Kicked System!" << std::endl;

            // Nothing has expired.
            std::this_thread::sleep_until(GetNextExpireTime());
        }
    }
};

#endif

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

/**

* @file MultiThreadWatchdog.h

* @author Atakan S.

* @brief Generalized Watchdog Class for multi-thread programs.

* Permission is hereby granted, free of charge, to any person obtaining a

* copy of this software and associated documentation files (the "Software"),

* to deal in the Software without restriction, including without limitation

* the rights to use, copy, modify, merge, publish, distribute, sublicense,

* and/or sell copies of the Software, and to permit persons to whom the

* Software is furnished to do so, subject to the following conditions:

* The above copyright notice and this permission notice shall be included in

* all copies or substantial portions of the Software.

* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING

* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER

* DEALINGS IN THE SOFTWARE.

#ifndef _H_MultiThreadWatchdog_H_

#define _H_MultiThreadWatchdog_H_

#include <iostream>

#include <unordered_map>

#include <list>

#include <mutex>

#include <atomic>

#include <thread>

#include <exception>

class MultiThreadWatchdog

{

public:

using Interval_t = std::chrono::duration<std::chrono::seconds>;

using ThreadId_t = std::thread::id;

using Time_t = std::chrono::time_point<std::chrono::system_clock>;

MultiThreadWatchdog(double interval, size_t maxNumClients)

: m_maxNumClients(maxNumClients), m_interval(interval)

{

m_bgThread = std::thread(&MultiThreadWatchdog::bgThread, this);

}

bool Kick(const ThreadId_t &id = std::this_thread::get_id())

{

std::lock_guard<std::mutex> lock(m_mutex);

if (m_hashMap.count(id))

{

// Erase the current item.

m_list.erase(m_hashMap[id]);

m_hashMap.erase(id);

--m_currentNumClients;

}

else if (m_currentNumClients == m_maxNumClients)

{

// Cannot add item.

return false;

}

// Insert new item to front.

auto expireTime = std::chrono::system_clock::now();

expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval);

m_list.emplace_front(std::make_pair(id, expireTime));

m_hashMap[id] = m_list.begin();

++m_currentNumClients;

// Kicked.

std::cout << "Kicked from: " << id << std::endl;

return true;

}

bool Done(const ThreadId_t &id = std::this_thread::get_id())

{

std::lock_guard<std::mutex> lock(m_mutex);

if (m_hashMap.count(id))

{

// Erase current item.

m_list.erase(m_hashMap[id]);

m_hashMap.erase(id);

--m_currentNumClients;

// Removed.

std::cout << "Thread: " << id << " is done!" << std::endl;

return true;

}

// Not found.

return false;

}

bool IsExpired(ThreadId_t &expiredId)

{

std::lock_guard<std::mutex> lock(m_mutex);

if (m_list.size() && m_list.back().second < std::chrono::system_clock::now())

{

expiredId = m_list.back().first;

return true;

}

return false;

}

auto GetNextExpireTime()

{

std::lock_guard<std::mutex> lock(m_mutex);

auto expireTime = std::chrono::system_clock::now();

expireTime += std::chrono::duration_cast<std::chrono::seconds>(m_interval);

return m_list.size() ? m_list.back().second : expireTime;

}

~MultiThreadWatchdog()

{

m_bgThreadEnable = false;

m_bgThread.join();

}

private:

size_t m_maxNumClients = 0;

size_t m_currentNumClients = 0;

std::list<std::pair<ThreadId_t, Time_t>> m_list;

std::unordered_map<ThreadId_t, decltype(m_list)::iterator> m_hashMap;

std::chrono::duration<double> m_interval;

std::mutex m_mutex;

std::thread m_bgThread;

std::atomic<bool> m_bgThreadEnable{true};

void bgThread()

{

while (m_bgThreadEnable)

{

ThreadId_t expiredThread{};

if (IsExpired(expiredThread))

{

// Trigger watchdog reset.

std::cout << "Expired thread: " << expiredThread << "!" << std::endl;

// TODO: Replace with respect to ypur system's requirement.

std::terminate();

}

// Kick the WatchDog.

std::cout << "Kicked System!" << std::endl;

// Nothing has expired.

std::this_thread::sleep_until(GetNextExpireTime());

}

};

#endif

Usage of the Example Implementation

You can demonstrate the behavior with the following code snippet.

#include <iostream>
#include <thread>
#include <chrono>
#include "MultiThreadWatchdog.h"

int main() {
    // Initialize with 3sec expire time and max 10 threads.
    MultiThreadWatchdog wdg(3, 10);

    // Kick 5 times and stop.
    int i = 5;
    while(true) {
        if(i > 0) {
            wdg.Kick();
            --i;
        } else {
            wdg.Done();
        }

        using namespace std::chrono_literals;
        std::this_thread::sleep_for(1s);
    }

    // Should never reach here.
    std::cout << "Bye" << std::endl;
    return 0;
}

#include <iostream>

#include <thread>

#include <chrono>

#include "MultiThreadWatchdog.h"

int main() {

// Initialize with 3sec expire time and max 10 threads.

MultiThreadWatchdog wdg(3, 10);

// Kick 5 times and stop.

int i = 5;

while(true) {

if(i > 0) {

wdg.Kick();

--i;

} else {

wdg.Done();

}

using namespace std::chrono_literals;

std::this_thread::sleep_for(1s);

}

// Should never reach here.

std::cout << "Bye" << std::endl;

return 0;

}

Compile it with g++ main.cpp -std=c++17 -lpthread to see the output below.

Kicked from: 139648404215616
Kicked System!
Kicked from: 139648404215616
Kicked from: 139648404215616
Kicked System!
Kicked from: 139648404215616
Kicked from: 139648404215616
Kicked System!
Thread: 139648404215616 is done!
Kicked System!
Kicked System!
Kicked System!
Kicked System!
...

Kicked from: 139648404215616

Kicked System!

Kicked from: 139648404215616

Kicked System!

Kicked from: 139648404215616

Kicked System!

Thread: 139648404215616 is done!

Kicked System!

...

If you comment-out line 17, you can observe an expire event.

Kicked from: 140708338083648
Kicked System!
Kicked from: 140708338083648
Kicked from: 140708338083648
Kicked System!
Kicked from: 140708338083648
Kicked from: 140708338083648
Kicked System!
Expired thread: 140708338083648!

terminate called without an active exception

Aborted

Kicked from: 140708338083648

Kicked System!

Kicked from: 140708338083648

Kicked System!

Kicked from: 140708338083648

Kicked System!

Expired thread: 140708338083648!

terminate called without an active exception

Aborted

Conclusion

This is only a reference design to guide you through. Adding more features like thread-specific timings or support for callbacks on expiration can advance your design one more level.

Tags: concurrent, cpp, hashmap, multi-thread, queue, watchdog