Health Monitoring

Last Updated on : 2024-01-18 07:23:53download

This topic describes the device state monitoring and regular health checks. The framework will inform the application of any anomalies detected on the device. The application then executes the designated actions. For example, it can send alerts to stakeholders or perform automatic repairs, such as reset. Health monitoring is essential for ensuring the reliability and stability of your services.

Features

Types of health monitoring metrics

  • Query

    Regularly query key metrics such as memory usage and queue depth.

  • Event

    Monitor anomalies and errors, such as HTTP API call errors.

    The event-based monitoring will subscribe to a public event ID using the framework’s event service. This event will be published when an exception occurs, with the input parameter being the index of the monitoring metric.

    #define EVENT_HEALTH_ALERT      "health.alert"      // Public event ID for health monitoring 
    

System health monitoring metrics

The framework comes with metrics for system health monitoring to check if the framework runs properly. When you call the health monitoring initialization API, these metrics will be automatically added to the monitoring list.

Macro definition Metrics Type Exception count
(threshold)
Monitoring period (seconds) Exception Exception handling
HEALTH_RULE_FREE_MEM_SIZE Free memory Query 1 600 Memory less than 8 KB Restart the device
HEALTH_RULE_MAX_MEM_SIZE The maximum memory that can be requested at a time Query 1 600 / /
HEALTH_RULE_ATOP_REFUSE Access to cloud API denied Event 5 / / /
HEALTH_RULE_ATOP_SIGN_FAILED Sign error in access to cloud API Event 5 / / /
HEALTH_RULE_WORKQ_DEPTH Queue depth Query 1 600 Queue depth exceeds 50 Print system queue information for troubleshooting
HEALTH_RULE_MSGQ_NUM Number of message queues Query 1 600 Queue depth exceeds 50 Print message queue information for troubleshooting
HEALTH_RULE_TIMER_NUM Number of software timers Query 1 600 The number exceeds 100 /
HEALTH_RULE_FEED_WATCH_DOG Feeding the watchdog Query 0 20 / /
HEALTH_RULE_RUNTIME_REPT Report real-time device status to the cloud:
timestamp, daylight saving time, free memory, and signal strength.
Query 0 3,600 / /

How it works

Query

Y
N
Y
N
Y
N
Start
Monitoring period is reached
Invoke query callback
Wait for the next monitoring period
Anomaly detected
Exceptions exceed the threshold
Exception count reset
Invoke notification callback
Exception count

Event

Health MonitoringEvent ServiceOther ServicesSubscribe to the public event for health monitoringEVENT_HEALTH_ALERTAn exception occurred.Publish the event.Push the event.Get the monitoring metrics index from the input parameters.Determine if the exception count exceeds the threshold and take action accordingly.Health MonitoringEvent ServiceOther Services

Development guide

How to use

  • The framework will call the API for health monitoring initialization during system service initialization. The query and notification callbacks will be automatically registered and handled accordingly. In other words, you only need to call the system service initialization API.

  • To manage custom metrics, call the respective API to add, delete, or update metrics.

API description

Initialize health monitoring service

Create a thread for health monitoring and register the health metrics. This API will be called during system service initialization.

/**
 * @brief devos health init function
 *
 * @return OPRT_OK on success. Others on error, please refer to tuya_error_code.h
 */
INT_T tuya_devos_health_init_and_start();

Add custom metrics

typedef VOID (*health_notify_cb)();
typedef BOOL_T(*health_query_cb)();

/**
 * @brief add health item
 *
 * @param[in] threshold: the threshold
 * @param[in] period: the period
 * @param[in] query: query cb
 * @param[in] notify: notify cb
 *
 * @return type id, success when large than 0,others failed
 */
INT_T tuya_devos_add_health_item(UINT_T threshold,UINT_T period,health_query_cb query,health_notify_cb notify);

Delete metrics

/**
 * @brief delete health item
 *
 * @param[in] type: the type
 *
 */
VOID tuya_devos_delete_health_item(INT_T type);

Update monitoring period

/**
 * @brief update health item period
 *
 * @param[in] type: the type
 * @param[in] period: the period
 *
 */
VOID tuya_devos_update_health_item_period(INT_T type,UINT_T period);

Update exception count threshold

/**
 * @brief update health item threshold
 *
 * @param[in] type: the type
 * @param[in] threshold: the threshold
 *
 */
VOID tuya_devos_update_health_item_threshold(INT_T type,UINT_T threshold);

Example

service_health_manager in TuyaOS example collection contains the complete example code.

FAQs

What is the priority of health monitoring?

It has the highest priority THREAD_PRIO_0.

What other features does health monitoring provide apart from anomaly detection?

You schedule watchdog feeding and runtime log uploading.