MapR 5.0 Documentation : Node Alarms

Node alarms indicate problems in individual nodes. The following tables describe the MapR node alarms.

CLDB Service Alarm

UI Column

CLDB Alarm

Logged As

NODE_ALARM_SERVICE_CLDB_DOWN

Meaning

The CLDB service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the CLDB service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the CLDB service, the alarm is cleared. If the warden is unable to restart the CLDB service, it may be necessary to contact technical support.

Core Present Alarm

UI Column

Core Present

Logged As

NODE_ALARM_CORE_PRESENT

Meaning

A service on the node has crashed and created a core dump file. When all core files are removed, the alarm is cleared.

Resolution

Contact technical support.

Debug Logging Active

UI Column

Excess Logs Alarm

Logged As

NODE_ALARM_DEBUG_LOGGING

Meaning

Debug logging is enabled on the node.

Resolution

Debug logging generates enormous amounts of data, and can fill up disk space. If debug logging is not absolutely necessary, turn it off: either use the Manage Services pane in the Node Properties view or the setloglevel command. If it is absolutely necessary, make sure that the logs in /opt/mapr/logs are not in danger of filling the entire disk.

Disk Failure

UI Column

Disk Failure Alarm

Logged As

NODE_ALARM_DISK_FAILURE

Meaning

A disk has failed on the node.

Resolution

Check the disk health log (/opt/mapr/logs/faileddisk.log) to determine which disk failed and view any SMART data provided by the disk. See Handling Disk Failure

Duplicate Host ID

UI Column

Duplicate Host Id

Logged As

NODE_ALARM_DUPLICATE_HOSTID

Meaning

Two or more nodes in the cluster have the same host ID.

Resolution

Multiple nodes with the same host ID are prevented from joining the cluster, in order to prevent addressing problems that can lead to data loss. To correct the problem and clear the alarm, make sure all host IDs are unique and use the maprcli node allow-into-cluster command to un-ban the affected host IDs.

FileServer Service Alarm

UI Column

FileServer Alarm

Logged As

NODE_ALARM_SERVICE_FILESERVER_DOWN

Meaning

The FileServer service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the FileServer service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the FileServer service, the alarm is cleared. If the warden is unable to restart the FileServer service, it may be necessary to contact technical support.

HBMaster Service Alarm

UI Column

HBase Master Alarm

Logged As

NODE_ALARM_SERVICE_HBMASTER_DOWN

Meaning

The HBMaster service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the HBMaster service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the HBMaster service, the alarm is cleared. If the warden is unable to restart the HBMaster service, it may be necessary to contact technical support.

HBRegion Service Alarm

UI Column

HBase RegionServer Alarm

Logged As

NODE_ALARM_SERVICE_HBREGION_DOWN

Meaning

The HBRegion service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the HBRegion service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the HBRegion service, the alarm is cleared. If the warden is unable to restart the HBRegion service, it may be necessary to contact technical support.

Heartbeat Processing Slow

UI Column

Heartbeat Processing Slow Alarm

Logged As

NODE_ALARM_HB_PROCESSING_SLOW

Meaning

The time that has elapsed since the CLDB processed the previous heartbeat from the MFS node has exceeded 5 seconds.

Resolution

When the CLDB is processing a heartbeat from a node, it will compare the current time to the time at which the previous heartbeat from that node was processed.  If the elapsed time exceeds 5 seconds then this alarm is raised. If this alarm occurs frequently, investigate what might be causing the relevant node or nodes to be busy, or whether the CLDB nodes have enough resources to handle their load.

HiveMeta Alarm

UI Column

HiveMeta Alarm

Logged As

NODE_ALARM_SERVICE_HIVEMETA_DOWN

Meaning

The HiveMeta service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether Hive Metastore is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may be necessary to contact technical support.

HiveServer 2 Alarm

UI Column

HiveServer 2 Alarm

Logged As

NODE_ALARM_SERVICE_HS2_DOWN

Meaning

The HiveServer 2 service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether HiveServer 2 is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may be necessary to contact technical support.

Hoststats Alarm

UI Column

HostStats

Logged As

NODE_ALARM_SERVICE_HOSTSTATS_DOWN

Meaning

The Hoststats service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the Hoststats service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may be necessary to contact technical support.

Incorrect Topology Alarm

UI Column

CLDB Alarm

Logged As

NODE_ALARM_INCORRECT_TOPOLOGY_ALARM

Meaning

The mapr.cldb.internal volume's topology (normally /cldb) must include all CLDB nodes. This alarm signifies that one or more CLDB nodes are outside the CLDB volume's topology.

Resolution

There are two ways to resolve this alarm:

  • Move any stray CLDB nodes into the topology in which mapr.cldb.internal resides. See Setting Up Topology for more information.
  • Change the volume topology of mapr.cldb.internal to include the stray CLDB nodes. See Managing Data with Volumes for more information.

Installation Directory Full Alarm

UI Column

Installation Directory Full

Logged As

NODE_ALARM_OPT_MAPR_FULL

Meaning

The partition /opt/mapr on the node is running out of space (95% full).

Resolution

Free up some space in /opt/mapr on the node.

JobTracker Service Alarm

UI Column

JobTracker Alarm

Logged As

NODE_ALARM_SERVICE_JT_DOWN

Meaning

The JobTracker service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the JobTracker service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the JobTracker service, the alarm is cleared. If the warden is unable to restart the JobTracker service, it may be necessary to contact technical support.

MapR-FS High Memory Alarm

UI Column

High FileServer Memory Alarm

Logged As

NODE_ALARM_HIGH_MFS_MEMORY

Meaning

Memory consumed by fileserver service on the node is in excess of the allotted amount.

Resolution

Log on as root to the node for which the alarm is raised, and restart the Warden:
service mapr-warden restart

MapR User Mismatch

UI Column

MapR User Mismatch Alarm

Logged As

NODE_ALARM_MAPRUSER_MISMATCH

Meaning

The cluster nodes are not all set up to run MapR services as the same user (for example, some nodes are running MapR as root while others are running as mapr_user.

Resolution

For the nodes on which the User Mismatch alarm is raised, follow the steps in Changing the User for MapR Services.

Memory Allocation Alarm

UI Column

Memory Allocation Alarm

Logged As

NODE_ALARM_MEMORY_ALLOCATION_EXCEEDED

Meaning

The percentage of system memory required to run services on the node exceeds the set threshold and could potentially overload the node. If you installed a service on the node that causes the sum of memory used by the services on the node to exceed the threshold set, the system raises the alarm.   

Resolution

To clear the alarm, you can add more memory to the node, stop a service from running on the node, or remove a service from the node. You can run the service list command to see the memory allocated to each service on the node. See service list for more information. 

The services.memoryallocation.alarm.threshold property in warden.conf defines the maximum amount of system memory that services running on the node can use before triggering the alarm. The default setting for this property is 95 percent:

services.memoryallocation.alarm.threshold=95 

The percentage of system memory that services can use on the node should not exceed 95. Restart the Warden service on the node after you edit the warden.conf file.

Memory Usage Alarm

UI Column

Memory Usage Alarm

Logged As

NODE_ALARM_MEMORY_SWAPPING

Meaning

The HostStats service raises this alarm for swap space when the delta of swap in memory and the delta of swap out memory exceeds the threshold set over a specific time period.

Resolution

To clear the alarm, you can increase the physical memory or reduce the load running on the node. You can run the service list command to see the memory allocated to each service on the node. See service list for more information. 

The memory swapping alarm is controlled by the following properties in /opt/mapr/conf/hoststats.conf:

  • alarm.swapping.threshold
  • alarm.swapping.counter 

The memory threshold for swap in and swap out is defined by the alarm.swapping.threshold property, which is set to 100MB by default. The duration over which HostStats checks the delta of the memory is defined by the alarm.swapping.counter, which is set to 100 seconds by default. 

Metrics Write Problem Alarm

UI Column

Metrics write problem Alarm

Logged As

NODE_ALARM_METRICS_WRITE_PROBLEM

Meaning

Unable to write Metrics data to the database or the MapR-FS local Metrics volume.

Resolution

This issue can have multiple causes. To clear the alarm, check the log file at /opt/mapr/logs/hoststats.log for the cause of the write failure. In the case of database access failure, restore write access to the MySQL database. For more information, consult the process outlined in Setting up the MapR Metrics Database.

NFS Gateway Alarm

UI Column

NFS Alarm

Logged As

NODE_ALARM_SERVICE_NFS_DOWN

Meaning

The NFS service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the NFS service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the NFS service, the alarm is cleared. If the warden is unable to restart the NFS service, it may be necessary to contact technical support.

No Heartbeat Alarm

UI Column

No Heartbeat Alarm

Logged As

NODE_ALARM_NO_HEARTBEAT

Meaning

Node is not undergoing maintenance, and no heartbeat detected for over 5 minutes.

Resolution

Check the node's status manually.

Node Too Many Containers

UI Column

Too Many Containers Alarm

Logged As

NODE_ALARM_TOO_MANY_CONTAINERS

Meaning

Number of containers on this node reached the maximum limit.

Resolution

Decrease the number of containers. You can reset the maximum with:

maprcli node modify -nodes <host> -maxContainers <number>

Oozie Alarm

UI Column

Oozie Alarm

Logged As

NODE_ALARM_SERVICE_OOZIE_DOWN

Meaning

The Oozie service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the Oozie service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the Oozie service, the alarm is cleared. If the warden is unable to restart the Oozie service, it may be necessary to contact technical support.

PAM Misconfigured Alarm

UI Column

Pam Misconfigured Alarm

Logged As

NODE_ALARM_PAM_MISCONFIGURED

Meaning

The PAM authentication on the node is configured incorrectly.

Resolution

See PAM Configuration.

Root Partition Full Alarm

UI Column

Root Partition Full

Logged As

NODE_ALARM_ROOT_PARTITION_FULL

Meaning

The root partition ('/') on the node is running out of space (99% full).

Resolution

Free up some space in the root partition of the node.

Spark Alarm

UI Column

Spark Alarm

Logged As

NODE_ALARM_SERVICE_SPARK_DOWN

Meaning

The Spark service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether Spark is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts Spark, the alarm is cleared. If the warden is unable to restart Spark, it may be necessary to contact technical support.

Spark History Server Alarm

UI Column

Spark History Server Alarm

Logged As

NODE_ALARM_SERVICE_SPARK_HISTORY_SERVER_DOWN

Meaning

The Spark History Server on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether Spark History Server is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts Spark History Server, the alarm is cleared. If the warden is unable to restart Spark History Server, it may be necessary to contact technical support.

TaskTracker Service Alarm

UI Column

TaskTracker Alarm

Logged As

NODE_ALARM_SERVICE_TT_DOWN

Meaning

The TaskTracker service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the TaskTracker service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the TaskTracker service, the alarm is cleared. If the warden is unable to restart the TaskTracker service, it may be necessary to contact technical support.

TaskTracker Local Directory Full Alarm

UI Column

TaskTracker Local Directory Full Alarm

Logged As

NODE_ALARM_TT_LOCALDIR_FULL

Meaning

The local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result.

Resolution

Delete or move data from the local disks, or add storage to the specified node(s), and try the jobs again.

Time Skew Alarm

UI Column

Time Skew Alarm

Logged As

NODE_ALARM_TIME_SKEW

Meaning

The clock on the node is out of sync with the master CLDB by more than 20 seconds.

Resolution

Use NTP to synchronize the time on all the nodes in the cluster.

Version Alarm

UI Column

Version Alarm

Logged As

NODE_ALARM_VERSION_MISMATCH

Meaning

One or more services on the node are running an unexpected version.

Resolution

Stop the node, Restore the correct version of any services you have modified, and re-start the node. See Managing Nodes.

WebServer Service Alarm

UI Column

Webserver Alarm

Logged As

NODE_ALARM_SERVICE_WEBSERVER_DOWN

Meaning

The WebServer service on the node has stopped running.

Resolution

Go to the Manage Services pane of the Node Properties View to check whether the WebServer service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in warden.conf. If the warden successfully restarts the WebServer service, the alarm is cleared. If the warden is unable to restart the WebServer service, it may be necessary to contact technical support.