Cluster Troubleshooting


About this Course

In this advanced Cluster Administration course, you will learn how to identify, diagnose and fix errors and warnings before they become bigger problems. Covered are reviewing alarms and settings, diagnosing service errors with CLDB, troubleshooting data read/write problems, ID’ing cluster stability and slow write issues, and troubleshooting Node/Topology problems and network failures.

Duration: 1.5 days

What’s Covered in the Course

Welcome to Class
  • Configure the Training Environment
  • Understand the Lab Environment
Lab Activities
    • Configure the Training Environment
    • Connect to Your Cluster
1: Introduction to Troubleshooting
  • Document a Cluster Problem Report
  • Classify the Problem
Lab Activities
    • Analyze Log Files
2: Service Availability Problems
  • Review Alarms and Alarm Settings
  • Identify Service Availability Problems
  • Diagnose and Fix Service Errors with CLDB
Lab Activities
    • Query the ZooKeeper Service
    • Troubleshoot CLDB Availability
3: Data Availability Problems
  • Identify Data Availability Problems
  • Troubleshoot Read and Write Data Problems
  • Resync Data Failures
Lab Activities
    • Troubleshoot Failing Writes
    • Troubleshoot Failing Reads
    • Resolve a Filename and Retrieve a FIDmap
    • Understand Container Resync Operations
4: Cluster Stability Problems
  • Identify Cluster Stability
  • Manage the High Memory Alarm
  • Location Container Unable to Become Master
  • Identify Source of Slow Write Condition
Lab Activities
    • Monitor Memory Use
    • Use cldbguts
    • Diagnose Slow Writes
5: Hardware Problems
  • Identify and Troubleshoot Node/Topology Problems
  • Troubleshoot Disk Drive Problems
  • Manage Network and Multiple Failures
Lab Activities
    • Node Failure
    • Topology Changes
    • Handle Disk Failures
6: Get Help
  • Reporting Problems
  • Preventing Problems
  • Growing Your Cluster
Lab Activities
    • No labs