Class HeartbeatBasedSystemStoreHealthChecker

java.lang.Object
com.linkedin.venice.controller.systemstore.HeartbeatBasedSystemStoreHealthChecker
All Implemented Interfaces:
SystemStoreHealthChecker, AutoCloseable

public class HeartbeatBasedSystemStoreHealthChecker extends Object implements SystemStoreHealthChecker
Default SystemStoreHealthChecker implementation that uses the heartbeat write+read cycle to determine system store health. For each store, it sends a heartbeat timestamp to all child regions, then polls periodically until the heartbeat is read back or a timeout is reached. A store that reads back a fresh heartbeat is marked HEALTHY; a store that still returns a stale or unreachable heartbeat once the timeout elapses is marked UNHEALTHY. Stores that are never polled before the check aborts (e.g., leadership loss or shutdown) are omitted from the result and deferred to the next round, per the SystemStoreHealthChecker contract.
  • Constructor Details

    • HeartbeatBasedSystemStoreHealthChecker

      public HeartbeatBasedSystemStoreHealthChecker(VeniceParentHelixAdmin parentAdmin, int heartbeatWaitTimeInSeconds, AtomicBoolean isRunning)
  • Method Details

    • checkHealth

      public Map<String,SystemStoreHealthChecker.HealthCheckResult> checkHealth(String clusterName, Set<String> systemStoreNames)
      Description copied from interface: SystemStoreHealthChecker
      Check the health of the given system stores in the specified cluster.
      Specified by:
      checkHealth in interface SystemStoreHealthChecker
      Parameters:
      clusterName - the Venice cluster name
      systemStoreNames - the set of system store names to check
      Returns:
      a map from system store name to its health check result. Implementations should return an entry for every store they were able to check. Missing entries (e.g., when the checker aborts early due to leadership change or shutdown) are treated by the caller as "deferred to next round" — they are neither marked HEALTHY nor UNHEALTHY for this round, so a partial result will not inflate unhealthy counts. Implementations should therefore omit a store from the result map only when no decision was reached for it; an explicit UNHEALTHY entry should be returned for stores that were checked and found to be unhealthy.

      This method is invoked on the repair service's single-threaded scheduler, so a call that blocks indefinitely will stall every subsequent repair round. Implementations must bound their own execution time and honor thread interruption (the service calls shutdownNow() on shutdown) rather than relying on the caller to time them out.