Lifecycle interface

Core lifecycle for modular robotics systems

Configure · Activate · Run · Transition · Shutdown

Sprint 2 — Solid error handling

Objective. Define and enforce error propagation rules; guarantee no zombie states on partial failure.

Deliverable. “Error semantics are explicit and lifecycle is always coherent on failure.”

Content

Error propagation rules

Define and document:

  • What happens when _on_configure fails in one component? - Do siblings already configured stay in that state, or rollback? - What is the node’s final state? (CONFIGURED or UNCONFIGURED?)

  • What happens when _on_activate fails? - Do already-activated siblings stay active? - Can the node be partially active?

  • Error hooks: when is _on_error called vs _on_cleanup?

  • Can a component skip _on_cleanup if _on_configure failed?

Rollback semantics

  • On partial failure during configure or activate: - Option A: Rollback all — restore previously-configured siblings to UNCONFIGURED - Option B: All-or-nothing — if one component fails, entire node transition fails, no partial state - Option C: Partial allowed — siblings stay configured; node state reflects which succeeded

  • Decision: Pick one, document it, enforce it

Retry and idempotence

  • Is _on_configure safe to call multiple times?

  • Is _on_activate idempotent?

  • Document contract: “hooks are called exactly once per transition, not retried on transient failure”

Protected hook invocation

  • Wrap all _on_* hook calls in try/catch

  • On exception: - Log with component name, hook type, exception message - Return FAILURE (or raise if enforced) - Do not leak exception to application

  • State coherence: component must be in a valid state after hook failure (no half-allocated resources)

Resource cleanup on failure

  • If _on_configure fails partway through resource allocation, _release_resources must clean what was allocated

  • Document contract: “if _on_configure raises, _release_resources will be called to clean partial allocation”

Tests to write

Configure failure tests

  • [x] One component fails in _on_configure → node returns FAILURE

  • [x] Siblings already configured → state depends on rollback decision (A/B/C)

  • [x] After failure, node can retry configure (idempotence test)

  • [x] Resources allocated before failure → cleaned by _release_resources on cleanup

Activate failure tests

  • [x] One component fails in _on_activate → node returns FAILURE

  • [x] Already-activated siblings → state depends on rollback decision

  • [x] Node is not in ACTIVE state after failure

  • [x] Partial activation → can retry or deactivate+cleanup

Partial failure scenarios

  • [x] Three components: C1 OK, C2 fails, C3 never reached. Final state coherent.

  • [x] Resource cleanup: C2 fails after allocating some resources; _release_resources called

  • [x] Exception in _on_configure → caught, logged, not leaked to application

Error hook tests

  • [x] _on_error called when transition enters ERROR state (implementation-dependent)

  • [x] _on_cleanup called on shutdown even if _on_configure failed

  • [x] No double-cleanup: if _on_error already released resources, _on_cleanup should be safe (idempotent)

Logging tests

  • [x] Failure logs include: component name, hook type, error message

  • [x] Rollback (if implemented) logs component restoration

  • [x] Example log output in test docstring

Risks and mitigation

Risk 1: Zombie state after failure

  • Problem: If rollback is not enforced, partially-active components leak visible state.

  • Mitigation: - Enforce one of rollback options (A/B/C) in code, not just documentation - Test every scenario - Document the chosen contract clearly

Risk 2: Resource leak on failure

  • Problem: _on_configure fails after allocating resource X; _release_resources not called or incomplete.

  • Mitigation: - Library calls _release_resources on failure (before returning) - Test explicitly: allocate in _on_configure, raise exception, verify _release_resources called

Risk 3: Retry infinite loop

  • Problem: If a hook is not idempotent, retrying configure could cause issues.

  • Mitigation: - Document: “hooks are called exactly once per transition; application is responsible for making them idempotent if needed” - Do not auto-retry in the library

Risk 4: Exception leak to application

  • Problem: Unhandled exception in hook bubbles up to user code.

  • Mitigation: - All hook invocations wrapped in try/catch - Return FAILURE or raise a typed LifecycleHookError (library-controlled exception) - Test that user code never sees raw hook exception

Dependencies

  • Requires: LifecycleComponent base (shipped)

  • Requires: Error handling from Sprint 1 (service/client error responses)

  • Requires: Testing fixtures (Sprint 3) for easy failure scenario setup

Scope boundaries

In-scope:

  • Error propagation rules (pick A/B/C, document, enforce)

  • Rollback on partial failure (if chosen)

  • Protected hook invocation (try/catch)

  • Logging (component name, hook, error)

  • Resource cleanup guarantees

  • Idempotence contract

Out-of-scope:

  • User-defined error recovery policies (e.g., “on failure, do X”) — deferred to lifecycle policies (Sprint 6)

  • Automatic retry with backoff — application responsibility

  • Compensation transactions (complex orchestration) — out of scope

Success signal

  • [ ] Error propagation rules written and documented in docs/architecture.rst

  • [ ] Rollback option chosen and enforced in code

  • [ ] All error scenarios have tests (unit + integration)

  • [ ] Resource cleanup is guaranteed (test with Mock if needed)

  • [ ] Logs are actionable and include component context

  • [ ] Ruff, Pyright, Pytest all green

  • [ ] Design note: error handling contract (if future-proofing needed)