Sprint 2 — Solid error handling¶
Objective. Define and enforce error propagation rules; guarantee no zombie states on partial failure.
Deliverable. “Error semantics are explicit and lifecycle is always coherent on failure.”
—
Content¶
Error propagation rules¶
Define and document:
What happens when
_on_configurefails in one component? - Do siblings already configured stay in that state, or rollback? - What is the node’s final state? (CONFIGURED or UNCONFIGURED?)What happens when
_on_activatefails? - Do already-activated siblings stay active? - Can the node be partially active?Error hooks: when is
_on_errorcalled vs_on_cleanup?Can a component skip
_on_cleanupif_on_configurefailed?
Rollback semantics¶
On partial failure during
configureoractivate: - Option A: Rollback all — restore previously-configured siblings to UNCONFIGURED - Option B: All-or-nothing — if one component fails, entire node transition fails, no partial state - Option C: Partial allowed — siblings stay configured; node state reflects which succeededDecision: Pick one, document it, enforce it
Retry and idempotence¶
Is
_on_configuresafe to call multiple times?Is
_on_activateidempotent?Document contract: “hooks are called exactly once per transition, not retried on transient failure”
Protected hook invocation¶
Wrap all
_on_*hook calls in try/catchOn exception: - Log with component name, hook type, exception message - Return
FAILURE(or raise if enforced) - Do not leak exception to applicationState coherence: component must be in a valid state after hook failure (no half-allocated resources)
Resource cleanup on failure¶
If
_on_configurefails partway through resource allocation,_release_resourcesmust clean what was allocatedDocument contract: “if
_on_configureraises,_release_resourceswill be called to clean partial allocation”
—
Tests to write¶
Configure failure tests¶
[x] One component fails in
_on_configure→ node returns FAILURE[x] Siblings already configured → state depends on rollback decision (A/B/C)
[x] After failure, node can retry configure (idempotence test)
[x] Resources allocated before failure → cleaned by
_release_resourceson cleanup
Activate failure tests¶
[x] One component fails in
_on_activate→ node returns FAILURE[x] Already-activated siblings → state depends on rollback decision
[x] Node is not in ACTIVE state after failure
[x] Partial activation → can retry or deactivate+cleanup
Partial failure scenarios¶
[x] Three components: C1 OK, C2 fails, C3 never reached. Final state coherent.
[x] Resource cleanup: C2 fails after allocating some resources;
_release_resourcescalled[x] Exception in
_on_configure→ caught, logged, not leaked to application
Error hook tests¶
[x]
_on_errorcalled when transition enters ERROR state (implementation-dependent)[x]
_on_cleanupcalled on shutdown even if_on_configurefailed[x] No double-cleanup: if
_on_erroralready released resources,_on_cleanupshould be safe (idempotent)
Logging tests¶
[x] Failure logs include: component name, hook type, error message
[x] Rollback (if implemented) logs component restoration
[x] Example log output in test docstring
—
Risks and mitigation¶
Risk 1: Zombie state after failure
Problem: If rollback is not enforced, partially-active components leak visible state.
Mitigation: - Enforce one of rollback options (A/B/C) in code, not just documentation - Test every scenario - Document the chosen contract clearly
Risk 2: Resource leak on failure
Problem:
_on_configurefails after allocating resource X;_release_resourcesnot called or incomplete.Mitigation: - Library calls
_release_resourceson failure (before returning) - Test explicitly: allocate in_on_configure, raise exception, verify_release_resourcescalled
Risk 3: Retry infinite loop
Problem: If a hook is not idempotent, retrying configure could cause issues.
Mitigation: - Document: “hooks are called exactly once per transition; application is responsible for making them idempotent if needed” - Do not auto-retry in the library
Risk 4: Exception leak to application
Problem: Unhandled exception in hook bubbles up to user code.
Mitigation: - All hook invocations wrapped in try/catch - Return
FAILUREor raise a typedLifecycleHookError(library-controlled exception) - Test that user code never sees raw hook exception
—
Dependencies¶
Requires:
LifecycleComponentbase (shipped)Requires: Error handling from Sprint 1 (service/client error responses)
Requires: Testing fixtures (Sprint 3) for easy failure scenario setup
—
Scope boundaries¶
In-scope:
Error propagation rules (pick A/B/C, document, enforce)
Rollback on partial failure (if chosen)
Protected hook invocation (try/catch)
Logging (component name, hook, error)
Resource cleanup guarantees
Idempotence contract
Out-of-scope:
User-defined error recovery policies (e.g., “on failure, do X”) — deferred to lifecycle policies (Sprint 6)
Automatic retry with backoff — application responsibility
Compensation transactions (complex orchestration) — out of scope
—
Success signal¶
[ ] Error propagation rules written and documented in
docs/architecture.rst[ ] Rollback option chosen and enforced in code
[ ] All error scenarios have tests (unit + integration)
[ ] Resource cleanup is guaranteed (test with
Mockif needed)[ ] Logs are actionable and include component context
[ ] Ruff, Pyright, Pytest all green
[ ] Design note: error handling contract (if future-proofing needed)
—