Dynamic Components — Design Note¶
Audience :class: note
Contributors and advanced readers evaluating future directions of
lifecore_ros2. This is a design note, not committed user
documentation. No code under src/lifecore_ros2/ exists for this feature
yet. remove_component already exists but is gated to the pre-transition
window only — see Current state — what already exists.
Status
Draft — gated on §4 (concurrency), §5 (strict lifecycle contract), and §6 (test coverage) being green. See Prerequisite gates.
Intent¶
Allow add_component and remove_component to be called at runtime,
i.e. after the first lifecycle transition has occurred, while keeping
the library’s core invariants intact:
native
rclpylifecycle semantics stay in control,no parallel hidden state machine,
no ghost or partially-registered entries.
Goal: make a node hot-reloadable in the small (replace one component, add a new pipeline stage, drop a deprecated subscriber) without restarting the process or destroying the node.
Current state — what already exists¶
Read this before proposing anything. The codebase already encodes a partial answer.
Registration gate¶
lifecore_ros2.LifecycleComponentNode keeps an
_registration_open flag that flips to False on the first lifecycle
transition (on_configure) and on on_shutdown. After closure,
add_component raises RegistrationClosedError. This is documented
in Architecture and validated by
tests/test_regression_add_component.py.
Rationale (recovered from the registration-gate ADR): rclpy does not
catch a late-added managed entity up to the node’s current lifecycle state.
A component added after on_configure would miss its own _on_configure
and enter activation in an inconsistent state.
remove_component — pre-transition only¶
A remove_component(name) method exists on
LifecycleComponentNode. It is currently restricted to the
pre-transition window (_registration_open is True); calling it later
raises RegistrationClosedError.
Implementation uses a _withdrawn flag on the component so that, even if
the component remains in rclpy’s managed-entity list, all subsequent
transition callbacks return SUCCESS as a silent no-op. This is the
Option B sketched in the Adoption & Hardening Roadmap (docs/planning/adoption_hardening.rst):
Option B (current): ``_withdrawn`` flag silences the component as a ghost managed entity — acceptable for pre-transition use, not clean post-transition.
The same TODO file documents that runtime removal is not safe under Option B because rclpy still owns the entity reference, and that Option C (library-driven propagation) is the path forward:
Option C (future): take over propagation from rclpy, iterate ``_components.values()`` directly in each ``on_*`` method. Enables true runtime removal.
This note specifies Option C and the runtime-add side that mirrors it.
Why Option B is insufficient post-transition¶
The withdrawn component stays in rclpy’s managed-entity list. Memory and reference are not actually released until the node is destroyed.
Any introspection that walks rclpy’s list (rather than
_components) still sees a ghost. This breaks the introspection note’s “no ghost entries” invariant.Resource release is never triggered for a withdrawn component if the node never reaches
on_cleanup/on_shutdownafter withdrawal — by Error Policy Rule D,_release_resourcesruns only inside those hooks. A component withdrawn while the node is active would leak its publishers/subscribers until the node terminates.The “non-atomic add” anti-pattern (orchestration room) extends symmetrically to remove: a partial removal that leaves the entity in rclpy but absent from
_componentsis a ghost in the other direction.
Proposed contract — Option C¶
The library takes over lifecycle propagation. rclpy’s native
managed-entity propagation is bypassed for components; transitions iterate
_components.values() directly. This unlocks both runtime removal
(release resources cleanly) and runtime add (replay missed transitions).
API surface¶
class LifecycleComponentNode:
def add_component(self, component: LifecycleComponent) -> None:
"""Register a component.
Pre-transition: as today. Post-transition: the new component is
caught up to the node's current state by replaying transitions
in order. Replay sequence depends on the node's current state
(see *Replay semantics for runtime add* below). Replay is
atomic: failure rolls back fully and raises, leaving no entry
in ``_components``.
"""
def remove_component(self, name: str) -> None:
"""Unregister a component.
Pre-transition: as today. Post-transition: the library drives
the component down to ``Unconfigured`` through the lifecycle
(``_on_deactivate`` if active, then ``_on_cleanup``), releasing
resources via the normal ``_release_resources`` path. The
node-level ``on_shutdown`` is *not* run — the node itself is
not shutting down. Then detach and drop from ``_components``.
"""
Both calls hold the existing threading.RLock for the entire mutate +
replay sequence. They are forbidden from inside a lifecycle hook (would
deadlock or reorder transitions) — see Open question 1.
Replay semantics for runtime add¶
Map of node state → replay sequence applied to the new component:
Unconfigured— none (matches current pre-transition behavior).Inactive—_on_configureonly.Active—_on_configurethen_on_activate.Finalized— refuse withRegistrationClosedError.During a transition — refuse with
ConcurrentTransitionError(existing exception).
Each replayed step uses the existing guard (_guarded_call) so
observability (Layer A logs and Layer B events from the observability note)
covers replays without special-casing.
Removal semantics for runtime remove¶
Symmetric tear-down driven by the node:
Unconfigured/ pre-transition — drop from_components, detach; no hooks run.Inactive— run_on_cleanup(with_release_resourcesper Rule D), then drop and detach.Active— run_on_deactivate, then_on_cleanup, then drop and detach._on_shutdownis not run because the node itself is not shutting down.Finalized— refuse withRegistrationClosedError(the node is done; nothing to remove).During a transition — refuse with
ConcurrentTransitionError.
Failure of any teardown hook follows Rule D: _release_resources still
runs; the worst result is logged. Even on failure, the component is removed
from _components and detached, because retaining it would create the
ghost case forbidden by the non-atomic-add anti-pattern.
Atomicity contract¶
Mirror of the existing add anti-pattern:
Add success — component is in
_componentsand attached and caught up to current state.Add failure — component is not in
_componentsand detached; any partially-allocated resources are released via the same path_on_cleanupwould use.Remove success — component is not in
_componentsand detached and its resources are released.Remove failure of a teardown hook — component is still removed from
_componentsand detached (forced); the failure is reported via the observer event (outcome=error). This is the only place the library proceeds past a hook failure to preserve the “no ghost” invariant.
Typed exceptions¶
RegistrationClosedError— already exists; reused forFinalizedstate.ConcurrentTransitionError— already exists; reused for mid-transition mutation.New:
ComponentNotFoundError(LifecoreError, KeyError)— replaces the current bareKeyErrorraised byremove_componentfor unknown names. Multi-inheritsKeyErrorper Error Policy Rule A pattern.
Invariants preserved¶
Native rclpy lifecycle stays in control of the node. Option C bypasses rclpy only for component-level propagation; the underlying
rclpy.lifecycle.LifecycleNodeitself still drives its own state machine. There is no parallel state machine on the node side.No ghost entries. Add is atomic; remove is forced past hook failure precisely to avoid ghosts.
Resource release goes through the lifecycle path. Runtime remove triggers the existing
_on_cleanup→_release_resourceschain; no new direct-call to_release_resourcesfrom outside a hook.Concurrency contract intact. All mutation is under the existing
RLock; reentrancy from hooks is forbidden via the existingConcurrentTransitionError.Error policy unchanged. Boundary errors (Rule A), hook exceptions (Rule B), activation gating (Rule C), and worst-of teardown propagation (Rule D) all apply unmodified to runtime add/remove.
Observer events. Layer B emits one event per replayed step on add and per teardown step on remove. No new emission site.
Public API stability.
ComponentNotFoundErroris additive; the semantic change toadd_component/remove_component(gate relaxation) is opt-in by virtue of being post-transition — pre-transition callers see no behavioral change.
Prerequisite gates¶
§4 — Architecture Concurrency Contract: Option C reuses the
RLockand the single-threaded executor ADR. Without §4 green, the reentrancy and lock-ordering questions are unanswerable.§5 — Architecture Strict direct-call contract: rejection paths (
Finalized, mid-transition) and rollback semantics depend on the strict typed-exception contract.§6 — Lifecycle test coverage: replay correctness can only be claimed if the existing full-cycle, double-activate, and concurrent-registration tests are stable as the regression baseline. The §6 deferred item (inter-component pub/sub interaction) does not gate this note.
Runtime Introspection — Design Note —
list_components/get_component_stateare the natural way to verify add/remove from outside; the introspection note must land first or in parallel.Observability — Design Note — replay and teardown events ride the existing Layer B channel; the observability note must land first or in parallel.
Open questions¶
To be answered in the implementation PR, not silently in code.
Reentrancy from hooks.
add_component/remove_componentfrom inside an_on_*hook: forbid (raiseConcurrentTransitionError) or allow (run after current transition completes)? Tentative: forbid. Allowing it implies a queue, which is the parallel state machine the project rejects.Replay granularity for runtime add in ``Active``. Run
_on_configurethen_on_activateas two separate guarded calls (two events, clear failure point) or one fused step? Tentative: two separate guarded calls.Failure of replay during runtime add. If
_on_configuresucceeds but_on_activatefails, do we leave the new componentInactive(asymmetric with siblings) or roll all the way back? Tentative: roll back — call_on_cleanupto release resources, detach, drop from_components, raise.Forced removal on teardown hook failure. Confirmed above as the chosen rule. Document as an explicit exception to “no proceed past a failed hook” or generalize? Tentative: explicit, narrow exception.
Interaction with rclpy’s managed-entity list. Option C bypasses rclpy propagation but rclpy still references the components it was given via
add_managed_entity. Do we (a) keep them in rclpy’s list inert, (b) remove them via a private rclpy API, or (c) never calladd_managed_entityin the first place? Tentative: (c) — Option
- C means the library owns propagation end to end; passing components
to rclpy is what created Option B’s ghost problem. Verify this is actually possible without breaking rclpy lifecycle node assumptions.
Concurrent runtime add and remove. Two threads call
add_componentandremove_componentsimultaneously. RLock serialises them, so order is non-deterministic but consistent. Document as “serialised, order is observation order” rather than inventing a priority.Naming. Should runtime-only behavior be exposed under
add_component/remove_component(overloaded semantics) or under explicitadd_component_runtime/remove_component_runtimevariants? Tentative: same name. Two names imply two contracts; the contract is uniform once Option C ships.Test surface. Concrete regression tests required: runtime-add in
Inactive, runtime-add inActive, runtime-remove inInactive, runtime-remove inActive, replay failure rollback, teardown failure forced removal, mid-transition rejection, concurrent add/remove. Match these against §6’s existing structure.MemPalace persistence. The “Option B vs Option C” deferred ADR is currently only in the Adoption & Hardening Roadmap. Once Option C is decided, persist via the MemPalace Knowledge Writer agent. Until then, this note is the canonical source.
Non-goals¶
No live reconfiguration of an existing component. Replace by
remove_component+add_component; do not introduce areplace_componentshortcut.No partial-state add. A new component cannot be inserted “directly in
Active” without going through_on_configure; replay is the only path.No proceed-past-failure on add. Asymmetric with the forced-removal rule, intentionally — adds that fail must roll back fully.
No new dependency. Option C is implementable with the existing rclpy + stdlib surface.
No promotion of ``_withdrawn`` to a public concept. Option B is internal scaffolding for the pre-transition window; once Option C ships,
_withdrawnis removed.No ROS-graph control surface. No service / topic to add or remove components remotely. Consumers wrap the Python API themselves.
No batch operations.
add_components([...])/remove_components([...])are out of scope; consumers loop.No re-entry for the §6 deferred item. Inter-component pub/sub interaction tests remain out of scope; this note does not pull them in.