Observability — Design Note¶
Audience :class: note
Contributors and advanced readers evaluating future directions of
lifecore_ros2. This is a design note, not committed user
documentation. No code under src/lifecore_ros2/ exists for this feature
yet.
Status
Draft — gated on §4 (concurrency) and §5 (strict lifecycle contract) being green. See Prerequisite gates.
Intent¶
Define a small, stable observability surface for lifecore_ros2:
Structured logging — make existing transition and error log lines parseable, with stable field names, so operators can filter and alert without scraping free-form messages.
Lifecycle tracing — expose a passive event stream describing transition starts, ends, failures, and rejections, so external code can record traces or feed dashboards without polling.
Goal: turn the library into a passive observer of native rclpy lifecycle behavior. Never own the state machine; never duplicate it.
Proposed contract¶
Two layers, both opt-in.
Layer A — structured logging (always on)¶
Existing log sites in _guarded_call, _release_resources, and the
direct-call rejection paths emit messages with a stable field set. The
field set is the contract; the human-readable template can evolve.
Mandatory fields (every transition-related log line):
component— registered name, or"<node>"for node-level lines.transition— one ofconfigure,activate,deactivate,cleanup,shutdown,error.from_state/to_state— rclpy state labels (read at log time, not cached).outcome— one ofsuccess,failure,error,rejected.error_class— fully qualified exception class name whenoutcomeiserrororrejected; absent otherwise.duration_ms— wall-clock duration of the hook forsuccess/failure/error; absent forrejected.
Implementation detail (informative, not contractual): use rclpy’s structured
logger with key/value extras where supported, and fall back to a stable
key=value suffix in the message string. A single helper produces both.
Layer B — lifecycle event stream (opt-in)¶
A read-only observer interface on lifecore_ros2.LifecycleComponentNode
that lets external code subscribe to lifecycle events as Python objects
(no ROS topics, no IPC).
from dataclasses import dataclass
from typing import Callable, Literal
Outcome = Literal["success", "failure", "error", "rejected"]
@dataclass(frozen=True)
class LifecycleEvent:
component: str # "<node>" for node-level events
transition: str # configure | activate | deactivate | cleanup | shutdown | error
from_state: str
to_state: str
outcome: Outcome
error_class: str | None
duration_ms: float | None
monotonic_ns: int # event timestamp from time.monotonic_ns()
class LifecycleComponentNode:
def add_lifecycle_observer(
self, callback: Callable[[LifecycleEvent], None]
) -> Callable[[], None]:
"""Register a passive observer. Returns an unsubscribe callable.
Observers are invoked synchronously from the lifecycle guard,
after rclpy has produced the transition outcome. Observer
exceptions are caught, logged at debug, and dropped.
"""
The observer is a notification channel. It does not influence transitions,
does not gate them, does not see rclpy internals.
Emission sites¶
Single emission chokepoint: the existing lifecycle guard
(_guarded_call + _worst_of per Error Policy Rule D). Events are
emitted after _worst_of so that cleanup / shutdown / error
events reflect the merged result, not the raw hook return.
Direct-call rejection paths (InvalidLifecycleTransitionError,
ConcurrentTransitionError) emit one rejected event before raising.
No emission anywhere else. In particular: no emission inside _on_*
hooks, no emission inside publish, no emission inside subscription
callbacks.
Invariants preserved¶
Passive observer. Observability never drives transitions, never derives state, never queues work that could reorder the native lifecycle.
No parallel state machine. All fields (
from_state,to_state,outcome) are read from rclpy / the guard’s already-computed result. No shadow FSM.Single chokepoint. All transition events flow through the existing guard. This avoids the drift risk explicitly forbidden by the ComposedLifecycleNode does NOT introduce custom transition logic contract.
Hot-path logging policy unchanged. Per Error Policy Rule C, gated inbound drops stay at
debugand caughton_messageexceptions stay aterror. No event is emitted on the message hot path.Hooks never raise outward (Rule B). Observer callbacks are wrapped: exceptions caught, logged at debug, dropped. An observer cannot crash a transition.
Tolerant of skipped component shutdown. Per ros2 transverse note, managed-entity
on_shutdownmay be skipped from Unconfigured/Inactive. Consumers MUST treat the event stream as best-effort for terminal transitions.Stable error vocabulary.
error_classuses the existinglifecore_ros2.LifecoreErrorhierarchy and stdlib classes; no new exception types are introduced for observability.Public API stability.
LifecycleEventandadd_lifecycle_observerare additive in__all__.No new runtime dependency. No OpenTelemetry, no
tracetools, no third-party tracing library. Adapters to such backends are user code.
Prerequisite gates¶
§4 — Architecture Concurrency Contract: observer callbacks run under the same threading guarantees as the guard. The reused
RLockbounds reentrancy; observer authors must respect “do not call back into the node from the callback”.§5 — Architecture Strict direct-call contract: rejection events depend on the typed exceptions (
InvalidLifecycleTransitionError,ConcurrentTransitionError) being the canonical signal.§6 — Lifecycle test coverage: observer behavior reuses the existing walks (full cycle, double activate, deactivate-without-activate, configure failure rollback) as fixtures.
Open questions¶
These are explicitly unresolved. To be answered in the implementation PR.
Synchronous vs. queued delivery. Sync keeps ordering trivial and avoids a side queue (fits “passive observer”). Queued isolates slow observers but creates a parallel buffer. Tentative: synchronous, with a documented “do fast work or hand off” rule.
Observer reentrancy. Should an observer be allowed to call
list_components(introspection note)? Reading is safe under theRLock; writing (add_component) is forbidden because the gate may already be closed.Per-component vs. node-level subscription. Single
add_lifecycle_observeron the node only, or also a per-component variant? Tentative: node-level only; consumers filter bycomponentfield.Logger choice for Layer A.
self.get_logger()(node logger) for every line, or a child loggerself.get_logger().get_child("lifecycle")for transition events? Tentative: child logger, to allow operators to tune the lifecycle category independently.Field stability commitment. Are the mandatory fields part of the stable public contract (semver-protected) or best-effort? Tentative: stable; additions are non-breaking, removals require a major bump.
Adapter examples. Should
examples/ship a tiny ros2_tracing adapter or an OpenTelemetry adapter? Tentative: no — keep adapters out-of-tree until at least one real consumer exists.Time source.
time.monotonic_ns()(process-local, monotonic) vsrclpy.Clock(sim-time aware). Tentative:monotonic_nsforduration_msmath; sim-time is irrelevant to wall-clock duration of a Python hook.
Non-goals¶
No new write API. Observers cannot influence, cancel, or reorder transitions.
No ROS topic / service for events. The event stream is a Python API on the node instance. ROS-graph publication is user adapter code.
No third-party tracing dependency. OpenTelemetry,
tracetools,ros2_tracingintegration — out of scope. Stay rclpy-only.No metric counters. No library-side aggregation (counts, histograms). Observability stays event-shaped.
No on-message-hot-path events. Activation gating drops stay at
debug; nothing is added on the per-message path.No state cache for fast queries. Snapshots come from the introspection note’s read-through API; observability does not duplicate it.
No log message template freeze. Only the structured field set is contractual.