![]() |
Interface for service job manager.
Methods
ensure_services
@abc.abstractmethod
ensure_services( pipeline_state:
tfx.orchestration.experimental.core.pipeline_state.PipelineState
) -> Set[tfx.orchestration.experimental.core.task.NodeUid
]
Ensures necessary service jobs are started and healthy for the pipeline.
Service jobs are long-running jobs associated with a node or the pipeline that persist across executions (eg: worker pools, Tensorboard, etc). Service jobs are started before the nodes that depend on them are started.
ensure_services
will be called in the orchestration loop periodically and
is expected to:
- Start any service jobs required by the pipeline nodes.
- Probe job health and handle failures. If a service job fails, the corresponding node uids should be returned.
- Optionally stop service jobs that are no longer needed. Whether or not a service job is needed is context dependent, for eg: in a typical sync pipeline, one may want Tensorboard job to continue running even after the corresponding trainer has completed but others like worker pool services may be shutdown.
Args | |
---|---|
pipeline_state
|
A PipelineState object for an active pipeline.
|
Returns | |
---|---|
List of NodeUids of nodes whose service jobs are in a state of permanent failure. |
stop_services
@abc.abstractmethod
stop_services( pipeline_state:
tfx.orchestration.experimental.core.pipeline_state.PipelineState
) -> None
Stops all service jobs associated with the pipeline.
Args | |
---|---|
pipeline_state
|
A PipelineState object for an active pipeline.
|