DiCE API nvidia_logo_transpbg.gif Up
Node manager

This module represents the node manager, a service to control the formation of clusters of worker nodes based on their properties. More...

Classes

class  mi::neuraylib::IWorker_node_descriptor
 This interface describes a worker node and its properties. More...
 
class  mi::neuraylib::ICluster_descriptor
 This interface describes a cluster and its properties. More...
 
class  mi::neuraylib::ICluster_property_callback
 Abstract interface for signaling changed cluster properties. More...
 
class  mi::neuraylib::IWorker_node_property_callback
 Abstract interface for signaling changed worker node properties. More...
 
class  mi::neuraylib::IClient_node_callback
 Abstract interface for signaling changed cluster members. More...
 
class  mi::neuraylib::IWorker_node_callback
 Abstract interface for signaling changed cluster members. More...
 
class  mi::neuraylib::IHead_node_callback
 Abstract interface for signaling a change of the cluster application head node. More...
 
class  mi::neuraylib::IShutdown_node_managers_callback
 Abstract interface for signaling a request to shutdown all clients and workers. More...
 
class  mi::neuraylib::IShutdown_cluster_callback
 Abstract interface for signaling a request to shutdown a cluster. More...
 
class  mi::neuraylib::IWorker_process_started_callback
 Abstract interface for indicating that a worker process has been fully started. More...
 
class  mi::neuraylib::INode_manager_cluster
 The interface to a cluster created and managed by the node manager. More...
 
class  mi::neuraylib::ICluster_filter
 A filter used to decide whether a cluster is eligible to be joined. More...
 
class  mi::neuraylib::IWorker_node_filter
 A filter used to decide whether a worker node is eligible to be included in a cluster. More...
 
class  mi::neuraylib::INode_manager_client
 The node manager client allows to start or join DiCE clusters built from worker nodes. More...
 
class  mi::neuraylib::IChild_process_resolver
 A filter used to decide if a command string to start a child process is eligible for execution. More...
 
class  mi::neuraylib::INode_manager_worker
 The node manager worker class allows to set properties and announce them to other nodes. More...
 
class  mi::neuraylib::INode_manager_factory
 Factory to create node manager client and worker instances. More...
 

Detailed Description

This module represents the node manager, a service to control the formation of clusters of worker nodes based on their properties.

The node manager is part of the DiCE library and can be used by any application integrating DiCE. In the following a client is an application based on DiCE which wants to make use of additional worker nodes to offload work. The node manager allows to allocate and manage those worker nodes.

For using the node manager, a node manager process must be running on the worker nodes to be used by client applications to delegate work to them. This process running on the worker nodes can be built based on the DiCE library, too. This library offers an API which allows to register properties at runtime, including the possibility to change them dynamically. The node manager process running on the worker nodes can for example detect local capabilities, e.g., the number of available CPU cores, the number of GPUs, or the amount of physical memory present and set them as properties of the worker node. Those and other arbitrarily chosen properties will be announced by DiCE to the client nodes.

On the client nodes, the node manager API can be used to control formation and/or joining clusters of worker nodes. This can happen before the start of the DiCE library and also later, in order to add a cluster of worker nodes to a running application or to join an already running cluster.

The application running on the client nodes has full control over which cluster to join or which worker nodes to select for the formation of a cluster. This can be achieved by writing a custom filter class to which DiCE offers eligible clusters respectively worker nodes along with their properties which have been set by the node manager process running on the worker nodes. Such a filter can then return either true or false. True is returned if the cluster respectively worker node in question should be chosen, or false, otherwise. In addition a client application can specify a minimum and maximum amount of worker nodes which need to be in the cluster for the cluster creation to be successful.

Each cluster created using the node manager API is associated with a multicast address which is automatically chosen and which can be passed to DiCE for forming a DiCE cluster. In addition to that a command string which is used to start child processes on the worker nodes is associated with the cluster.

A cluster can be shut down automatically when no client is using the cluster anymore. Shutting down can also be delayed by a timeout which can be set by the client application. In addition it is possible to shut down a cluster immediately, even if there are still client nodes using the clusters or the timeout has not elapsed.

The node manager API allows a client node to form or join any number of clusters at the same time or at different times.

The node manager can be operated in two network modes: multicasting and TCP networking with a discovery host. Multicasting is the default. TCP networking can be used in network environments where switches/routers do not allow UDP multicasting and establishing a connection between node manager instances does not work. With TCP networking, a head node is used to allow node manager instances to find each other. There can be only one head node and it needs to be the first instance that is started. A node manager instance that is started in TCP mode and where the address that follows is the local IP address will become the head node. Other nodes specify the head node's IP address as well and will obtain the list of known nodes from there.

Keepalive PDU for the child process watchdog

struct keepalive { int type; int sequence_number; };