
This changes the instance manager to use the InstanceGroupManager to manage clusters of instances instead of individual instances, including clusters of 1 node for non-HA resources. This also adds some missing documenation on enabling clustering of the astara-orchestartor service. Change-Id: Ib679453aafe68e6653c8c5f9f412efa72c2b7cb1
169 lines
6.3 KiB
ReStructuredText
169 lines
6.3 KiB
ReStructuredText
.. _rug:
|
|
|
|
Service VM Orchestration and Management
|
|
=======================================
|
|
|
|
Astara Orchestrator
|
|
-----------------------------
|
|
|
|
:program:`astara-orchestrator` is a multi-processed, multithreaded Python process
|
|
composed of three primary subsystems, each of which are spawned as a subprocess
|
|
of the main :py:mod:`astara-orchestrator` process:
|
|
|
|
L3 and DHCP Event Consumption
|
|
-----------------------------
|
|
|
|
:py:mod:`astara.notifications` uses `kombu <https://pypi.python.org/pypi/kombu>`_
|
|
and a Python :py:mod:`multiprocessing.Queue` to listen for specific Neutron service
|
|
events (e.g., ``router.interface.create``, ``subnet.create.end``,
|
|
``port.create.end``, ``port.delete.end``) and normalize them into one of
|
|
several event types:
|
|
|
|
* ``CREATE`` - a router creation was requested
|
|
* ``UPDATE`` - services on a router need to be reconfigured
|
|
* ``DELETE`` - a router was deleted
|
|
* ``POLL`` - used by the :ref:`health monitor<health>` for checking aliveness
|
|
of a Service VM
|
|
* ``REBUILD`` - a Service VM should be destroyed and recreated
|
|
|
|
As events are normalized and shuttled onto the :py:mod:`multiprocessing.Queue`,
|
|
:py:mod:`astara.scheduler` shards (by Tenant ID, by default) and
|
|
distributes them amongst a pool of worker processes it manages.
|
|
|
|
This system also consumes and distributes special :py:mod:`astara.command` events
|
|
which are published by the :program:`rug-ctl` :ref:`operator tools<operator_tools>`.
|
|
|
|
|
|
State Machine Workers and Router Lifecycle
|
|
------------------------------------------
|
|
Each multithreaded worker process manages a pool of state machines (one
|
|
per virtual router), each of which represents the lifecycle of an individual
|
|
router. As the scheduler distributes events for a specific router, logic in
|
|
the worker (dependent on the router's current state) determines which action to
|
|
take next:
|
|
|
|
.. graphviz:: worker_diagram.dot
|
|
|
|
For example, let's say a user created a new Neutron network, subnet, and router.
|
|
In this scenario, a ``router-interface-create`` event would be handled by the
|
|
appropriate worker (based by tenant ID), and a transition through the state
|
|
machine might look something like this:
|
|
|
|
.. graphviz:: sample_boot.dot
|
|
|
|
State Machine Flow
|
|
++++++++++++++++++
|
|
|
|
The supported states in the state machine are:
|
|
|
|
:CalcAction: The entry point of the state machine. Depending on the
|
|
current status of the Service VM (e.g., ``ACTIVE``, ``BUILD``, ``SHUTDOWN``)
|
|
and the current event, determine the first step in the state machine to
|
|
transition to.
|
|
|
|
:Alive: Check aliveness of the Service VM by attempting to communicate with
|
|
it via its REST HTTP API.
|
|
|
|
:CreateVM: Call ``nova boot`` to boot a new Service VM. This will attempt
|
|
to boot a Service VM up to a (configurable) number of times before
|
|
placing the router into ``ERROR`` state.
|
|
|
|
:CheckBoot: Check aliveness (up to a configurable number of seconds) of the
|
|
router until the VM is responsive and ready for initial configuration.
|
|
|
|
:ConfigureVM: Configure the Service VM and its services. This is generally
|
|
the final step in the process of booting and configuring a router. This
|
|
step communicates with the Neutron API to generate a comprehensive network
|
|
configuration for the router (which is pushed to the router via its REST
|
|
API). On success, the state machine yields control back to the worker
|
|
thread and that thread handles the next event in its queue (likely for
|
|
a different Service VM and its state machine).
|
|
|
|
:ReplugVM: Attempt to hot-plug/unplug a network from the router via ``nova
|
|
interface-attach`` or ``nova-interface-detach``.
|
|
|
|
:StopVM: Terminate a running Service VM. This is generally performed when
|
|
a Neutron router is deleted or via explicit operator tools.
|
|
|
|
:ClearError: After a (configurable) number of ``nova boot`` failures, Neutron
|
|
routers are automatically transitioned into a cool down ``ERROR`` state
|
|
(so that :py:mod:`astara` will not continue to boot them forever; this is
|
|
to prevent further exasperation of failing hypervisors). This state
|
|
transition is utilized to add routers back into management after issues
|
|
are resolved and signal to :py:mod:`astara-orchestrator` that it should attempt
|
|
to manage them again.
|
|
|
|
:STATS: Reads traffic data from the router.
|
|
|
|
:CONFIG: Configures the VM and its services.
|
|
|
|
:EXIT: Processing stops.
|
|
|
|
|
|
ACT(ion) Variables are:
|
|
|
|
:Create: Create router was requested.
|
|
|
|
:Read: Read router traffic stats.
|
|
|
|
:Update: Update router configuration.
|
|
|
|
:Delete: Delete router.
|
|
|
|
:Poll: Poll router alive status.
|
|
|
|
:rEbuild: Recreate a router from scratch.
|
|
|
|
VM Variables are:
|
|
|
|
:Down: VM is known to be down.
|
|
|
|
:Booting: VM is booting.
|
|
|
|
:Up: VM is known to be up (pingable).
|
|
|
|
:Configured: VM is known to be configured.
|
|
|
|
:Restart Needed: VM needs to be rebooted.
|
|
|
|
:Hotplug Needed: VM needs to be replugged.
|
|
|
|
:Gone: The router definition has been removed from neutron.
|
|
|
|
:Error: The router has been rebooted too many times, or has had some
|
|
other error.
|
|
|
|
.. graphviz:: state_machine.dot
|
|
|
|
.. _health:
|
|
|
|
Health Monitoring
|
|
-----------------
|
|
|
|
``astara.health`` is a subprocess which (at a configurable interval)
|
|
periodically delivers ``POLL`` events to every known virtual router. This
|
|
event transitions the state machine into the ``Alive`` state, which (depending
|
|
on the availability of the router), may simply exit the state machine (because
|
|
the router's status API replies with an ``HTTP 200``) or transition to the
|
|
``CreateVM`` state (because the router is unresponsive and must be recreated).
|
|
|
|
High Availability
|
|
-----------------
|
|
|
|
Astara supports high-availability (HA) on both the control plane and data
|
|
plane.
|
|
|
|
The ``astara-orchestrator`` service may be deployed in a configuration that
|
|
allows multiple service processes to span nodes to allow load-distribution
|
|
and HA. For more information on clustering, see the :ref:`install docs<cluster_astara>`.
|
|
|
|
It also supports orchestrating pairs of virtual appliances to provide
|
|
HA of the data path, allowing pairs of virtual routers to be clustered among
|
|
themselves using VRRP and connection tracking. To enable this, simply
|
|
create Neutron routers with the ``ha=True`` parameter or set this property
|
|
on existing routers and issue a rebuild command via ``astara-ctl`` for that
|
|
router.
|
|
|
|
|
|
|