diff --git a/source/management_and_operations/monitor_alert/forecast.rst b/source/management_and_operations/monitor_alert/forecast.rst index b3541e7810..465af812f4 100644 --- a/source/management_and_operations/monitor_alert/forecast.rst +++ b/source/management_and_operations/monitor_alert/forecast.rst @@ -4,18 +4,39 @@ Resource Forecast ================================================================================ +Overview +-------------------------------------------------------------------------------- + The OpenNebula Resource Forecast system provides short-term and long-term predictions for resource usage across hosts and virtual machines. By forecasting metric trends and performance related to CPU, memory, network, and disk usage, it enables administrators to proactively manage resources. This predictive capability helps optimize resource allocation, anticipate potential bottlenecks, and ensure the efficiency and stability of both infrastructure and virtual resources. +.. note:: + Resource forecasting works automatically once the hypervisor package is installed and requires no additional action from administrators to generate predictions. For configuration options, see the :ref:`Configuration and Optimization ` section. + +Key Benefits +-------------------------------------------------------------------------------- + +* **Proactive resource management** - Identify potential resource constraints before they impact performance +* **Improved capacity planning** - Make informed decisions about infrastructure expansion +* **Enhanced workload scheduling** - Optimize VM placement based on predicted resource utilization +* **Better user experience** - Maintain consistent performance by avoiding resource contention + By integrating forecasting with the Distributed Resource Scheduler (DRS), cloud administrators gain access to a proactive tool for intelligent scheduling and resource optimization. This integration enables the system to anticipate potential issues in virtual machine workloads, facilitating dynamic adjustments to prevent performance bottlenecks, enhance workload balancing, and ensure optimal utilization of available resources. As a result, overall cluster efficiency, reliability, and performance can be significantly improved. +Types of Forecasts +-------------------------------------------------------------------------------- + +OpenNebula provides two types of resource forecasts: + Long-term Forecast ================================================================================ +**Purpose**: Used for capacity planning, hardware provisioning, and resource allocation strategies. + Long-term forecast information is accessible through the CLI and Sunstone. Using Sunstone, when selecting a Virtual Machine or a Host, the "Monitoring" tab displays not only real-time resource usage data but also predicted long-term forecasts. These forecasts provide valuable insights into expected resource consumption trends, helping cloud administrators plan and allocate resources more effectively. [INSERT SUNSTONE IMAGE HERE FOR LONG-TERM FORECAST] -You can use also the CLI to access information about the long-term forecast. +You can also use the CLI to access information about the long-term forecast. [INSERT CLI INFORMATION HERE FOR LONG-TERM FORECAST] @@ -23,43 +44,68 @@ By default, long-term forecast is performed for the next 30 days. Short-term Forecast ================================================================================ -Short-term forecasts are utilized by the Predictive DRS to optimize cluster load distribution based on CPU, memory, disk, and network predictions. To enable predictive capabilities for DRS, refer to the section [DRS section reference]. This ensures that scheduling decisions are informed by accurate and data-driven insights, enhancing resource efficiency. -Information about short-term forecast can be accessed also using Sunstone. When selecting a Virtual Machine or a Host, the "Monitoring" tab displays not only real-time resource usage data but also predicted short-term forecasts. +**Purpose**: Used for immediate operational decisions and dynamic resource adjustments. + +Short-term forecasts are utilized by the Predictive DRS to optimize cluster load distribution based on CPU, memory, disk, and network predictions. To enable predictive capabilities for DRS, refer to the :ref:`Distributed Resource Scheduling section `. This ensures that scheduling decisions are informed by accurate and data-driven insights, enhancing resource efficiency. + +Information about short-term forecasts can also be accessed using Sunstone. When selecting a Virtual Machine or a Host, the "Monitoring" tab displays not only real-time resource usage data but also predicted short-term forecasts. [INSERT SUNSTONE IMAGE HERE FOR SHORT-TERM FORECAST] -You can use also the CLI to access information about the long-term forecast. +You can also use the CLI to access information about the short-term forecast. [INSERT CLI INFORMATION HERE FOR SHORT-TERM FORECAST] By default, short-term forecast is performed for the next 5 minutes. -Forecast generation +How Forecasting Works +-------------------------------------------------------------------------------- + +Forecast Generation ================================================================================ -Forecasts are generated by using the OpenNebula Built-in monitoring system (see :ref:`monitor_alert_monitor`). At defined intervals, a prediction probe is executed for both Hosts and Virtual Machines to analyze real-time resource usage metrics, including CPU, memory, disk, and network utilization. -Each host maintains a dedicated database ('/var/tmp/one_db/host.db') that is continuously updated during monitoring cycles. This database stores historical metrics and is used for time-series analysis and prediction generation. The forecast computation process is distributed across all hosts within a cluster to enhance scalability and efficiency. For each VM running on a host, an individual database is created and stored in '/var/tmp/one_db' as '.db'. +Forecasts are generated using the OpenNebula Built-in monitoring system (see :ref:`monitor_alert_monitor`). At defined intervals, a prediction probe is executed for both Hosts and Virtual Machines to analyze real-time resource usage metrics, including CPU, memory, disk, and network utilization. + +Each host maintains a dedicated database (``/var/tmp/one_db/host.db``) that is continuously updated during monitoring cycles. This database stores historical metrics and is used for time-series analysis and prediction generation. The forecast computation process is distributed across all hosts within a cluster to enhance scalability and efficiency. For each VM running on a host, an individual database is created and stored in ``/var/tmp/one_db`` as ``.db``. .. note:: If a VM is migrated, the related DB will be created from scratch in the new host where the VM will be allocated. This will impact the forecast of that VM until enough data is monitored. -The forecast computation process accesses the stored metrics and performs statistical analysis to identify trends, patterns, and seasonal variations. Using this analysis, the system computes predictive models that estimate future resource consumption. The generated predictions are then sent to the OpenNebula monitoring system, where they are integrated with existing monitoring data and made accessible via Sunstone, CLI or used by the Predictive DRS. +The forecast computation process: + +1. Accesses the stored metrics from the database +2. Performs statistical analysis to identify trends, patterns, and seasonal variations +3. Computes predictive models that estimate future resource consumption +4. Sends generated predictions to the OpenNebula monitoring system +5. Makes predictions accessible via Sunstone, CLI, or for use by the Predictive DRS -Forecast quality and accuracy ------------------------------ +Forecast Quality and Accuracy +================================================================================ + +The accuracy of forecasts depends on several factors: + +* **Historical data volume** - More data generally leads to better predictions +* **Data quality** - Consistent monitoring data without gaps improves accuracy +* **Workload predictability** - Regular patterns are easier to forecast than random spikes +* **Database retention period** - Longer retention captures more seasonal patterns The retention period for both Host and Virtual Machine databases is configurable, enabling administrators to manage storage utilization efficiently while maintaining prediction accuracy. Database retention can impact the accuracy of predictions, particularly for long-term forecasts. The forecast module analyzes all historical data in the database to decompose time series data for different metrics into trends and seasonality. Depending on the data's seasonality and the duration of the long-term forecast, the database retention period should be appropriately configured, considering both the required storage size and prediction accuracy. -.. note:: The actual prediction module is sensible to outliers. This means that presence of outliers can have a bad effect on the predictions. +.. warning:: The prediction module is sensitive to outliers. This means that the presence of outliers can have a negative effect on the predictions. Consider investigating unusual VM behavior if forecasts suddenly become less accurate. For further details on configuring forecast retention or optimizing prediction accuracy, refer to the next section. -Configuration +.. _forecast_configuration: + +Configuration and Optimization +-------------------------------------------------------------------------------- + +Configuration File ================================================================================ The configuration file for the Resource Forecast can be found in ``/var/lib/one/remotes/kvm-probes.d/forecast.conf``. -The default configuration is the following +The default configuration is the following: .. code:: yaml @@ -78,10 +124,53 @@ The default configuration is the following forecast_far_period: 48 # Number of hours -The configuration file consists of two sections, one related to the Host and the other to the Virtual Machine. +The configuration file consists of two sections: + +1. **Host section**: Controls forecast settings for physical hosts +2. **Virtual Machine section**: Controls forecast settings for VMs + +Default Configuration Values +================================================================================ + +**Host settings**: + +* DB retention: 4 weeks +* Short-term forecast: 5 minutes +* Long-term forecast: 720 hours (30 days) + +**Virtual Machine settings**: + +* DB retention: 2 weeks +* Short-term forecast: 5 minutes +* Long-term forecast: 48 hours (2 days) + +Storage Considerations +-------------------------------------------------------------------------------- + +The size of forecast databases depends on retention periods and monitoring frequency: + +* **Host database**: ~2.5 MB for 4 weeks of data (6 metrics, 2-minute interval) +* **VM database**: ~6.5 MB for 2 weeks of data (8 metrics, 30-second interval) + +You may need to adjust these values based on: + +* Available storage capacity on hosts +* Number of VMs per host +* Accuracy requirements for forecasts +* Historical data needs for your specific workloads + +After changing configuration values, monitoring will continue with the new settings without requiring a restart of OpenNebula services. + +Practical Usage Tips +-------------------------------------------------------------------------------- -By default, Host DB retention is set to 4 weeks, the short term forecast to 5 minutes and the long-term forecast to 720 hours (i.e., 30 days). +* **Start with defaults**: The default configuration works well for most environments +* **Increase retention gradually**: If you need more accurate long-term forecasts, increase retention periods incrementally +* **Monitor database sizes**: Check ``/var/tmp/one_db/`` periodically to ensure forecast DBs aren't consuming too much space +* **Consider workload patterns**: Adjust retention based on your workload cycles (daily, weekly, monthly) +* **Use short-term forecasts** for operational decisions and **long-term forecasts** for capacity planning -By default, Virtual Machine DB retention is set to 2 weeks, the short term forecast to 5 minutes and the long-term forecast to 48 hours (i.e., 2 days). +See Also +-------------------------------------------------------------------------------- -The estimated size of the Host database for 4 weeks of data across 6 metrics with a 2-minute monitoring interval is approximately 2.5 MB. Instead, the estimated size of the Virtual Machine database for 2 weeks of data across 8 metrics with a 30-second monitoring interval is around 6.5 MB. \ No newline at end of file +* :ref:`OpenNebula Monitoring System ` \ No newline at end of file diff --git a/source/management_and_operations/monitor_alert/monitor.rst b/source/management_and_operations/monitor_alert/monitor.rst index 96fdce9c75..9b145db798 100644 --- a/source/management_and_operations/monitor_alert/monitor.rst +++ b/source/management_and_operations/monitor_alert/monitor.rst @@ -5,50 +5,75 @@ OpenNebula Built-in Monitoring ================================================================================ Virtual Machine Monitoring -================================================================================ +-------------------------------------------------------------------------------- +The monitoring probes gather information attributes and insert them in the VM template. This information is mainly used for: + + * Monitoring the status of the VM. + * Gathering the resource usage data of the VM. + +In general, you can find the following monitoring information for a VM. Note that each hypervisor may include additional attributes: + ++---------------+-----------------------------------------------------------------------------------+ +| Key | Description | ++===============+===================================================================================+ +| ID | ID of the VM in OpenNebula. | ++---------------+-----------------------------------------------------------------------------------+ +| UUID | Unique ID, must be unique across all hosts. | ++---------------+-----------------------------------------------------------------------------------+ +| MONITOR | Base64 encoded monitoring information (see details below). | ++---------------+-----------------------------------------------------------------------------------+ + +The MONITOR information includes the following data: + ++---------------+-----------------------------------------------------------------------------------+ +| Key | Description | ++===============+===================================================================================+ +| TIMESTAMP | Timestamp of the measurement. | ++---------------+-----------------------------------------------------------------------------------+ +| CPU | Percentage of 1 CPU consumed (two fully consumed CPUs is 2.0). | ++---------------+-----------------------------------------------------------------------------------+ +| MEMORY | MEMORY consumption in kilobytes. | ++---------------+-----------------------------------------------------------------------------------+ +| DISKRDBYTES | Amount of bytes read from disk. | ++---------------+-----------------------------------------------------------------------------------+ +| DISKRDIOPS | Number of IO read operations. | ++---------------+-----------------------------------------------------------------------------------+ +| DISKWRBYTES | Amount of bytes written to disk. | ++---------------+-----------------------------------------------------------------------------------+ +| DISKWRIOPS | Number of IO write operations. | ++---------------+-----------------------------------------------------------------------------------+ +| NETRX | Received bytes from the network. | ++---------------+-----------------------------------------------------------------------------------+ +| NETTX | Sent bytes to the network. | ++---------------+-----------------------------------------------------------------------------------+ -The monitoring probes gathers information attributes and insert them in the VM template. This information is mainly used for: - - * Monitor the status of the VM. - * Gather the resource usage data of the VM. - -In general, you can find the following monitoring information for a VM, note that each hypervisor may include additional attributes: - -+---------------+----------------------------------------------------------------------------------------------+ -| Key | Description | -+===============+==============================================================================================+ -| ID | ID of the VM in OpenNebula. | -+---------------+----------------------------------------------------------------------------------------------+ -| UUID | Unique ID, must be unique across all hosts. | -+---------------+----------------------------------------------------------------------------------------------+ -| MONITOR | Base64 encoded monitoring information, the monitoring information includes following data: | -+---------------+----------------------------------------------------------------------------------------------+ -| TIMESTAMP | Timestamp of the measurement. | -+---------------+----------------------------------------------------------------------------------------------+ -| CPU | Percentage of 1 CPU consumed (two fully consumed cpu is 2.0). | -+---------------+----------------------------------------------------------------------------------------------+ -| MEMORY | MEMORY consumption in kilobytes. | -+---------------+----------------------------------------------------------------------------------------------+ -| DISKRDBYTES | Amount of bytes read from disk. | -+---------------+----------------------------------------------------------------------------------------------+ -| DISKRDIOPS | Number of IO read operations. | -+---------------+----------------------------------------------------------------------------------------------+ -| DISKWRBYTES | Amount of bytes written to disk. | -+---------------+----------------------------------------------------------------------------------------------+ -| DISKWRIOPS | Number of IO write operations. | -+---------------+----------------------------------------------------------------------------------------------+ -| NETRX | Received bytes from the network. | -+---------------+----------------------------------------------------------------------------------------------+ -| NETTX | Sent bytes to the network. | -+---------------+----------------------------------------------------------------------------------------------+ +The metrics above are directly read from and stored in the monitoring database. + +Additionally, the following derived metrics are calculated from the stored metrics and used for forecasting. These derived metrics are not stored in the database but are computed on-demand: + ++---------------+-----------------------------------------------------------------------------------+ +| Key | Description | ++===============+===================================================================================+ +| NETRX_BW | Network received bandwidth (rate of change of NETRX). | ++---------------+-----------------------------------------------------------------------------------+ +| NETTX_BW | Network transmitted bandwidth (rate of change of NETTX). | ++---------------+-----------------------------------------------------------------------------------+ +| DISKRD_BW | Disk read bandwidth (rate of change of DISKRDBYTES). | ++---------------+-----------------------------------------------------------------------------------+ +| DISKWR_BW | Disk write bandwidth (rate of change of DISKWRBYTES). | ++---------------+-----------------------------------------------------------------------------------+ +| DISKRDIOPS_BW | Rate of change of disk read IOPS. | ++---------------+-----------------------------------------------------------------------------------+ +| DISKWRIOPS_BW | Rate of change of disk write IOPS. | ++---------------+-----------------------------------------------------------------------------------+ Host Monitoring -================================================================================ +-------------------------------------------------------------------------------- The monitoring probes gather information attributes and insert them in the Host template. This information is mainly used for: * Monitoring the status of the Host to detect any error condition. - * Gathering the configuration of the Host (e.g. capacity, PCI devices or NUMA nodes). This information is used to control VM resource assignments. + * Gathering the configuration of the Host (e.g., capacity, PCI devices, or NUMA nodes). This information is used to control VM resource assignments. * Creating placement constraints for allocation of VMs, :ref:`see more details here `. In general, you can find the following monitoring information in a Host. Note that each hypervisor may include additional attributes: @@ -58,61 +83,104 @@ In general, you can find the following monitoring information in a Host. Note th +============+====================================================================================================+ | HYPERVISOR | Name of the hypervisor of the Host, useful for selecting the Hosts with a specific technology. | +------------+----------------------------------------------------------------------------------------------------+ -| ARCH | Architecture of the Host CPUs, e.g. x86_64. | +| ARCH | Architecture of the Host CPUs, e.g., x86_64. | +------------+----------------------------------------------------------------------------------------------------+ -| MODELNAME | Model name of the Host CPU, e.g. Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz. | +| MODELNAME | Model name of the Host CPU, e.g., Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz. | +------------+----------------------------------------------------------------------------------------------------+ | CPUSPEED | Speed in MHz of the CPUs. | +------------+----------------------------------------------------------------------------------------------------+ | HOSTNAME | As returned by the ``hostname`` command. | +------------+----------------------------------------------------------------------------------------------------+ -| VERSION | This is the version of the monitoring probes. Used to control local changes and the update process | +| VERSION | This is the version of the monitoring probes. Used to control local changes and the update process.| +------------+----------------------------------------------------------------------------------------------------+ -| MAX_CPU | Number of CPUs multiplied by 100. For example, a 16 cores machine will have a value of 1600. | +| MAX_CPU | Number of CPUs multiplied by 100. For example, a 16-core machine will have a value of 1600. | | | The value of RESERVED_CPU will be subtracted from the information reported by the | -| | monitoring system. This value is displayed as ``TOTAL CPU`` by the | -| | ``onehost show`` command under ``HOST SHARE`` section. | +| | monitoring system. This value is displayed as ``TOTAL CPU`` by the | +| | ``onehost show`` command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | MAX_MEM | Maximum memory that can be used for VMs. It is advised to discount the memory | | | used by the hypervisor using RESERVED_MEM. This value is subtracted from the memory | | | amount reported. The value is displayed as ``TOTAL MEM`` by the ``onehost show`` | -| | command under ``HOST SHARE`` section. | +| | command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | MAX_DISK | Total space in megabytes in the DATASTORE LOCATION. | +------------+----------------------------------------------------------------------------------------------------+ | USED_CPU | Percentage of used CPU multiplied by the number of cores. This value is displayed | -| | as ``USED CPU (REAL)`` by the ``onehost show`` command under ``HOST SHARE`` section. | +| | as ``USED CPU (REAL)`` by the ``onehost show`` command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | USED_MEMORY| Memory used, in kilobytes. This value is displayed as ``USED MEMORY (REAL)`` | -| | by the ``onehost show`` command under ``HOST SHARE`` section. | +| | by the ``onehost show`` command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | USED_DISK | Used space in megabytes in the DATASTORE LOCATION. | +------------+----------------------------------------------------------------------------------------------------+ | FREE_CPU | Percentage of idling CPU multiplied by the number of cores. For example, | -| | if 50% of the CPU is idling in a 4 core machine the value will be 200. | +| | if 50% of the CPU is idling in a 4-core machine, the value will be 200. | +------------+----------------------------------------------------------------------------------------------------+ | FREE_MEMORY| Available memory for VMs at that moment, in kilobytes. | +------------+----------------------------------------------------------------------------------------------------+ -| FREE_DISK | Free space in megabytes in the DATASTORE LOCATION | +| FREE_DISK | Free space in megabytes in the DATASTORE LOCATION. | +------------+----------------------------------------------------------------------------------------------------+ | CPU_USAGE | Total CPU allocated to VMs running on the Host as requested in ``CPU`` | | | in each VM template. This value is displayed as ``USED CPU (ALLOCATED)`` | -| | by the ``onehost show`` command under ``HOST SHARE`` section. | +| | by the ``onehost show`` command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | MEM_USAGE | Total MEM allocated to VMs running on the Host as requested in ``MEMORY`` | | | in each VM template. This value is displayed as ``USED MEM (ALLOCATED)`` | -| | by the ``onehost show`` command under ``HOST SHARE`` section. | +| | by the ``onehost show`` command under the ``HOST SHARE`` section. | +------------+----------------------------------------------------------------------------------------------------+ | DISK_USAGE | Total size allocated to disk images of VMs running on the Host; computed | | | using the ``SIZE`` attribute of each image and considering the datastore characteristics. | +------------+----------------------------------------------------------------------------------------------------+ -| NETRX | Received bytes from the network | +| NETRX | Received bytes from the network. | +------------+----------------------------------------------------------------------------------------------------+ -| NETTX | Transferred bytes to the network | +| NETTX | Transferred bytes to the network. | +------------+----------------------------------------------------------------------------------------------------+ | WILD | Comma-separated list of VMs running in the Host that were not launched | -| | and are not currently controlled by OpenNebula | +| | and are not currently controlled by OpenNebula. | +------------+----------------------------------------------------------------------------------------------------+ | ZOMBIES | Comma-separated list of VMs running in the Host that were launched by | | | OpenNebula but are not currently controlled by it. | +------------+----------------------------------------------------------------------------------------------------+ + +The metrics above are directly read from and stored in the monitoring database. + +Additionally, the following derived metrics are calculated from the stored metrics and used for forecasting. These derived metrics are not stored in the database but are computed on-demand: + ++---------------+-----------------------------------------------------------------------------------+ +| Key | Description | ++===============+===================================================================================+ ++---------------+-----------------------------------------------------------------------------------+ +| NETRX_BW | Network received bandwidth (rate of change of NETRX). | ++---------------+-----------------------------------------------------------------------------------+ +| NETTX_BW | Network transmitted bandwidth (rate of change of NETTX). | ++---------------+-----------------------------------------------------------------------------------+ + +Monitoring Database Structure +-------------------------------------------------------------------------------- + +OpenNebula uses a distributed database approach to store and process monitoring data, optimizing performance and scalability across your cloud infrastructure. + +Host Databases +================================================================================ + +Each physical host in your OpenNebula deployment maintains its own dedicated monitoring database: + +* **Location**: ``/var/tmp/one_db/host.db`` +* **Purpose**: Stores all historical monitoring metrics for the host +* **Updates**: Continuously updated during regular monitoring cycles +* **Processing**: The forecast computation occurs locally on each host, distributing the computational load across the cluster + +Virtual Machine Databases +================================================================================ + +Each VM has a dedicated database that tracks its specific metrics: + +* **Location**: ``/var/tmp/one_db/.db`` on the host where the VM is running +* **Purpose**: Stores all historical monitoring metrics for that specific VM +* **Updates**: Continuously updated during regular monitoring cycles with VM-specific data +* **Lifecycle**: If a VM is migrated to another host, a new database will be created from scratch on the destination host + +.. note:: + After VM migration, forecast accuracy may be temporarily reduced until sufficient monitoring data is collected on the new host. + +For more information about how these databases are used for resource forecasting, see the :ref:`Resource Forecast ` section.