Skip to content

Latest commit

 

History

History
546 lines (426 loc) · 12.4 KB

File metadata and controls

546 lines (426 loc) · 12.4 KB

NEXUS On-Call Management Documentation

Overview

The NEXUS On-Call Management system provides comprehensive incident management, intelligent scheduling, and multi-channel notification capabilities. All on-call components are operational and have been thoroughly tested. Status: OPERATIONAL - 100% Complete & Debugged

Components

User Management

Status: Operational

Features: User profile management with skills and preferences Availability tracking and scheduling Role-based access control Performance metrics and statistics Contact information managementAPI Endpoints: GET /api/oncall/users - List all users GET /api/oncall/user/:user

Id/stats - User performance stats POST /api/oncall/users - Create new user PUT /api/oncall/user/:user

Id - Update user DELETE /api/oncall/user/:user

Id - Delete user

Usage Examples: Get all users curl http://127.0.0.1:41663/api/oncall/users

Get user stats curl http://127.0.0.1:41663/api/oncall/user/user-1/stats

Create new user curl -X POST http://127.0.0.1:41663/api/oncall/users
-H "Content-Type: application/json"
-d '{ "name": "John Doe", "email": "john@nexus-support.com", "role": "senior_engineer", "skills": ["backend", "database"], "phone": "+1234567890" }'User Profile Structure: { "id": "user-1", "name": "Alice Johnson", "email": "alice@nexus-support.com", "role": "senior_engineer", "timezone": "America/New_York", "skills": ["backend", "database", "security"], "phone": "+1234567890", "preferences": { "preferred

Hours": { "start": "09:00", "end": "17:00" }, "avoid

Days": ["saturday", "sunday"] }, "current

Incidents": [], "total

Incidents": 0, "average

Resolution

Time": 0, "on

Call

Hours": 0 }Incident Management

Status: Operational

Features: Complete incident lifecycle management Incident assignment and tracking Escalation policy enforcement Resolution tracking and analysis Incident collaboration toolsAPI Endpoints: GET /api/oncall/incidents - List incidents POST /api/oncall/incidents - Create incident GET /api/oncall/incident/:incident

Id - Get incident details POST /api/oncall/incident/:incident

Id/assign - Assign incident POST /api/oncall/incident/:incident

Id/acknowledge - Acknowledge incident POST /api/oncall/incident/:incident

Id/resolve - Resolve incident POST /api/oncall/incident/:incident

Id/handoff - Handoff incident PUT /api/oncall/incident/:incident

Id - Update incident DELETE /api/oncall/incident/:incident

Id - Delete incident

Usage Examples: Create incident curl -X POST http://127.0.0.1:41663/api/oncall/incidents
-H "Content-Type: application/json"
-d '{ "title": "Database Connection Issue", "description": "Unable to connect to MongoDB", "severity": "high", "tags": ["database", "connectivity"] }'Assign incident curl -X POST http://127.0.0.1:41663/api/oncall/incident/incident-1/assign
-H "Content-Type: application/json"
-d '{"user

Id": "user-1"}'Resolve incident curl -X POST http://127.0.0.1:41663/api/oncall/incident/incident-1/resolve
-H "Content-Type: application/json"
-d '{ "user

Id": "user-1", "resolution

Details": "Fixed database connection string" }'Incident Structure: { "id": "incident-1", "title": "Database Connection Issue", "description": "Unable to connect to MongoDB", "severity": "high", "status": "open", "created

At": "2026-05-14T03:00:00.000Z", "updated

At": "2026-05-14T03:00:00.000Z", "assigned

To": "user-1", "acknowledged

By": null, "resolved

By": null, "resolution

Details": null, "tags": ["database", "connectivity"], "related

Alerts": ["alert-1", "alert-2"], "escalations": [], "handoffs": [] }Scheduling System

Status: Operational

Features: Intelligent on-call rotation scheduling Automatic shift generation Conflict resolution and optimization Schedule visibility and management Holiday and availability handlingAPI Endpoints: GET /api/oncall/schedule - Get schedule GET /api/oncall/schedule/:date - Get schedule for specific date POST /api/oncall/schedule - Generate new schedule PUT /api/oncall/schedule - Update schedule GET /api/oncall/current - Get current on-call person

Usage Examples: Get current schedule curl http://127.0.0.1:41663/api/oncall/schedule

Get schedule for specific date curl http://127.0.0.1:41663/api/oncall/schedule/2026-05-15Get current on-call person curl http://127.0.0.1:41663/api/oncall/current

Generate new schedule curl -X POST http://127.0.0.1:41663/api/oncall/schedule
-H "Content-Type: application/json"
-d '{ "start

Date": "2026-05-15", "end

Date": "2026-05-22", "shift

Type": "weekly" }'Schedule Structure: { "date": "2026-05-14", "shifts": [ { "id": "shift-1", "type": "primary", "start

Time": "09:00", "end

Time": "17:00", "assigned

To": "user-1", "backup": "user-2", "status": "active" }, { "id": "shift-2", "type": "secondary", "start

Time": "17:00", "end

Time": "09:00", "assigned

To": "user-2", "backup": "user-3", "status": "standby" } ] }Escalation Management

Status: Operational

Features: Automated escalation policies Multi-level escalation rules Timeout-based escalation Escalation tracking and reporting Custom escalation workflowsAPI Endpoints: GET /api/oncall/escalation-policies - List escalation policies POST /api/oncall/escalation-policies - Create escalation policy GET /api/oncall/escalation-policy/:id - Get escalation policy PUT /api/oncall/escalation-policy/:id - Update escalation policy DELETE /api/oncall/escalation-policy/:id - Delete escalation policy

Usage Examples: Get escalation policies curl http://127.0.0.1:41663/api/oncall/escalation-policies

Create escalation policy curl -X POST http://127.0.0.1:41663/api/oncall/escalation-policies
-H "Content-Type: application/json"
-d '{ "name": "Critical Incident Escalation", "rules": [ { "condition": "severity == critical", "timeout": 300, "action": "escalate_to_manager" } ] }'Notification System

Status: Operational

Features: Multi-channel notifications (Email, Slack, Pager

Duty, SMS) Notification templates and customization Notification history and tracking Delivery status monitoring Notification preferences managementAPI Endpoints: GET /api/oncall/notification-channels - List notification channels POST /api/oncall/notifications - Send notification GET /api/oncall/notifications/history - Notification history GET /api/oncall/notification-preferences - User notification preferences PUT /api/oncall/notification-preferences - Update notification preferences

Usage Examples: Get notification channels curl http://127.0.0.1:41663/api/oncall/notification-channels

Send notification curl -X POST http://127.0.0.1:41663/api/oncall/notifications
-H "Content-Type: application/json"
-d '{ "user

Id": "user-1", "type": "incident_assigned", "message": "New incident assigned: Database Connection Issue", "channels": ["email", "slack"] }'Configuration

Escalation Policies

Define escalation rules for different incident types:{ "name": "Critical Incident Escalation", "description": "Escalation policy for critical incidents", "rules": [ { "condition": "severity == 'critical'", "timeout": 300, "action": "escalate_to_manager", "notifications": ["pagerduty", "slack"] }, { "condition": "age > 3600 AND status == 'open'", "timeout": 600, "action": "escalate_to_director", "notifications": ["email", "pagerduty"] } ] }Notification Channels

Configure notification channels:{ "email": { "enabled": true, "smtp_host": "smtp.gmail.com", "smtp_port": 587, "username": "alerts@nexus-support.com", "templates": { "incident_assigned": "templates/incident_assigned.html", "incident_escalated": "templates/incident_escalated.html" } }, "slack": { "enabled": true, "webhook_url": "https://hooks.slack.com/services/...", "channel": "#oncall", "username": "NEXUS On-Call Bot" }, "pagerduty": { "enabled": true, "integration_key": "your-pagerduty-key", "severity_mapping": { "critical": "critical", "high": "high", "medium": "warning", "low": "info" } } }Scheduling Rules

Configure scheduling behavior:{ "shift

Types": { "primary": { "duration": 8, "start_time": "09:00", "end_time": "17:00", "days": ["monday", "tuesday", "wednesday", "thursday", "friday"] }, "secondary": { "duration": 16, "start_time": "17:00", "end_time": "09:00", "days": ["monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"] } }, "rotation": { "type": "weekly", "auto_generate": true, "conflict_resolution": "auto" } }Integration

Alert Management

Integrate with alerting systems:// Create incident from alert const alert = { title: "High CPU Usage", description: "CPU usage exceeded 90%", severity: "high", source: "prometheus" };// Create incident const incident = await on

Call

Management.create

Incident(alert);ITSM Integration

Integrate with ITSM systems:// Sync incident with ITSM const itsm

Incident = { incident

Id: incident.id, title: incident.title, description: incident.description, severity: incident.severity, assigned

To: incident.assigned

To };// Send to ITSM await itsm

Client.create

Incident(itsm

Incident);Calendar Integration

Integrate with calendar systems:// Export schedule to calendar const schedule = await on

Call

Management.get

Schedule(); const calendar

Events = schedule

ToCalendar

Events(schedule);// Add to Google Calendar await calendar

Client.add

Events(calendar

Events);Best Practices

Incident Management Clear Classification: Use consistent severity levels Proper Assignment: Assign to appropriate skill sets Timely Acknowledgment: Acknowledge incidents promptly Documentation: Document resolution steps Post-Mortem: Conduct post-incident reviews

Scheduling Fair Rotation: Ensure equitable on-call distribution Skill Matching: Match incidents with appropriate skills Availability: Respect user preferences and availability Backup Coverage: Always have backup on-call personnel Holiday Coverage: Plan for holiday and vacation periods

Escalation Clear Policies: Define escalation triggers and timeouts Multiple Channels: Use multiple notification channels Context Preservation: Maintain incident context during escalation Follow-up: Ensure proper handoff during escalation Review: Regularly review escalation effectiveness

Troubleshooting

Common Issues

Incident Not Assigned: Check user availability Verify skill matching Review scheduling conflicts

Escalation Not Triggered: Check escalation rules Verify timeout configuration Review notification channels

Notifications Not Sent: Check channel configuration Verify delivery settings Review notification history

Health Checks

Monitor on-call system health: Check current on-call curl http://127.0.0.1:41663/api/oncall/current

Check schedule curl http://127.0.0.1:41663/api/oncall/schedule

Check active incidents curl http://127.0.0.1:41663/api/oncall/incidents?status=open

Debugging

Enable debug logging: const on

Call

Config = { debug: true, log

Level: 'debug', log

Notifications: true, log

Escalations: true };Metrics

On-Call Metrics oncall_incidents_total - Total incidents created oncall_incidents_by_severity - Incidents by severity level oncall_incidents_by_status - Incidents by status oncall_response_time_seconds - Time to acknowledge incidents

User Metrics oncall_user_shifts_total - Total shifts worked oncall_user_incidents_total - Incidents assigned to user oncall_user_resolution_time_seconds - Average resolution time oncall_user_satisfaction_score - User satisfaction scores

Escalation Metrics oncall_escalations_total - Total escalations oncall_escalations_by_reason - Escalations by reason oncall_escalation_time_seconds - Time to escalation oncall_notification_deliveries_total - Notifications delivered

Security

Data Protection Personal Information: Protect user contact information Incident Data: Secure sensitive incident details Access Control: Implement role-based access Audit Logging: Log all on-call operations

Privacy Contact Information: Limit access to contact details Incident Details: Redact sensitive information User Preferences: Protect user privacy settings Communication Logs: Secure communication logs

Future Enhancements

Planned Features AI-Powered Assignment: Machine learning for incident assignment Predictive Scheduling: Predict optimal scheduling patterns Mobile App: Native mobile application Voice Notifications: Voice call notifications Integration Hub: Centralized integration platform

Scalability Multi-Region: Global on-call coordination Team Management: Multi-team on-call coordination Advanced Analytics: Advanced on-call analytics Automation: Increased automation capabilities