Service Level Objectives: Webdelo's Approach to Production Support

When Webdelo takes on a project for ongoing support, the first step is defining Service Level Objectives (SLOs) — measurable targets for service quality that we commit to maintaining. Specific SLO values are always determined after a project audit and formalized in the Statement of Work (SOW). The standard support package includes multi-layer monitoring, alerting, an S1–S4 incident response model, and regular reporting. We do not offer one-size-fits-all uptime numbers. Instead, we define realistic targets and take ownership of meeting them.

Terminology Scope SLIs: What We Measure Package Components Incident Severity Error Budget Glossary

Terminology

Term Definition
SLI (Service Level Indicator) A specific measurable metric of service quality: availability, latency, error rate
SLO (Service Level Objective) A target value or range for an SLI that Webdelo commits to maintaining
SLA (Service Level Agreement) The contractual agreement between Webdelo and the client formalizing SLOs and consequences of violations
SOW (Statement of Work) The technical specification document where concrete SLO values are fixed after the project audit
Error budget The allowable amount of service degradation within a measurement period; exhaustion signals a shift from feature delivery to reliability work

Scope

In scope

Web applications and APIs with defined service tiers
Microservice architectures and distributed systems
High-load analytics platforms and data processing pipelines
Financial and trading systems, internal fintech/banking tooling
Marketing automation and ERP modules
Infrastructure layer: servers, containers, databases, network services

Out of scope

Third-party services and APIs outside the client's or Webdelo's control (payment gateways, external providers)
Planned maintenance windows agreed in advance
Degradation caused by client-side changes outside the agreed change management process
Force majeure events beyond reasonable technical control

What Drives SLO Parameters

Webdelo's approach recognizes that SLO targets are shaped by three factors.
Architectural complexity
A single-database monolith and a twenty-service distributed system require fundamentally different monitoring approaches. More complex architectures require more observation points and broader on-call coverage.
Service tier
Not all system components carry equal business weight. We classify services by tier based on their impact on the availability of the client's core product. A payment gateway and an internal report-generation service warrant very different response requirements. The tier determines monitoring priority, response times, and on-call team composition.
Support budget
Budget determines how many specialists can be allocated, their working mode (business hours, extended hours, or 24/7 coverage), and the depth of business-metric monitoring.

When SLOs Are Established

SLOs are defined at one of two stages.
New service builds
Monitoring and metrics are built into the architecture from the start. By launch time, full observability coverage is already in place — the ideal scenario.
Onboarding to an existing project
Webdelo conducts an infrastructure audit, establishes service tiers, identifies monitoring gaps, and builds observability incrementally. Concrete SLO values are fixed in the SOW following the audit.
In both cases, error budgets are defined for key metrics to guide prioritization decisions between feature development and stability work.

SLIs: What We Measure

SLI Category What Is Measured Example Metrics
Availability Real-time service health Healthcheck status, per-service uptime
Errors Error volume and trends by service HTTP 5xx rate, error breakdown by type and severity
Latency Request response times p50, p95, p99 latency on key endpoints
Resources Server infrastructure load CPU, RAM, disk I/O, network throughput
Databases DB health and performance Query latency, slow query log, CPU/memory load
Network endpoints Availability of external and internal dependencies Third-party API reachability, internal microservice availability
Business metrics Execution time of business operations and workflows Order status transition times, product-level SLA metrics

A real-world example

On one project, Webdelo configured monitoring for order status transition times. Metrics revealed that a subset of orders was getting stuck at specific processing stages. Investigation uncovered a compounded failure: support confirmation flow misfires + external delivery API latency spikes = accumulation of stalled orders = database overload = system-wide degradation. Each component appeared healthy in isolation; the cascade was only visible through business-level metrics.

Standard SLO Package Components

Every Webdelo support engagement includes five core components.
1. Monitoring infrastructure
Primary stack: Grafana + Prometheus (or equivalent tools agreed with the client). Coverage spans all SLI categories: availability, errors, latency, resources, databases, network dependencies, and business processes.
2. Alerting system
Alerting is built on VictoriaMetrics (or equivalent). Every alert is tied to a specific metric threshold and severity level. Notifications are delivered to the team's messaging channel (Slack, Telegram, or as agreed). On-call engineers receive only actionable signals — low-priority noise is filtered out.
3. Runbooks and escalation paths
For each service we maintain a responsibility map: who gets paged for which incident, escalation order, and initial response steps to execute before a domain specialist joins. This eliminates the "who owns this?" delay at the moment an incident fires.
4. Incident management team

The incident team operates across three levels:

  • Incident management — specialists with full project architecture knowledge who open incidents, create tickets in the project management system (Jira, ClickUp, or the client's preferred tool), and assign ownership.
  • On-call DevOps/SysOps — infrastructure engineers for diagnosing issues at the server, container, network, and orchestration level.
  • On-call developers — service-specific engineers engaged when an issue is localized to application code or business logic.
5. User-reported issue channel
We establish a channel for client-side employees to report observed anomalies. End users often detect issues before automated alerts fire. All reports are consolidated, triaged, and routed into the standard incident workflow.

Incident Severity Model

Severity Level Description Time to Acknowledge Target Resolution Time
S1 Critical Complete unavailability of a key service or product; direct financial impact 15 minutes 4 hours
S2 High Partial degradation of a key service; significant user experience impact 30 minutes 8 hours
S3 Medium Degradation of a non-critical service; core product remains operational 4 hours 24 hours
S4 Low Minor issues with no operational impact; monitoring improvement tasks Next business day As agreed
Specific response times depend on service tier, on-call mode, and support budget. Final parameters are fixed in the SOW.

Error Budget Policy

The error budget is the allowable amount of service degradation within a measurement period (typically monthly or quarterly).
Budget healthy
(remaining budget above warning threshold): the team maintains its standard delivery pace — feature development and infrastructure changes proceed normally.
Budget approaching exhaustion
(remaining budget below warning threshold): priorities shift toward stability. New feature work pauses until the budget recovers or the SLO is renegotiated with the client.
Budget exhausted
Feature work stops entirely. The team focuses exclusively on reliability restoration. A joint review is conducted with the client to analyze root causes and update the plan.

Reporting and Service Reviews

Periodic report contents

Section Contents
SLO status Actual SLI values for the period vs. SLO targets
Error budget Remaining budget per service, consumption trend
Incidents Incident list by S1–S4 severity, MTTR, root causes
Trends Performance and reliability changes vs. prior period
Recommendations Proposals for monitoring, architecture, or process improvements

Post-incident process

Every incident generates a task documenting root causes, affected systems, and remediation steps taken. Related bug-fix and logic-correction tickets are linked to this task. The resulting task cluster forms the foundation for postmortem review and longitudinal team performance analysis.

Cadence

Standard reporting cadence is monthly. Weekly or quarterly formats are available by agreement.

What We Do Not Promise

Clear capability boundaries are part of how Webdelo builds trust with clients.
No universal SLO numbers
We do not publish or commit to "99.99% for everything" before an audit. Targets are always project-specific.
Third-party dependencies are outside our control
We monitor external APIs and services but cannot guarantee their availability.
SLO is not SLA
SLOs are operational targets, not legally binding commitments. Legal consequences for violations are defined in a separate SLA.
24/7 on-call is not the default
Round-the-clock coverage depends on budget and service tier. Standard packages may cover extended business hours only.
Zero incidents are not guaranteed
We work to prevent incidents and minimize their impact. We do not promise they will not occur.
Business-metric monitoring requires explicit agreement
Monitoring business processes requires a separate scope agreement covering which metrics to track and data access arrangements.
terminology

Glossary

*
API (Application Programming Interface) — a set of rules that allows one software system to communicate with another. For example, connecting a payment system to a website or calling an external delivery provider happens through an API.
*
Capability boundaries — an honest description of what a team can and cannot guarantee within a service-level agreement; defines the limits of responsibility.
*
CPU (Central Processing Unit) — the primary computational component of a server. High CPU load is one of the key signals of performance degradation.
*
DevOps (Development + Operations) — a practice and culture that unifies software development and IT operations. A DevOps engineer is responsible for CI/CD pipelines, infrastructure provisioning, containerization, and deployment automation.
*
Disk I/O (Disk Input/Output) — read and write operations on a storage device. High disk I/O can be the root cause of database slowdowns or service-wide degradation.
*
Endpoint — a specific URL or network address at which a service or API accepts requests. Monitoring endpoints allows tracking the availability of each individual interface.
*
ERP (Enterprise Resource Planning) — integrated software that combines finance, warehouse, production, HR, and other business processes into a single platform.
*
Error budget — the allowable amount of service degradation within a measurement period (typically monthly or quarterly). When the budget is exhausted, feature work stops and the team shifts focus to reliability restoration.
*
Fintech (Financial Technology) — the sector that applies software solutions to deliver financial services: payments, lending, trading, and insurance.
*
Grafana — an open-source platform for metrics visualization and monitoring dashboards. Typically used together with Prometheus or VictoriaMetrics to display system state in real time.
*
Healthcheck — a periodic automated probe that checks whether a service is alive or unreachable. It is the baseline mechanism for detecting service failures.
*
HTTP 5xx — the class of HTTP response codes (500–599) indicating server-side errors: internal server error (500), bad gateway (502), service unavailable (503), and others. A rising 5xx rate is a direct signal of service degradation.
*
Jira / ClickUp — project and issue tracking systems used to create incident tickets, assign ownership, and manage remediation tasks during the incident response process.
*
Latency — response delay; the time between a client sending a request and receiving a response from the server. Measured in milliseconds. High latency degrades user experience even when a service is technically operational.
*
MTTR (Mean Time To Resolve) — the average time to resolve an incident from the moment it is detected to full service restoration. One of the key performance indicators for a support team.
*
On-call — a duty rotation in which an engineer is designated as the primary responder during a given period and is expected to acknowledge and begin addressing incidents within the agreed response time.
*
p50 / p95 / p99 (latency percentiles) — statistical measures of response-time distribution. p95 means 95% of requests are handled faster than the stated value; p99 covers 99% of requests. Percentiles reflect real user experience more accurately than averages.
*
Postmortem — a structured review conducted after an incident is resolved: root cause analysis, event timeline, list of affected systems, and concrete measures to prevent recurrence. The foundation for systematic reliability improvement.
*
Prometheus — a metrics collection and storage system with support for flexible PromQL queries and a built-in alerting mechanism. The de facto standard for monitoring cloud-native and microservice applications.
*
RAM (Random Access Memory) — fast temporary storage for running processes and applications. Exhausting RAM leads to performance degradation or a hard service crash.
*
Runbook — a documented set of step-by-step procedures that an on-call engineer follows when a specific alert fires. Runbooks ensure consistent, fast response without depending on individual knowledge.
*
S1 / S2 / S3 / S4 (Severity 1–4) — incident severity levels ranging from critical (S1: complete unavailability, direct financial impact) to low (S4: minor defects with no operational effect). Severity determines response priority and escalation path.
*
Scope — the defined boundaries within which agreements, obligations, or monitoring coverage apply. A clear scope prevents disputes about responsibility.
*
Service review — a regular meeting between the support team and the client to analyse period metrics, review incidents, and agree on an improvement plan. Standard cadence is monthly.
*
Severity — the level of criticality of an incident; determines how severely service availability is impacted and how quickly the team must respond.
*
SLA (Service Level Agreement) — a contractual document formalizing SLOs and specifying the legal consequences of violations. SLA is the binding agreement; SLO is the operational target.
*
SLI (Service Level Indicator) — a specific measurable metric of service quality: availability, latency, error rate. SLIs are the raw measurements against which SLOs are evaluated.
*
SLO (Service Level Objective) — a target value or range for an SLI that the team commits to maintaining. SLOs are defined per service and formalized in the SOW after a project audit.
*
Slow query log — a database log that automatically records queries whose execution time exceeds a defined threshold. A key diagnostic tool for identifying database performance issues.
*
SOW (Statement of Work) — the technical specification document that fixes the scope, timelines, and concrete conditions of an engagement, including specific SLO values agreed after the project audit.
*
SysOps (Systems Operations) — a specialist responsible for server infrastructure, networks, storage, and container orchestration. Engaged during an incident when the problem is localized to the infrastructure layer.
*
Trading system — a software platform for automated execution of transactions on financial markets. Belongs to the class of high-criticality systems with strict latency and availability requirements.
*
Uptime — the period of continuous service operation without failures. Usually expressed as a percentage over a measurement period: 99.9% uptime means no more than ~8.7 hours of downtime per year.
*
VictoriaMetrics — a high-performance time-series database used for long-term metrics storage and alerting. Compatible with the Prometheus format and consumes fewer resources under high load.

Conclusion

Webdelo's SLO approach is an engineering practice, not a sales pitch. We define what to measure, configure the tooling, staff the team, and agree on realistic targets with each client. Concrete SLO values are always established in the SOW following a project audit. This lets us take clear ownership of outcomes — and lets clients know exactly what is being guaranteed.
To discuss production support for your project, contact us.