Service Level Objectives: Webdelo's Approach to Production Support

When Webdelo takes on a project for ongoing support, the first step is defining Service Level Objectives (SLOs) — measurable targets for service quality that we commit to maintaining. Specific SLO values are always determined after a project audit and formalized in the Statement of Work (SOW). The standard support package includes multi-layer monitoring, alerting, an S1–S4 incident response model, and regular reporting. We do not offer one-size-fits-all uptime numbers. Instead, we define realistic targets and take ownership of meeting them.

Terminology Scope SLIs: What We Measure Package Components Incident Severity Error Budget Glossary

Terminology

Term	Definition
SLI (Service Level Indicator)	A specific measurable metric of service quality: availability, latency, error rate
SLO (Service Level Objective)	A target value or range for an SLI that Webdelo commits to maintaining
SLA (Service Level Agreement)	The contractual agreement between Webdelo and the client formalizing SLOs and consequences of violations
SOW (Statement of Work)	The technical specification document where concrete SLO values are fixed after the project audit
Error budget	The allowable amount of service degradation within a measurement period; exhaustion signals a shift from feature delivery to reliability work

Scope

In scope

Web applications and APIs with defined service tiers

Microservice architectures and distributed systems

High-load analytics platforms and data processing pipelines

Financial and trading systems, internal fintech/banking tooling

Marketing automation and ERP modules

Infrastructure layer: servers, containers, databases, network services

Out of scope

Third-party services and APIs outside the client's or Webdelo's control (payment gateways, external providers)

Planned maintenance windows agreed in advance

Degradation caused by client-side changes outside the agreed change management process

Force majeure events beyond reasonable technical control

What Drives SLO Parameters

Webdelo's approach recognizes that SLO targets are shaped by three factors.

Architectural complexity

A single-database monolith and a twenty-service distributed system require fundamentally different monitoring approaches. More complex architectures require more observation points and broader on-call coverage.

Service tier

Not all system components carry equal business weight. We classify services by tier based on their impact on the availability of the client's core product. A payment gateway and an internal report-generation service warrant very different response requirements. The tier determines monitoring priority, response times, and on-call team composition.

Support budget

Budget determines how many specialists can be allocated, their working mode (business hours, extended hours, or 24/7 coverage), and the depth of business-metric monitoring.

When SLOs Are Established

SLOs are defined at one of two stages.

New service builds

Monitoring and metrics are built into the architecture from the start. By launch time, full observability coverage is already in place — the ideal scenario.

Onboarding to an existing project

Webdelo conducts an infrastructure audit, establishes service tiers, identifies monitoring gaps, and builds observability incrementally. Concrete SLO values are fixed in the SOW following the audit.

In both cases, error budgets are defined for key metrics to guide prioritization decisions between feature development and stability work.

SLIs: What We Measure

SLI Category	What Is Measured	Example Metrics
Availability	Real-time service health	Healthcheck status, per-service uptime
Errors	Error volume and trends by service	HTTP 5xx rate, error breakdown by type and severity
Latency	Request response times	p50, p95, p99 latency on key endpoints
Resources	Server infrastructure load	CPU, RAM, disk I/O, network throughput
Databases	DB health and performance	Query latency, slow query log, CPU/memory load
Network endpoints	Availability of external and internal dependencies	Third-party API reachability, internal microservice availability
Business metrics	Execution time of business operations and workflows	Order status transition times, product-level SLA metrics

A real-world example

On one project, Webdelo configured monitoring for order status transition times. Metrics revealed that a subset of orders was getting stuck at specific processing stages. Investigation uncovered a compounded failure: support confirmation flow misfires + external delivery API latency spikes = accumulation of stalled orders = database overload = system-wide degradation. Each component appeared healthy in isolation; the cascade was only visible through business-level metrics.

Standard SLO Package Components

Every Webdelo support engagement includes five core components.

1. Monitoring infrastructure

Primary stack: Grafana + Prometheus (or equivalent tools agreed with the client). Coverage spans all SLI categories: availability, errors, latency, resources, databases, network dependencies, and business processes.

2. Alerting system

Alerting is built on VictoriaMetrics (or equivalent). Every alert is tied to a specific metric threshold and severity level. Notifications are delivered to the team's messaging channel (Slack, Telegram, or as agreed). On-call engineers receive only actionable signals — low-priority noise is filtered out.

3. Runbooks and escalation paths

For each service we maintain a responsibility map: who gets paged for which incident, escalation order, and initial response steps to execute before a domain specialist joins. This eliminates the "who owns this?" delay at the moment an incident fires.

4. Incident management team

The incident team operates across three levels:

Incident management — specialists with full project architecture knowledge who open incidents, create tickets in the project management system (Jira, ClickUp, or the client's preferred tool), and assign ownership.
On-call DevOps/SysOps — infrastructure engineers for diagnosing issues at the server, container, network, and orchestration level.
On-call developers — service-specific engineers engaged when an issue is localized to application code or business logic.

5. User-reported issue channel

We establish a channel for client-side employees to report observed anomalies. End users often detect issues before automated alerts fire. All reports are consolidated, triaged, and routed into the standard incident workflow.

Incident Severity Model

Severity	Level	Description	Time to Acknowledge	Target Resolution Time
S1	Critical	Complete unavailability of a key service or product; direct financial impact	15 minutes	4 hours
S2	High	Partial degradation of a key service; significant user experience impact	30 minutes	8 hours
S3	Medium	Degradation of a non-critical service; core product remains operational	4 hours	24 hours
S4	Low	Minor issues with no operational impact; monitoring improvement tasks	Next business day	As agreed

Specific response times depend on service tier, on-call mode, and support budget. Final parameters are fixed in the SOW.

Error Budget Policy

The error budget is the allowable amount of service degradation within a measurement period (typically monthly or quarterly).

Budget healthy

(remaining budget above warning threshold): the team maintains its standard delivery pace — feature development and infrastructure changes proceed normally.

Budget approaching exhaustion

(remaining budget below warning threshold): priorities shift toward stability. New feature work pauses until the budget recovers or the SLO is renegotiated with the client.

Budget exhausted

Feature work stops entirely. The team focuses exclusively on reliability restoration. A joint review is conducted with the client to analyze root causes and update the plan.

Reporting and Service Reviews

Periodic report contents

Section	Contents
SLO status	Actual SLI values for the period vs. SLO targets
Error budget	Remaining budget per service, consumption trend
Incidents	Incident list by S1–S4 severity, MTTR, root causes
Trends	Performance and reliability changes vs. prior period
Recommendations	Proposals for monitoring, architecture, or process improvements

Post-incident process

Every incident generates a task documenting root causes, affected systems, and remediation steps taken. Related bug-fix and logic-correction tickets are linked to this task. The resulting task cluster forms the foundation for postmortem review and longitudinal team performance analysis.

Cadence

Standard reporting cadence is monthly. Weekly or quarterly formats are available by agreement.

What We Do Not Promise

Clear capability boundaries are part of how Webdelo builds trust with clients.

No universal SLO numbers

We do not publish or commit to "99.99% for everything" before an audit. Targets are always project-specific.

Third-party dependencies are outside our control

We monitor external APIs and services but cannot guarantee their availability.

SLO is not SLA

SLOs are operational targets, not legally binding commitments. Legal consequences for violations are defined in a separate SLA.

24/7 on-call is not the default

Round-the-clock coverage depends on budget and service tier. Standard packages may cover extended business hours only.

Zero incidents are not guaranteed

We work to prevent incidents and minimize their impact. We do not promise they will not occur.

Business-metric monitoring requires explicit agreement

Monitoring business processes requires a separate scope agreement covering which metrics to track and data access arrangements.

terminology

Glossary

API (Application Programming Interface) — a set of rules that allows one software system to communicate with another. For example, connecting a payment system to a website or calling an external delivery provider happens through an API.

Capability boundaries — an honest description of what a team can and cannot guarantee within a service-level agreement; defines the limits of responsibility.

CPU (Central Processing Unit) — the primary computational component of a server. High CPU load is one of the key signals of performance degradation.

DevOps (Development + Operations) — a practice and culture that unifies software development and IT operations. A DevOps engineer is responsible for CI/CD pipelines, infrastructure provisioning, containerization, and deployment automation.

Disk I/O (Disk Input/Output) — read and write operations on a storage device. High disk I/O can be the root cause of database slowdowns or service-wide degradation.

Endpoint — a specific URL or network address at which a service or API accepts requests. Monitoring endpoints allows tracking the availability of each individual interface.

ERP (Enterprise Resource Planning) — integrated software that combines finance, warehouse, production, HR, and other business processes into a single platform.

Error budget — the allowable amount of service degradation within a measurement period (typically monthly or quarterly). When the budget is exhausted, feature work stops and the team shifts focus to reliability restoration.

Fintech (Financial Technology) — the sector that applies software solutions to deliver financial services: payments, lending, trading, and insurance.

Grafana — an open-source platform for metrics visualization and monitoring dashboards. Typically used together with Prometheus or VictoriaMetrics to display system state in real time.

Healthcheck — a periodic automated probe that checks whether a service is alive or unreachable. It is the baseline mechanism for detecting service failures.

HTTP 5xx — the class of HTTP response codes (500–599) indicating server-side errors: internal server error (500), bad gateway (502), service unavailable (503), and others. A rising 5xx rate is a direct signal of service degradation.

Jira / ClickUp — project and issue tracking systems used to create incident tickets, assign ownership, and manage remediation tasks during the incident response process.

Latency — response delay; the time between a client sending a request and receiving a response from the server. Measured in milliseconds. High latency degrades user experience even when a service is technically operational.

MTTR (Mean Time To Resolve) — the average time to resolve an incident from the moment it is detected to full service restoration. One of the key performance indicators for a support team.

On-call — a duty rotation in which an engineer is designated as the primary responder during a given period and is expected to acknowledge and begin addressing incidents within the agreed response time.

p50 / p95 / p99 (latency percentiles) — statistical measures of response-time distribution. p95 means 95% of requests are handled faster than the stated value; p99 covers 99% of requests. Percentiles reflect real user experience more accurately than averages.

Postmortem — a structured review conducted after an incident is resolved: root cause analysis, event timeline, list of affected systems, and concrete measures to prevent recurrence. The foundation for systematic reliability improvement.

Prometheus — a metrics collection and storage system with support for flexible PromQL queries and a built-in alerting mechanism. The de facto standard for monitoring cloud-native and microservice applications.

RAM (Random Access Memory) — fast temporary storage for running processes and applications. Exhausting RAM leads to performance degradation or a hard service crash.

Runbook — a documented set of step-by-step procedures that an on-call engineer follows when a specific alert fires. Runbooks ensure consistent, fast response without depending on individual knowledge.

S1 / S2 / S3 / S4 (Severity 1–4) — incident severity levels ranging from critical (S1: complete unavailability, direct financial impact) to low (S4: minor defects with no operational effect). Severity determines response priority and escalation path.

Scope — the defined boundaries within which agreements, obligations, or monitoring coverage apply. A clear scope prevents disputes about responsibility.

Service review — a regular meeting between the support team and the client to analyse period metrics, review incidents, and agree on an improvement plan. Standard cadence is monthly.

Severity — the level of criticality of an incident; determines how severely service availability is impacted and how quickly the team must respond.

SLA (Service Level Agreement) — a contractual document formalizing SLOs and specifying the legal consequences of violations. SLA is the binding agreement; SLO is the operational target.

SLI (Service Level Indicator) — a specific measurable metric of service quality: availability, latency, error rate. SLIs are the raw measurements against which SLOs are evaluated.

SLO (Service Level Objective) — a target value or range for an SLI that the team commits to maintaining. SLOs are defined per service and formalized in the SOW after a project audit.

Slow query log — a database log that automatically records queries whose execution time exceeds a defined threshold. A key diagnostic tool for identifying database performance issues.

SOW (Statement of Work) — the technical specification document that fixes the scope, timelines, and concrete conditions of an engagement, including specific SLO values agreed after the project audit.

SysOps (Systems Operations) — a specialist responsible for server infrastructure, networks, storage, and container orchestration. Engaged during an incident when the problem is localized to the infrastructure layer.

Trading system — a software platform for automated execution of transactions on financial markets. Belongs to the class of high-criticality systems with strict latency and availability requirements.

Uptime — the period of continuous service operation without failures. Usually expressed as a percentage over a measurement period: 99.9% uptime means no more than ~8.7 hours of downtime per year.

VictoriaMetrics — a high-performance time-series database used for long-term metrics storage and alerting. Compatible with the Prometheus format and consumes fewer resources under high load.

Conclusion

Webdelo's SLO approach is an engineering practice, not a sales pitch. We define what to measure, configure the tooling, staff the team, and agree on realistic targets with each client. Concrete SLO values are always established in the SOW following a project audit. This lets us take clear ownership of outcomes — and lets clients know exactly what is being guaranteed.

To discuss production support for your project, contact us.