# Observability

| Key | Value |
|-----|-------|
| Status | Active |
| Owner | QA Automation |
| Updated | 2026-03-26 |
| Scope | OpenSearch, Grafana, Prometheus, and Alertmanager — the telemetry and dashboarding stack |

Observability is what turns raw test results into something you can investigate, trend, and act on. The stack here combines structured log storage, metrics collection, dashboards, and alert routing so that operators do not have to piece together context by hand.

## Stack Components

| Component | Port | Purpose |
|-----------|------|---------|
| OpenSearch | 9200 | structured log storage, full-text search, aggregation queries |
| OpenSearch Dashboards | 5601 | log browser and ad-hoc query UI |
| Grafana | 3000 | primary dashboards, trends, investigation panels |
| Prometheus | 9090 | time-series metrics from the metrics exporter |
| Alertmanager | 9093 | alert routing rules, Slack delivery |

## Data Flow

```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#2c6fad', 'lineColor': '#555', 'fontFamily': 'sans-serif'}}}%%
flowchart LR
    TESTS["Playwright tests"] --> EL["EventLogger\nsrc/core/event-logger.ts"]
    EL --> JSONL["Local JSONL logs\ntest-results/logs/"]
    EL --> OS["OpenSearch\ncncqa_tests-* / cncqa_events-*"]
    OS --> GR["Grafana dashboards"]
    TESTS --> PROM["Prometheus metrics\nmetrics-exporter"]
    PROM --> GR
    GR --> AM["Alertmanager\nalert routing"]
    AM --> SLACK["Slack\nalert channels"]
```

Tests produce structured records through the EventLogger. Those records land in OpenSearch for long-term querying and in Grafana for dashboards. Prometheus collects numeric metrics in parallel. Alertmanager routes threshold-based alerts to Slack.

## OpenSearch Indices And Users

### Indices

| Index Pattern | Contents | Audience |
|---------------|----------|----------|
| `cncqa_tests-*` | test summaries, pass/fail records, screenshots | Grafana dashboards, reports |
| `cncqa_events-*` | detailed event stream (per-action records) | AI workflows, deep debugging |

### Access Users

| User | Access Level | Use Case |
|------|-------------|---------|
| `cnc_writer` | write + search via Grafana proxy UID | reporter write path, shared across CNC projects |
| `v1admin` | cluster admin via Grafana proxy ID 38 | ISM policies, index templates, retention setup, index deletion |

Both users are accessed via the Grafana proxy using `GRAFANA_SERVICE_ACCOUNT_TOKEN`. Direct access to port 9200 is not the primary path in production.

## Starting And Stopping The Stack

| Command | What It Does |
|---------|-------------|
| `npm run observability:start` | start OpenSearch, Grafana, Prometheus, Alertmanager |
| `npm run observability:stop` | stop the stack |
| `npm run observability:status` | check whether each component is running |
| `npm run observability:logs` | tail combined stack logs |

The stack runs via Docker Compose defined in `observability/`. Configuration files live under `observability/prometheus/`, `observability/grafana/provisioning/`, and `observability/alertmanager/`.

## Grafana

### Access

Grafana is available at `https://grafana.measure.aws.cnci.tech` in production (CI and scheduled runs) and at `localhost:3000` when the local stack is running.

### Dashboard Areas

| Dashboard Area | What It Shows |
|----------------|---------------|
| Status | current health summary, recent run outcomes, per-site and per-suite pass rates |
| Investigate | failure drill-down, error categories, selector failures, site-specific breakdowns |
| Trends | 14-day pass rate history, recurrence patterns, flaky test candidates |

### Dashboard-As-Code

Grafana dashboards are not hand-authored JSON. They are generated from TypeScript using the Foundation SDK.

| Path | Purpose |
|------|---------|
| `observability/grafana/suite/` | TypeScript dashboard definitions |
| `observability/grafana/generated/` | generated JSON dashboards (output) |
| `observability/grafana/deploy.ts` | deploy script |
| `observability/grafana/generate.ts` | generation pipeline |
| `observability/grafana/validate.ts` | query validation |

### Dashboard Deployment Commands

| Command | What It Does |
|---------|-------------|
| `npm run grafana:deploy` | generate and deploy dashboards to Grafana |
| `npm run grafana:validate-data` | validate that dashboard queries return data |
| `npm run monitor:grafana` | browser-based visual panel health check |

The `monitor:grafana` command opens Grafana in a real browser and checks each dashboard panel for empty states or error conditions. This catches query regressions that static validation misses.

## Retention And Index Management

| Command | What It Does |
|---------|-------------|
| `npm run os:setup-retention` | create ISM policies and index templates (needs `OPENSEARCH_URL` + `GRAFANA_TOKEN`) |
| `npm run os:stats` | show OpenSearch index statistics |
| `npm run os:failed` | query recent failures from OpenSearch |

ISM (Index State Management) policies handle automatic rollover and deletion of old shards. These require `v1admin` access and should be run once during initial setup or when retention rules change.

## Key Configuration

| Variable | Purpose |
|----------|---------|
| `OPENSEARCH_URL` | write endpoint; set to Grafana proxy URL in CI |
| `GRAFANA_URL` | Grafana base URL for links and queries |
| `GRAFANA_SERVICE_ACCOUNT_TOKEN` | authenticates both OpenSearch users via Grafana proxy |
| `PROMETHEUS_PUSHGATEWAY_URL` | metrics push target for CI runs |

## Common Issues

| Issue | What To Check |
|-------|---------------|
| OpenSearch not starting locally | port 9200 conflict; check `npm run observability:status` |
| Grafana showing no data | verify `OPENSEARCH_URL` points to correct proxy; check index pattern matches `cncqa_tests-*` |
| field type mismatch in queries | old indices (pre-2025) may lack `.keyword` sub-fields; aggregations will fail on older shards |
| V1 vs beta data mixing | V1 records use `blesk.cz` site names; beta records use `blesk`; queries may need to handle both |
| events index empty | `cncqa_events-*` is created but the EventLogger write path to OpenSearch may not be reaching it in all environments; check `OPENSEARCH_URL` in the runner environment |

## Related Pages

| Need | Go To |
|------|-------|
| logging model and event schema | [Logging System](./logging-system.md) |
| Slack, GitLab, and OpenSearch credentials | [Integrations](./integrations.md) |
| configuration and env vars | [Configuration Guide](./configuration.md) |
| dashboard queries and patterns | `.claude/docs/grafana-patterns.md` in the repo |
