Recipe authoring¶
A recipe is the unit of safe, repeatable remediation in OpenRemedy: an Ansible playbook bundled with metadata (slug, risk level, incident type, parameter schema) that the agent can propose and execute on your behalf.
This page is for operators who want to write or modify recipes for their tenant. The shipped catalogue covers the common cases — start there before authoring custom ones.
Recipe shape¶
Every recipe is a row in the recipes table. The fields that matter
to authors:
| Field | Type | Meaning |
|---|---|---|
slug |
unique string | Machine name. The agent calls execute_recipe(slug=...) with this. Stable across versions. |
name |
string | Human-readable name shown in the UI ("Safe Systemd Service Restart"). |
description |
text | What the recipe does, when to use it, when not to. Operator-facing. |
incident_type |
string | The Incident.incident_type this recipe applies to (service_down, disk_full, port_unavailable, custom, …). The agent filters by this when listing candidates. |
risk_level |
enum | One of none / low / medium / high / critical. Drives the trust × risk gate. |
playbook_content |
text | Inline playbook YAML. The worker runs this directly — no file on disk required. Preferred for tenant-authored recipes. |
playbook_path |
string | Path to a .yml playbook on disk inside the backend image (recipes/<file>.yml). Used by the shipped catalogue; defaults to an empty string for new recipes that carry playbook_content. |
variables |
JSONB | Parameter schema — the keys the agent can pass via execute_recipe(variables=...). |
pre_checks / post_checks |
text | Human description of what the playbook validates before / after. Not executed by the platform — they're for the operator reviewing the recipe. |
prerequisites |
text | OS/software dependencies. Informational. |
os_family |
JSONB array | Allowed OS families (e.g. ["debian", "rhel"]). Used at execution time to filter incompatible servers. |
tags |
array | Free-form filters for search. |
category |
string | Loose grouping (diagnostic, remediation, …). |
version |
semver string | Operator-managed. Bump when you change the playbook. |
is_parameterized |
bool | True if the recipe accepts variables. |
is_proactive |
bool | True if the recipe is safe to schedule before an incident (e.g. log rotation). |
tenant_id |
UUID | null | NULL = global (superadmin-managed); set = tenant-owned. |
Inline content vs file-based playbooks¶
There are two ways to store the playbook:
Inline (playbook_content) — the YAML is stored as a text field
in the database and runs directly from there. This is how the
in-browser editor works and is the recommended approach for all
tenant-authored recipes. No backend image rebuild, no volume mount.
At execution time, the worker uses playbook_content when it is
present.
File-based (playbook_path) — the shipped catalogue stores
playbooks as .yml files inside the backend image under
recipes/<file>.yml. playbook_path holds the relative path.
Use this approach only when shipping a recipe as part of a custom
backend image build.
When a recipe has both fields set, playbook_content takes
precedence at execution time.
Tenant recipes vs the global catalogue¶
The global catalogue (tenant_id IS NULL) ships with OpenRemedy and
is seeded at install time. It is read-only for all tenant roles —
only superadmin can modify it.
Tenant admins can create their own recipes (tenant_id = their
tenant's UUID). These are visible only to their tenant, and the slug
only needs to be unique within the tenant. To create or update a
tenant recipe, use the in-browser editor at /recipes/new or
/recipes/{slug}/edit, or the REST API.
# Create a tenant recipe
curl -X POST https://app.example.com/api/v1/recipes \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"slug": "redis-flush-cache",
"name": "Redis FLUSHDB on a single key namespace",
"description": "Clears redis keys matching a prefix without touching other DBs.",
"incident_type": "custom",
"risk_level": "medium",
"playbook_content": "---\n- hosts: \"{{ target_host }}\"\n tasks:\n - name: Flush namespace\n ...",
"variables": {"namespace": "session:"},
"tags": ["redis", "cache"],
"category": "remediation",
"version": "1.0.0",
"is_parameterized": true
}'
# Update — bump version explicitly when the playbook changes
curl -X PATCH https://app.example.com/api/v1/recipes/redis-flush-cache \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"version": "1.0.1", "playbook_content": "---\n..."}'
# Delete
curl -X DELETE https://app.example.com/api/v1/recipes/redis-flush-cache \
-H "Authorization: Bearer $TOKEN"
Risk levels¶
The risk_level you pick is the single most consequential authoring
decision. It determines whether the agent can auto-execute the recipe
on its own or has to wait for a human.
| Level | Examples | Auto-executes for trust ∈ |
|---|---|---|
none |
system-info, disk-usage, log-read |
autonomous, supervised, manual |
low |
systemd-restart, port-validation, config-validation |
autonomous, supervised |
medium |
disk-cleanup, log-cleanup |
(always requires approval) |
high |
Rolling reboots, schema migrations | (always requires approval) |
critical |
(reserved for the most destructive operations) | (always requires approval) |
Rule of thumb:
none= pure read. Nobecome: true. No file writes. No service changes. If your playbook only runscommand:calls that gather data, it'snone.low= idempotent writes with proper pre/post checks. Restart a service that's already-supposed-to-be-up. Validate config without reloading. Anything where retrying is safe and a partial run can't corrupt state.medium= destructive but recoverable. Clean a cache. Vacuum logs. Remove orphaned containers. Failure mode is "I deleted something I shouldn't have"; data loss is bounded but real.high= rare — rebuilds, rolling reboots, partition resizes, schema migrations on production. Anything where a wrong invocation is a Saturday-morning incident.critical= reserve for operations where even partial execution is catastrophic and recovery is uncertain.
When in doubt, pick the higher level. You can always loosen later.
Guardian, the destructive-operation safeguard, may raise a recipe's effective risk above what you set here before the approval gate evaluates it. See Guardian for details.
Variables and parameter substitution¶
If your recipe needs runtime parameters, declare them in the
variables JSONB field as a flat dict of defaults:
The agent's call site looks like:
Variables are passed to Ansible as extravars. Inside the playbook,
reference them with "{{ service_name }}" Jinja syntax. Ansible
escapes them at template time, so an LLM-supplied parameter value
cannot break out of an argument slot.
One caveat: literal {{ ... }} in a recipe's command string
(Docker --format '{{.Names}}', kubectl -o jsonpath=..., Helm)
needs no special handling — the platform escapes the operator's
literal at dispatch time and Ansible renders the right thing on the
target. This was a real bug (#69)
fixed in v0.1.x.
Authoring workflow: editor + validate + dry-run¶
The recommended workflow for new recipes uses the in-browser editor
at /recipes/new or /recipes/{slug}/edit.
1. Draft¶
Write or paste the playbook into the CodeMirror editor. Use
hosts: "{{ target_host }}" as the play target — OpenRemedy injects
target_host as an extravar at execution time.
The AI Assistant panel beside the editor can help draft or refine the playbook conversationally. Ask it questions, request specific changes, or describe what you want ("restart nginx if it's down and verify it's listening on port 80"). When the assistant proposes a concrete edit it shows a Proposed playbook block with Apply and Discard buttons. Nothing reaches the editor automatically — you click Apply explicitly.
2. Validate¶
As you type, the editor calls POST /recipes/validate 600 ms after
the last keystroke (debounced). Validation runs two checks:
- YAML parse — the content must be valid YAML.
ansible-playbook --syntax-check— a temporary file is written and checked against the local Ansible installation. Atarget_hostplaceholder is injected sohosts: "{{ target_host }}"does not cause a false-negative "undefined variable" error.
The toolbar shows Valid (green) or Invalid (red). Errors appear in a strip below the editor. You can also call the endpoint directly:
curl -X POST https://app.example.com/api/v1/recipes/validate \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"playbook_content": "---\n- hosts: ..."}'
# → {"ok": true, "errors": []}
# → {"ok": false, "errors": ["YAML parse error: ..."]}
Minimum role: viewer.
3. Dry-run Test¶
Before saving a recipe that makes changes, run it with --check
against a real server in your fleet. The Test (dry-run) button
opens a panel:
- Select a target server from your tenant's fleet.
- Click Run --check.
- The editor polls for the result. When complete, the Ansible stdout appears in a code block — green for a clean run, red for failures or predicted changes that would fail.
The test uses POST /recipes/test to enqueue an ARQ dry-run job and
GET /recipes/test/{job_id} to poll the result. The execution lands
in status preview_completed (distinct from success) — it is
tracked separately and does not appear in the incident execution log.
Minimum role: operator.
4. Save¶
The Save / Create recipe button is only enabled when
validation has passed (ok: true). Saving stores playbook_content
inline; no file on disk is required.
Lifecycle¶
sequenceDiagram
autonumber
participant Agent
participant Gate as guardrails
participant Op as Operator
participant Worker as ARQ worker
participant Host as Target server
Agent->>Gate: execute_recipe(slug, variables)
Gate->>Gate: trust × risk + safety classifier + role override
alt auto-execute (low risk + autonomous, or override)
Gate->>Worker: dispatch (status=approved)
else approval required
Gate->>Op: status=awaiting_approval
Op->>Gate: approve / reject
Gate->>Worker: dispatch
end
Worker->>Host: ansible-playbook --extravars '{...}'
Host-->>Worker: stdout / stderr / rc
Worker->>Worker: persist output to S3, mark execution
Worker-->>Agent: execution.completed
The full path through the code is documented in architecture flow E.
Recipe role overrides¶
The recipe_role_overrides table lets a tenant admin promote a
specific (recipe_slug, server_role) tuple out of the trust × risk
gate. The next time the agent proposes that recipe on a server with
that role, the gate is short-circuited and the call auto-executes
even though the agent's trust_level would normally require
approval.
Use this when:
- You've manually approved the same recipe on the same role enough times that the dashboard surfaces a "Promote?" suggestion.
- You're confident the recipe is safe on this role specifically and you want to stop being asked.
Don't use this when:
- The recipe is medium-or-higher risk and you haven't run it many times. The override skips the safety classifier too — there's no fallback layer.
Revoke an override from the Agents page in the dashboard; on the next agent run, the gate re-engages.
The shipped catalogue¶
OpenRemedy seeds these recipes at install time. Most operator needs are covered; check here before authoring:
| Slug | Risk | Type | What it does |
|---|---|---|---|
systemd-restart |
low | service_down |
Restarts a systemd unit with pre/post checks. |
service-restart |
low | service_down |
Generic service restart (init.d / systemd / OpenRC). |
config-validation |
low | service_down |
Runs nginx -t / apachectl configtest / mysqld --validate-config before any restart-based recipe. |
port-validation |
low | port_unavailable |
Confirms a port is listening and reports the owning process. |
firewall-allow |
medium | port_unavailable |
UFW or firewalld rule to allow a TCP/UDP port. |
log-cleanup |
low | disk_full |
Vacuum journald + rotate / delete old /var/log/*.log.*. |
disk-cleanup |
medium | disk_full |
Aggressive: apt cache, dnf cache, /tmp, large *.log files. |
systemd-override |
medium | service_down |
Edits a unit's [Service] block (e.g., Restart=always). Preventive; not a first-line fix. |
system-info, disk-usage, log-read, log-search, service-status, … |
none | (diagnostic) | Read-only fact gathering. |
Full list: seed.py:STARTER_RECIPES in the backend.
What not to do¶
Patterns that look harmless but break the security model:
- Shell module with un-quoted variables.
shell: "rm {{ path }}"is shell-injection-prone ifpathis operator-controlled. Usecommand: rm "{{ path }}"(the command module never invokes a shell) oransible.builtin.file: state=absent(ideal). - Missing
become: trueon tasks that need root. The playbook silently runs as the daemon user, fails cryptically. Default tobecome: trueand only drop it on diagnostic playbooks. - Hardcoded paths that vary by distro.
path: /var/log/nginx/exists on Debian; on RHEL the path is/var/log/nginx/too — but on Alpine the binary is in/etc/init.d/nginxnot systemd. Usewhen: ansible_os_family == "Debian"guards or set the recipe'sos_familyfield to a single family. - Unbounded
command:loops. Notimeout:, noregister:, nofailed_when:. The worker's overall timeout will eventually kill the playbook, but you'll lose the partial state. - Recipes that mutate state without a corresponding rollback.
If your recipe edits
/etc/systemd/system/foo.service.d/override.conf, ship a sibling recipe (systemd-override-revert) that removes it. Operators will need to reverse course at some point.
See also¶
- Architecture: agent pipeline — where
execute_recipeslots into the broader flow. - Security model — the trust × risk matrix in full.
- Maintenance plan authoring — recipes are also invokable from a
recipestep in maintenance plans. - Dashboard: Recipes — editor, validate, dry-run, and AI assist from the operator's perspective.