Skip to content

Commit 3f678bd

Browse files
TomasTomecekmfocko
andcommitted
Apply review changes from Mato
Co-authored-by: Matej Focko <mfocko@users.noreply.github.com> Signed-off-by: Tomas Tomecek <ttomecek@redhat.com>
1 parent 2ace003 commit 3f678bd

1 file changed

Lines changed: 9 additions & 9 deletions

File tree

  • posts/lessons-learned-ai-automation-rhel

posts/lessons-learned-ai-automation-rhel/index.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ When we started this project, the number of AI frameworks available was both exc
2424
- **[LlamaStack](https://llamastack.github.io/latest/)** - It wasn't mature enough
2525
- **Custom MCP (Model Context Protocol)** integrations - For connecting to services like Jira and GitLab
2626

27-
The reality? Every framework has trade-offs, and the "best" choice depends heavily on your specific use case. Goose was great for rapid prototyping and was an excellent starting point choice, but we quickly discovered we needed something more robust for production. With so many options available, we went with something we were able to easily start with and still achieve good results. BeeAI's structured approach with conditional requirements and many more possibilities of restrictions and control over agents proved more suitable for our complex, multi-step workflows.
27+
The reality? Every framework has trade-offs, and the best choice depends heavily on your specific use case. Goose was great for rapid prototyping and was an excellent starting point choice, but we quickly discovered we needed something more robust for production. With so many options available, we went with something we were able to easily start with and still achieve good results. BeeAI's structured approach with conditional requirements and many more possibilities of restrictions and control over agents proved more suitable for our complex, multi-step workflows.
2828

2929
Since security is very important for the production-level system, and it's also easy to create your own MCP server, we've opted for that option so the code is 100% controlled by us. This way, we can define exactly which Jira fields can be touched, what values are allowed, and under what conditions. Agents get an exact list of functions they can use, and none of them have any credentials in their environment: everything that needs authentication goes through MCP.
3030

@@ -54,7 +54,7 @@ You must decide between one of 5 actions. Follow these guidelines:
5454
"""
5555
```
5656

57-
What looks like a simple "analyse this issue" task actually requires:
57+
What looks like a simple analyse this issue task actually requires:
5858

5959
- Complex decision trees with multiple fallback paths
6060
- Comprehensive error handling for edge cases
@@ -97,8 +97,8 @@ This specialisation journey is still ongoing, and we can absolutely see opportun
9797

9898
We've experimented with multiple models, providers and settings:
9999

100-
- [Google's Gemini](https://ai.google.dev/gemini-api) (currently gemini-2.5-pro)
101-
- [Anthropic's Claude](https://claude.ai/) (claude-sonnet-4)
100+
- [Google's Gemini](https://ai.google.dev/gemini-api) (currently `gemini-2.5-pro`)
101+
- [Anthropic's Claude](https://claude.ai/) (`claude-sonnet-4`)
102102
- Custom temperature and parameter tuning
103103

104104
Yes, newer models often perform better, and it's worth trying them. But here's what matters more:
@@ -108,7 +108,7 @@ Yes, newer models often perform better, and it's worth trying them. But here's w
108108
3. **Validation and error handling:** Every step has validation, and failures are handled gracefully with retry logic.
109109
4. **Explicit system prompt:** LLMs can be very creative in their solutions, make sure to precisely guide them via system prompts.
110110

111-
For example, our backport agent doesn't just "apply a patch" - it has a sophisticated workflow and tools:
111+
For example, our backport agent doesn't just apply a patch - it has a sophisticated workflow and tools:
112112

113113
```python
114114
# Specialized tools for reliable patch handling
@@ -140,18 +140,18 @@ This might be the most important lesson: AI agents will never be 100% determinis
140140
- File system operations
141141
- Complex deterministic logic, like mapping RHEL versions to branches
142142

143-
**Lesson learned:** Design your system to be "AI-assisted" rather than "AI-controlled." Use AI for the creative, analytical work, but rely on deterministic code for the critical operations.
143+
**Lesson learned:** Design your system to be AI-assisted rather than AI-controlled”. Use AI for the creative, analytical work, but rely on deterministic code for the critical operations.
144144

145145
## From code to service
146146

147-
Once you move from building a prototype to building a production system, you also need to think about the infrastructure and other components, not just agents and code around them. Here are just a few examples of what we plumbed that keeps AI agents running reliably:
147+
Once you move from building a prototype to building a production system, you also need to think about the infrastructure and other components, not just agents and code around them. Here are just a few examples of what we architected that keeps AI agents running reliably:
148148

149149
- **Queue-based architecture:** We use valkey queues to orchestrate work between agents, a reliable pattern we already knew worked. Issues flow through different queues with automatic retry logic.
150150
- **Observability:** Every agent action flows through [Phoenix tracing](https://phoenix.arize.ai/), giving us complete visibility into what agents are thinking and doing. When something goes wrong, we can trace exactly which decisions led to the failure. This isn't just logging - it's understanding why your AI system made specific choices.
151151
- **Container deployment:** The entire system runs on OpenShift and each agent type has its own deployment with resource limits. This also makes the whole system easily scalable.
152-
- **Dry-run capabilities:** Every operation has dry-run support built in. This helps us to better test and debug the agent's behaviour. When an agent makes an unexpected decision, we can replay the scenario safely.
152+
- **Dry-run capabilities:** Every operation has a dry-run support built in. This helps us to better test and debug the agent's behaviour. When an agent makes an unexpected decision, we can replay the scenario safely.
153153

154-
**Lesson learned:** Start thinking about operational capabilities on day one. Dry-run modes, proper logging, and failure recovery aren't "nice to haves" - they're what make AI systems actually usable. And since LLMs produce a ton of output, you'll need an efficient way to find out why the agents made such an odd decision.
154+
**Lesson learned:** Start thinking about operational capabilities on day one. Dry-run modes, proper logging, and failure recovery aren't nice to haves - they are what makes AI systems actually usable. And since LLMs produce a ton of output, you'll need an efficient way to find out why the agents made such an odd decision.
155155

156156
## Takeaways
157157

0 commit comments

Comments
 (0)