Hidden Features To 10x Your Hermes Agent Setup
Chapters9
Explains how Hermes uses config.yaml per profile and how to adjust max bytes to control how much tool output is read into the context window.
Unlock hidden Hermes settings to massively boost your agent workflow, cut costs, and read larger outputs without losing context.
Summary
AI LABS’s latest deep-dive into Hermes reveals practical tweaks that go beyond the basics. The team explains that the real power lies in using Hermes’ own config options to optimize context, output limits, and memory. They walk through their preferred max bytes, which expands how much tool output Hermes can ingest, and show how to adjust this per profile with either config.yaml edits or the Hermes config command. Large knowledge bases and long markdown lines are addressed by increasing line-length and chunk handling settings. The hosts also tune compression thresholds and the tail behavior (target ratio) to preserve useful context across conversations. Sub agents get a boost too: increasing max concurrent children and leveraging max spawn depth lets Hermes spawn more helpers and explore nested repos, with a caution about token costs. They explain how to optimize cost using auxiliary models, effort levels, and provider-specific quirks, such as switching to cheaper models for background tasks. The video also covers workflow enhancements like quick commands (exec and alias), checkpointing, background process notifications, ephemeral system prompts, and a YOLO mode for uninterrupted action. Throughout, they emphasize the importance of experimenting with settings to fit your model window size and available budget. The session wraps with a quick plug for their starter pack and a nod to audience support via the channel. If you’re optimizing Hermes for your team, this is a manifest of practical levers to pull.
Key Takeaways
- Increase max bytes in config.yaml (or via Hermes config) to pull more tool output into the context window, improving long-run issue detection.
- Set 5,000+ character read limits for files with 2,000+ line documents to avoid missing crucial details.
- Raise compression threshold to 0.75 to avoid premature compression on 200k-1M token contexts.
- Adjust target ratio (10%–80%) to control how much uncompressed tail remains when new conversations start.
- Raise max concurrent children from 3 to 5 to speed up sub-agent workflows, while monitoring token costs.
- Enable auto-approve for sub agents to remove permission-prompt friction in automated tasks.
- Use Hermes O to pull in cheaper auxiliary models from other providers for background tasks and compression, saving tokens on mundane tasks.
Who Is This For?
Teams using Hermes who want to squeeze more performance and lower costs from their agent setups, especially those juggling large files, long documents, or nested sub-agents.
Notable Quotes
""The first one we'll change is max bytes... it pulls 50,000 characters from any tool output into the context window at once""
—Demonstrates where the context window limit starts and why increasing it helps with long test runs.
""We set it to 5,000 and let the agent read more of the file at once""
—Addresses handling of large knowledge bases and documents.
""Compression threshold is set to 0.75... so we can at least use 75% of the context window before it hits compression""
—Shows rationale for adjusting compression behavior based on model window size.
""Sub agents handle simple tasks like web searches... but running them on that powerful model burns a lot of cost""
—Explains cost-aware use of auxiliary/sub-models.
""You can switch between the multiple personalities that come with it and have fun with the different voice styles""
—Highlights personality customization feature for Hermes.
Questions This Video Answers
- How do I increase Hermes' context window without losing performance?
- What are the best practices for configuring sub agents and max_spawn_depth in Hermes?
- Can I reduce Hermes costs by using auxiliary models for background tasks?
- What is the effect of changing the compression_threshold on long conversations in Hermes?
- How do I enable YOLO mode in Hermes and when should I use it?
Hermesconfig.yamlmax_bytescompression_thresholdtarget_ratiosub_agentsmax_concurrent_childrenmax_spawn_depthauto_approveauxiliary_models","YOLO mode","ephemeral_system_prompt
Full Transcript
Ever since we started using Hermes, we've set up a lot of our workflows on it. As we showed you in the previous videos, it's been monitoring our apps, coordinating the team on Slack, and more. But the more we used it, the more we ran into the same problems, and it started to feel like our setup wasn't enough. Like we always do, we started looking for ways to solve the issues. But that's when we realized we didn't need to add anything else because everything we needed was already in Hermes itself. We just weren't using it to its full potential.
Now, if you're new to the channel, then welcome. We're a software company and this is AI Labs where we show you how to optimize a business with AI using proven methods from our own team. And in this video, we're going through all the settings we change to improve our workflow. So the first category is all about context and output limits. Hermes uses the Hermes folder which holds all the configs and info that run the agent and all of that lives in one single file called config.yaml. It's a really long file and it contains every config tied to the agent setup.
So if you're managing multiple profiles like we are, each one gets its own separate folder and every profile has its own config.yaml file. So the first one we'll change is max bytes. By default, this is set to 50,000 which means it pulls 50,000 characters from any tool output into the context window at once and the rest get cut off. That became a problem when we were using it to monitor test runs because it wouldn't properly see the issues when they were long. So we needed more of that output in the context window. For that, you can either set max bytes directly in the config.yaml file or change it to the number you need using the Hermes config command.
Once that's done, it pulls that many characters into the context window from all tool outputs. But you'll need to make sure the right profile is selected because the changes you make with the Hermes config command show up in your active profiles config file. Another problem shows up when the agent reads a file with a lot of lines. This happened to us when we connected Hermes to our company's knowledge base where we have these large policy documents that are easily more than 2,000 lines. So, when it pulled them in by breaking them into chunks, it kept missing important details.
So, we set it to 5,000 and let the agent read more of the file at once. There's another limit that becomes a problem when you have a lot of large markdown files. If your document has long paragraphs stored as a single long line and that line is more than 2,000 characters, it won't be fully read. So, if you want to increase that, you can change it with the Hermes config command and set the character count you need. That way, the agent can read more than 2,000 characters in a single line. The first three settings mostly matter if you work with large files, but this next one's important for everyone, and that's the compression threshold.
By default, the compression threshold is set to 50%, which means once 50% of the context window is filled, it compresses everything in there. But a lot of other agents like Codeex and Claude Code have this set to around 75%. We ran into this ourselves while running Hermes. Since we'd set it up with a smaller model on 200,000 context, it compressed too early, which isn't ideal when you actually want to get things done. Now, models like Opus or the Gemini ones with a million token window would be fine here because compression only happens at 500,000 tokens for them.
But for models with 200,000 context, it happens at 100,000 tokens, which causes issues on a long run. So, we set the compression threshold to 0.75. That way, we can at least use 75% of the context window before it hits compression. Another setting is called target ratio, which is set to 20% by default. When Hermes hits compress, it doesn't compress the entire chat. Instead, it leaves 20% of the conversation uncompressed and starts the new conversation with that uncompressed part along with the summary. So that uncompressed 20% becomes your tail once the new compressed conversation starts. Now, how much is left uncompressed depends on how big your context window is.
For a 1 million token context window, 100,000 tokens get added. And for a 200,000 token context window, only 20,000 tokens get added. And this tail gives the agent more context on the previous conversation, so it can pick up easily. So 20% works for us on a 200,000 context window. But if you're on a larger model, you can use the config command to set it higher. The ideal range is between 10% to 80%. The higher the number, the more tokens stay in your context window, but you'll also have less free room to work with. As we talked about in the previous video, the memory MD and user.md files that Hermes keeps have a hard limit on how many characters they can hold.
After that, Hermes starts dropping information the agent thinks it no longer needs. You can change these limits too, either directly in your config.yamel file or through the Hermes desktop app from the settings pane. From there, you can also change most of the settings we just talked about. And if you're enjoying the video so far, subscribe to the channel and hit the like button. This small gesture of support goes a long way for us. The second category is sub agents. On Hermes, you're limited to spawning three sub aents at once. And when we were working on our projects, we hit this limit and things ended up taking longer than they needed to.
In the config, this limit comes from the max concurrent children value, which is set to three by default. Since we were running into issues, we used the config command and changed this value to five. From that point on, whenever it spins up sub agents, it can run up to five of them together. But this is tokenheavy. So if you're working with a lot of sub aents, cost is something you need to watch out for. Now in Claude Code, each sub agent can create its own sub aents. And that's helpful when you're working with a large folder where one agent can branch out into more agents to explore nested repos.
But Hermes blocks this with the max spawn depth flag, which is set to one by default, and that stops any sub agent from creating more. So you can push the max spawn depth above one. After that, your sub aents can create their own sub aents, too. There's another sub aent feature called auto approve which is set to false by default. This means the sub aents you spawn only inherit the parents permissions and they might still get blocked by permission prompts. So if you want to change this, you can set it to true directly here. Once you've done that, your sub agents can run in auto approve mode and won't get blocked by any permission prompts.
Sub agents handle simple tasks like web searches that don't need the heavy lifting of your main model. But running them on that powerful model burns a lot of cost for work like this. So you can change the model used for any sub aent and switch it to a smaller one which saves you tokens. And if that smaller model is from a different provider, you can add it using the Hermes O command which lets you pull in models from whichever provider you want. But before we move towards the settings that save us costs, let's have a word by our sponsor, Helix.
Every week there's a new AI tool that helps you build apps, websites, and products faster than ever. But nobody talks about what happens before you start building. Most people jump straight into coding with a half-baked idea and end up rebuilding the same thing three times. Helix is an AI guided product planning platform that takes a rough idea and turns it into a structured plan you can actually hand off to a developer or a stakeholder. You describe your idea in one sentence and five AI specialist agents go to work covering validation, market research, product development, business modeling, and growth strategy.
It pulls live market data in real time, connects to over 20 tools you already use like notion, Jira, and Air Table. And the canvas adapts to your actual product needs instead of forcing you into a generic template. When you're done, you export an investor ready PDF blueprint that's actually built on real research, not guesswork. Click the link in the description and try Helix for free. The third [snorts] category is cost. These are basically the settings that save you tokens. When you first set up Hermes, you give it the models for different purposes, but you can set up auxiliary models as well.
Auxiliary models are basically the cheaper, faster ones that Hermes uses for background subtasks. That way, the expensive main model you've set up isn't wasted on small tasks that aren't that complicated. By default, when you leave the auxiliary models empty, Hermes falls back to the lowest cost model in your config. Since we were using open router, it was set to Gemini flash. So these cheaper models could handle tasks behind the scenes. So if you want to save costs, you can set up cheaper models manually. They can save you a lot of money on tasks like web searches or compression.
If your main model is something like Opus, you probably don't want to waste it on trivial tasks on saving costs. Another thing you can configure is the effort level of the model you're using. Effort is basically how much thinking the model puts into a task. If the effort is higher, even though the output will be better, but the tokens consumed will also be higher. You can set it to low or minimum so the agent doesn't waste tokens. You can also turn off thinking completely if you don't want to use effort levels. The fourth category is workflow and it covers a bunch of other features that make Hermes so much better to use.
The first one is quick commands. If you've been using Claude code, you might know/comands where you add custom reusable instructions. They do a similar job, but Hermes handles them differently because it doesn't use prompt instructions the way Claude code and other agents do. Quick commands come in two types. The first one is exec, which runs a terminal command and drops its output into the context window. This is helpful for creating scripts that run a whole series of commands from just a single one. For example, for git operations, you can set up a custom exec command and run it whenever you want the agent to use those commands.
The other type is alias. This is less of a custom command and more of a way to rename existing ones. For example, if you want a quicker way to run compress, you can set an alias to just a single letter and run it fast. There's no direct way to set this up, so you actually have to do it in config.yaml, or you can just ask claude code or Hermes to do it for you, and it'll make the changes itself. Aside from that, Hermes has a checkpointing mechanism, too. A checkpoint is basically a saved state of your files at a certain point in time.
You can roll back to it if an experiment breaks something. It's turned off by default, so you'll have to set it to true. Once checkpointing is on, you can use the roll back command to go back to a previous checkpoint. Another thing you can change is background process notifications. If you set this to all, you'll get a notification for everything Hermes is doing in the background. You can change it if you don't want those. There's also a flag called Hermes ephemeral system prompt, which lets you add content into the system prompt of the agent. This is an environment variable, and the instruction you add in as the value, it becomes part of the system prompt.
So you can add whatever instructions you want this way, but this prompt only applies to the session you open in that terminal and it doesn't stick around longterm. So it's mainly useful for one-time use cases. You can also run Hermes in YOLO mode, which is the same as the dangerously skip permissions mode in Claude. This stops the agent from sitting there waiting for you to approve every action. You can turn it on with the YOLO command or by launching Hermes with the YOLO flag in the terminal. At one point, we ran into an error and weren't sure if it was coming from Hermes itself or from some config we'd set up.
That's when we came across the ignore user config mode. It strips the agent of all the configs in your Hermes folder and runs it in isolation so you can figure out what's actually causing the error and fix it. You can also switch between the multiple personalities that come with it and have fun with the different voice styles already in the configs using the personality command. Since a lot of people have been asking about it, we've put together a starter pack with all the guides and resources you'll need. It's available inside our community, AIABS Pro. So, if you'd like to support the channel and get access to this resource pack, be sure to check it out.
The link is in the description. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.
More from AI LABS
Get daily recaps from
AI LABS
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









