08 Jun 2026

Agent engineering: smolagents

table of contents

This is the sixth take on the same news reader. Previous versions used tool-calling agents. This time I used smolagents from HuggingFace, specifically its CodeAgent. Instead of calling predefined tools, the agent writes Python code and executes it in a sandbox.

How CodeAgent works

A CodeAgent follows a ReAct loop: think, act, observe, repeat. At each step, the LLM generates a Python snippet, the framework executes it, and stdout plus the return value feed back as the next observation. To finish, the agent calls final_answer(value).

from smolagents import CodeAgent, LiteLLMModel

agent = CodeAgent(
    tools=[],
    model=LiteLLMModel("anthropic/claude-haiku-4-5"),
    executor_type="blaxel",
    executor_kwargs={"sandbox_name": "news-reader"},
    additional_authorized_imports=["httpx", "bs4", "pydantic"],
    max_steps=15,
    verbosity_level=2,
)

result = agent.run(task_prompt)

smolagents has no native Anthropic client. LiteLLMModel wraps the LiteLLM Python library, which is better known as a proxy server but here works as a local library that sends requests directly to api.anthropic.com. The anthropic/ prefix in the model ID tells LiteLLM which provider to use.

Sandboxes: Jupyter repurposed

smolagents has several executors that determine where the generated code runs.

The default LocalPythonExecutor is a restricted AST interpreter running in your process. It walks the generated code’s syntax tree and evaluates it node by node, blocking modules like os and subprocess, and functions like eval and exec. Only a small whitelist of safe imports is available. You can expand it with additional_authorized_imports, but the local executor still can’t make HTTP requests or touch the filesystem unless you explicitly opt in.

The remote executors take a different approach. They spin up an isolated environment and run a Jupyter kernel inside it. smolagents connects to the kernel over WebSocket and sends code the same way a Jupyter notebook sends cells to its kernel. This is the Jupyter ecosystem repurposed. The kernel manages state between steps, so variables defined in one step are available in the next.

With executor_type="blaxel", the sandbox is a remote VM that boots in under 25ms from hibernation. The additional_authorized_imports parameter tells smolagents to pip install those packages into the VM before the first step. The isolation is at the VM level rather than the interpreter level.

What the agent actually generated

Here’s what the agent generated with verbosity_level=2.

Step 1 — imports, Pydantic models, and fetch both pages:

import httpx
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field

class NewsItem(BaseModel):
    title: str
    url: str
    source: str
    tags: list[str]
    summary: str
    discussion_url: str | None = None

class ScraperResult(BaseModel):
    items: list[NewsItem]
    report: str

hn_response = httpx.get("https://news.ycombinator.com", timeout=10)
hn_html = hn_response.text
lobsters_response = httpx.get("https://lobste.rs", timeout=10)
lobsters_html = lobsters_response.text

Step 2 — parse Hacker News with keyword filtering:

relevant_keywords = ['python', 'ai', 'ml', 'machine learning', 'tool',
    'developer', 'architecture', 'software', 'programming', 'database',
    'framework', 'library', 'algorithm', 'performance', 'deployment',
    'devops', 'rust', 'golang', 'javascript', 'typescript', 'web',
    'backend', 'frontend', 'distributed', 'system']
skip_keywords = ['business', 'funding', 'vc', 'startup funding',
    'social media', 'twitter', 'elon', 'drama', 'crypto', 'bitcoin',
    'nft', 'politics', 'hiring', 'jobs']

hn_soup = BeautifulSoup(hn_html, 'html.parser')
hn_rows = hn_soup.find_all('tr', class_='athing')

for idx, row in enumerate(hn_rows[:30]):
    title_cell = row.find('span', class_='titleline')
    a_tag = title_cell.find('a')
    title = a_tag.get_text(strip=True)
    url = a_tag.get('href', '')

    title_lower = title.lower()
    is_relevant = any(kw in title_lower for kw in relevant_keywords)
    is_skip = any(kw in title_lower for kw in skip_keywords)
    if is_relevant and not is_skip:
        hn_items.append(...)

Output: “Found 7 relevant items on Hacker News”

Step 3 — parse Lobsters:

lobsters_soup = BeautifulSoup(lobsters_html, 'html.parser')
lobsters_rows = lobsters_soup.find_all('li', class_='story')

for idx, row in enumerate(lobsters_rows[:40]):
    title_elem = row.find('a', class_='u-url')
    title = title_elem.get_text(strip=True)
    url = title_elem.get('href', '')

    tags_container = row.find('ul', class_='tags')
    # ...
    comments_link = row.find('a', class_='comments_label')
    # ...

Output: “Found 6 relevant items on Lobsters”

The Lobsters selectors (li.story, a.u-url) were correct for titles and URLs, but the tag and discussion URL selectors returned nothing.

Steps 4 and 5 generated tags from a keyword map and validated with Pydantic:

validated = ScraperResult(**result)
final_answer(validated.model_dump())

Output: “Validation successful! 13 items validated”

Structured output through self-validation

Unlike Pydantic AI’s output_type=MyModel, smolagents has no built-in schema enforcement for the final answer. final_answer() is a built-in tool that smolagents injects into every agent. The system prompt tells the agent to call it when the job is done, and the framework stops the loop. The agent can pass any value to it.

I included Pydantic models in the prompt and told the agent to validate before returning. The agent could ignore the instruction, but in practice it doesn’t. If validation fails, the exception becomes the next observation and the agent has remaining steps to fix the data.

For a harder guarantee, smolagents has final_answer_checks: functions that run on the host before accepting the result. If a check returns False, the agent continues:

def validate_result(answer, **kwargs):
    try:
        ScraperResult(**answer)
        return True
    except Exception:
        return False

agent = CodeAgent(..., final_answer_checks=[validate_result])

This runs on your machine, outside the sandbox. The agent can’t bypass it.

The tradeoff: code vs. judgment

A tool-calling agent would decide “this article about perceptrons is relevant to AI and Python” because it reads the content. A CodeAgent writes keyword-matching code at generation time.

In my run, “Trusted Computing Frequently Asked Questions” got tagged as ['ml', 'web', 'rust'] and “How to fix a laptop that reboots randomly” got tagged as ['ml', 'web'] because the keyword filter matched on substrings. The summaries were just the article titles repeated verbatim.

The Perplexity team recently published research arguing that code-generating agents outperform tool-calling agents for search tasks. Their claim is that code expresses complex retrieval logic more naturally than a sequence of tool calls. The news reader task is too simple to test this, but the approach is gaining traction beyond HuggingFace.

What would make this better

I intentionally kept the implementation naive to see what a zero-tool CodeAgent produces out of the box.

The agent guessed page structure. The Hacker News selectors (tr.athing, span.titleline) were correct. The Lobsters selectors for titles (li.story, a.u-url) worked, but the selectors for tags and discussion URLs didn’t match anything. The agent has no way to verify its guesses against the actual HTML. A tool-calling agent with a web_fetch tool would have read the markup and adapted. For a CodeAgent, the fix is to give it a deterministic parsing tool. You’d write a parse_hn() and parse_lobsters() tool with tested selectors, and let the agent call them from its code. The smolagents docs recommend exactly this: “Whenever possible, logic should be based on deterministic functions rather than agentic decisions.”

Keyword matching replaced LLM judgment. The agent wrote a keyword filter instead of evaluating each article. A better architecture would split scraping from judgment: one agent (or deterministic code) fetches and parses the raw data, and a second agent reads the titles and summaries to filter and tag them. smolagents has managed_agents for this. You pass one agent as a managed agent to another, and the manager calls it like a function from its generated code.

Pydantic validation felt bolted on. I told the agent to validate with Pydantic in the prompt, and it did, but defining models in generated code to validate generated data is circular. If the agent controls both the schema and the data, validation catches typos but not structural problems. A more natural approach for a CodeAgent would be to write results to a structured store (SQLite, for example) where the schema is enforced externally. The sandbox can run sqlite3 or any Python library. The agent writes INSERT statements, and the database rejects malformed data.

Everything runs sequentially. The agent uses one kernel and parses one site after another. A more natural architecture would run two CodeAgents in parallel, one per site, each with its own sandbox. A third agent would collect their results, filter and summarize them. smolagents supports this with managed_agents, where one agent calls others as functions from its generated code.

Comparing the approaches

	Tool-calling agents	CodeAgent
What the LLM produces	Tool name + arguments	Python code
Execution	Framework calls the function	Jupyter kernel runs the code
Available capabilities	Only registered tools	Anything Python can do
Safety model	Tool allowlist + argument validation	Sandbox isolation (VM/container)
Structured output	Schema validation with retry	Self-validation in generated code
Best for	Content requiring LLM judgment	Procedural tasks with clear logic

The full project is on GitHub.

Roman Imankulov