Skip to content

cap_llm_inspect — image analysis

Source: cap_llm_inspect.c · header: cap_llm_inspect.h

cap_llm_inspect is the LLM-interaction reference capability. It demonstrates a special pattern: while executing a tool for the LLM, it starts another LLM inference (nested call) to analyze a local image file.

Useful subtasks include:

  • Image understanding (this module)
  • Text summarization
  • Code explanation

cap_llm_inspect registers a single Callable, inspect_image:

Tool IDinspect_image
DescriptionAnalyze a local image from an absolute path.
Confirm the path first, then provide a prompt describing what to inspect.
Input{ "path": "<absolute path>", "prompt": "<what to inspect>" }
OutputTextual analysis from the LLM

Typical flow:

Diagram

The implementation is small (~110 lines). Core flow:

static const char *CAP_LLM_INSPECT_SYSTEM_PROMPT =
    "You analyze local image files for the ESP-Claw. "
    "Describe visible content plainly and briefly. "
    "If the image is unclear, say what is uncertain instead of guessing.";

static esp_err_t cap_llm_inspect_execute(const char *input_json,
                                          const claw_cap_call_context_t *ctx,
                                          char *output, size_t output_size)
{
    // 1. Parse path and prompt
    claw_media_asset_t asset = {
        .kind = CLAW_MEDIA_ASSET_KIND_LOCAL_PATH,
        .path = path_json->valuestring,
    };

    // 2. Build multimodal request (dedicated system prompt + user prompt + image)
    claw_llm_media_request_t request = {
        .system_prompt = CAP_LLM_INSPECT_SYSTEM_PROMPT,
        .user_prompt   = prompt_json->valuestring,
        .media         = &asset,
        .media_count   = 1,
    };

    // 3. Nested LLM call; write result to output
    char *analysis = NULL;
    err = claw_core_llm_infer_media(&request, &analysis, &error_message);
    snprintf(output, output_size, "%s", analysis);
    free(analysis);
}

claw_core_llm_infer_media is a standalone LLM invocation, isolated from the user-facing Agent session:

  • Uses its own system_prompt (image-focused, no chat history)
  • Does not consume the current session token budget as normal turns
  • Does not go through cap_skill tool-visibility management
  • Requires multimodal support from the configured backend
if (err != ESP_OK) {
    snprintf(output, output_size,
             "Error: image analysis failed (%s)%s%s",
             esp_err_to_name(err),
             error_message ? ": " : "",
             error_message ? error_message : "");
    free(error_message);
    return err;
}

If multimodal is unavailable or the path is bad, the error string is returned in output to the caller (LLM or automation).

cap_llm_inspect highlights several choices:

  1. Dedicated system prompt: image work should not be biased by conversational history; a fixed specialist prompt stabilizes tone.

  2. “Confirm the path first” in the description: the LLM should use cap_files list_dir before burning a nested call on a wrong path.

  3. Stateless: no retained state; each invocation is its own mini-task.

  4. Separation of concerns: download lives in cap_im_*, paths in cap_files, analysis here.

Same pattern could power:

  • inspect_audio (needs audio-capable models)
  • summarize_document (text inference entrypoint)
  • classify_image (different fixed system prompt)