class Riffer::Evals::Judge

Executes LLM-as-judge evaluations using the provider infrastructure.

The Judge class handles calling an LLM to evaluate agent outputs and parsing the structured response. It uses tool calling internally to get guaranteed structured output from the judge model.

judge = Riffer::Evals::Judge.new(model: "anthropic/claude-opus-4-5-20251101")
result = judge.evaluate(
  instructions: "Assess answer relevancy...",
  input: "What is Ruby?",
  output: "Ruby is a programming language."
)
result[:score]  # => 0.85
result[:reason] # => "The response is relevant..."

Attributes

model [R]

The model string (provider/model format).

Public Class Methods

new (model:, provider_options: {})

Source

# File lib/riffer/evals/judge.rb, line 46
def initialize(model:, provider_options: {})
  provider_name, model_name = model.split("/", 2)
  unless [provider_name, model_name].all? { |part| part.is_a?(String) && !part.strip.empty? }
    raise Riffer::ArgumentError, "Invalid model string: #{model}"
  end

  @model = model
  @provider_options = provider_options
end

Initializes a new judge.

Public Instance Methods

evaluate (instructions:, input:, output:, ground_truth: nil)

Source

# File lib/riffer/evals/judge.rb, line 66
def evaluate(instructions:, input:, output:, ground_truth: nil)
  system_message = build_system_message(instructions)
  user_message = build_user_message(input: input, output: output, ground_truth: ground_truth)

  response = provider_instance.generate_text(
    system: system_message,
    prompt: user_message,
    model: model_name,
    tools: [EvaluationTool]
  )

  parse_tool_response(response)
end

Evaluates using the configured LLM.

Composes system and user messages from the semantic fields:

instructions: evaluation criteria and scoring rubric.
input: the original input/question.
output: the agent’s response to evaluate.
ground_truth: optional reference answer for comparison.