logo
P
Prompt Master

Prompt 大师

掌握和 AI 对话的艺术

Evaluate outputs (teacher)

use an LLM to compare outputs

TL;DR(中文)

  • 这是一个典型的 LLM-as-a-Judge 用法:让 judge model 比较两个 outputs,并按“老师批改”给反馈。
  • 适合做 evaluation:A/B 测试 prompts、比较不同 models、或比较同一模型的不同设置。
  • 风险在于 judge 的 bias 与不稳定性;建议固定 rubric、要求引用 evidence(从输出中摘句),并做多轮/多 judge 交叉验证。

Background

This prompt tests an LLM's ability to evaluate and compare outputs from two different models (or two different prompts), as if it was a teacher.

One workflow:

  1. Ask two models to write the dialogue with the same prompt
  2. Ask a judge model to compare the two outputs

Example generation prompt (for the two models):

Plato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models?

How to Apply(中文)

你可以把这个评估 workflow 拆成三步:

  1. Generate:用同一个 generation prompt 生成两个 outputs(不同 model 或不同 prompt 版本)
  2. Judge:用 judge prompt 对比评估
  3. Decide:按 rubric 选更好的版本,或把反馈用于下一轮迭代

在生产中,建议把 rubric 写得更具体,比如:

  • coherence(逻辑与结构)
  • faithfulness(是否偏离题意)
  • style adherence(是否符合 Plato dialogue 风格)
  • clarity(可读性与表达)

How to Iterate(中文)

  1. 让 judge 输出结构化结果:Winner + Scores + Evidence + Actionable feedback
  2. 固定比较维度与打分范围(例如 1-5),减少随意性
  3. 加 “tie / unsure” 选项,避免强行选边
  4. 用多个 judge prompt 或多个 judge models 做一致性检查(majority vote)

Self-check rubric(中文)

  • judge 是否引用了 outputs 中的具体片段(evidence)?
  • 评分是否对应 rubric 维度,而不是泛泛而谈?
  • 是否能给出可执行的改进建议(下一轮怎么改 prompt)?
  • 多次运行结果是否稳定(temperature 控制 + 多轮一致性)?

Practice(中文)

练习:对你常用的一个写作任务做 A/B prompt:

  • prompt A:更短、更开放
  • prompt B:更长、带 constraints 与结构化输出

然后用 judge prompt 做对比,产出一个“改进清单”用于下一轮 prompt 迭代。

Prompt (evaluation)

Can you compare the two outputs below as if you were a teacher?

Output from ChatGPT: {output 1}

Output from GPT-4: {output 2}

Code / API

OpenAI (Python)

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    temperature=1,
    max_tokens=1500,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

Fireworks (Python)

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

completion = fireworks.client.ChatCompletion.create(
    model="accounts/fireworks/models/mixtral-8x7b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Can you compare the two outputs below as if you were a teacher?\n\nOutput from ChatGPT:\n{output 1}\n\nOutput from GPT-4:\n{output 2}",
        }
    ],
    stop=["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
    stream=True,
    n=1,
    top_p=1,
    top_k=40,
    presence_penalty=0,
    frequency_penalty=0,
    prompt_truncate_len=1024,
    context_length_exceeded_behavior="truncate",
    temperature=0.9,
    max_tokens=4000,
)

Reference