사전 구축된 평가자를 사용하는 방법

LangSmith는 오픈소스 openevals 패키지와 통합되어, 평가를 위한 시작점으로 사용할 수 있는 사전 구축된 평가자 모음을 제공합니다.

이 사용 가이드에서는 한 가지 유형의 평가자(LLM-as-a-judge)를 설정하고 실행하는 방법을 보여드립니다. 사용 예제가 포함된 사전 구축된 평가자의 전체 목록은 openevals 및 agentevals 리포지토리를 참조하세요.

설정

사전 구축된 LLM-as-a-judge 평가자를 사용하려면 openevals 패키지를 설치해야 합니다.

pip install -U openevals

또한 OpenAI API 키를 환경 변수로 설정해야 하지만, 다른 공급자를 선택할 수도 있습니다:

export OPENAI_API_KEY="your_openai_api_key"

평가를 실행하기 위해 Python에서는 LangSmith의 pytest 통합을, TypeScript에서는 Vitest/Jest를 사용할 것입니다. openevals는 evaluate 메서드와도 원활하게 통합됩니다. 설정 지침은 해당 가이드를 참조하세요.

평가자 실행하기

일반적인 흐름은 간단합니다: openevals에서 평가자 또는 팩토리 함수를 가져온 다음, 테스트 파일 내에서 입력, 출력 및 참조 출력과 함께 실행합니다. LangSmith는 평가자의 결과를 자동으로 피드백으로 기록합니다. 모든 평가자가 각 매개변수를 요구하는 것은 아닙니다(예를 들어, 정확한 일치 평가자는 출력과 참조 출력만 필요합니다). 또한 LLM-as-a-judge 프롬프트에 추가 변수가 필요한 경우, kwargs로 전달하면 프롬프트에 형식화됩니다. 다음과 같이 테스트 파일을 설정하세요:

import pytest
from langsmith import testing as t
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    feedback_key="correctness",
    model="openai:o3-mini",
)

# Mock standin for your application
def my_llm_app(inputs: dict) -> str:
    return "Doodads have increased in price by 10% in the past year."

@pytest.mark.langsmith
def test_correctness():
    inputs = "How much has the price of doodads changed in the past year?"
    reference_outputs = "The price of doodads has decreased by 50% in the past year."
    outputs = my_llm_app(inputs)

    t.log_inputs({"question": inputs})
    t.log_outputs({"answer": outputs})
    t.log_reference_outputs({"answer": reference_outputs})

    correctness_evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

feedback_key/feedbackKey 매개변수는 실험에서 피드백의 이름으로 사용됩니다. 터미널에서 평가를 실행하면 다음과 같은 결과가 표시됩니다:

LangSmith에서 이미 데이터셋을 생성한 경우, 사전 구축된 평가자를 evaluate 메서드에 직접 전달할 수도 있습니다. Python을 사용하는 경우 langsmith>=0.3.11이 필요합니다:

from langsmith import Client
from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT

client = Client()
conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    feedback_key="conciseness",
    model="openai:o3-mini",
)

experiment_results = client.evaluate(
    # This is a dummy target function, replace with your actual LLM-based system
    lambda inputs: "What color is the sky?",
    data="Sample dataset",
    evaluators=[
        conciseness_evaluator
    ]
)

사용 가능한 평가자의 전체 목록은 openevals 및 agentevals 리포지토리를 참조하세요.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

사전 구축된 평가자를 사용하는 방법

설정

평가자 실행하기

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​설정

​평가자 실행하기

설정

평가자 실행하기