Skip to main content
langchain Runnable 객체(채팅 모델, 리트리버, 체인 등)는 evaluate() / aevaluate()에 직접 전달할 수 있습니다.

설정

평가할 간단한 체인을 정의해 보겠습니다. 먼저 필요한 모든 패키지를 설치합니다:
pip install -U langsmith langchain[openai]
이제 체인을 정의합니다:
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

instructions = (
    "Please review the user query below and determine if it contains any form "
    "of toxic behavior, such as insults, threats, or highly negative comments. "
    "Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't."
)

prompt = ChatPromptTemplate(
    [("system", instructions), ("user", "{text}")],
)

model = init_chat_model("gpt-4o")
chain = prompt | model | StrOutputParser()

평가

체인을 평가하려면 evaluate() / aevaluate() 메서드에 직접 전달하면 됩니다. 체인의 입력 변수는 예제 입력의 키와 일치해야 합니다. 이 경우 예제 입력은 {"text": "..."} 형식이어야 합니다.
from langsmith import aevaluate, Client

client = Client()

# Clone a dataset of texts with toxicity labels.
# Each example input has a "text" key and each output has a "label" key.
dataset = client.clone_public_dataset(
    "https://smith.langchain.com/public/3d6831e6-1680-4c88-94df-618c8e01fc55/d"
)

def correct(outputs: dict, reference_outputs: dict) -> bool:
    # Since our chain outputs a string not a dict, this string
    # gets stored under the default "output" key in the outputs dict:
    actual = outputs["output"]
    expected = reference_outputs["label"]
    return actual == expected

results = await aevaluate(
    chain,
    data=dataset,
    evaluators=[correct],
    experiment_prefix="gpt-4o, baseline",
)
Runnable은 각 출력에 대해 적절하게 추적됩니다. Runnable Evaluation

관련 항목


Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I