LlamaIndex(九)——LlamaIndex Evaluation

一、简介

LLM开发中，评估和基准测试是至关重要的概念。为了提高一个LLM应用（例如RAG、agent）的性能，必须有一种方法来衡量它。LlamaIndex提供了关键模块来衡量生成结果的质量。还提供关键模块来衡量检索质量。

Response Evaluation: 响应是否与检索到的上下文相匹配？它是否也与查询相匹配？它是否与参考答案或指南相匹配？
Retrieval Evaluation: 检索到的来源是否与查询相关？

二、Response Evaluation

评估生成结果可能很困难，因为与传统机器学习不同，预测结果不是一个单一的数字，而且很难为此问题定义定量指标。LlamaIndex提供了基于LLM的评估模块来衡量结果的质量。这使用一个LLM（例如GPT-4）以多种方式决定预测答案是否正确。多当前的评估模块不需要真实标签。评估可以通过查询、上下文、响应的某种组合，并与LLM调用结合来完成。

这些评估模块以以下形式存在：

Correctness: 生成的答案是否与给定查询的参考答案相匹配（需要标签）。
Semantic Similarity：预测答案是否在语义上与参考答案相似（需要标签）。
Faithfulness: 评估答案是否忠实于检索到的上下文（换句话说，是否有幻觉）。
Context Relevancy: 检索到的上下文是否与查询相关。
Answer Relevancy: 生成的答案是否与查询相关。
Guideline Adherence: 预测答案是否遵循特定指南。

除了评估查询外，LlamaIndex还可以使用数据生成问题以进行评估。这意味着可以自动生成问题，然后运行一个评估管道来测试LLM是否能够使用数据准确回答这些问题。

2.1 BaseEvaluator

LlamaIndex 中的所有评估模块都实现了 BaseEvaluator 类，主要包含两个方法：

evaluate 方法接受query, contexts, response以及额外的关键字参数。

def evaluate(
    self,
    query: Optional[str] = None,
    contexts: Optional[Sequence[str]] = None,
    response: Optional[str] = None,
    **kwargs: Any,
) -> EvaluationResult:

evaluate_response 方法提供了一个替代接口，它接受一个 llamaindex Response对象（其中包含响应字符串和源节点），而不是单独的contexts和response。

def evaluate_response(
    self,
    query: Optional[str] = None,
    response: Optional[Response] = None,
    **kwargs: Any,
) -> EvaluationResult:

它的功能与 evaluate 相同，只是在使用 llamaindex 对象时更方便。

2.2 EvaluationResult

每个评估器在执行时都会输出一个 EvaluationResult：

eval_result = evaluator.evaluate(query=..., contexts=..., response=...)
eval_result.passing  # binary pass/fail
eval_result.score  # numerical score
eval_result.feedback  # string feedback

不同的评估器可能会填充结果字段的一个子集。

2.3 Evaluating Response Faithfulness

FaithfulnessEvaluator 评估答案是否忠实于检索到的上下文（换句话说，是否有幻觉）。

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import FaithfulnessEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build index
...

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
response = query_engine.query(
    "What battles took place in New York City in the American Revolution?"
)
eval_result = evaluator.evaluate_response(response=response)
print(str(eval_result.passing))

也可以选择单独评估每个源上下文：

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import FaithfulnessEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build index
...

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
response = query_engine.query(
    "What battles took place in New York City in the American Revolution?"
)
response_str = response.response
for source_node in response.source_nodes:
    eval_result = evaluator.evaluate(
        response=response_str, contexts=[source_node.get_content()]
    )
    print(str(eval_result.passing))

将得到一个结果列表，对应于response.source_nodes 中的每个源节点。

2.4 Evaluating Query + Response Relevancy

RelevancyEvaluator 评估检索到的上下文和答案对于给定查询是否相关且一致。此评估器除了Response对象外，还需要传递query。

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import RelevancyEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build index
...

# define evaluator
evaluator = RelevancyEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
query = "What battles took place in New York City in the American Revolution?"
response = query_engine.query(query)
eval_result = evaluator.evaluate_response(query=query, response=response)
print(str(eval_result))

同样，也可以在特定源节点上进行评估。

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import RelevancyEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build index
...

# define evaluator
evaluator = RelevancyEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
query = "What battles took place in New York City in the American Revolution?"
response = query_engine.query(query)
response_str = response.response
for source_node in response.source_nodes:
    eval_result = evaluator.evaluate(
        query=query,
        response=response_str,
        contexts=[source_node.get_content()],
    )
    print(str(eval_result.passing))

2.5 Question Generation

LlamaIndex 还可以使用数据生成问题以进行回答。结合上述评估器，可以创建一个完全自动化的评估管道来评估数据。

from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build documents
documents = SimpleDirectoryReader("./data").load_data()

# define generator, generate questions
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=10,  # set the number of questions per nodes
)

rag_dataset = dataset_generator.generate_questions_from_nodes()
questions = [e.query for e in rag_dataset.examples]

2.6 Batch Evaluation

LlamaIndex还提供了一个批量评估运行器，用于在多个问题上运行一组评估器。

from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
)

eval_results = await runner.aevaluate_queries(
    vector_index.as_query_engine(), queries=questions
)

2.7 Integrations

LlamaIndex集成了社区评估工具。

2.7.1 DeepEval

DeepEval提供了 6 个评估器（包括 3 个 RAG 评估器，用于检索器和生成器评估），由其专有的评估指标提供支持。安装 deepeval：

1	`pip install -U deepeval`

然后可以导入并使用 deepeval 中的评估器。

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)
rag_application = index.as_query_engine()

# An example input to your RAG application
user_input = "What is LlamaIndex?"

# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)

evaluator = DeepEvalAnswerRelevancyEvaluator()
evaluation_result = evaluator.evaluate_response(
    query=user_input, response=response_object
)
print(evaluation_result)

以下是如何从 deepeval 导入所有 6 个评估器：

from deepeval.integrations.llama_index import (
    DeepEvalAnswerRelevancyEvaluator,
    DeepEvalFaithfulnessEvaluator,
    DeepEvalContextualRelevancyEvaluator,
    DeepEvalSummarizationEvaluator,
    DeepEvalBiasEvaluator,
    DeepEvalToxicityEvaluator,
)

关于如何使用 deepeval 的评估指标与 LlamaIndex 并利用其完整的 LLM 测试套件：https://docs.confident-ai.com/docs/integrations-llamaindex

三、Retrieval Evaluation

LlamaIndex还提供模块来帮助独立评估检索。检索评估的概念并不新鲜；给定一个问题数据集和真实排名，可以使用排名指标如平均倒数排名（MRR）、命中率、精确度等来评估检索器。

核心检索评估步骤围绕以下内容：

Dataset generation: 给定一个非结构化文本语料库，合成地生成（问题，上下文）对。
Retrieval Evaluation: 给定一个检索器和一组问题，使用排名指标评估检索结果。

3.1 RetrieverEvaluator

这会在给定检索器的情况下，对单个查询 + 真实文档集进行评估。标准做法是从 from_metrics 中指定一组有效的指标。

from llama_index.core.evaluation import RetrieverEvaluator

# define retriever somewhere (e.g. from index)
# retriever = index.as_retriever(similarity_top_k=2)
retriever = ...

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

retriever_evaluator.evaluate(
    query="query", expected_ids=["node_id1", "node_id2"]
)

3.2 Building an Evaluation Dataset

可以手动策划一个检索评估数据集，包括问题 + 节点 ID。LlamaIndex还提供了一个 generate_question_context_pairs 函数，可以在现有文本语料库上生成合成数据集：

from llama_index.core.evaluation import generate_question_context_pairs

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

返回的结果是一个 EmbeddingQAFinetuneDataset 对象（包含queries, relevant_docs, 和corpus）。

LlamaIndex提供了一个方便的函数，用于在批量模式下对数据集运行 RetrieverEvaluator。

1	`eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)`

这应该比尝试对每个查询分别调用 .evaluate 要快得多。

四、LLamaDatasets

4.1 Evaluating With LabelledRagDataset

评估模块中的核心抽象概念已经了解过了，这些概念支持对基于LLM的应用程序或系统（包括RAG系统）进行各种类型的评估方法。当然，要评估一个系统，需要一个评估方法、系统本身以及评估数据集。最佳实践是在来自不同来源和领域的多个不同数据集上测试LLM应用程序，这样做有助于确保系统的总体鲁棒性（即系统在未见过的新案例中工作的级别）。

为此，LlamaIndex在库中包含了 LabelledRagDataset 抽象。它们的核心目的是通过使这些数据集易于创建、易于使用和广泛可用，来促进对各种数据集上的系统进行评估。该数据集由示例组成，其中示例带有query、reference_answer以及reference_contexts。使用 LabelledRagDataset 的主要原因是为了测试 RAG 系统的性能，方法是首先预测对给定query的响应，然后将预测（或生成）的响应与 reference_answer 进行比较。

from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    CreatedBy,
    CreatedByType,
    LabelledRagDataExample,
)

example1 = LabelledRagDataExample(
    query="This is some user query.",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    reference_answer="This is a reference answer. Otherwise known as ground-truth answer.",
    reference_contexts=[
        "This is a list",
        "of contexts used to",
        "generate the reference_answer",
    ],
    reference_by=CreatedBy(type=CreatedByType.HUMAN),
)

# a sad dataset consisting of one measely example
rag_dataset = LabelledRagDataset(examples=[example1])

4.1.1 构建 LabelledRagDataset

可以通过逐个构建 LabelledRagDataExample 来手动构建一个 LabelledRagDataset。然而，这有点繁琐，尽管人工标注的数据集非常有价值，但由强大的LLM生成的数据集也非常有用。
因此，lamma_dataset 模块配备了 RagDatasetGenerator，能够基于一组源Document生成 LabelledRagDataset。

from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
import nest_asyncio

nest_asyncio.apply()

documents = ...  # a set of documents loaded by using for example a Reader

llm = OpenAI(model="gpt-4")

dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=10,  # set the number of questions per nodes
)

rag_dataset = dataset_generator.generate_dataset_from_nodes()

4.1.2 使用 LabelledRagDataset

使用 LabelledRagDataset 来评估一个基于相同源Document构建的 RAG 系统的性能。需要执行两个步骤：(1) 在数据集上进行预测（即生成每个单独例子的查询响应），以及 (2) 通过将其与参考答案进行比较来评估预测的响应。在步骤 (2) 中，还评估 RAG 系统的检索上下文，并将其与参考上下文进行比较，以获得对 RAG 系统检索组件的评估。
为了方便，LlamaIndex有一个名为 RagEvaluatorPack 的 LlamaPack，它可以简化这个评估过程！

from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,  # built with the same source Documents as the rag_dataset
    rag_dataset=rag_dataset,
)
benchmark_df = await rag_evaluator.run()

上述 benchmark_df 包含了之前介绍的评估指标的平均分数：Correctness, Relevancy, Faithfulness以及Context Similarity该指标衡量参考上下文以及 RAG 系统为生成预测响应而检索的上下文之间的语义相似性。

4.1.3 LabelledRagDataset在哪

可以在 llamahub 上找到所有的 LabelledRagDataset。你可以浏览每一个，并决定是否使用它来评估你的 RAG 流水线。如果决定使用它，可以通过以下两种方式方便地下载数据集以及源文档：llamaindex-cli 或者通过使用 download_llama_dataset 实用程序函数的 Python 代码。

1 2	`# using cli llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data`

# using python
from llama_index.core.llama_dataset import download_llama_dataset

# a LabelledRagDataset and a list of source Document's
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data"
)

4.1.4 Resources

4.2 Evaluating Evaluators with LabelledEvaluatorDataset

Llama-datasets 的目的是为构建者提供快速基准测试 LLM 系统或任务的手段。本着这种精神，LabelledEvaluatorDataset 旨在以无缝且毫不费力的方式促进评估器的评估。该数据集包含主要包含以下属性的示例：query, answer, ground_truth_answer, reference_score, 和 reference_feedback，以及一些其他补充属性。使用此数据集进行评估的用户流程包括使用提供的 LLM 评估器对数据集进行预测，然后通过与相应的参考进行计算比较，计算衡量评估好坏的指标。

以下是一段代码片段，它利用 EvaluatorBenchmarkerPack 便捷地处理上述流程。

from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.gemini import Gemini

# download dataset
evaluator_dataset, _ = download_llama_dataset(
    "MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)

# define evaluator
gemini_pro_llm = Gemini(model="models/gemini-pro", temperature=0)
evaluator = CorrectnessEvaluator(llm=gemini_pro_llm)

# download EvaluatorBenchmarkerPack and define the benchmarker
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
    evaluator=evaluators["gpt-3.5"],
    eval_dataset=evaluator_dataset,
    show_progress=True,
)

# produce the benchmark result
benchmark_df = await evaluator_benchmarker.arun(
    batch_size=5, sleep_time_in_seconds=0.5
)

另一个相关的 llama-dataset 是 LabelledPairwiseEvaluatorDataset，它同样旨在评估评估器，但这次评估器的任务是比较一对 LLM 响应对给定查询，并确定其中更好的一个。上述描述的使用流程与 LabelledEvaluatorDataset 完全相同，唯一的区别是 LLM 评估器必须能够执行成对评估任务——即，应该是一个 PairwiseComparisonEvaluator。

更多资料：

官方资源

LlamaIndex

#LLM #LLM学习笔记 #LlamaIndex #Agent

LlamaIndex(九)——LlamaIndex Evaluation

https://mztchaoqun.com.cn/posts/D22_LlamaIndex_Evaluating/

作者

mztchaoqun

发布于

2024年5月31日

许可协议

LlamaIndex(十)——LlamaIndex Observability 上一篇

LlamaIndex(八)——LlamaIndex Agents 下一篇