以Ragas和LLM为评判标准评估亚马逊基岩代理

语言 Chinese, Simplified

SEO Title

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

技术挑战

如今，人工智能代理开发人员通常面临以下技术挑战：

端到端代理评估——尽管Amazon Bedrock为LLM模型和RAG检索提供了内置的评估功能，但它缺乏专门为Amazon Bedrock代理设计的指标。需要评估整体代理目标，以及特定任务和工具调用的单个代理跟踪步骤。还需要支持单代理和多代理，以及单轮和多轮数据集。
具有挑战性的实验管理——亚马逊基岩代理提供了许多配置选项，包括LLM模型选择、代理说明、工具配置和多代理设置。然而，由于缺乏系统的方法来跟踪、比较和衡量不同代理版本之间配置更改的影响，因此对这些参数进行快速实验在技术上具有挑战性。这使得通过迭代测试有效优化代理性能变得困难。

解决方案概述

下图说明了开源基岩剂评估是如何在高层次上工作的。该框架运行一个评估作业，该作业将调用您在Amazon Bedrock中的代理并评估其响应。

工作流程包括以下步骤：

用户指定代理ID、别名、评估模型以及包含问题和基本事实对的数据集。
用户执行评估作业，该作业将调用指定的Amazon Bedrock代理。
检索到的代理调用跟踪通过框架中的自定义解析逻辑运行。
该框架根据代理调用结果和问题类型进行评估：
- 思维链——LLM作为评委，亚马逊基岩LLM电话（针对不同类型问题的每次评估运行进行）
- RAG–Ragas评估库
- 文本到SQL-LLM作为法官与亚马逊Bedrock LLM通话
收集评估结果和解析的痕迹，并将其发送给Langfuse以获取评估见解。

Prerequisites

To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.

To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.

Overview of evaluation metrics and input data

First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.

Evaluation metrics

The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

Agent goal – Chain-of-thought (run on every question)
Task accuracy – RAG, text-to-SQL (run only when the specific tool is used to answer question)

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

Reference-based evaluation – The user provides a reference that will be used as the ideal outcome. The metric is computed by comparing the reference with the goal achieved by the end of the workflow.
Evaluation without reference – The metric evaluates the performance of the LLM in identifying and achieving the goals of the user without reference.

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.

Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.

The following is a breakdown of the key metrics used in each evaluation type included in the framework:

RAG:
- Faithfulness – How factually consistent a response is with the retrieved context
- Answer relevancy – How directly and appropriately the original question is addressed
- Context recall – How many of the relevant pieces of information were successfully retrieved
- Semantic similarity – The assessment of the semantic resemblance between the generated answer and the ground truth
Text-to-SQL:
- SQL query semantic equivalence – The equivalence of response query with the reference query
- Answer correctness – How well the generated answer correctly represents the query results and matches ground truth
Chain-of-thought:
- Helpfulness – How well the agent satisfies explicit and implicit expectations
- Faithfulness – How well the agent sticks to available information and context
- Instruction following – How well the agent respects all explicit directions

User-agent trajectories

The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.

For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.

The following is an example of a RAG sample agent trajectory:

{
	"Trajectory0": [
		{
			"question_id": 0,
			"question_type": "RAG",
			"question": "Was Abraham Lincoln the sixteenth President of the United States?",
			"ground_truth": "yes"
		}
	]
}

The following is an example of a text-to-SQL sample agent trajectory:

{
	"Trajectory1": [
		{
			"question_id": 1,
			"question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
			"question_type": "TEXT2SQL",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` 
				FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST
				(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
				"ground_truth_sql_context": "[{'table_name': 'frpm', 'columns': 
				[('cdscode', 'varchar'), ('academic year', 'varchar'), ...",
				"ground_truth_query_result": "1.0",
				"ground_truth_answer": "The highest eligible free rate for K-12 students 
				in schools in Alameda County is 1.0."
		}
	]
}

Pharmaceutical research agent use case example

In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.

The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

如图所示，RAG评估将在临床证据研究人员子代理上进行。同样，文本到SQL的评估将在生物标志物数据库分析师子代理上运行。思维链评估评估主管代理的最终答案，以检查它是否正确地编排了子代理并回答了用户的问题。

研究代理轨迹

对于像药物研究代理这样更复杂的设置，我们使用了一组与行业相关的预先生成的测试问题。通过根据主题创建问题组，而不管可能调用哪些子代理来回答问题，我们创建了包括跨越多种工具使用类型的多个问题的轨迹。由于已经产生了相关问题，与评估框架的整合只需要将地面实况数据正确地格式化为轨迹。

我们将根据包含RAG问题和文本到SQL问题的轨迹来评估此代理：

{
	"Trajectory1": [
		{
			"question_id": 3,
			"question_type": "RAG",
			"question": "According to the knowledge base, how did the EGF pathway associate 
			with CT imaging features?",
			"ground_truth": "The EGF pathway was significantly correlated with the presence of ground-glass 
			opacity and irregular nodules or nodules with poorly defined margins."
		},
		{
			"question_id": 4,
			"question_type": "TEXT2SQL",
			"question": "According to the database, What percentage of patients have EGFR mutations?",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT (COUNT(CASE WHEN 
				EGFR_mutation_status = 'Mutant' THEN 1 END) * 100.0 / COUNT(*)) AS percentage 
				FROM clinical_genomic;",
				"ground_truth_sql_context": "Table clinical_genomic: - Case_ID: VARCHAR(50) - 
				EGFR_mutation_status: VARCHAR(50)",
				"ground_truth_query_result": "14.285714",
				"ground_truth_answer": "According to the query results, approximately 14.29% of 
				patients in the clinical_genomic table have EGFR mutations."
			}
		}
	]
}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.

After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

屏幕截图显示以下信息：

跟踪信息（代理调用的输入和输出）
跟踪步骤（代理生成和相应的子步骤）
跟踪元数据（输入和输出令牌、成本、模型、代理类型）
评估指标（RAG和思维链指标）

以下屏幕截图显示了Langfuse上文本到SQL问题（问题ID 4）评估的跟踪，该评估评估了生物标志物数据库分析代理，该代理生成SQL查询，以在包含生物标志物信息的Amazon Redshift数据库上运行。

屏幕截图显示了以下信息：

跟踪信息（代理调用的输入和输出）
跟踪步骤（代理生成和相应的子步骤）
跟踪元数据（输入和输出令牌、成本、模型、代理类型）
评估指标（文本到SQL和思维链指标）

思维链评价是两个问题评价轨迹的一部分。对于这两种痕迹，LLM作为法官被用来围绕亚马逊基岩特工对给定问题的推理生成分数和解释。

总的来说，我们针对代理运行了56个问题，分为21个轨迹。跟踪、模型成本和分数显示在以下屏幕截图中。

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category	Metric Type	Metric Name	Number of Traces	Metric Avg. Value
Agent Goal	COT	Helpfulness	50	0.77
Agent Goal	COT	Faithfulness	50	0.87
Agent Goal	COT	Instruction following	50	0.69
Agent Goal	COT	Overall (average of all metrics)	50	0.77
Task Accuracy	TEXT2SQL	Answer correctness	26	0.83
Task Accuracy	TEXT2SQL	SQL semantic equivalence	26	0.81
Task Accuracy	RAG	Semantic similarity	20	0.66
Task Accuracy	RAG	Faithfulness	20	0.5
Task Accuracy	RAG	Answer relevancy	20	0.68
Task Accuracy	RAG	Context recall	20	0.53

安全注意事项

考虑以下安全措施：

启用Amazon Bedrock代理日志记录-为了使用Amazon Bedrocks代理的安全最佳实践，请启用Amazon Bedrock模型调用日志记录，以便在您的帐户中安全地捕获提示和响应。
检查合规性要求——在您的生产环境中实施Amazon Bedrock Agent之前，请确保Amazon Bedrock合规性认证和标准符合您的监管要求。有关满足合规要求的更多信息和资源，请参阅Amazon Bedrock的合规性验证。

清理

如果部署了示例代理，请运行以下笔记本以删除创建的资源。

如果您选择了自托管的Langfuse选项，请按照以下步骤清理您的AWS自托管Langfuse设置。

结论

在这篇文章中，我们介绍了开源基岩代理评估框架，这是一个简化代理开发过程的Langfuse集成解决方案。该框架配备了用于RAG的内置评估逻辑、文本到SQL、思维链推理，并与Langfuse集成以查看评估指标。使用开源基岩试剂评估试剂，开发人员可以快速评估他们的试剂，并快速尝试不同的配置，加快开发周期并提高试剂性能。

我们展示了如何将该评估框架与药物研究代理相结合。我们使用它来评估生物标志物问题的代理性能，并将跟踪信息发送给Langfuse，以查看不同问题类型的评估指标。

开源基岩代理评估框架使您能够使用Amazon基岩代理加速生成AI应用程序的构建过程。要在您的AWS帐户中自托管Langfuse，请参阅使用CDK Python在Amazon ECS上使用Fargate托管Langfus。要探索如何简化您的亚马逊基岩代理评估过程，请开始使用开源基岩代理评估。

请参阅亚马逊基岩团队的《迈向有效的GenAI多代理协作：企业应用程序的设计和评估》，以了解更多关于多代理协作和端到端代理评估的信息。

本文地址

https://architect.pub/evaluate-amazon-bedrock-agents-ragas-and-llm-judge