category
重要事项
本文中描述的一些功能可能仅在预览版中可用。此预览版没有服务级别协议,我们不建议将其用于生产工作负载。某些功能可能不受支持或功能受限。有关详细信息,请参阅Microsoft Azure预览版的补充使用条款。
Azure AI Studio允许您评估单回合或复杂的多回合对话,在这些对话中,您可以将生成性AI模型建立在特定数据中(也称为检索增强生成或RAG)。您还可以评估一般的单轮问答场景,其中没有上下文用于构建生成性AI模型(非RAG)。目前,我们支持以下任务类型的内置指标:
问答(单论)
在这种设置中,用户提出个人问题或提示,并采用生成式人工智能模型立即生成响应。
测试集格式将遵循以下数据格式:
{"question":"Which tent is the most waterproof?","context":"From our product list, the Alpine Explorer tent is the most waterproof.
The Adventure Dining Table has higher weight.","answer":"The Alpine Explorer Tent is the most waterproof.","ground_truth":"The Alpine
Explorer Tent has the highest rainfly waterproof rating at 3000m"}
Note
The "context" and "ground truth" fields are optional, and the supported metrics depend on the fields you provide
Conversation (single turn and multi turn)
In this context, users engage in conversational interactions, either through a series of turns or in a single exchange. The generative AI model, equipped with retrieval mechanisms, generates responses and can access and incorporate information from external sources, such as documents. The Retrieval Augmented Generation (RAG) model enhances the quality and relevance of responses by using external documents and knowledge.
The test set format will follow this data format:
{"messages":[{"role":"user","content":"How can I check the status of my online order?"},{"content":"Hi Sarah Lee! To check the status of
your online order for previous purchases such as the TrailMaster X4 Tent or the CozyNights Sleeping Bag, please refer to your email for
order confirmation and tracking information. If you need further assistance, feel free to contact our customer support at
support@contosotrek.com or give us a call at 1-800-555-1234.
","role":"assistant","context":{"citations":[{"id":"cHJvZHVjdF9pbmZvXzYubWQz","title":"Information about product item_number: 6",
"content":"# Information about product item_number: 6\n\nIt's essential to check local regulations before using the EcoFire
Camping Stove, as some areas may have restrictions on open fires or require a specific type of stove.\n\n30) How do I clean and
maintain the EcoFire Camping Stove?\n To clean the EcoFire Camping Stove, allow it to cool completely, then wipe away any ash
or debris with a brush or cloth. Store the stove in a dry place when not in use."}]}}]}
支持的指标
如评估大型语言模型的方法所述,有手动和自动测量方法。自动测量对于扩大覆盖范围进行大规模测量非常有用,可以提供更全面的结果。随着系统、使用和缓解措施的发展,监控任何回归也有助于持续的测量。
我们支持两种主要的生成式人工智能应用程序自动测量方法:
- 传统机器学习指标
- 人工智能辅助指标
人工智能辅助指标利用GPT-4等语言模型来评估人工智能生成的输出,特别是在由于缺乏明确的基本事实而无法获得预期答案的情况下。传统的机器学习指标,如F1分数,衡量人工智能生成的响应和预期答案之间的精确度和召回率。
我们的人工智能辅助指标评估生成人工智能应用程序的安全性和生成质量。这些指标分为两类:
- 风险和安全指标:
- 这些指标侧重于识别潜在的内容和安全风险,并确保生成内容的安全性。
- 其中包括:
- 仇恨和不公平的内容缺陷率
- 性内容缺陷率
- 暴力内容缺陷率
- 自残相关内容缺陷率
- 越狱缺陷率
- 生成质量指标:
- 这些指标评估生成内容的整体质量和一致性。
- 其中包括:
- 一致性
- 流利
- 基础性
- 关联
- 检索得分
- 相似性
对于上述任务类型,我们支持以下人工智能辅助指标:
Task type | Question and Generated Answers Only (No context or ground truth needed) | Question and Generated Answers + Context | Question and Generated Answers + Context + Ground Truth |
---|---|---|---|
Question Answering | - Risk and safety metrics (all AI-Assisted): hateful and unfair content defect rate, sexual content defect rate, violent content defect rate, self-harm-related content defect rate, and jailbreak defect rate - Generation quality metrics (all AI-Assisted): Coherence, Fluency |
Previous Column Metrics + Generation quality metrics (all AI-Assisted): - Groundedness - Relevance |
Previous Column Metrics + Generation quality metrics: Similarity (AI-assisted) F1-Score (traditional ML metric) |
Conversation | - Risk and safety metrics (all AI-Assisted): hateful and unfair content defect rate, sexual content defect rate, violent content defect rate, self-harm-related content defect rate, and jailbreak defect rate - Generation quality metrics (all AI-Assisted): Coherence, Fluency |
Previous Column Metrics + Generation quality metrics (all AI-Assisted): - Groundedness - Retrieval Score |
N/A |
注:
虽然我们为您提供了一套全面的内置指标,有助于轻松有效地评估生成式人工智能应用程序的质量和安全性,但最佳做法是根据您的特定任务类型对其进行调整和定制。此外,我们授权您引入全新的指标,使您能够从新的角度衡量您的应用程序,并确保与您的独特目标保持一致。
风险和安全指标
风险和安全指标借鉴了我们之前的大型语言模型项目(如GitHub Copilot和Bing)的见解。这确保了对风险和安全严重程度评分生成的响应进行全面评估的方法。这些指标是通过我们的安全评估服务生成的,该服务采用了一组LLM。每个模型的任务都是评估响应中可能存在的具体风险(例如,性内容、暴力内容等)。这些模型提供了风险定义和严重程度量表,并相应地注释了生成的对话。目前,我们计算以下风险和安全指标的“缺陷率”。对于这些指标中的每一个,服务都会衡量是否检测到这些类型的内容以及严重程度。四种类型中的每一种都有四个严重性级别(非常低、低、中、高)。用户指定容差阈值,我们的服务产生的缺陷率对应于在每个阈值级别及以上生成的实例数量。
内容类型:
- 仇恨和不公平的内容
- 性内容
- 暴力内容
- 自残相关内容
除了上述类型的内容外,我们还支持在评估中比较“越狱缺陷率”,这是一个衡量模型响应中越狱流行率的指标。越狱是指模型响应绕过了对它的限制。越狱也发生在LLM偏离预期任务或主题的情况下。
您可以在自己的数据或测试数据集上衡量这些风险和安全指标。然后,您可以在此模拟测试数据集上进行评估,以输出带有内容风险严重程度级别(非常低、低、中或高)的带注释测试数据集,并在Azure AI中查看您的结果,Azure AI为您提供了整个测试数据集的总体缺陷率以及每个内容风险标签和推理的实例视图。
与表中的其他指标不同,越狱漏洞无法通过LLM的注释可靠地衡量。然而,越狱漏洞可以通过比较两个不同的自动化数据集来衡量:(1)内容风险数据集与(2)第一轮有越狱注入的内容风险数据集中。然后,用户通过比较两个数据集的内容风险缺陷率来评估越狱漏洞。
注:
人工智能辅助的风险和安全指标由Azure AI Studio安全评估后端服务托管,仅在以下地区可用:美国东部2、法国中部、英国南部、瑞典中部。
可用区域具有以下容量:
Region | TPM |
---|---|
Sweden Central | 450k |
France Central | 380k |
UK South | 280k |
East US 2 | 80k |
Hateful and unfair content definition and severity scale
Warning
The content risk definitions and severity scales contain descriptions that may be disturbing to some users.
Sexual content definition and severity scale
Warning
The content risk definitions and severity scales contain descriptions that may be disturbing to some users.
Violent content definition and severity scale
Warning
The content risk definitions and severity scales contain descriptions that may be disturbing to some users.
Self-harm-related content definition and severity scale
Warning
The content risk definitions and severity scales contain descriptions that may be disturbing to some users.
生成质量指标
生成质量指标用于评估生成AI应用程序生成的内容的整体质量。以下是这些指标的细分:
人工智能辅助:基础性(Groundedness)
对于基础性,我们提供两个版本:
- 通过集成到Azure AI Studio安全评估中,利用Azure AI内容安全服务(AACS)进行接地检测。用户无需进行部署,因为后端服务将为您提供模型以输出分数和推理。目前支持以下地区:美国东部2和瑞典中部。
- 仅基于提示的基础性,使用您自己的模型仅输出分数。目前支持所有地区。
基于AACS的基础性
Score characteristics | Score details |
---|---|
Score range | 1-5 where 1 is ungrounded and 5 is grounded |
What is this metric? | Measures how well the model's generated answers align with information from the source data (for example, retrieved documents in RAG Question and Answering or documents for summarization) and outputs reasonings for which specific generated sentences are ungrounded. |
How does it work? | Groundedness Detection leverages an Azure AI Content Safety Service custom language model fine-tuned to a natural language processing task called Natural Language Inference (NLI), which evaluates claims as being entailed or not entailed by a source document. |
When to use it? | Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. It's essential for applications where factual correctness and contextual accuracy are key, like information retrieval, question-answering, and content summarization. This metric ensures that the AI-generated answers are well-supported by the context. |
What does it need as input? | Question, Context, Generated Answer |
Prompt-only-based groundedness
Score characteristics | Score details |
---|---|
Score range | 1-5 where 1 is ungrounded and 5 is grounded |
What is this metric? | Measures how well the model's generated answers align with information from the source data (user-defined context). |
How does it work? | The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Even if the responses from LLM are factually correct, they'll be considered ungrounded if they can't be verified against the provided sources (such as your input source or your database). |
When to use it? | Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. It's essential for applications where factual correctness and contextual accuracy are key, like information retrieval, question-answering, and content summarization. This metric ensures that the AI-generated answers are well-supported by the context. |
What does it need as input? | Question, Context, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric:
You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed
by the CONTEXT by choosing one of the following rating:
1. 5: The ANSWER follows logically from the information contained in the CONTEXT.
2. 1: The ANSWER is logically false from the information contained in the CONTEXT.
3. an integer score between 1 and 5 and if such integer score does not exist,
use 1: It is not possible to determine whether the ANSWER is true or false without further information.
Read the passage of information thoroughly and select the correct answer from the three answer labels.
Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.
Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not
be a negative factor in the evaluation.
AI-assisted: Relevance
Score characteristics | Score details |
---|---|
Score range | Integer [1-5]: where 1 is bad and 5 is good |
What is this metric? | Measures the extent to which the model's generated responses are pertinent and directly related to the given questions. |
How does it work? | The relevance measure assesses the ability of answers to capture the key points of the context. High relevance scores signify the AI system's understanding of the input and its capability to produce coherent and contextually appropriate outputs. Conversely, low relevance scores indicate that generated responses might be off-topic, lacking in context, or insufficient in addressing the user's intended queries. |
When to use it? | Use the relevance metric when evaluating the AI system's performance in understanding the input and generating contextually appropriate responses. |
What does it need as input? | Question, Context, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric (For question answering data format):
Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether
all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question,
score the relevance of the answer between one to five stars using the following rating scale:
One star: the answer completely lacks relevance
Two stars: the answer mostly lacks relevance
Three stars: the answer is partially relevant
Four stars: the answer is mostly relevant
Five stars: the answer has perfect relevance
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
Built-in prompt used by the Large Language Model judge to score this metric (For conversation data format) (without Ground Truth available):
You will be provided a question, a conversation history, fetched documents related to the question and a response to the
question in the {DOMAIN} domain. Your task is to evaluate the quality of the provided response by following the steps below:
- Understand the context of the question based on the conversation history.
- Generate a reference answer that is only based on the conversation history, question, and fetched documents. Don't
generate the reference answer based on your own knowledge.
- You need to rate the provided response according to the reference answer if it's available on a scale of 1 (poor) to
5 (excellent), based on the below criteria:
5 - Ideal: The provided response includes all information necessary to answer the question based on the reference answer
and conversation history. Please be strict about giving a 5 score.
4 - Mostly Relevant: The provided response is mostly relevant, although it might be a little too narrow or too broad
based on the reference answer and conversation history.
3 - Somewhat Relevant: The provided response might be partly helpful but might be hard to read or contain other
irrelevant content based on the reference answer and conversation history.
2 - Barely Relevant: The provided response is barely relevant, perhaps shown as a last resort based on the reference
answer and conversation history.
1 - Completely Irrelevant: The provided response should never be used for answering this question based on the
reference answer and conversation history.
- You need to rate the provided response to be 5, if the reference answer can not be generated since no relevant
documents were retrieved.
- You need to first provide a scoring reason for the evaluation according to the above criteria, and then provide
a score for the quality of the provided response.
- You need to translate the provided response into English if it's in another language.
- Your final response must include both the reference answer and the evaluation result. The evaluation result
should be written in English.
Built-in prompt used by the Large Language Model judge to score this metric (For conversation data format) (with Ground Truth available):
Your task is to score the relevance between a generated answer and the question based on the ground truth answer in
the range between 1 and 5, and please also provide the scoring reason.
Your primary focus should be on determining whether the generated answer contains sufficient information to address
the given question according to the ground truth answer.
If the generated answer fails to provide enough relevant information or contains excessive extraneous information,
then you should reduce the score accordingly.
If the generated answer contradicts the ground truth answer, it will receive a low score of 1-2.
For example, for question "Is the sky blue?", the ground truth answer is "Yes, the sky is blue." and the generated
answer is "No, the sky is not blue.".
In this example, the generated answer contradicts the ground truth answer by stating that the sky is not blue, when
in fact it is blue.
This inconsistency would result in a low score of 1-2, and the reason for the low score would reflect the contradiction
between the generated answer and the ground truth answer.
Please provide a clear reason for the low score, explaining how the generated answer contradicts the ground truth answer.
Labeling standards are as following:
5 - ideal, should include all information to answer the question comparing to the ground truth answer, and the generated
answer is consistent with the ground truth answer
4 - mostly relevant, although it might be a little too narrow or too broad comparing to the ground truth answer, and the
generated answer is consistent with the ground truth answer
3 - somewhat relevant, might be partly helpful but might be hard to read or contain other irrelevant content comparing to
the ground truth answer, and the generated answer is consistent with the ground truth answer
2 - barely relevant, perhaps shown as a last resort comparing to the ground truth answer, and the generated answer
contradicts with the ground truth answer
1 - completely irrelevant, should never be used for answering this question comparing to the ground truth answer, and
the generated answer contradicts with the ground truth answer
AI-assisted: Coherence
Score characteristics | Score details |
---|---|
Score range | Integer [1-5]: where 1 is bad and 5 is good |
What is this metric? | Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. |
How does it work? | The coherence measure assesses the ability of the language model to generate text that reads naturally, flows smoothly, and resembles human-like language in its responses. |
When to use it? | Use it when assessing the readability and user-friendliness of your model's generated responses in real-world applications. |
What does it need as input? | Question, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric:
Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider
the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer
between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Three stars: the answer is partially coherent
Four stars: the answer is mostly coherent
Five stars: the answer has perfect coherency
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
AI-assisted: Fluency
Score characteristics | Score details |
---|---|
Score range | Integer [1-5]: where 1 is bad and 5 is good |
What is this metric? | Measures the grammatical proficiency of a generative AI's predicted answer. |
How does it work? | The fluency measure assesses the extent to which the generated text conforms to grammatical rules, syntactic structures, and appropriate vocabulary usage, resulting in linguistically correct responses. |
When to use it? | Use it when evaluating the linguistic correctness of the AI-generated text, ensuring that it adheres to proper grammatical rules, syntactic structures, and vocabulary usage in the generated responses. |
What does it need as input? | Question, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric:
Fluency measures the quality of individual sentences in the answer, and whether they are well-written and grammatically correct.
Consider the quality of individual sentences when evaluating fluency. Given the question and answer, score the fluency of the
answer between one to five stars using the following rating scale:
One star: the answer completely lacks fluency
Two stars: the answer mostly lacks fluency
Three stars: the answer is partially fluent
Four stars: the answer is mostly fluent
Five stars: the answer has perfect fluency
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
AI-assisted: Retrieval Score
Score characteristics | Score details |
---|---|
Score range | Float [1-5]: where 1 is bad and 5 is good |
What is this metric? | Measures the extent to which the model's retrieved documents are pertinent and directly related to the given questions. |
How does it work? | Retrieval score measures the quality and relevance of the retrieved document to the user's question (summarized within the whole conversation history). Steps: Step 1: Break down user query into intents, Extract the intents from user query like “How much is the Azure linux VM and Azure Windows VM?” -> Intent would be [“what’s the pricing of Azure Linux VM?”, “What’s the pricing of Azure Windows VM?”]. Step 2: For each intent of user query, ask the model to assess if the intent itself or the answer to the intent is present or can be inferred from retrieved documents. The answer can be “No”, or “Yes, documents [doc1], [doc2]…”. “Yes” means the retrieved documents relate to the intent or answer to the intent, and vice versa. Step 3: Calculate the fraction of the intents that have an answer starting with “Yes”. In this case, all intents have equal importance. Step 4: Finally, square the score to penalize the mistakes. |
When to use it? | Use the retrieval score when you want to guarantee that the documents retrieved are highly relevant for answering your users' questions. This score helps ensure the quality and appropriateness of the retrieved content. |
What does it need as input? | Question, Context, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric:
A chat history between user and bot is shown below
A list of documents is shown below in json format, and each document has one unique id.
These listed documents are used as contex to answer the given question.
The task is to score the relevance between the documents and the potential answer to the given question in
the range of 1 to 5.
1 means none of the documents is relevant to the question at all. 5 means either one of the document or combination of
a few documents is ideal for answering the given question.
Think through step by step:
- Summarize each given document first
- Determine the underlying intent of the given question, when the question is ambiguous, refer to the given chat history
- Measure how suitable each document to the given question, list the document id and the corresponding relevance score.
- Summarize the overall relevance of given list of documents to the given question after # Overall Reason, note that
the answer to the question can soley from single document or a combination of multiple documents.
- Finally, output "# Result" followed by a score from 1 to 5.
# Question
{{ query }}
# Chat History
{{ history }}
# Documents
---BEGIN RETRIEVED DOCUMENTS---
{{ FullBody }}
---END RETRIEVED DOCUMENTS---
AI-assisted: GPT-Similarity
Score characteristics | Score details |
---|---|
Score range | Integer [1-5]: where 1 is bad and 5 is good |
What is this metric? | Measures the similarity between a source data (ground truth) sentence and the generated response by an AI model. |
How does it work? | The GPT-similarity measure evaluates the likeness between a ground truth sentence (or document) and the AI model's generated prediction. This calculation involves creating sentence-level embeddings for both the ground truth and the model's prediction, which are high-dimensional vector representations capturing the semantic meaning and context of the sentences. |
When to use it? | Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. GPT-similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy. |
What does it need as input? | Question, Ground Truth Answer, Generated Answer |
Built-in prompt used by the Large Language Model judge to score this metric:
GPT-Similarity, as a metric, measures the similarity between the predicted answer and the correct answer. If the information
and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric
should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of
Equivalence metric using the following rating scale:
One star: the predicted answer is not at all similar to the correct answer
Two stars: the predicted answer is mostly not similar to the correct answer
Three stars: the predicted answer is somewhat similar to the correct answer
Four stars: the predicted answer is mostly similar to the correct answer
Five stars: the predicted answer is completely similar to the correct answer
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
Traditional machine learning: F1 Score
Score characteristics | Score details |
---|---|
Score range | Float [0-1] |
What is this metric? | Measures the ratio of the number of shared words between the model generation and the ground truth answers. |
How does it work? | The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the generation, and recall is the ratio of the number of shared words to the total number of words in the ground truth. |
When to use it? | Use the F1 score when you want a single comprehensive metric that combines both recall and precision in your model's responses. It provides a balanced evaluation of your model's performance in terms of capturing accurate information in the response. |
What does it need as input? | Question, Ground Truth Answer, Generated Answer |
Next steps
- 登录 发表评论
- 16 次浏览
Tags
最新内容
- 17 hours ago
- 17 hours ago
- 18 hours ago
- 19 hours ago
- 19 hours ago
- 6 days 17 hours ago
- 1 week ago
- 1 week 3 days ago
- 1 week 3 days ago
- 1 week 3 days ago