Microsoft Vulnerability Severity Classification for AI Systems
Our commitment to protecting customers from vulnerabilities in our software, services, and devices includes providing security updates and guidance that address these vulnerabilities when they are reported to Microsoft. We want to be transparent with our customers and security researchers in our approach. The following table describes the Microsoft severity classification for common vulnerability types for systems involving Artificial Intelligence or Machine Learning (AI/ML). It is derived from the Microsoft Security Response Center (MSRC) advisory rating. MSRC uses this information as guidelines to triage bugs and determine severity. In addition, the ease of exploitation is also considered during severity assessment.
Inference Manipulation
- This category consists of vulnerabilities that could be exploited to manipulate the model’s response to individual inference requests, but do not modify the model itself.
- The severity of the vulnerability depends on the resulting security impact.
- Content-related issues are assessed separately based on Microsoft’s Responsible AI Principles and Approach.
Vulnerability | Description | Security Impact | Severity |
---|---|---|---|
Prompt Injection |
The ability to inject instructions that cause the model to generate unintended output resulting in a specific security impact.
|
Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring no user interaction (e.g., zero click).
|
Critical
|
Example: In an instruction-tuned language model, a textual prompt from an untrusted source contradicts the system prompt and is incorrectly prioritized above the system prompt, causing the model to change its behavior.
|
Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring some user interaction (e.g., one or more clicks).
|
Important
|
|
References: Greshake et al. 2023, Rehberger 2023
|
Allows an attacker to influence or manipulate the generated output.
|
Content-related issue
|
|
Input Perturbation |
The ability to perturb valid inputs such that the model produces incorrect outputs. Also known as model evasion or adversarial examples. |
Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring no user interaction (e.g., zero click).
|
Critical
|
Example: In an image classification model, an attacker perturbs the input image such that it is misclassified by the model.
|
Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring some user interaction (e.g., one or more clicks).
|
Important
|
|
References: Szegedy et al. 2013, Biggio & Roli 2018
|
Allows an attacker to influence or manipulate the generated output.
|
Content-related issue
|
Model Manipulation
- This category consists of vulnerabilities that could be exploited to manipulate a model during the training phase.
- The severity of the vulnerability depends on how the impacted model is used.
- Vulnerabilities that directly modify the data of the model (e.g., the model weights) after training are assessed using existing definitions (e.g., “Tampering”).
Vulnerability | Description | Use of impacted model | Severity |
---|---|---|---|
Model Poisoning or Data Poisoning |
The ability to poison the model by tampering with the model architecture, training code, hyperparameters, or training data.
|
Used to make decisions that affect other users or generate content that is directly shown to other users.
|
Critical
|
Example: An attacker adds poisoned data records to a dataset used to train or fine-tune a model, in order to introduce a backdoor (e.g., unintended model behavior that can be triggered by specific inputs). The trained model may be used by multiple users.
|
Used to make decisions that affect only the attacker or generate content that is shown only to the attacker.
|
Low
|
|
References: Carlini et al. 2023
|
Inferential Information Disclosure
- This category consists of vulnerabilities that could be exploited to infer information about the model’s training data, architecture and weights, or inference-time input data.
- Inferential information disclosure vulnerabilities specifically involve inferring information using the model itself (e.g., through the legitimate inference interface). Vulnerabilities that obtain information in other ways (e.g., storage account misconfiguration) are assessed using existing definitions (e.g., “Information Disclosure”).
- These vulnerabilities are evaluated in terms of the level of confidence/accuracy attainable by a potential attacker, and are only applicable if an attacker can obtain a sufficient level of confidence/accuracy.
- The severity depends on the classification of the impacted data, using the data classification definitions from the Microsoft Vulnerability Severity Classification for Online Services.
Targeting training data
- For vulnerabilities targeting the training data, the severity depends on the classification of this data.
Vulnerability | Description | Data classification of training data | Severity |
---|---|---|---|
Membership Inference |
The ability to infer whether specific data records, or groups of records, were part of the model’s training data.
|
Highly Confidential or Confidential
|
Moderate
|
Example: An attacker guesses potential data records and then uses the outputs of the model to infer whether these were part of the training dataset, thus confirming the attacker’s guess.
|
General or Public
|
Low
|
|
References: Carlini et al. 2022, Ye et al. 2022
|
|||
Attribute Inference |
The ability to infer sensitive attributes of one or more records that were part of the training data.
|
Highly Confidential or Confidential
|
Important
|
Example: An attacker knows part of a data record that was used for training and then uses the outputs of the model to infer the unknown attributes of that record.
|
General
|
Moderate
|
|
References: Fredrikson et al. 2014, Salem et al. 2023
|
Public
|
Low
|
|
Training Data Reconstruction |
The ability to reconstruct individual data records from the training dataset.
|
Highly Confidential or Confidential
|
Important
|
Example: An attacker can generate a sufficiently accurate copy of one or more records from the training data, which would not have been possible without access to the model.
|
General
|
Moderate
|
|
References: Fredrikson et al. 2015, Balle et al. 2022
|
Public
|
Low
|
|
Property Inference |
The ability to infer sensitive properties about the training dataset.
|
Highly Confidential or Confidential
|
Moderate
|
Example: An attacker can infer what proportion of data records in the training that belong to a sensitive class, which would not have been possible without access to the model.
|
General or Public
|
Low
|
|
References: Zhang et al. 2021, Chase et al. 2021
|
Targeting model architecture/weights
For vulnerabilities targeting the model itself, the severity depends on the classification of the model architecture/weights.
Vulnerability | Description | Data classification of model architecture/weights | Severity |
---|---|---|---|
Model Stealing |
The ability to infer/extract the architecture or weights of the trained model.
|
Highly Confidential or Confidential
|
Critical
|
Example: An attacker is able to create a functionally equivalent copy of the target model using only inference responses from this model.
|
General
|
Important
|
|
References: Jagielski et al. 2020, Zanella-Béguelin et al. 2021
|
Public
|
Low
|
Targeting prompt/inputs
- For vulnerabilities targeting the inference-time inputs (including the system prompt), the severity depends on the classification of these inputs.
Vulnerability | Description | Data classification of system prompts/user input | Severity |
---|---|---|---|
Prompt Extraction |
The ability to extract or reconstruct the system prompt provided to the model by interacting with the model.
|
Any
|
Not in scope
|
Example: In an instruction-tuned language model, an attacker uses a specially crafted input to cause the model to output (part of) its system prompt.
|
|||
References: Shen et al. 2023
|
|||
Input Extraction |
The ability to extract or reconstruct other users’ inputs to the model.
|
Highly Confidential or Confidential
|
Important
|
Example: In an instruction-tuned language model, an attacker uses a specially crafted input that causes the model to reveal (part of) another user’s input to the attacker.
|
General or Public
|
Low
|
Microsoft recognizes that this list may not incorporate all vulnerability types and that new vulnerabilities may be discovered at any time. We reserve the right to classify any vulnerabilities that are not covered by this document at our discretion, and we may modify these classifications at any time. Examples are given for reference only. Any penetration testing against Microsoft systems must follow the Microsoft Penetration Testing Rules of Engagement.