Adversarial Robustness 360 - Resources

Welcome to the Adversarial Robustness Toolbox

Security and privacy of our AI training data and models is one of the key pillars for building trust in AI. As machine learning and AI models grow with increasing sophistication and accuracy, they are making their way into more and more applications and processes that govern our daily lives. Our increased reliance on these models and the value they represent as an accumulation of confidential and proprietary knowledge, are at increasing risk for attack. Further, these models pose unique security risks that must be accounted for and mitigated. The Adversarial Robustness Toolbox is here to help. 

The Adversarial Robustness Toolbox (ART) is a Python library for machine learning security. ART provides tools that enable developers and researchers to evaluate and defend machine learning models and applications against the adversarial threats of evasion, poisoning, extraction, and inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

The pages here present an introduction to the threat landscape and how ART is working towards a more secure, private, and robust AI / ML ecosystem. We have provided interactive demonstrations to illustrate evasion (demo 1), (demo 2), and poisoning attacks. You will also find tutorials and other notebooks offering more comprehensive, technical, and hands-on demonstrations geared towards the machine learning practitioner. The API documentation is also available.


Adversarial attacks are studied using a variety of threat models. The two most common threat models are the whitebox and blackbox threat models. In the whitebox threat model, an adversary has visibility into the model parameters including, but not limited to the architecture, weights, pre- and post- processing steps. The whitebox threat model is thought to represent the strongest attacker as they have full knowledge of the system. Some works also make reference to an adaptive whitebox threat model, which allows the adversary to make modifications to their attack based on the target model's defensive responses. In the blackbox threat model, the adversary only has query access to the model. That is to say, given an input from the adversary, the model provides either a soft output (i.e., prediction probabilities) or a hard output (i.e., top-1 or top-k output labels). Blackbox attacks are perceived as the "realistic" threat model when evaluating a system for deployment. Note that while a whitebox threat model is strictly stronger, some defenses defeat attacks using whitebox access that fail against blackbox attacks.

Evasion: Adversarial evasion is an inference time attack in which the adversary seeks to add adversarial noise to an input and create an adversarial sample. These samples, when provided to a well-trained target model, cause predictable errors at the model's output. Evasion attacks can be targeted (i.e., the noise causes a specific error at the output) or untargeted (i.e., the noise causes an error at the output, but the type of error is not important to the adversary).  The adversarial noise can be crafted using gradient and non-gradient based methods depending on the adversarial threat model. If using a gradient based method, the adversary must have white-box knowledge of the target model. With this knowledge the adversary uses the model's gradient with respect to the adversarial object in order to identify the optimal adversarial noise to add. Non-gradient based methods (including approximated gradient approaches) often only require blackbox knowledge of the target model. Based on the soft or hard model outputs, a non-gradient based method often relies on searching the noise region around an input. The amount of adversarial noise added to an input is usually constrained so as to keep the noise imperceptible or inconspicuous.

Poisoning: Adversarial poisoning is a training time time attack in which the adversary uses direct or indirect methods to corrupt the training data in order to achieve a specific goal. Poisoning is a major concern whenever the adversary has the ability to influence the training data, such as in online learning, in which live data is periodically used to retrain the model so as to remain robust to concept drift. Through adversarial poisoning, an adversary can degrade model performance and inject backdoors into the model so as to induce certain errors when triggered.

Inference and Inversion: Adversarial inference and inversion are inference time attacks in which the adversary uses API access to a target blackbox model in order to extract information about the training data. In a model inference attack,  the adversary uses the API in order to learn the data distribution of the training data or determine if certain data points were used when training the target model. In an adversarial inversion attack, the adversary uses the API in an attempts to reconstruct the training data for further use later. Adversarial inference is a major issue when the confidentiality of the data needs to be maintained due to privacy or proprietary reasons. 

Model Extraction: Adversarial model extraction is an inference time attack in which the adversary uses API access to the target model in order to learn the target model's parameters or create an approximation of the target model. By querying the model and using the outputs as the labels, the adversary can train a new, substitute model whose performance is similar to the target model. Once trained, the adversary can re-use the model for their own purposes (theft) or perform evasion attacks on the substitute model, which can then be transferred to the target model with high likelihood to succeed.

ART Attacks

The attack descriptions include a link to the original publication and tags describing framework-support of implementations in ART:

  • all/Numpy: implementation based on Numpy to support all frameworks
  • TensorFlow: implementation optimised for TensorFlow
  • PyTorch: implementation optimised for PyTorch

Evasion Attacks

  • Auto-Attack (Croce and Hein, 2020)

    Auto-Attack runs one or more evasion attacks, defaults or provided by the user, against a classification task. Auto-Attack optimises the attack strength by only attacking correctly classified samples and by first running the untargeted version of each attack followed by running the targeted version against each possible target label.


  • Auto Projected Gradient Descent (Auto-PGD) (Croce and Hein, 2020) all/Numpy

    Auto Projected Gradient Descent attacks classification and optimizes its attack strength by adapting the step size across iterations depending on the overall attack budget and progress of the optimisations. After adapting its steps size Auto-Attack restarts from the best example found so far.

  • Shadow Attack (Ghiasi et al., 2020) TensorFlow, PyTorch

    Shadow Attack causes certifiably robust networks to misclassify an image and produce "spoofed" certificates of robustness by applying large but naturally looking perturbations.

  • Wasserstein Attack (Wong et al., 2020) all/Numpy

    Wasserstein Attack generates adversarial examples with minimised Wasserstein distances and perturbations according to the content of the original images.

  • Brendel & Bethge Attack (Brendel et al., 2019) all/Numpy

    Brendel & Bethge attack is a powerful gradient-based adversarial attack that follows the adversarial boundary (the boundary between the space of adversarial and non-adversarial images as defined by the adversarial criterion) to find the minimum distance to the clean image.

  • Targeted Universal Adversarial Perturbations (Hirano and Takemoto, 2019) all/Numpy

    This attack creates targeted universal adversarial perturbations combining iterative methods to generate untargeted examples and fast gradient sign method to create a targeted perturbation.

  • High Confidence Low Uncertainty (HCLU) Attack (Grosse et al., 2018) GPy

    The HCLU attack Creates adversarial examples achieving high confidence and low uncertainty on a Gaussian process classifier.

  • Iterative Frame Saliency (Inkawhich et al., 2018)

    The Iterative Frame Saliency attack creates adversarial examples for optical flow-based image and video classification models.

  • DPatch (Liu et al., 2018) all/Numpy

    DPatch creates digital, rectangular patches that attack object detectors.

  • Robust DPatch (Liu et al., 2018, (Lee and Kolter, 2019)) all/Numpy

    A Robust version of DPatch including sign gradients and expectations over transformations.

  • ShapeShifter (Chen et al., 2018)

  • Projected Gradient Descent (PGD) (Madry et al., 2017)

  • NewtonFool (Jang et al., 2017)

  • Elastic Net (Chen et al., 2017)

  • Adversarial Patch (Brown et al., 2017) all/Numpy, TensorFlow

    This attack generates adversarial patches that can be printed and applied in the physical world to attack image and video classification models.

  • Decision Tree Attack (Papernot et al., 2016) all/Numpy

    The Decision Tree Attack creates adversarial examples for decision tree classifiers by exploiting the structure of the tree and searching for leaves with different classes near the leaf corresponding to the prediction for the benign sample.

  • Carlini & Wagner (C&W) L_2 and L_inf attack (Carlini and Wagner, 2016) all/Numpy

    The Carlini & Wagner attacks in L2 and Linf norm are some of the strongest white-box attacks. A major difference with respect to the original implementation ( is that ART's implementation uses line search in the optimization of the attack objective.

  • Basic Iterative Method (BIM) (Kurakin et al., 2016) all/Numpy

  • Jacobian Saliency Map (Papernot et al., 2016)

  • Universal Perturbation (Moosavi-Dezfooli et al., 2016)

  • Feature Adversaries (Sabour et al., 2016) all/Numpy

    Feature Adversaries manipulates images as inputs to neural networks to mimic the intermediate representations/layers of the original images while changing its classification.

  • DeepFool (Moosavi-Dezfooli et al., 2015) all/Numpy

    DeepFool efficiently computes perturbations that fool deep networks, and thus reliably quantifies the robustness of these classifiers.

  • Virtual Adversarial Method (Miyato et al., 2015)

  • Fast Gradient Method (Goodfellow et al., 2014) all/Numpy


Poisoning Attacks

Extraction Attacks

Inference Attacks

Attribute Inference

Membership Inference

Model Inversion

  • MIFace (Fredrikson et al., 2015)

    Inference attack exploiting adversarial access to an model to learn information its training data using confidence values revealed in predictions.


  • Database Reconstruction

    Implementation of a database reconstruction attack inferring the missing row of a training dataset for trained model.

ART Defenses









  • Basic detector based on inputs
  • Detector trained on the activations of a specific layer
  • Detector based on Fast Generalized Subset Scan (Speakman et al., 2018)


ART Metrics

Robustness Metrics



Developer tutorials

An overview of adversarial machine learning can be found in this slide presentation.

The following notebook-based tutorials provide different examples of using the ART toolbox.

Adversarial Attacks

Attack Defense ImageNet
This notebook shows basic workflow with ART for evasion attacks and defences.

Adversarial Audio Examples
This notebook demonstrates how to use the ART library to create adversarial audio examples.

ASR DeepSpeech Example
This notebook demonstrates ART's DeepSpeech estimator and the Imperceptible ASR attack.

Privacy Attacks

Membership Inference Attacks
This notebook demonstrates how to run black-box membership attacks.

Attribute Inference Attacks
This notebook demonstrates how to run both black-box and white-box inference attacks.

Poison Attacks

Defend against Poisoning Attacks with Neural Cleanse
This notebook demonstrates how ART can defend against poison input.

Clean-Label Feature Collision Attacks
This notebook shows to use ART to run a clean-label feature collision poisoning attack on a neural network trained with Keras.

Model Theft

Defending Against Theft with Reverse Sigmoid
This notebook demonstrates model stealing attacks and a reverse sigmoid defense against them.

Evasion Defenses

Adversarial Training
This notebook demonstrates adversarial training using ART on the MNIST dataset.

Adversarial Retraining
This notebook demonstrates how to load and evaluate the MNIST and CIFAR-10 models synthesized and trained.

Related Trusted AI Technologies


Machine learning models are increasingly used to inform high stakes decisions about people. Although machine learning, by its very nature, is always a form of statistical discrimination, the discrimination becomes objectionable when it places certain privileged groups at systematic advantage and certain unprivileged groups at systematic disadvantage. The AI Fairness 360 toolkit includes a comprehensive set of metrics for datasets and models to test for biases, explanations for these metrics, and algorithms to mitigate bias in datasets and models. The AI Fairness 360 interactive demo provides a gentle introduction to the concepts and capabilities of the toolkit. The package includes tutorials and notebooks for a deeper, data scientist-oriented introduction.

To learn more about this toolkit, visit IBM Research AI Fairness 360.


Many privacy regulations, including GDPR, mandate that organizations abide by certain privacy principles when processing personal information. This is also relevant for AI models trained using personal data, since it has been shown that trained ML models may leak sensitive information about their training sets. The AI Privacy 360 toolkit includes novel tools to support the assessment of privacy risks of AI-based solutions, and to enable them to adhere to such privacy requirements.

To learn more about this toolbox, visit IBM Research AI Privacy 360.


Black box machine learning models are achieving impressive accuracy on various tasks. However, as machine learning is increasingly used to inform high stakes decisions, explainability of the models is becoming essential. The AI Explainability 360 toolkit includes algorithms that span the different dimensions of ways of explaining along with proxy explainability metrics. The AI Explainability 360 interactive demo provides a gentle introduction to the concepts and capabilities of the toolkit. The package includes tutorials and notebooks for a deeper, data scientist-oriented introduction.

To learn more about this toolkit, visit IBM Research AI Explainability 360.

Uncertainty Quantification

Uncertainty quantification gives AI the ability to express that it is unsure, adding critical transparency for the safe deployment and use of AI. Uncertainty Quantification 360 is an extensible open-source toolkit with a Python package that provides data science practitioners and developers access to state-of-the-art algorithms, to streamline the process of estimating, evaluating, improving, and communicating uncertainty of AI and machine learning models.

To learn more about this toolkit, visit IBM Research Uncertainty Quantification 360.

Transparency and Governance

There is an increasing call for AI transparency and governance. The FactSheets project's goal is to foster trust in AI by increasing understanding and governance of how AI was created and deployed. The FactSheet 360 website includes many example FactSheets for publically available models, a methodology for creating useful FactSheet templates, an illustration of how FactSheets can be used for AI governance, and various resources, such as over 24 hours of video lectures.

To learn more about this project, visit IBM Research AI FactSheets 360.