CARAML Lab effiCient, fAir, Robust, and Active ML Lab

Subset Selection in Machine Learning: Hands-On Application with CORDS, DISTIL, SUBMODLIB, and TRUST

The 37th AAAI Conference on Artificial Intelligence, Washington DC, USA

February 23rd, 2023 08:30 AM EST - 12:30 PM EST

About Tutorial

Machine learning – specifically, deep learning – has transformed numerous application domains like computer vision and video analytics, speech recognition, natural language processing, and so on. As a result, significant focus of researchers in the last decade has been on obtaining the most accurate models, often matching and sometimes surpassing human level performance in these areas. However, deep learning is also unlike human learning in many ways. To achieve the human level performance, deep models require large amounts of labeled training data, several GPU instances to train, and massive size models (ranging from hundreds of millions to billions of parameters). In addition, they are often not robust to noise, imbalance, and out of distribution data and can also easily inherit the biases in the training data. Motivated by these desiderata and many more, we will present a rich framework of PyTorch toolkits for subset selection and coreset-based approaches that satisfy them. We will begin by providing a brief introduction of these desiderata and how they are handled by the methods implemented in our toolkits. Next, we will then introduce each toolkit – CORDS, DISTIL, SUBMODLIB, and TRUST – by highlighting their field of application and by walking through enriching, real-scenario tutorials showcasing their ease of use and capability for satisfying the above desiderata. In particular, we will provide hands-on experiences for compute-efficient training through CORDS; label-efficient training through DISTIL; powerful submodular optimization through SUBMODLIB; and robust, fair, and personalized learning via TRUST. We will present these toolkits under the larger cooperative effort of DECILE, highlighting the rich community of researchers and practitioners supporting these toolkits. Our toolkits are available in the following git link.

Lab Tutorial Goals:

The goal of this lab is to provide and highlight a toolkit framework for solving many real-world complications within deep learning using subset selection and coreset-based approaches. Specifically, we believe that providing these toolkits and enriching hands-on experience regarding their use will enable researchers and practitioners to think beyond just improving the model accuracy and in broader yet important aspects like Green AI, fairness, robustness, personalization, data efficiency, and so on. Furthermore, the hands-on demonstrations will also be useful to students and researchers from industry to get oriented in and practically started with the subject matter of each toolkit and its related aspects. By introducing these toolkits, we also hope to build a larger community around their usage, which will help strengthen their applicability across deep learning and help connect the interests of like-minded researchers and practitioners.

Key Takeaways:

  1. How to learn machine learning models in real world settings achieving near optimal performanc while achieving other desiderata like compute efficiency, data efficiency, robustness, fairness, personalization, etc.
  2. How is the industry/academia tackling this challenge and doing Data-Efficient Learning?
  3. Hands-on session using PyTorch for Data-Efficient Learning with the following State-of-the-Art toolkits:
    1. CORDS for Compute Efficient Learning
    2. DISTIL for Active Learning
    3. SUBMODLIB for Submodular Optimization
    4. TRUST for Targeted Learning

Who should attend this Lab Tutorial Session?

The target audience of this lab is practitioners in deep learning and machine learning as well as researchers working on more theoretical areas in optimization in machine learning. Further, we encourage participation from individuals working in the industry, academia, and students with experience and/or interest in Machine Learning, Deep Learning, and its efficient application for solving problems quickly. Finally, this session is highly relevant for academics who wish to learn how to design efficient, resilient, and computationally efficient models with limited GPU resources.

Schedule

  1. An Introduction into Subset Selection for Data-Efficient ML (8:30 AM - 8:50 AM)
  2. Submodlib and TRUST (8:50 AM - 10:00 AM)
    1. Introduction of Submodlib and TRUST
    2. Overview of Submodular functions and Submodular Information Measures
    3. Modeling Capabilities of Submodular functions using Submodlib
    4. Data Summarization using Submodlib
    5. Overview and Motivation of TRUST
    6. TRUST Demonstrations
  3. DISTIL (10:00 AM - 10:30 AM)
    1. Introduction of DISTIL
    2. Overview of Selection Strategies
  4. Break (10:30 AM - 11:00 AM)
  5. DISTIL (11:00 AM - 11:30 AM)
    1. DISTIL Demonstrations
  6. CORDS (11:30 AM - 12:30 AM)
    1. Introduction of CORDS
    2. Use Cases of CORDS
    3. Overview of Selection Strategies
    4. CORDS Demonstrations

An Introduction into Subset Selection for Data-Efficient ML (8:30 AM - 8:50 AM)

To contextualize the various subset selection problems and toolkits that will be presented in this lab, a brief overview of CARAML Lab is given, including an overview of the pertinent projects that have been and are being conducted.

SubModlib and TRUST (8:50 AM - 10:00 AM)

Introduction

Overview of Data subset selection, Submodlib and Trust.

SubmodLib is an easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine. Submodlib finds its application in summarization, data subset selection, hyper parameter tuning, efficient training etc. Through a rich API, it offers a great deal of flexibility in the way it can be used. The SubModLib toolkit ia available at: https://github.com/decile-team/submodlib

TRUST is a toolkit which provides support for various targeted selection algorithms. Most real-world datasets have one or more charateristics that make its use on the state-of-the-art subset selection algorithms very difficult. Quite often, these characteristics are either known or can be easily found out. For example, real-world data is imbalanced, redudant and has samples that are of not of concern to the task at hand. Hence, there is a need to favor some samples while ignore the others. This is possible via different Submodular Information Measures based algorithms implemented in TRUST. The TRUST toolkit ia available at: https://github.com/decile-team/trust

Overview of Submodular functions and Submodular Information Measures

We briefly discuss submodularity and information theory, and show the formulations of some submodular functions and their information measures.

Modeling Capabilities of Submodular functions using Submodlib

We discuss different submodular functions implemented in Submodlib and the ease of use of obtaining a subset with just a few lines of code. We also show the modeling capabilities like representation, diversity and coverage of differentsubmodular functions, Submodular Mutual Information, Conditional Gain, and Conditional Mutual Information functions.

Data Summarization using Submodlib

We illustrate the usage of Submodlib for visual data summarization. Particularly use submodular information measures for Generic, Query-focused and Privacy preserving summarization.

Overview and Motivation of TRUST

We discuss different applications of Targeted Subset Selection using TRUST and the utility of using Submodular Information Measures for class imbalance, OOD and Redundancy. We illustrate real-world Medical imaging and Autonomous driving use cases that can be tackled using TRUST.

TRUST Demonstrations

We show the utility of TRUST via an Interactive Application. The TRUST Interactive Application is available at: https://github.com/decile-team/trust/tree/demo-app/demo

DISTIL (10:00 AM - 10:30 AM)

Introduction

DISTIL is a toolkit designed to ease the complexity of performing active learning in a number of settings. Here, a brief introduction of active learning is given along with an overview of DISTIL’s salient features.

Overview of Selection Strategies

To better understand the selection strategies implemented within DISTIL, brief overviews of each strategy are given, including how these strategies can be used via DISTIL.

Break (10:30 AM - 11:00 AM)

DISTIL (11:00 AM - 11:30 AM) {#schedule-3(2nd)}

DISTIL Demonstrations

To see how DISTIL is used in a full implementation, a number of working demonstrations are presented, spanning from medical imaging tasks to sentiment analysis tasks.

CORDS (11:30 AM - 12:30 AM)

Introduction

CORDS is a toolkit designed for efficient learning of machine learning models using subset selection. Here, we give a brief into various features implemented to CORDS library for ease of usage and various subset selection strategies incorporated for efficient learning. The CORDS toolkit is available at: https://github.com/decile-team/cords

Use Cases of CORDS

We elaborate on use-cases of CORDS for different applications like supervised learning, semi-supervised learning, and hyper-parameter tuning. We further give a peek into future usecases that we plan to incorporate into CORDS shortly.

Overview of Selection Strategies

To better understand the selection strategies implemented within CORDS, brief overviews of each strategy are given, including how these strategies can be used via CORDS.

CORDS Demonstrations

To see how CORDS is used in a full implementation, a number of working demonstrations are presented for supervised learning, semi-supervised learning, and hyper-parameter tuning across different datasets from vision, text, and tabular domains. Demonstration notebooks shown for CORDS toolkit during the tutorial are available at: https://github.com/decile-team/cords/tree/main/tutorial