Learnings from ICML 2023

I recently attended the International Conference on Machine Learning (ICML), which showcased the latest research in machine learning, along with several valuable talks and sessions. In this post, I’ll outline the key insights and takeaways from the conference across various topics.

Computer Vision

1) Scaling Vision Transformers to 22 Billion Parameters (paper)

Researchers have made significant progress with language models by increasing their size. However, similar models for images and videos (called Vision Transformers) haven’t been scaled up as much. The largest one had 4 billion parameters.

In this study, the researchers successfully trained a much larger Vision Transformer with 22 billion parameters. They tested it on various tasks and found that its performance improved with its increased size. Additionally, the model showed other benefits such as better fairness, alignment with human visual perception, and robustness.

ViT-22B leads to significant accuracy improvements over other methods, especially when the input size is small.

This breakthrough suggests that it’s possible to achieve similar scaling benefits in vision models as in language models, paving the way for future advancements.

2) Raising the Cost of Malicious AI-Powered Image Editing (paper)

To prevent malicious editing of images using advanced AI models, the authors propose a way to “immunize” images. This involves adding tiny, invisible changes to the image that disrupt the AI model’s ability to edit it, causing it to produce unrealistic results.

Overview of the immunization framework

As can be seen in the image above, by immunizing the original image before the adversary can access it, this approach disrupts their ability to successfully perform such edits. The authors have developed two methods to create these changes: an encoder attack, and a diffusion attack, and tested their effectiveness.

However, the authors also call out that for this approach to work in practice, the companies developing these AI models should implement and support the immunization process, rather than relying on individual users to do so.

3) Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles (paper)

Recent vision transformers have added many complex components to improve performance, but this has actually made them slower. The authors found that these extra components aren’t necessary.

By using a strong pre-training method, this paper simplifies the state-of-the-art vision transformer without losing accuracy. The proposed model, called Hiera, is not only more accurate but also significantly faster – when tested on various image and video recognition tasks, Hiera achieved impressive results.

Hiera Setup: it is designed to be as simple as possible.

To be more specific, modern hierarchical transformers end up slower due to overhead from adding spatial bias through vision-specific modules like shifted windows or convs. In contrast, Hiera is designed to be as simple as possible. It consists entirely of standard ViT blocks. To add spatial bias, the authors opt to teach it to the model using a strong pretext task like MAE (pictured here) instead. For efficiency, the authors use local attention within “mask units” for the first two stages and global attention for the rest.

Language Models: Algorithms and Architecture

1) SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

LLMs like GPT4 and LLaMA exhibit extraordinary capabilities, but unfortunately are very large in size and really expensive to run. This paper sets out with a goal to make such LLMs more efficient by compressing them, or more specifically by pruning them, i.e. remove as many weights as possible without impacting the model performance significantly.

The paper introduces SparseGPT which for the first time enables pruning at the scale of 100B parameter models. The authors show that large-scale GPT family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.

SparseGPT can induce uniform layerwise sparsity of up to 60% in e.g. the 175-billion-parameter variant of the OPT family with minor accuracy loss. By contrast, as shown in figure below, the only known one-shot baseline which easily extends to this scale, Magnitude Pruning preserves accuracy only until 10% sparsity, and completely collapses beyond 30% sparsity.

Sparsity-vs-perplexity comparison of SparseGPT against magnitude pruning on OPT-175B

For reference, Perplexity is a common metric used to evaluate language models – the smaller the value, the better the performance. WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

The authors execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% sparsity with minor accuracy loss. Remarkably, more than 100 billion weights from these models can be ignored at inference time.

The authors also find that larger models are more compressible – they drop significantly less accuracy at a fixed sparsity (relative to their smaller counterparts). One intuition behind this is that very large models are potentially over-parameterized.

2) Pretraining Language Models with Human Preferences (paper)

LLMs are trained to imitate internet text, but internet text contains offensive content, PII, and low quality code. Imitating such content is not what we expect from AI Assistants or reliable tools.

One of the existing ways to mitigate this is RLHF, however, LLMs are resistant to forgetting their pre-training data, so not the most efficient way. Another way is to filter out undesirable content from the pretraining data, but this can severely handicap LLMs capabilities, reduce data diversity or amplify social biases. Can we instead learn from the entire pre-training dataset but somehow let the LM know what not to imitate?

This study explores different ways to pretrain LMs so they generate text that better aligns with human preferences. The researchers tested five methods of pretraining with human feedback and found that “Conditional Training” was the most effective. This method involves training the model to generate text based on human preference scores, which significantly reduces undesirable content while maintaining performance on other tasks.

As shown in the figure above, this study suggests that incorporating human preferences from the beginning of training is better than trying to correct undesirable behavior later – Toxicity score is the lowest for the Pretraining with feedback approach.

3) Large Language Models Can Be Easily Distracted by Irrelevant Context (paper)

Large language models have performed well on various natural language processing tasks, but they are usually tested on benchmarks where all input information is relevant. This study examines how these models are affected by irrelevant context, introducing a dataset called Grade-School Math with Irrelevant Context (GSM-IC) to test this.

An example problem from GSM-IC. An irrelevant sentence (italic and underlined) is added immediately before the question

The results show that when irrelevant information is included, the models’ accuracy drops significantly. The study also suggests ways to reduce this issue, such as decoding with self-consistency (Wang et al., 2022c), presenting example problems with irrelevant context in the prompt, and adding instructions to ignore irrelevant information in the prompt.

Language Models: Human Impact

1) DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (paper)

As AI-generated text becomes more common, it’s increasingly important to be able to detect it. In this paper, the authors present a way to identify text generated by large language models (LLMs) by analyzing the model’s probability function.

The proposed method, called DetectGPT, looks for patterns in the text that are unique to LLM-generated content. It doesn’t require training a separate detector or collecting a dataset of real and fake text. Instead, it uses the LLM’s own probabilities and random variations of the text.

The goal is to determine whether a piece of text was generated by a particular LLM p, such as GPT-3

To classify a candidate passage x, DetectGPT first generates minor perturbations of the passage using a generic pre-trained model such as T5. Then DetectGPT compares the log probability under p of the original sample x with each perturbed sample. If the average log ratio is high, the sample is likely from the source model.

DetectGPT was found it to be highly effective, outperforming existing methods for detecting AI-generated text. For example, it was able to detect fake news articles generated by a powerful LLM with 95% accuracy, compared to 81% for the best existing method.

2) A Watermark for Large Language Models (paper)

The potential risks of large language models can be reduced by watermarking their output. This involves embedding signals in the generated text that are invisible to humans but detectable by algorithms. The authors propose a watermarking method for proprietary models that has little effect on text quality and can be detected using a simple open-source algorithm without needing access to the model itself.

Outputs of a language model, both with and without
the application of a watermark

The watermark works by promoting the use of certain “green” tokens during text generation. The watermarked text, if written by a human, is expected to contain 9 “green” tokens, yet it contains 28. The probability of this happening by random chance is ≈ 6×10⁻¹⁴, leaving us extremely certain that this text is machine generated. The authors also introduce a statistical test to detect the watermark and analyze its effectiveness. They tested this method on a large model and discuss its robustness and security.

3) Transformers learn in-context by gradient descent (paper)

In-context learning in Transformers refers to the ability of these models to learn patterns, relationships, or rules directly from the input data provided during inference, without any additional training. The way Transformers learn in-context is not well understood and is mostly based on intuition.

This paper suggests that training Transformers on tasks where they predict the next word is similar to how models learn in gradient-based meta-learning. The authors show that a single linear self-attention layer in Transformers and gradient descent (GD) on a regression task can produce similar results, as shown in figure below.

Hypothesis: gradient-based optimization (green) and attention-based in-context learning (blue) are equivalent

They found that Transformers trained on simple regression tasks often learn in a way that mimics gradient descent. This suggests that Transformers act as “mesa-optimizers,” learning models through gradient descent during their forward pass. This insight helps us better understand how Transformers learn in context, especially in regression tasks.

Additionally, the paper shows that Transformers can outperform plain gradient descent by learning to make corrections and handle complex tasks.

Multimodal

1) Cross-Modal Fine-Tuning: Align then Refine

Significant advancements have been made in modalities such as vision and NLP by fine-tuning large pre-trained models. However, other modalities have not seen the same level of progress due to a lack of suitable pre-trained models.

To address this, the authors introduce ORCA, a versatile cross-modal fine-tuning framework that enables a single large pre-trained model to be applied to various modalities.

As shown in figure above, ORCA uses a 3-step process to adapt large pre-trained models to new tasks:

It creates two adapters: one to match the input data to the pre-trained model, and another to convert the output to the desired format.
It trains the input adapter to make the target data look similar to the data the pre-trained model was trained on.
It fine-tunes all three components (input adapter, pre-trained model, and output adapter) together to minimize the error on the target task.

The experiments in the paper show that ORCA achieves state-of-the-art results on 60+ datasets from 12 different domains, outperforming a range of specialized and automated methods.

2) Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (paper)

Language closely tied to visuals is common in various sources, like textbooks with diagrams, web pages with images and tables, and mobile apps with buttons and forms. Previously, research in this area was often limited to specific domains, with little sharing of data or methods.

Examples of visually-situated language understanding tasks

The authors introduce Pix2Struct, a model designed to understand visual language by converting masked screenshots of web pages into simplified HTML. This approach leverages the rich visual elements found on the web, which provides a diverse source of training data.

Pix2Struct also includes features like variable-resolution input and flexible integration of text and image inputs, such as overlaying text prompts on images. The authors demonstrate that a single pretrained model can achieve top results in six out of nine tasks across four different areas: documents, illustrations, user interfaces, and natural images.

3) StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis (paper)

Text-to-image synthesis has improved with large pretrained models and new techniques like diffusion and autoregressive models. However, these models are slow, needing multiple evaluations per image. In contrast, Generative Adversarial Networks (GANs) are faster because they only require one pass, but they lag behind in quality.

This paper introduces StyleGAN-T, a model designed to be competitive by addressing issues like capacity, training stability, text alignment, and variation control. StyleGAN-T shows significant improvements over previous GANs and outperforms the fastest previous models in both quality and speed.

Tutorials

Here are the tutorials I attended and found valuable. I’ll summarize some of my key takeaways below, but I highly recommend watching the videos linked for more in-depth information.

1) Reinforcement Learning from Human Feedback (slides)

This tutorial begins with a history of preference models and provides a technical overview of Reinforcement Learning from Human Feedback (RLHF). It covers the three key phases: starting with a language base model, preference collection and training, and Reinforcement Learning (RL) optimization.

Three phases of RLHF

The session also discusses Proximal Policy Optimization (PPO) and fine-tuning with RL. The presenters then delve into human annotation for RLHF, including data labeling and Supervised Fine-Tuning.

2) Multimodal Machine Learning: Principles, Challenges, and Open Questions

This talk begins with an introduction to Multimodal Machine Learning and outlines its key challenges, including Representation, Alignment, Reasoning, Generation, Transference, and Quantification. It then explores various fusion concepts, such as Tensor Fusion, Gated Fusion, Modality Shifting Fusion, and Non-Linear Fusion.

Additionally, the talk covers the use of Multimodal Masked Autoencoders for Image Representation Learning, along with Contrastive Learning and Cross-Modal Attention. It concludes with a discussion on future directions in the field.

3) Graph Neural Networks in Tensorflow: a practical guide (Recording)

Graph Neural Networks (GNNs) have garnered significant attention for their ability to leverage graph-structured data within neural network models. However, deploying and scaling GNNs on large, complex datasets remains a substantial challenge for machine learning platforms.

This tutorial offers a guide to using GNNs with TensorFlow, focusing on TF-GNN, a library designed for working with graph-structured data. For further details, refer to the associated paper.

4) A Practical Tutorial to Machine Learning with Differential Privacy

This Google talk begins with an introduction to Differential Privacy in Machine Learning, explaining how to integrate differential privacy into ML models, including the DP-SGD Algorithm. It concludes with practical aspects of Differential Privacy training. For a detailed exploration of these topics, refer to the associated paper and blog post.

5) Self-Supervised Learning in Vision (Slides)

This talk begins with an introduction to Self-Supervised Learning, which enables large-scale representation learning, and discusses Masked Autoencoders (MAE) as a key advancement toward scalable vision models. It also touches on self-supervised learning techniques applied to masked video and audio.

Masked Auto Encoding in action

The talk then delves into leveraging unlabeled data to develop an effective feature space. It concludes with best practices for applying Self-Supervised Learning. For more details, refer to A Cookbook of Self-Supervised Learning.

Learnings from ICML 2023

Computer Vision

Language Models: Algorithms and Architecture

Language Models: Human Impact

Multimodal

Tutorials

Trending Articles

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Password Reset on SX6036?

Outlook でメールを保存または送信時に...

Palakurthi Mandal Sarpanch Mobile Numbers List Warangal District in Telangana...

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Practice Sheet of Right form of verbs for HSC Students

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Muloraki Au

SEAGCD2 - Editorial

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Felony Arrest of Joseph A. White and Heather Coomer-White

the range cannot be deleted (6028) in microsoft word

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Arrow Flash 2 – Sinhala Dubbed – Episode 17 – 28th February 2016

Teen Shot In Miami Drive-By Dies From Injuries

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Arrest logs for Wednesday, March 20, 2019

Bureau of Internal Revenue: Regional Offices (Directory)