In this paper, a deep learning-based approach has been developed to classify the images of galaxies into three major categories, namely, elliptical, spiral, and irregular. The classifier successfully classified the images with an accuracy of 97.3958%, which outperformed conventional classifiers like Support Vector Machine and Naive Bayes. The convolutional neural network architecture involves one input convolution layer having 16 filters, followed by 4 hidden layers, 1 penultimate dense layer, and an output Softmax layer. The model was trained on 4614 images for 200 epochs using NVIDIA-DGX-1 Tesla-V100 Supercomputer machine and was subsequently tested on new images to evaluate its robustness and accuracy.A significant challenge facing deep neural networks is catastrophic forgetting of previous tasks or knowledge. Several solutions use a task-aware approach that provides the task identity to the model during training and test time. In addition, it is not known how best to incorporate new distributions and modalities, and under what conditions expansion of model capacity may be necessary. In line with the latter, there has been a renewed interest in the “Mixture-of-Experts” (MoE) paradigm, and research into scaling up transformer-based MoE architectures (Vision and Language) with large pretraining uni/multimodal datasets has gained momentum. The main bottleneck in training MoE architectures is having a robust and well-trained gating function. In the absence of which, the model can undergo an uncontrollable increase in size and further lead to expert collapse, depending on the expansion criteria used. Although previous works have proposed several approaches to address MoE training issues, training such architectures continually remains a challenge that has not been sufficiently explored. At the same time, gating and specialization can help prevent forgetting important features from previous tasks. We investigate the ability to learn sequences of tasks with a sparse MoE model. To that extent, we introduce a new MoE loss which we term Gating Distillation ($L_{GATE}$) and highlight the advantages and disadvantages of different gating approaches for continual learning of Mixture-of-Experts (MoEs). In addition, we investigate model expansion using a sparse MoE model in the task-agnostic setting.Benefiting from the capability of building interdependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive insight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights. Dense information retrieval yields strong in-domain performance, but often struggles with out-of-domain generalization, lagging be- hind unsupervised methods. Retrieval tasks can vary across a num- ber of dimensions including domain, query intent, and language. Using a single dense retrieval model for all tasks often underper- forms lexical methods such as BM25. For practical information retrieval systems, it is expensive to deploy a different model for each task. Therefore, our motivation is to develop a cheap and effective information retrieval model that maintains strong per- formance across different domains while easily adapting to any new domain. Other approaches to domain transfer in information retrieval rely on large auxiliary language models or datasets and create a separate model for each task. In this work, we develop a method utilizing prompt tuning to efficiently adapt dense retrievers with a minimal amount of additional computation. By combining models trained on a variety of different domains, we can effectively boost performance on a target task in a new domain. Specifically, we train dense retrieval models using prompt tuning on a large number of information retrieval tasks across diverse domains and types of query intents. To adapt to a new domain, we create new prompt embeddings by averaging the prompt embeddings from a set of source tasks selected in an unsupervised manner. We evaluate zero-shot transfer performance across a wide variety of information retrieval domains and show competitive performance while lever- aging a minimal amount of compute. Notably, our SPIRIT method achieves while being extremely lightweight and practical to deploy in production. The size and prevalence of large language models (LLMs) make them an apt target for model compression. Most LLMs consist of a Transformer encoder and decoder, which each have 6 to 12 layers of multiheaded self-attention blocks, along with fully connected layers. This results in a large number of parameters, making them quite expensive to train and query. Our work focuses on finding techniques to prune CodeBERT, a specific LLM trained to work multimodally between text and code. We explore the effects of structured and unstructured magnitude pruning on the encoder layers of CodeBERT, evaluating on the task of generating natural language comments from a piece of Ruby code.Printable electronics based electromagnetic absorbers are receiving increasing attention of the electromagnetic community because of their unprecedented advantages. This paper presents the design of printable electromagnetic absorbers for the X band. The design of the absorber is optimized using the Genetic Algorithm (GA) to enhance the absorptivity and the absorption bandwidth. The design involves the placement of several square-shaped conductive ink at optimal locations on the paper substrate such that desired absorption characteristics are obtained. Simulations are carried out using the HFSS simulation software. The optimized structure offers an absorptivity of more than 90% in the X band thereby proving to be a viable solution for stealth applications.
Standard gradient descent algorithms applied to sequences of tasks are known to produce catastrophic forgetting in deep neural networks. When trained on a new task in a sequence, the model updates its parameters on the current task, forgetting past knowledge. This article explores scenarios where we scale the number of tasks in a finite environment. Those scenarios are composed of a long sequence of tasks with reoccurring data. We show that in such setting, stochastic gradient descent can learn, progress, and converge to a solution that according to existing literature needs a continual learning algorithm. In other words, we show that the model performs knowledge retention and accumulation without specific memorization mechanisms. We propose a new experimentation framework, SCoLe (Scaling Continual Learning), to study the knowledge retention and accumulation of algorithms in potentially infinite sequences of tasks. To explore this setting, we performed a large number of experiments on sequences of 1,000 tasks to better understand this new family of settings. We also propose a slight modifications to the vanilla stochastic gradient descent to facilitate continual learning in this setting. The SCoLe framework represents a good simulation of practical training environments with reoccurring situations and allows the study of convergence behavior in long sequences. Our experiments show that previous results on short scenarios cannot always be extrapolated to longer scenarios.In the past I have been fortunate to work with the likes of Dr. Amrita Chaturvedi from Indian Institute of Technology, Varanasi (IIT-BHU) in the field of biomedical data analysis and Vijay Kumar Verma from Indian Space Research Organization (ISRO) in the domain of Genetic Algorithms.Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Learning under constraints has been a fundamental avenue of research in deep learning since the advent of modern deep neural networks. In parallel to the upwards trajectory of scaling neural networks, one practical constraint that has embodied efficient deep learning has been that of sparsity. Unstructured weight sparsity has been the cornerstone of pioneering works in the space of pruning and lottery ticket hypothesis. In this paper, we propose \textbf{$\mathcal{D}^2$-Sparse}, a novel dual dynamic sparse learning system for low-data learning regime. Our paper combines two popular constraints in deep learning namely sparsity and low-data learning, often studied in disjoint paradigms, thus opening new directions of research in sparsity. $\mathcal{D}^2$-Sparse outperforms standard iterative pruning schema when coupled with standard deep networks in computer vision tasks like image classification and in natural language processing like code generation with no extra-overhead cost on inference. Compared to iterative pruning, on $\frac{1}{8}$-th total data budget, $\mathcal{D}^2$-Sparse achieves a $\approx$ 4% top-1 accuracy boost for ResNet-18 on the CIFAR-100 classification task. Further, we demonstrate the effectiveness of the proposed method in anytime learning scenarios and provide extensive analysis into evolution of sparse masks in $\mathcal{D}^2$-Sparse over the training process. Code, dashboard, and model weights will be open-sourced for public access upon acceptance.
Improving performance of deep networks in data limited regimes has warranted much attention. In this work, we empirically show that “winning tickets” (small subnetworks) obtained via magnitude pruning based on the lottery ticket hypothesis, apart from being sparse are also effective recognizers in data limited regimes. Based on extensive experiments, we find that in low data regimes (datasets of 50-100 examples per class), sparse winning tickets substantially outperform the original dense networks. This approach, when combined with augmentations or fine-tuning from a self-supervised backbone network, shows further improvements in performance by as much as 16% (absolute) on low sample datasets and longtailed classification. Further, sparse winning tickets are more robust to synthetic noise and distribution shifts compared to their dense counterparts. Our analysis of winning tickets on small datasets indicates that, though sparse, the networks retain density in the initial layers and their representations are more generalizable.[NeurIPS 2022] “Sparse Winning Tickets are Data-Efficient Image Recognizers” by Mukund Varma T, Xuxi Chen, Zhenyu Zhang, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang