![]() ![]() Our method combines masked prediction (Devlin et al., 2019 Baevski et al., 2020b Bao et al., 2021) with the learning of latent target representations (Grill et al., 2020 Caron et al., 2021) but generalizes the latter by using multiple network layers as targets and shows that this approach works across several modalities. The student predicts the average of K network layers of the teacher (shaded in blue). The teacher parameters are an exponentially moving average of the student weights. The model first produces representations of the original input example (teacher mode) which are then regressed by the same model based on a masked version of the input. Figure 1: Illustration of how data2vec follows the same learning process for different modalities. and therefore several prominent models are equipped with mechanisms to learn an inventory of speech units (Baevski et al., 2020b Hsu et al., 2021).Ī similar problem exists for computer vision, where researchers either learn discrete visual tokens (Radford et al., 2021a Bao et al., 2021), regress the input (He et al., 2021) or learn representations invariant to data augmentation (Chen et al., 2020 Grill et al., 2020 Caron et al., 2021). Models and code are available at Research in self-supervised algorithms has focused on individual modalities which results in specific designs and learning biases.įor example, in speech processing, there is no vocabulary of speech units over which we can define a self-supervised learning task such as words in NLP 1 1 1This is true for many languages but not for certain Asian languages. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.Įxperiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. ![]() While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. ![]()
0 Comments
Leave a Reply. |