ECCV 2024

Self-Supervised Skeleton Representation Learning

S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition

École polytechnique fédérale de Lausanne (EPFL)

S-JEPA predicts latent joint representations instead of reconstructing raw coordinates. The result is stronger and more transferable skeleton features.

Predict semantic motion cues instead of raw coordinates

Comparison between previous masked prediction targets and S-JEPA latent representation targets

Instead of recovering noisy low-level joint coordinates, S-JEPA predicts abstract latent targets. That shift pushes the model toward context, structure, and action semantics.

A cleaner pretraining target for action understanding

Masked self-reconstruction is effective for skeletal action recognition, but raw coordinate targets can overemphasize noisy local details. S-JEPA replaces those targets with latent representations of missing joints from the same sequence.

This objective encourages the model to capture higher-level context and depth cues. A simple centering operation further stabilizes training.

With a vanilla transformer backbone, S-JEPA reaches state-of-the-art performance on NTU60, NTU120, and PKU-MMD across linear probing, fine-tuning, semi-supervised learning, and transfer learning.

A simple pipeline with a more meaningful objective

Overview diagram of the S-JEPA architecture

S-JEPA uses motion-aware masking, a view encoder, an EMA target encoder, and latent-space cross-entropy to learn stronger skeleton representations.

The method keeps the architecture simple while moving the learning signal to a more useful target space.

Design choices that make the representation stronger

Motion-aware masking

Informative joints are masked more often so prediction remains meaningful.

Latent target prediction

The predictor matches latent targets instead of regressing raw coordinates.

EMA target encoder

A momentum encoder provides stable targets and helps avoid collapse.

Geometric views

Rotations, translations, and flips improve robustness without changing semantics.

Centered targets

Centering and sharpening improve convergence and feature quality.

Simple backbone

A vanilla transformer is enough when the pretraining target is chosen well.

Strong transfer across evaluation protocols

Linear Eval

89.8%

NTU60 XView with frozen features.

Fine-Tuning

97.6%

Best NTU60 XView result reported in the paper.

Semi-Supervised

67.5%

NTU60 XSub with only 1% labels.

Transfer

74.2%

Best PKU-II transfer result when pretrained on NTU120.

Reference the paper

@inproceedings{abdelfattah2024s,
  title={S-jepa: A joint embedding predictive architecture for skeletal action recognition},
  author={Abdelfattah, Mohamed and Alahi, Alexandre},
  booktitle={European Conference on Computer Vision},
  pages={367--384},
  year={2024},
  organization={Springer}
}