Stars
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
PyTorch implementation of Wasserstein Adversarial Proximal Policy Optimization(WAPPO).
Code for Reinforcement Learning from Vision Language Foundation Model Feedback
Official Implementation of RoboCLIP (NeurIPS 2023)
Parameter-efficient Fine Tuning for Clinical LLMs
Note: DO NOT USE IT! THIS CODE IS PROVEN TO CONTAIN DATA LEAKAGE! Archive version of "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval (CVPR 2024 Highlight)"
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
[CVPR 2024] TeachCLIP for Text-to-Video Retrieval
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
Frontier Multimodal Foundation Models for Image and Video Understanding
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Sentence Embeddings using Siamese SKT KoBERT
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Democratization of RT-2 "RT-2: New model translates vision and language into action"
Mastering Diverse Domains through World Models
Codebase for Automated Creation of Digital Cousins for Robust Policy Learning
A generative and self-guided robotic agent that endlessly propose and master new skills.
Octo is a transformer-based robot policy trained on a diverse mix of 800k robot trajectories.