OpenMOSS presents a collection of our research on Large Language Models and Multimodal Foundation Models, supported by Shanghai Innovation Institute (SII), Fudan University, and MOSI.AI.
📌 This page is a curated overview. For the complete and most recent list of repositories, visit the OpenMOSS organization.
Last updated: 2026-05-27
Foundation language models and training infrastructure.
| Project | Description |
|---|---|
| MOSS | An open-source tool-augmented conversational language model from Fudan University — the founding project of the OpenMOSS series. |
| CoLLiE | A library for collaborative training of large language models in an efficient way. |
Multimodal models for visual and video understanding.
| Project | Description |
|---|---|
| MOSS-VL | Core multimodal model series within the OpenMOSS ecosystem, dedicated to visual understanding. Includes the XRoPE architecture and a fully open training stack. |
| MOSS-Video-Preview | A real-time video understanding foundation model built on Llama-3.2-Vision, with comprehensively extended video processing and multimodal reasoning capabilities. |
End-to-end models for audio understanding and generation — speech, sound, music.
| Project | Description |
|---|---|
| MOSS-TTS-Nano | A 0.1B-parameter open-source multilingual TTS model — runs in real time on CPU without a GPU, designed for local demos, web serving, and lightweight product integration. |
| MOSS-TTS | Open-source TTS family covering stable long-form speech, multi-speaker dialogue, voice/character design, environmental sound effects, and real-time streaming TTS. |
| MOSS-TTSD · 🤗 HF | Spoken dialogue generation model with expressive multi-speaker synthesis, long-context modeling, flexible speaker control, multilingual support, and zero-shot voice cloning. |
| MOSS-Audio | Open-source foundation model for unified audio understanding — speech, sound, music, captioning, QA, and reasoning in real-world scenarios. |
| MOSS-Audio-Tokenizer | Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of audio, supports streaming and variable bitrates, delivers SOTA reconstruction. |
| MOSS-Speech | A true speech-to-speech large language model without text guidance. |
| MOSS-Music | Music understanding model for captioning, lyrics ASR, structural analysis, chord/key/tempo reasoning, and long-form musical QA. |
| SpeechGPT-2.0-preview | GPT-4o-level, real-time spoken dialogue system. |
Unified multimodal generation across modalities.
| Project | Description |
|---|---|
| AnyGPT | Unified multimodal LLM with discrete sequence modeling. |
| MOVA | Towards scalable and synchronized video–audio generation. |
Embodied AI: humanoid control, robotic manipulation, and embodied planning.
| Project | Description |
|---|---|
| FRoM-W1 | Towards general humanoid whole-body control with language instructions (arXiv 2026). Supports Unitree H1/G1 and FFTAI humanoid robots. |
| RoboOmni | Proactive robot manipulation in omni-modal context. |
| Embodied-Planner-R1 · arXiv | A reinforcement learning framework that enables LLMs to acquire embodied planning capabilities through autonomous exploration with sparse rewards. |
| RoboJuDo | Deployment framework for the FRoM-W1 humanoid project. |
| VehicleWorld | First comprehensive multi-device environment for intelligent vehicle interaction, modeling complex interconnected systems in modern cockpits. |
Mechanistic interpretability of large language models.
| Project | Description |
|---|---|
| Llamascopium · 🤗 HF · Neuronpedia | (formerly Language-Model-SAEs) A performant, fully-distributed framework for training, analyzing, and visualizing Sparse Autoencoders (SAEs) and frontier variants, empowering scalable and systematic mechanistic interpretability research. |
| Lorsa | Low-rank sparse attention for interpretability. |
The Embodied AI Team empowers large models to execute real-world tasks, aiming to automate tedious chores and unlock superhuman intelligence through environmental interaction. We believe true AI emerges from engaging with the physical world.
| Project | Venue | Description |
|---|---|---|
| VLABench · arXiv · GitHub | ICCV 2025 | The first large-scale robot manipulation benchmark designed to fairly evaluate the multi-dimensional ability of general-purpose Vision-Language-Action models. |
| D2PO · arXiv · GitHub | ACL 2025 | A unified learning framework that empowers embodied agents with stronger world modeling and embodied planning ability via dual preference optimization. |
| World-Aware-Planning · arXiv · GitHub | — | World-aware narrative enhancement bridging high-level task instructions and nuanced real-world environment details. |
| Embodied-Planner-R1 · arXiv · GitHub | — | RL framework enabling LLMs to acquire embodied planning capabilities through autonomous exploration with sparse rewards. |
| Awesome-WAM · GitHub | — | A curated, continuously updated reading list, paper blogs, and resources for World Action Models in embodied AI. |
The SII-OpenMOSS New Architecture Team explores new architectures and paradigms of LLMs, particularly from the perspective of long-context capability and efficiency.
| Project | Venue | Description |
|---|---|---|
| ReAttention · arXiv · GitHub | ICLR 2025 | Training-free approach that enables LLMs to support infinite context length extrapolation with finite attention scope. |
| LongLLaDA · arXiv · GitHub | AAAI 2026 | First systematic investigation comparing long-context performance of diffusion LLMs and traditional auto-regressive LLMs. |
| RoPE++ · GitHub | ICLR 2026 | Beyond Real: imaginary extension of Rotary Position Embeddings for long-context LLMs. |
| Sparse-dLLM · GitHub | — | Sparse diffusion-based large language models. |
| FourierAttention · arXiv | — | Training-free framework that exploits the heterogeneous roles of transformer head dimensions. |
| Thus Spake Long-Context LLM · arXiv · GitHub | — | A survey on the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation. |
| Project | Venue | Description |
|---|---|---|
| GAOKAO-MM · GitHub | ACL 2024 Findings | A Chinese human-level benchmark for multimodal model evaluation. |
| Project | Venue | Description |
|---|---|---|
| HalluQA · GitHub | — | Dataset and evaluation script for evaluating hallucinations in Chinese large language models. |
| Say-I-Don't-Know · GitHub | ICML 2024 | Can AI assistants know what they don't know? |
| LongSafety · GitHub | — | Safety evaluation for long-context LLMs. |
| Project | Description |
|---|---|
| UnifiedToolHub · GitHub | A comprehensive project supporting LLM-based tool use — unifies dataset formats and provides training, annotation, and evaluation functionalities. |
| ABC-Bench · GitHub | A benchmark for agentic backend coding — evaluates whether code agents can explore repos, edit code, configure environments, deploy services, and pass external end-to-end API tests. |
| OurClaw · GitHub | Institutional OpenClaw Solution. Share One Claw with Others. |
For collaborations, internships, or general inquiries: openmoss@sii.edu.cn