15B-parameter unified Transformer. Joint video & audio generation. 1080p in ~38s. Fully open source.
A unified architecture that changes what's possible in open-source AI video.
One inference pass generates video, dialogue, ambient sound, and Foley effects simultaneously β no post-production needed.
40-layer pure self-attention architecture. Text, video, and audio tokens denoised in a single unified sequence β no cross-attention.
Only 8 denoising steps required. Combined with MagiCompiler inference acceleration, it outpaces all comparable open-source models.
Supports English, Mandarin, Cantonese, Japanese, Korean, German, and French with industry-leading WER accuracy.
Complete release: base model, distilled model, super-resolution model, and inference code β commercial use included.
Native 1080p output with support for 16:9, 9:16, 4:3, 21:9, and 1:1 aspect ratios straight from the model.
Artificial Analysis Arena Β· Last updated: Apr 8, 2026
| Rank | Model | T2V Elo | I2V Elo | Audio |
|---|---|---|---|---|
| π₯ 1 | HappyHorse-1.0 | 1333 | 1392 | β |
| π₯ 2 | Seedance 2.0 | ~1273 | ~1355 | β |
| π₯ 3 | Kling 3.0 Pro | ~1240 | ~1260 | β |
| 4 | SkyReels V4 | ~1210 | ~1230 | β |
| 5 | WAN 2.6 | 1189 | β | β |
Data sourced from Artificial Analysis Arena Β· View Full Leaderboard β
HappyHorse-1.0 is a groundbreaking open-source AI video generation model that stunned the research community in early April 2026 by claiming the top position on the Artificial Analysis Video Generation Arena β surpassing every commercial closed-source model in head-to-head blind evaluation. The HappyHorse-1.0 model achieved a Text-to-Video (T2V) Elo score of 1333 and an Image-to-Video (I2V) Elo score of 1392, beating established commercial systems from ByteDance, Kuaishou, and other major AI labs.
What makes HappyHorse-1.0 particularly remarkable is its open-source commitment. While competing models like Seedance 2.0 and Kling 3.0 Pro remain proprietary, HappyHorse-1.0 plans to release its full model weights, inference code, and training methodology β including a distilled version and super-resolution model β under a license that permits commercial use.
The name HappyHorse carries cultural significance: 2026 is the Year of the Horse in the Chinese lunar calendar, and the model's emergence as an unexpected champion from the open-source community β overtaking billion-dollar commercial labs β embodies the spirit of the underdog. In Mandarin AI circles, HappyHorse-1.0 has been dubbed "the dark horse that became the lead horse."
At the core of the HappyHorse-1.0 architecture is a 15-billion-parameter single-stream Transformer that processes text, video frames, and audio tokens as a single unified sequence β an approach that fundamentally differs from most competing models, which use separate encoders and decoders for each modality. The HappyHorse single-stream design enables true joint denoising: text prompts, video latents, and audio spectrograms are denoised together in a single forward pass, producing synchronized audio-visual content without any post-processing step.
The HappyHorse-1.0 Transformer consists of 40 layers. The first four and last four layers use modality-specific projections to handle the different representations of text, video, and audio. The central 32 layers share parameters across all modalities, enabling efficient cross-modal learning. A per-head gating mechanism in each attention head controls how strongly different modalities influence one another during training β a critical stabilization technique for joint audio-video generation.
Speed is a defining feature of HappyHorse-1.0. Through DMD-2 (Distribution Matching Distillation v2), the model requires only 8 denoising steps β compared to 50 or more steps in standard diffusion models β without requiring Classifier-Free Guidance (CFG). Combined with MagiCompiler full-graph compilation that fuses operators across Transformer layers, HappyHorse-1.0 generates a 5-second 1080p clip in approximately 38 seconds on an NVIDIA H100 GPU β among the fastest in its class.
The Artificial Analysis Video Generation Arena uses human preference voting to rank AI video models through blind A/B comparisons β evaluators see two videos generated from the same prompt and vote for the better one without knowing which model produced it. HappyHorse-1.0 topped this leaderboard across both T2V (text-to-video) and I2V (image-to-video) categories on its first appearance, a result that sent shockwaves through the AI community.
In the T2V no-audio category, HappyHorse-1.0 achieved an Elo of 1333 β approximately 60 Elo points ahead of Seedance 2.0 (~1273) and 93 points ahead of Kling 3.0 Pro (~1240). In I2V, the gap was even larger: HappyHorse-1.0 scored 1392 versus Seedance 2.0's ~1355. These margins are statistically significant in the Elo rating system, indicating a clear and consistent preference advantage.
The one area where HappyHorse-1.0 does not hold the top position is the combined audio evaluation category, where Seedance 2.0 has an edge β partly because Seedance uses a dedicated audio generation pipeline optimized separately from video. The HappyHorse-1.0 joint audio-video approach trades some audio specialization for the significant advantage of synchronized generation in a single pass.
For developers and researchers evaluating AI video tools, HappyHorse-1.0 represents a compelling option: it delivers best-in-class visual quality while remaining open-source and commercially usable β a combination no other top-tier model currently offers.
One of the standout features of HappyHorse-1.0 is its multilingual lip-sync capability. The model supports accurate lip synchronization for spoken dialogue in seven languages: English, Mandarin Chinese, Cantonese, Japanese, Korean, German, and French. Independent evaluations show HappyHorse-1.0 achieving industry-leading Word Error Rate (WER) scores in lip-sync accuracy across these languages.
This makes HappyHorse-1.0 particularly valuable for global content creation β generating talking-head videos, dubbed animations, and multilingual marketing content without the need for separate post-production dubbing. The joint audio-video architecture of HappyHorse means the model generates speech-synchronized video natively, rather than applying audio as a post-processing step that can introduce timing mismatches.
As of April 2026, the HappyHorse-1.0 model weights have not been publicly released. The model appeared on the Artificial Analysis leaderboard through a submission from a pseudonymous team, and while the official HappyHorse-1.0 site promises a full open-source release β including base model, distilled weights, super-resolution model, and inference code β no release date has been confirmed.
The open-source AI community is eagerly awaiting the HappyHorse-1.0 weight release. When available, the model is expected to require an NVIDIA H100 or A100 GPU with at least 48GB VRAM for full-precision inference. An FP8-quantized version of HappyHorse is expected to reduce memory requirements significantly, enabling deployment on 40GB A100 GPUs.
Subscribe to our notification list above to be among the first to know when HappyHorse-1.0 weights become available. In the meantime, researchers interested in the underlying architecture can explore daVinci-MagiHuman β the open-source model from GAIR Lab and Sand.ai that is most closely linked to the HappyHorse-1.0 architecture.
Be first to access the model when weights and API are publicly released.
No spam. Unsubscribe anytime.