Alibaba Releases New AI Models, Text-to-Video Tech
The newly released open-source Qwen 2.5 models vary in size from 0.5 to 72 billion parameters.
Alibaba has announced the release of over 100 new open-source artificial intelligence (AI) models and text-to-video AI technology. This announcement builds on the Qwen 2.5 family, its latest foundational large language model launched in May.[1]
The newly released open-source Qwen 2.5 models vary in size from 0.5 to 72 billion parameters, which influence their capabilities and performance. According to Alibaba, these models are proficient in mathematics, coding, and support over 29 languages. They are designed to serve a range of applications across sectors including automotive, gaming, and scientific research.
Additionally, Alibaba unveiled a new text-to-video model as part of its Tongyi Wanxiang image generation family, entering a market that is attracting interest from various Chinese tech firms. This move places Alibaba in competition with global players like OpenAI, which is also exploring text-to-video technology. Recently, ByteDance, the owner of TikTok, launched its text-to-video app Jimeng AI on the App Store for Chinese users.
Chinese tech companies are investing in generative AI, with firms working to create diverse product offerings. While competitors like Baidu and OpenAI have primarily utilized closed-source models, Alibaba is adopting a hybrid approach, focusing on both proprietary and open-source development.
Earlier this month, China’s State Administration of Market Regulation announced that Alibaba Group completed a three-year rectification period following a $2.75 billion fine imposed in 2021 for monopolistic practices. The fine was issued after the e-commerce giant was found to have restricted merchants from working with rival platforms.
In August, it was announced that Alibaba Group Holding, the Chinese e-commerce giant, is developing Tora, a video-generating tool based on OpenAI's Sora. Detailed in a paper released by five Alibaba researchers, Tora uses the Diffusion Transformer (DiT) architecture, the same framework used by OpenAI's Sora, a text-to-video model launched in February.
Tora is the first trajectory-oriented DiT framework for video generation, ensuring precise movement along specified paths while mimicking real-world dynamics. The researchers adapted OpenSora's workflow to transform raw videos into high-quality video-text pairs and used an optical flow estimator for trajectory extraction.