Grok Code, VibeVoice, GPT-realtime, Wan S2V, USO, OLMoASR, & more AI NEWS
Welcome to the AI Search newsletter. Here are the top highlights in AI this week.
Google revealed Gemini 2.5 Flash Image, a top-notch AI for creating and editing pictures from text or other images. It tops the charts on image editing leaderboards and lets you mix photos, keep characters looking the same in stories, or change parts with simple words. Plus, it uses Google's knowledge to make super realistic edits. Read more
USO is an open-source image model that allows for style-driven and subject-driven generation of images. It uses a disentangled learning scheme to align style features and disentangle content from style, and can be used for various applications such as subject-driven generation, style-driven generation, and style-subject driven generation. Read more
VibeVoice is a new, open-source text-to-speech model that can generate expressive, long-form audio conversations, such as podcasts, from text. It uses a novel framework that can synthesize speech up to 90 minutes long with up to 4 distinct speakers, and can even handle tasks like spontaneous emotion, singing, and cross-lingual conversations. Read more
Do you prefer to watch instead of read? Check out this video covering all the highlights in AI this week
xAI launched Grok Code Fast 1, a fast and cheap AI model designed for coding tasks like building apps or fixing bugs automatically. It's built from the ground up to handle "agentic coding," meaning it can act like a smart assistant in coding environments. For the next week, you can try it for free on tools like Cursor and GitHub Copilot. Read more
Alibaba's Tongyi Lab dropped Wan2.2-S2V, an open-source AI that turns a single photo and sound into awesome videos. This 14-billion-parameter model shines in making movie-like scenes with real faces, body moves, and camera angles. It works for full or half-body characters and handles stuff like talking, singing, or acting professionally. Read more
Waver is a new AI video model by Bytedance that can create realistic videos from text or images. It supports various features like multi-camera storytelling, different artistic styles, and even sports scenes. Waver can generate high-quality videos in various resolutions and lengths, making it a versatile tool for content creation. Read more
To celebrate reaching 500K subscribers on Youtube, we are giving a away a DJI Mini 4 Pro! This is a small, versatile, powerful drone equipped with a high resolution 48MP sensor and can capture 4K/60fps HDR video. Enter for FREE!
OpenAI rolled out gpt-realtime, an upgraded AI that chats by voice more naturally and follows tricky commands better. It improves on calling tools accurately and making speech sound expressive, like a real person. The API now handles images, remote servers, and even phone calls via SIP, and it's no longer in beta. Read more
Ai2 released OLMoASR, a new set of open automatic speech recognition (ASR) models trained entirely from scratch on a big, high-quality dataset. These models perform as well as or better than Whisper in understanding speech without needing extra training (zero-shot) across different model sizes. This means anyone can use strong speech-to-text tools without restrictions. Read more
Tencent Hunyuan introduced HunyuanVideo-Foley, a new AI model that creates high-quality audio from video text descriptions. Trained on a huge 100,000-hour dataset, it generates realistic sounds that match the scenes perfectly, like nature sounds or animated effects. This model outperforms all other open-source audio generation systems in quality and timing. Read more
With Monica, you can use the top AI models, image generators, and video generators, all in one integrated platform. Use code AISEARCH10 to get 25% OFF 'Unlimited Annual Plan' within 24h of registration, or enjoy 10% OFF. Try it for free today!
Google’s NotebookLM now offers Video Overviews in 80 languages, expanding beyond its existing Audio Overviews. This means users can get quick video summaries in many languages, making information easier to digest globally. The Audio Overviews in these languages have also been improved to be more detailed. Read more
VoxHammer is a new AI tool that allows for precise and coherent editing of 3D models without needing any training data. It works by predicting the inversion trajectory of a 3D model and then replacing the features of preserved regions with cached latents and key-value tokens. This approach enables consistent reconstruction of preserved areas and coherent integration of edited parts. Read more