Multimodal AI Archives - Tech | Business | Economy

OpenAI Counters Google’s Gemini 3 Surge with New GPT-5.2

Joan Aimuengheuwa — Fri, 12 Dec 2025 11:48:11 +0000

OpenAI has launched its GPT-5.2 model, pushing forward again in the competition that has become stronger since Google released Gemini 3 last month.

This follows reports that CEO Sam Altman declared a “code red” inside the company in early December, halting side projects and pulling teams into a faster development sprint.

The urgency was linked to Google’s latest innovations, which had placed Gemini 3 at the top of key performance rankings across reasoning, coding and multimodal tasks.

OpenAI says GPT-5.2 brings stronger general intelligence, better coding results, and far longer context handling. The company believes these improvements will help users complete more demanding work, particularly tasks that involve spreadsheets, complex documents, and project-heavy workflows.

Interestingly, the new model stretches to handle up to a million tokens, a big difference from the previous model.

Google has been keen to highlight what Gemini 3 is capable of across text, audio, images and video, and analysts say its tight integration with Workspace and Android gives it an advantage with corporate users.

Even with that, Altman played down the internal panic when he spoke on CNBC, saying: “Gemini 3 has had less of an impact on our metrics than we feared.” Google has not responded to requests for comment.

OpenAI is rolling out GPT-5.2 in three versions: Instant for quick responses, Thinking for slower but more reasoned answers, and Pro for enterprise-level performance. Paid ChatGPT users will receive them first. The company also states it will continue to support GPT-5.1, GPT-5 and GPT-4.1 on its API, giving developers more flexibility.

Away from the technical competition, OpenAI is also moving into entertainment. Disney has confirmed a $1 billion investment in the company and will allow its Sora video generator to use characters and worlds from Star Wars, Pixar and Marvel.

This is one of the largest licensing deals yet between Hollywood and an AI firm, and it sets up OpenAI as a direct partner in digital content production. Microsoft, still OpenAI’s biggest backer with about $13 billion committed since 2019, continues to host the company’s models on Azure.

Industry forecasts show spending on cloud-based AI services is expected to rise sharply, with Gartner estimating it will exceed $723 billion next year. Many companies are already relying on GPT models for coding assistance, document processing and data insights. According to OpenAI, enterprise usage has climbed roughly 40% in the past year.

However, regulators in the US and Europe are examining safety standards, competition risks and copyright issues, with Disney’s licensing deal likely to draw even closer attention.

The post OpenAI Counters Google’s Gemini 3 Surge with New GPT-5.2 appeared first on Tech | Business | Economy.

Musk to Open Source Grok 2 Next Week, Extending His AI Transparency Push

Joan Aimuengheuwa — Wed, 06 Aug 2025 08:33:18 +0000

Elon Musk has announced that xAI, his artificial intelligence venture, will release the source code for its flagship chatbot, Grok 2, next week.

Grok 2, built on Musk’s proprietary Grok-1 language model, has been marketed as a less filtered and more “truth-seeking” alternative to tools like ChatGPT or Claude.

Unlike many rivals, it draws directly from live data on X (formerly Twitter), enabling it to react to breaking news and trending conversations in real time. It also offers multimodal features, producing text, images, and video, and is currently available to X Premium+ subscribers.

By open sourcing the system, developers and researchers will gain direct access to Grok 2’s underlying code and architecture. This would allow them to audit, modify, and build upon the technology.

Musk framed the decision as part of a consistent release pattern, stating it was “high time” to share the new model with the public. This aligns with a growing industry shift toward open-weight AI models, with Meta’s LLaMA, Mistral, and the GPT-oss series from OpenAI following similar paths.

However, Grok’s looser content restrictions have attracted complaints, with past instances of misleading or offensive responses bringing concern. Opening up its code could amplify risks, including the spread of misinformation or the misuse of the technology in sensitive fields such as medical diagnostics or autonomous systems.

Grok Imagine—its image and video generator—has already been caught in controversy over its potential to produce explicit content, prompting further debate on the balance between openness and safety.

xAI continues to present Grok as a counterweight to larger AI players like OpenAI, Google, and Anthropic, putting transparency and developer freedom at the forefront.

Analysts also note that this strategy may strengthen Musk’s business network, opening possibilities for integration across Tesla, SpaceX, Neuralink, and X.

The post Musk to Open Source Grok 2 Next Week, Extending His AI Transparency Push appeared first on Tech | Business | Economy.

Multimodal AI Faces New Threats | Report Reveals Safety Risks, CSEM Exposure

Joan Aimuengheuwa — Fri, 09 May 2025 10:36:25 +0000

As generative AI systems increasingly combine text and images, a new Multimodal Safety Report from Enkrypt AI exposes critical vulnerabilities that could compromise the safety, integrity, and responsible use of multimodal models.

Enkrypt AI’s red teaming exercise tested multiple multimodal models against a range of safety and harm categories outlined in the NIST AI Risk Management Framework.

The results show that new jailbreak techniques can exploit how these models interpret combined media, allowing harmful outputs to bypass safety filters, often without any visible warning in the user prompt.

“Multimodal AI promises incredible benefits, but it also expands the attack surface in unpredictable ways,” said Sahil Agarwal, CEO of Enkrypt AI. “This research is a wake-up call: the ability to embed harmful textual instructions within seemingly innocuous images has real implications for enterprise liability, public safety, and child protection.”

Key Findings: New Attack in Plain Sight

The research illustrates how multimodal models—designed to handle text and image inputs—can inadvertently expand the surface area for abuse when not sufficiently safeguarded.

Such risks can be found in any multimodal model, however, the report focused on two popular ones developed by Mistral: Pixtral-Large (25.02) and Pixtral-12b.

According to Enkrypt AI’s findings, these two models are 60 times more prone to generate child sexual exploitation material (CSEM)-related textual responses than comparable models like OpenAI’s GPT-4o and Anthropic’s Claude 3.7 Sonnet.

Additionally, the tests revealed that the models were 18-40 times more likely to produce dangerous CBRN(Chemical, Biological, Radiological, and Nuclear) information when prompted with adversarial inputs. These risks threaten to undermine the intended use of generative AI and highlight the need for stronger safety alignment.

These risks were not due to malicious text inputs but triggered by prompt injections buried within image files, a technique that could realistically be used to evade traditional safety filters.

Recommendations for Securing Multimodal Models

The report urges AI developers and enterprises to act swiftly to mitigate these emerging risks, outlining key best practices:

Integrate red teaming datasets into safety alignment processes
Conduct continuous automated stress testing
Deploy context-aware multimodal guardrails
Establish real-time monitoring and incident response
Create model risk cards to transparently communicate vulnerabilities

“These are not theoretical risks,” added Sahil Agarwal. “If we don’t take a safety-first approach to multimodal AI, we risk exposing users—and especially vulnerable populations—to significant harm.”

The post Multimodal AI Faces New Threats | Report Reveals Safety Risks, CSEM Exposure appeared first on Tech | Business | Economy.

Google Launches Gemini 2.0, Multimodal AI Ushering in the ‘Agentic Era’

Joan Aimuengheuwa — Fri, 13 Dec 2024 10:16:07 +0000

Google has launched Gemini 2.0, described as its most capable artificial intelligence model yet.

Built for what the company refers to as the “agentic era,” Gemini 2.0 is designed to understand its environment, think ahead, and take action while keeping user oversight at its core.

Sundar Pichai, CEO of Google and Alphabet, commented on the possibilities of this new model. In his statement, he said: “Information is at the core of human progress. It’s why we’ve focused for more than 26 years on our mission to organize the world’s information and make it accessible and useful. And it’s why we continue to push the frontiers of AI to organize that information across every input and make it accessible via any output, so that it can be truly useful for you.”

Gemini 2.0 represents a huge evolution, bringing multimodal functions and native tool use. With this, Google is now closer to its vision of a universal assistant.

Pichai added: “If Gemini 1.0 was about organizing and understanding information, Gemini 2.0 is about making it much more useful. I can’t wait to see what this next era brings.”

Enhanced Multimodal Functions

A major feature of Gemini 2.0 is its advanced multimodal functions. Unlike Gemini 1.5, the new model supports inputs across text, images, audio, and video while enabling outputs such as native image generation and multilingual text-to-speech. These innovations bring in richer and more interactive user experiences.

For developers, the experimental Gemini 2.0 Flash model is now available via Google AI Studio and Vertex AI. This version prioritizes low latency and high performance, ideal for dynamic applications.

Added to these, the new Multimodal Live API allows real-time audio and video streaming, opening possibilities for more immersive use cases.

Expanding AI Integration Across Google’s Services

Starting today, Gemini 2.0’s functions will be accessible to users of the Gemini app, with broader integration into Google products like Search expected by early next year. A feature called Deep Research is also being introduced, which Pichai described as:

“A research assistant, exploring complex topics and compiling reports on your behalf.”

This feature uses advanced reasoning and long-context capabilities to provide deeper insights. It is available in Gemini Advanced starting today.

AI Overviews in Search will also receive enhancements powered by Gemini 2.0, allowing users to tackle complex queries involving multi-step questions, advanced maths, and multimodal inputs. Pichai noted:

“No product has been transformed more by AI than Search. Our AI Overviews now reach 1 billion people, enabling them to ask entirely new types of questions — quickly becoming one of our most popular Search features ever.”

New AI Agents on the Horizon

Google is leveraging Gemini 2.0 to pioneer a new class of AI agents designed to perform complex tasks, including:

Project Astra: A prototype universal assistant that combines tools like Google Search and Maps while enhancing memory and multilingual dialogue capabilities.
Project Mariner: An experimental browser assistant capable of interacting with web pages to complete tasks.
Jules: A coding assistant that integrates with GitHub, enabling developers to plan and execute coding tasks under supervision.

These projects are currently in testing with trusted users as Google fine-tunes their safety and functionality.

Responsible AI Development

Throughout the development of Gemini 2.0, Google has emphasised ethical practices, rigorous safety measures, and collaboration with internal and external experts.

The model is built using Trillium, Google’s sixth-generation Tensor Processing Units (TPUs), which powered 100% of Gemini 2.0’s training and inference.

Pichai highlighted the importance of these foundational technologies: “2.0’s advances are underpinned by decade-long investments in our differentiated full-stack approach to AI innovation. It’s built on custom hardware like Trillium, our sixth-generation TPUs. TPUs powered 100% of Gemini 2.0 training and inference, and today Trillium is generally available to customers so they can build with it too.”

Gemini 2.0 completely changed the way users interact with AI, offering tools that are intuitive, versatile, and capable of enhancing everyday tasks. From simplifying research to assisting with coding, this next-generation model promises to be a cornerstone of future AI technologies.

Sundar Pichai encapsulated the company’s vision: “Today we’re excited to launch our next era of models built for this new agentic era: introducing Gemini 2.0, our most capable model yet. With new advances in multimodality — like native image and audio output — and native tool use, it will enable us to build new AI agents that bring us closer to our vision of a universal assistant.”

The post Google Launches Gemini 2.0, Multimodal AI Ushering in the ‘Agentic Era’ appeared first on Tech | Business | Economy.

OpenAI Expands Multimodal AI with Launch of Sora Text-to-Video Model

Joan Aimuengheuwa — Tue, 10 Dec 2024 07:39:06 +0000

OpenAI has launched Sora, the text-to-video model, making it available to ChatGPT Plus and Pro users.

This is an expansion into multimodal AI, targeting competitors such as Meta, Google, and Stability AI, which have similar tools in development.

Initially introduced in February 2023 as part of a research preview by OpenAI, Sora was limited to safety testers. The model, now branded as Sora Turbo, is offered at no additional cost to subscribers, enabling them to create videos up to 20 seconds long in 1080p resolution.

Users can choose from widescreen, vertical, or square formats, and generate content either from scratch or by blending existing assets.

While accessible in most regions where ChatGPT operates, the rollout excludes the European Union, Switzerland, and the United Kingdom.

OpenAI has announced plans to introduce targeted pricing for different user categories early next year, reiterating its goal of making the technology widely accessible.

OpenAI has implemented measures to prevent misuse of the tool, specifically prohibiting the creation or upload of harmful content, including child exploitation materials and deepfake videos.

The company has limited the feature for uploading real individuals, intending to expand access as its safeguards improve.

To promote transparency, all Sora-generated videos include metadata identifying them as AI-generated, along with visible watermarks by default. OpenAI is also developing internal tools to verify the origins of content created using the model.

With Sora’s integration, OpenAI is enhancing creativity within its space. The company aims to shape industry norms around responsible AI use while building tools that enhance content creation.

The post OpenAI Expands Multimodal AI with Launch of Sora Text-to-Video Model appeared first on Tech | Business | Economy.