Lesson 1: Introduction Introduction to the Multimodal AI Applications course and what you will learn.
Lesson 2: Multimodal AI Fundamentals Discover multimodal AI fundamentals and technologies, including models and use cases that process and generate text, images, audio, and video for richer, real-world applications.
Lesson 3: Using Multimodal AI Technologies Explore practical applications of multimodal AI by using APIs and open-source models for image captioning and audio transcription, with hands-on exercises and secure credential handling.
Lesson 4: Transformers & Multimodal Processing Explore how transformers unify text, images, audio, and video through attention, embeddings, and fusion strategies, powering state-of-the-art multimodal understanding and generation.
Lesson 5: Multimodal AI Tooling Explore practical tools for building multimodal AI apps, compare commercial and open-source options, and use Pydantic AI to create reliable, structured, vendor-agnostic workflows.
Lesson 6: Introduction to Enterprise Visual Content Processing Explore enterprise visual content processing: core computer vision tasks, digital image representation, and real-world applications for efficiency, safety, and automation.
Lesson 7: Vision Pre-processing Pipelines with HuggingFace Explore vision data pipelines using HuggingFace, from dataset loading to resising and normalisation, with demos and hands-on exercises for effective image pre-processing.
Lesson 8: Understanding Embeddings in Computer Vision Learn how embeddings convert images into compact vectors for efficient search, enable cross-modal tasks with models like CLIP, and power large-scale, robust computer vision systems.
Lesson 9: Image Search Using CLIP Embeddings Explore how to build text-to-image and image-to-image search using CLIP embeddings, combining theory, real-world demos, hands-on practice, and solution walkthroughs.
Lesson 10: Using Multimodal Model APIs for Vision Explore multimodal vision APIs: prompt design, parameter tuning, structured outputs, cost control, integration, and best practices for robust, efficient image analysis.
Lesson 11: Gemini Vision API Basics Explore Gemini Vision API basics by practicing image moderation, learning to analyse images and implement moderation workflows using real-world examples and guided hands-on exercises.
Lesson 12: Vision Transformer Models & Architectures Explore Vision Transformer models: core architecture, image tokenisation, self- and cross-attention, and top models for segmentation, detection, and enterprise use.
Lesson 13: Using Vision Transformers Explore vision transformers with hands-on demos to extract image embeddings and perform object detection and segmentation using state-of-the-art models.
Lesson 14: Vision-Language Models Learn how vision-language models align images and text for tasks like search, captioning, and visual question answering, with a focus on enterprise deployment considerations.
Lesson 15: Multimodal Vision Applications with CLIP Explore zero-shot image classification and auto-labelling for driving scenes using CLIP, enabling efficient, scalable multimodal vision applications.
Lesson 16: Diffusion Models & Image Generation Explore how diffusion models generate images by reversing noise through iterative denoising, a key technique behind modern generative image models.
Lesson 17: Introduction to Enterprise Audio Processing Discover enterprise audio processing, including core speech tasks, use cases, and integration strategies for modern business environments.
Lesson 18: Audio Data Representation Explore how audio is digitised for AI, including sample rate, bit depth, channels, formats, and best practices for preprocessing and analysis.
Lesson 19: Audio Processing with librosa Explore audio processing with librosa to load, resample, convert, analyse and visualise audio data through hands-on exercises.
Lesson 20: Sound Retrieval and Classification Explore audio embeddings for efficient sound classification and retrieval using models like CLAP to enable semantic audio analysis at scale.
Lesson 21: Sound Retrieval and Classification with CLAP Apply CLAP for sound retrieval, similarity and zero-shot classification to detect fan on and off states in real audio data.
Lesson 22: Speech Processing Discover automatic speech recognition with Whisper, a robust multilingual model for transcription, translation, and real-world speech processing.
Lesson 23: Implementing Speech Processing with Whisper & Gemini Explore real-world speech transcription and translation using Whisper and Gemini, including multilingual support and alignment techniques.
Lesson 24: Audio Intelligence Explore advances in audio intelligence, including multimodal systems, speech recognition, text-to-speech, ethics, and enterprise controls.
Lesson 25: Audio Sentiment Analysis with Gemini Explore audio sentiment and command analysis using Pydantic AI and Gemini to extract emotions and recognise spoken commands.
Lesson 26: Audio Classification and Moderation Explore voice content moderation including compliance, privacy, layered detection and operational excellence.
Lesson 27: Building a Basic Voice Moderation System with Gemini Build a voice moderation system using Gemini to transcribe audio, detect personal data disclosures, and flag policy violations.
Lesson 28: Introduction to Enterprise Video Processing Discover how enterprise video AI addresses temporal complexity using efficient frame selection for understanding and moderation.
Lesson 29: AI Models for Video Understanding Explore AI models for real-time detection, motion tracking and temporal understanding to enable scalable video analytics.
Lesson 30: Implementing Object Recognition & Tracking Learn how to detect and track objects in videos, apply multi-object tracking, and count items in practical scenarios.
Lesson 31: Video Understanding & Search Explore methods for analysing and searching video using foundation models, balancing accuracy, cost and performance.
Lesson 32: Video Understanding & Search with Gemini & Clip4Clip Explore automated video description, key moment detection, and natural language video search using AI models and structured outputs.
Lesson 33: Video Classification & Moderation Learn to classify and moderate video by modelling temporal patterns and combining automation with human oversight.
Lesson 34: Video Classification & Moderation with Gemini Build automated systems for video classification and moderation using Gemini and Pydantic AI in real-world scenarios.
Lesson 35: Video Generation Explore generative video AI tools and workflows that turn text, images or footage into dynamic video content.
Lesson 36: Video Generation with Veo 3 Generate marketing videos using Veo 3 with text-to-video and image-to-video workflows, understanding strengths and limitations.
Lesson 37: Multimodal AI Deployment Explore deployment strategies for multimodal AI systems via unified APIs and orchestration approaches.
Lesson 38: Implementation Tools and Serving Strategies Explore tools and strategies for implementing, serving and monitoring AI solutions from prototyping to production.
Lesson 39: Using Gradio and Pydantic AI Build multimodal chatbots and analysis apps using Gradio and Pydantic AI, covering async programming and interface customisation.
Lesson 40: Multimodal AI Performance Monitoring and Logging Learn to monitor and log multimodal AI systems, tracking performance, costs and failures across modalities.
Lesson 41: Logging and Performance Monitoring with Gradio and Arize Phoenix Implement logging and performance monitoring for multimodal AI chatbots to enable robust analytics and debugging.
Course Project: Evaluating Multimodal Applications Learn how to evaluate multimodal AI applications using user feedback, automated metrics and continuous monitoring.
Lesson 43: Testing Multimodal Apps with Pydantic AI Evals Build robust testing frameworks for multimodal AI apps using structured outputs and semantic evaluation techniques.
Lesson 44: Scaling Multimodal AI Architecture Learn strategies to scale multimodal AI systems, focusing on performance, reliability, cost and architectural trade-offs.