We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model.
In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. Download a PDF of the paper titled The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), by Zhengyuan Yang and 6 other authors Download PDF Abstract:Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence.