Apple has released a new open source AI model called MGIE that can edit images based on natural language instructions. From the report: MGIE stands for MLLM-Guided Image Editing, which utilizes multimodal large-scale language models (MLLM) to interpret user commands and perform pixel-level operations. This model can handle various editing aspects such as Photoshop style changes, global photo optimization, and local editing. MGIE is the result of a collaboration between Apple and researchers at the University of California, Santa Barbara. The model was presented in a paper accepted at the International Conference on Learning Representations (ICLR) 2024, one of the top venues for AI research. In this paper, we demonstrate the effectiveness of his MGIE in improving automated metrics and human evaluation while maintaining the efficiency of competitive inference.
MGIE is based on the idea of powering instruction-based image editing using MLLM, a powerful AI model that can process both text and images. Although MLLM has shown good ability in cross-modal understanding and visually aware response generation, it has not been widely applied to image editing tasks. MGIE integrates MLLM into the image editing process in two ways. The first is to use MLLM to derive expressive instructions from user input. These steps are concise and clear and provide clear guidance for the editing process. For example, given the input “make the sky more blue,'' MGIE can generate the instruction “increase the saturation of the sky region by 20%.''