Google Brain published the latest research on image generation artificial intelligence, proposing an artificial intelligence system called Imagen, which can create realistic images by parsing the user’s text input, and other current advanced image generation algorithms, such as Compared with VQ-GAN, LDM and DALL-E 2, humans tend to feel that the images generated by Imagen are more realistic and more in line with the input text description.
Imagen generates an image diffusion model for text, which can deeply understand the meaning of text and output photo-realistic images. Imagen is built on a large Transformer language model, so it has powerful text understanding capabilities, and relies on a diffusion model to generate high-fidelity images.
The researchers mentioned that they found that general-purpose large language models such as T5, pre-trained on plain text corpora, were very effective at encoding text for image synthesis. By increasing the size of the language model in Imagen, it is possible to improve the realism of the sample and the consistency of the image and the text description, which is more effective than increasing the size of the diffusion model.
Although Imagen has not been trained on the COCO (Microsoft Common Objects in Context) dataset, it can obtain the current lowest FID score of 7.27 (lower is better), and human evaluators have also found that Imagen is consistent in image and text consistency. The sample is comparable to the COCO dataset.
Google also used the DrawBench benchmark, a more comprehensive and challenging benchmark in the text-to-image field, to further evaluate Imagen’s ability to generate images from text. By combining Imagen with the VQ-GAN, LDM, and DALL-E 2 algorithms, systematic testing of spatial relationships, long-form text, and rare words is performed, and the algorithm is manually evaluated for image-text consistency, and authenticity of the image.
As can be seen from the figure below, whether in terms of image and text consistency (Alignment) or image authenticity (Fidelity), humans generally believe that Imagen performs better than VQ-GAN, LDM and DALL-E 2.
Google does not open Imagen to the outside world for the time being, and focuses its future work on solving the challenges and limitations of opening. The researchers mentioned that although they have filtered the training data set and used the LAION-400M data set with inappropriate content to avoid harmful models , but because Imagen relies on a text encoder trained using unfiltered web data, Imagen may still have some harmful stereotypes.
In addition, people have done a lot of review work on image-generated text and image tagging models to avoid social bias, but there is relatively little work on social bias assessment of text-to-image models. Google researchers A number of social and cultural biases have been found in Imagen, such as the lighter skin tone of the characters in the images overall, and the depiction of occupations that are more inclined towards Western gender stereotypes.
Therefore, even though Imagen’s capabilities are powerful, Google still does not plan to open source Imagen’s code, nor provide public display, because the downstream applications of text-generated image models are very diverse, and may affect society in a complex form, considering the potential Google will not open Imagen to the outside world until developers establish a responsible external framework to balance the risks of unrestricted opening.