The dalle-flow project can generate images based on text

Project Address

The following demonstrates the effect of the project and describes the algorithm used.

1.Effect Demo

Let's take a simple example to show you.

For example, we want to generate an image for the text "a teddy bear on a skateboard in Times Square".

After typing it into dalle-flow, we can get the following image


Isn't it amazing!

I'll show you how to use this project with a few lines of Python code.

First, install docarray

pip install "docarray[common]>=0.13.5" jina

Define server_url variable to store the dalle-flow model address

server_url = 'grpc://'

server_url is the official service provided, we can also follow the documentation and deploy the model to our own server (GPU required).

Submit the text to the server and get the candidate images.

prompt = 'a teddy bear on a skateboard in Times Square'
from docarray import Document

da = Document(text=prompt).post(server_url, parameters={'num_images': 2}).matches

After submitting the text, the server calls the DALL-E-Mega algorithm to generate candidate images, and then calls CLIP-as-services to rank the candidate images.

We specify num_images equal to 2, and eventually 4 images will be returned, 2 from the DALLE-mega model and 2 from the GLID3 XL model. Since the server_url server is abroad, the program may take a long time to run, so you should wait more when running it.

After the program is finished, we will show these 4 images

da.plot_image_sprites(fig_size=(10,10), show_index=True)


We can select one of them and continue to submit it to the server for diffusion.

Each image has a number in the upper left corner, here I have selected the image numbered 2

fav_id = 2
fav = da[fav_id]

diffused ='{server_url}', parameters={'skip_rate': 0.5, 'num_images': 36}, target_executor='diffusion').matches

diffusion actually takes the selected image and feeds it into the GLID-3 XL model to enrich the texture and background.

The returned results are as follows.


We can choose a satisfactory image from among them for the final result page.

fav = diffused[6]

2. Knowledge of dalle-flow algorithm

The dalle-flow project is simple to use, but the DALL-E algorithm involved is complex, and is only briefly described here.

The goal of DALL-E is to treat text token and image token as a sequence of data and perform autoregression by Transformer.


This process is somewhat similar to machine translation, where machine translation translates English text into Chinese text, while DALL-E translates English text into images, where the token in the text is a word and the token in the image is a pixel.

Those who are interested in the dalle-flow project can run the above code and try deploying the model by themselves.

keywords: python