type of project: AI Assisted Theatrical Design

published: 2021

by: Elena Tilli & Samuel Chan




AI Assisted Theatrical Design

Chapter 0: Understanding the Latent Space

From Text to music, from picture to video, different tools of Artificial Intelligence have been used in art making. Could AI also be applied to theatre making, particularly theatrical design, which is a more complex “problem” for the machine?

Coming from a theatrical design background, the research team would like to explore if these AI tools could do us any good in “assisting” theatrical design.

In 2014, Ian Goodfellow and his team published their ground breaking idea of Generative Adversarial Networks (GAN), and since than many exciting works were done with this idea.

A typical GAN consists of a Generator and a Discriminator. A Generator would generate “fake” data while the Discriminator will try to distinguish between “real” and “fake” data, and we could compare that to an art forger and an art historian/expert. The GAN are usually constructed in way so that when we proceed with the training, both the Generator and the Discriminator would get better at their job. Computer scientist has developed different metrics to measure the performance of the network they developed. Once the trained network get to an optimal point, the would stop the training, and the trained Generator could then be used to generate realistic looking data.

"This X Does Not Exist" has listed a lot of different projects that people have been exploring with GAN. The myth and the promise to create something that doesn't exist before is very intriguing, that's how we decided to start our exploration with the GAN network. And we decided to start with a very popular network called the “Stylegan”.

"Stylegan" is first proposed and published by a team of researchers working in Nvidia in 2018. Since then the team continues to make improvement and created stylegan2, whereas there are official implementation created by Nvidia to support training in Pytorch, which is a popular AI software library researchers use these days. The latest version is Stylegan3. For all these years Stylegan remains to be the state of the art network for generating realistic looking images.

Since the GAN are said to be able to create something that don't exist, we are wondering what would happen if we train the network with a theatrical images. The idea is to start with some time lapse pictures of a single production and see if it could generate a “look” that never happen in that production.

So we have the idea, but how should we implement that? How do we use those networks? What kind of hardware is needed? Do we train the network locally or online? These are all issues that take us sometime to figure out.

Stylegan3 is not yet published when we started the fellowship in September 2021, so we actually began with Stylegan2. If we look at the requirements on github, it said this: “1–8 high-end NVIDIA GPUs with at least 12 GB of memory. We have done all testing and development using NVIDIA DGX-1 with 8 Tesla V100 GPUs”. While this sound pretty normal for big tech company like Nvidia, for other institutions this might not be something that to be accessed easily. And the expected training time is rather long. For example training a picture of 1024 X 1024 up to 25000 kimg will take 46days with only one V100 GPU. Usually the training happens locally with powerful computer and graphics card, but before we committed to such local training and request the related resources, we've explored some training online.

The service we've used is called the Runway ML.

Below are the training results using Runway ML. The picture datasets are the time lapse photos from 3 different productions. And there are different training settings and we have used transferred training provided by the Runway ML.

Twelfth Night/15000 Steps/pretrained “landscape” model) Death of Yazdgerd/10000 Steps/pretrained “illustration” model The Seagull/10000 Steps/pretrained “illustration” model

From the distance, these generated video looks pretty “real”. But when we look closer, it is of course missing a lots of details in those original “real” pictures. We discovered that no matter what size of pictures we upload, Runway ML will resize them to 320 x 320. And this leave us no choice but to explore the option of local training, so that we have more control over the training and see if that would benefit the network we wish to train.

We started our local training on a window laptop with an external GPU (Nvidia RTX 2080ti), which only has 11GB video memory. At that time styglegan3, which give you more option to control how much video memory you use, is not available.

We managed to install the software needed on the laptop and start the training. However, the training will stop randomly whenever the computer running out of video memory. So for a while we have to check the status of training everyday, and when it stops we would just have to restart the training. We discovered that this phenomenon is mostly related to the process of metrics calculation that needs more memory. Once we have disabled that the training become more stable, but it will still stop from time to time. This is why we also started to explore other online option for a more stable training.

After some research on internet, we discovered that many people are doing their experiment on Google Colab.

Colab is an online linux & python-based programming platform that meant to be used interactively, meaning that originallu it wasn't designed to be used for long time training. Compare to running things locally, it save your time and perhaps hard drive space to create many virtual environment, and you could test your code quickly with a Chrome Browser. Of course there are limitation of such service. For free version you could only run the code when your browser is active, and even for paid Pro+ version your code will only run for 24 hours from the first execution.

The benefit of training on Colab is however you will get GPU that are more than 12GB, which means you will be able to run the training without any memory issues. There are many colab notebook that researchers share online, and it's quite straight forward to use them once you copy them into your own google drive. An example is the notebook made by Derrick Schultz, where he has also modified the code for his own interests. We did some training on various dataset with the notebook, below is the results of the “Twelfth Night” dataset.

Fake Pictures Generated:

Real Pictures:

Since we are now running the software on appropriate GPU and with 1024 X 1024 resolution, we are getting better results. But if we look closer a lot of details are still lost, this is probably because there are too less pixel for the network to understand the features of the performer/costume fully. One thing to notice is this production is almost a single scenery production, and the ceiling is always there. Apparently the network are catching the features of the ceiling and other fixed elements pretty well.

The software package also allows us to create interpolation video:

The drawback of using online training is that we have to restart the training everyday. We've also subscribed to Google Colab Pro+, which allows us to access better graphics card. The services of the Colab did change a little bit during the period of the fellowship, and since November 2021 a google Colab Pro+ user will get NVidia V100 16GB GPU the minimum and sometimes get NVidia A100 40 GB GPU, which are much faster and stabler compare to what was available before.

Colab will continues to be a great resources for independent research who might not have the resources to build their own deep learning machine, especially in the time when both the materials to make computer chips and the graphics card itself are in shortage.

Since training a new network takes a long time, we try to utilize any resources we have even in some computer we need to reduce the the resolution to 512 X 512. Below is the results of the Stylegan2 training of Theater Dortmund Archive, which contains over 21,000 pictures of Theater Dortmund's productions in the past 10 years:

This training was done in the local linux desktop computer with 2 X NVidia RTX 2080 Super, we have to set that to 512 X 512 so that the training works with the 8GB memory we have. Since we are able to use 2 GPUs at the same time the local training is faster but it only works in 512 X 512 resolution.

By doing thing parallelly we then able to explore different dataset simultaneously, although it still take a long training time and need to be taken care of daily.

On October 28 2021, NVidia share their official implementation of Stylegan3 on github. This development allowed us to training more efficiently both offline and online.

Because the Stylegan3 network is using a different network architecture to understand the datasets, so we have to retrain all our network to see the latest results. The software package would still let you train in Stylegan2, and there are two option of Stylegan3: Stylegan3-R and Stylegan3-T. However these training configuration are not exchangeable, which means once you started the training you cannot change the configuration it uses.

We trained a new dataset, which contains around 24,000 pictures from 15 productions, on a local linux desktop with 2 X RTX 2080ti GPUs with Stylegan3-T configuration, and the results are below:

We now have an AI that help us to generate 2D images, but our target is to have something in 3D, that's why we have look into 3D reconstruction tools that are available.

We found MiDaS, a pretrained depth detection network that works with any images without the requirement for the image to be trained with the network. This network allowed us to complete the pipeline of 2D image generation to depth image generation.

With the 2D image and the depth image, we could then move on to 3D reconstruction. We've use instancing technique in TouchDesigner, but other software that could utilize a color and depth map in 3D creation should be able to create similar results.

3D Reconstruction View 1 3D Reconstruction View 2 3D Reconstruction View 3

Throughout the fellowship we almost never stop the Stylegan3 training, both online and offline. Below represent the results combine with 3D reconstruction:

Time lapse photos of 15 productions on the left; around 1,100 pictures of theatrical design on the middle; and Theater Dortmund Archive on the right.

What we discovered with all these trainings is the network actually recreate the dataset pretty well. In the case of the 15 productions datasets, it is almost always able to generate pictures that look close to the original production.

The pictures that generate from the trained network, however, don't really create a “Theatrical Design” that doesn't exist. We think there are a few reasons why that's the case:

1. The first reason is related to the limitation of resolution and hardware. The Stylegan network excels in learning and generating human and animal faces which are usually close-up photos, or generated landscapes which are quite abstract. The theatrical design images we feed the network are mostly wide shots that covers the whole stage, which usually contains a lot of details. While the network is able to learn the overall arrangement of the “stage”, it failed to recreate the costumes and the faces accurately. We've mentioned that we trained the network with 1024 X 1024 images, one could expect that if we train with even higher resolution say 2048 X 2048 or even 4096 X 4096, we could get better results. However, equipment that could train such a network might be difficult to access at this point for researchers outside those big tech companies. Still, we could expect soon 4k or higher training for images will be available and it will be worth then training the network again.

2. The second reason is related to our perception and understanding of theatre. The features of a real human or animal is definite: you have brown eyes, black hair and tall nose, and these features probably won't change much. So when the researcher claims that the network created a person that doesn't exist, what they means is it will be hard to find a person with the exact same features in real life. In the case of theatrical design or theatrical images, we however have more tolerance. Small changes in the position of a performer doesn't really create a “new” design. Theatre is imaginary in nature, and theatrical design requires a certain degree of specificities that the Stylegan is not designed for.

3. The third reason is related to the handling of the dataset. All the images are currently treated as one big pool of data, which means the network has to learn common features from some very different pictures and it is beyond the design of the algorithm. We might be able to get better results by dividing the picture into different categories. This is called conditional training and will however requires a larger dataset. We did try to use conditional with the 15 productions dataset, but the network failed to converge to usable results.

With all these limitations we found with applying stylegan3 in theatrical design, we then started to explore other ways to use the network. One of the directions is an exploration of using the tools with the subject “Lilith”.

While investigating the possibilities that AI and Deep Learning could offer to the technicalities of the design process in theater, we also decided to see if in some ways it could provide research material for the creation of content during the design process.

Could we use the outcome of the learnings, based on the specific subjects of the play or the main theme of the play, to gather ideas about color patterns, textures, and shapes to use as a resource during the design process? Could the outcome of the deep learning be inspiring to the design process and provide new ideas, providing a new point of view to the storytelling?

We choose a subject that could be of interest in the theatrical matter.

Lilith is a research work looking for the lost myth of devouring femininity. It is the first wife of Adam, which at the time refuses to condescend in the sexual relationship with the Edenic father, transform herself into a demon of the night and because of that she leaves of course paradise and it is erased from the pages of the Bible. Lilith did not like to live as the subservient wife of Adam and hence went to the deserts, desiring solitude and freedom. It is the story of a woman who threatens to shake the very foundations of male-centered faith and discover a new self. Its name means 'night monster' and is derived from the name of the Babylonian female demon Lilit whose male counterpart is Lilu.

From there on its lay a mixture of legends, tales, magic texts, and cabalistic evocations. When three angels tried to force its return, everything was vain. It is also described as a vampire child-killer like her classical counterpart Lamia. The evil threatened by her against children was said to be counteracted by wearing an amulet, which bore the name of the three angels. Really, Lilith is the last link between the Jewish culture and the ancient God, the giver of life and death, of which the first biblical woman only inherits the negative aspect of the giver of death.

Lilith is might expelled from the Bible, but that didn't avoid her to be at the center of the conversation and the artistic production. From Michelangelo to Primo Levi, Marina Abramovich and Monica Grycho, she keeps inspiring the artistic work throughout history and the relationship between the feminine and sacred, the role of mother and woman, and so on.


In principle, we started using VQGAN + CLIP to generate images starting from a caption or a prompt. Google Colab is the online programming environment we approached to run the algorithm. In short: VQGAN is a generative adversarial neural network that is good at generating images that look similar to others (but not from a prompt), and CLIP is another neural network that is able to determine how well a caption (or prompt) matches an image.

Testing with the Text “Lilith”

Even though we tried running the algorithm with several different combinations of words in the caption, the presence of the name Lilith was driving the outcome to a very similar color pattern [dark-red was the predominant color], texture, and final image. We can identify the presence of feminine sexual organs and a cartoonish texture of the image.

Many Deep Learning Networks are able to learn the features from a lot of images and recognize the similarities among large datasets. Like we are affected in our understanding by the world and the context that surrounds us, the same happens for those deep learning networks.

The dataset is the world for the AI to understand and from it to find common features that will be collected and reshaped at each step of the learning process, therefore it must be carefully created.

Here below there are two videos. To the left is the quick carousel of the images chosen from the network to train the algorithm, the database, the input to the learning process. To the right, are some of the outcomes during the learning process, the images shaped by the algorithm in color, texture, and shape.

Alongside with the Lilith project, we continues to explore a more technical aspect of what the network could do. From the training we did with production time lapse, we found out that sometimes the MiDaS network will mistook light beam as a solid. Light is abstract even for most human being. To verified what the network sees, we created some synthesized light pictures and trained them in Stylegan3, and then use MiDaS to create the depth pictures. Below are the results of Color, Depth and 3D reconstruction:

We know that just giving the network 2D images won't let them understand what a light actually does. In modern stage lighting design process, computer aided design and drafting play a big parts, and this enable us to export the data in a CAD drawing files for AI development. Below is the results of a GAN network we created based on a small dataset of light (only one production):

We only choose 6 parameters for a light at the moment: X Position, Y Position, Z Position, Z Rotation, Tilt Angle of the light and the Spread (Field) Angle of the light.

With the data we have in hands it is impossible to evaluate the design of the network. The results presented above is a prove of concept of what is possible if we have more data. From the results we can see currently either with the design of the network or the data the network is not able to understand the data fully, therefore in the generated lights it doesn't resemble what's in the real data.

Moving forward there are many other aspect to be taken care of, including how to take scenery into account. In any case the training will requires a network not only for 2D images training, but also with others aspect like 3D data and data of light.

Learning the features of data and simplifying data representations with the purpose of finding patterns is in deep learning what brings the dimensionality of the image to first be reduced before it is ultimately increased.

After a lossy compression, the model is required to store all the relevant information in order to be able to reconstruct the compressed data.

This ‘compressed state’ is the Latent Space Representation of our data. The hidden to us information of the representation of the n-dimensional point we are not able to imagine because limited to the 4 physical dimensions, 3 for space and 1 for time.

This hidden space is what we would like to represent giving a physical disposition to the pictures created by the algorithm. People would walk through this multidimensional space created by displaying many of the iterations printed on transparent film, while entering with their body into a space their mind cannot yet conceive.

We believe it was a first attempt to allow people closer to something of yet not easy access, an insight into the algorithm, so much present in our everyday lives, but still so much obscure to our consciousness.

This 5 months adventure with AI and theatrical design is a challenging and fun one. The application of AI to different aspects of our lives is just the beginning, the technology related will continue to evolve. Like it or not there weren't many things we could do to stop this trend. To move beyond the almost cliché yet discussion of ethics in AI, as artists and researchers we do need more understanding of what these tools are able to do, and also figure out a way for the public to understand them. The installation we create is one such attempt. We will continue to explore the possibilities of working with these tools and different archives.

Elena is a multidisciplinary artist.

She creates experiences for the audience, inspired by humanity and the human body.

Graduated in mechanical engineering (MA), Elena completes her education at Yale School of Drama studying Technical Design and Production and Design for Theater. She is currently in the second year of the program held by Claudia Castellucci at Scuola Conia Raffaello Sanzio focused on the Theory of Scenic Representation.

She was also the recipient of the residency offered by Troikatronix-Isadora and LakeStudio Berlin in 2019. During this time she created the short piece Ex Matrice Corpora.

Elena is the co-founder of the Cistifellea Collective together with the musician Nicolas Cristancho, in which she practices her love for directing and media design. She is also a video designer in the theater and her collaborations have been seen at The Kitchen Theater in New York, at the Bayerische Opera in Munich, and in public theaters in Barcelona. With Cistifellea she directed the performances Asuntos Humanos (2021), Todo Bien Todo Tranqui (2021) El Silencio no Existe (2020)

Samuel received his MFA from Yale School of Drama, whereas he majors in Design. His work on Twelfth Night was nominated for the Outstanding Lighting Design in the Connecticut Critics Circle Awards. His media art work Aesthetics and Freedom of Human Beings is one of the finalists of IFVA (Incubator for Film and Visual media in Asia) in 2020.

  • projects/aiassistedtheatricaldesign/start.txt
  • Last modified: 28.02.2022 12:21
  • by Elena Tilli