type of project: fellow research project
published: 2020
by: Marco Donnarumma and Andrea Familari
website(s): https://marcodonnarumma.com, http://famifax.com, https://frontevacuo.com
license(s): CC Attribution-Noncommercial-Alike 4.0 International
maintainer(s)/contact:
repository:
Humane Methods
Humane Methods (2019-present) is an ongoing project by Fronte Vacuo (Marco Donnarumma, Margherita Pevere, Andrea Familari) that reflects on the new forms of violence emerging in algorithmic societies. The project consists of happenings and stage productions where dancetheatre, bio-art, interactive music, adaptive lighting, and AI technology integrate into living ecosystems. Each work in the series acts as a branch of a rhizome: each dives into different facets of the same theme and world. The works created so far – ΔNFANG (2019), ℧R (2020) and the forthcoming ΣXHALE (2021-) – combine in different ways performing arts with technology and the human with the non-human to reveal algorithmic violence through sensory, physical and ahuman strategies. It is an aesthetic research into how a synthesis of symbols, movement, music and AI can enable posthuman forms of empathy.
The technological research for ℧R and ΣXHALE have been conducted at the Academy for Theater and Digitality and this page describes our research questions, workflow and findings in relation to these projects. Please note that this page does not discuss the theatre pieces we created in their entirety, but focuses only on the technological research that underpins those pieces. For further information about the aesthetic, conceptual and dramturgical research behind these works please visit our website.
℧R
℧R is a solo performance for a body, AI algorithms, lights and the vibrational force of sound. Embodied in Donnarumma’s musical instrument XTH Sense, an interactive, machine learning algorithm transforms the inner sound of muscles, blood, and bones into a real-time choreography of movement, sounds and light. Drawing on Donnarumma’s more recent movement research, the piece combines multi-sensorial stimulation with human-computer interaction and somatic practices to test the seeming impossibilities of bodily articulation and repetition. ℧R is a walk in the dark. A thin body walks while a dim light shines. A path covered in soil is between them. Gazeless, the figure walks and prays, walks and prays, walks and, entranced, pries into the darkness. Its red cloth sanctifies it. Its missing face alienates it. The walk is a loop, a prayer, an aggression, an annihilation and a transformation. An AI computer vision system watches over the body, attempting to distinguish the body from the ‘noise’ surrounding it. The AI wants to represents what it sees, but the system works against itself in a paradoxical loop; the more the algorithm learns to see, the brighter the light becomes. In the eyes of the algorithm, the body is eaten by the storm of light.
ΣXHALE
The latest chapter in the Humane Methods series is ΣXHALE (2021-). A poetic reflection on ethics in algorithmic societies, ΣXHALE is a procedural and participatory piece lasting several days. It develops in loops, performed simultaneously by 6 performers* and an AI computer-vision system. As it watches the performers repeat a proto-prayer ritual, the AI learns their movements in the space and then begins searching for anomalies, something that deviates from what it expects to see. The learning process is accurately transformed in real time in music and lighting, influencing thus the events in the form of audiovisual stimuli. With each repetition, each variation and each anomaly, the loops mutate into stories of violence and empathy and the audience is asked to choose how they want to participate in these stories. The interaction between audience, performers and AI gives life to an endless sequence of micro-narratives.
AI in theater and performance: Strategy & Questions
With ℧R and ΣXHALE we aim to renew the way in which people experience the socio-cultural implications of AI and how they are publicly discussed nationwide through two strategies. First, by leveraging our previous research into the synthesis of movement, sound and human-machine interaction, the pieces symbolically reveal algorithmic violence to the audience through experiences that are primarily corporeal and visceral. Second, by combining dance theatre, procedural dramaturgy, audience participation, real-time music and AI we strive to create performances as social experiments, where human and non-human actors have to negotiate the collective ethics of their community.
A crucial research question for us has been how to create an algorithm, an AI, that can perform as an actor. That is, how to create an AI that does not dictate instructions to human actors, or generate textual, sonic or video material, but rather one that interacts with human and non-human actors and audiences, as well as one that can manipulate time and space within a performance. This is important because we want to avoid any stereotypical characterization of AI - be it utopian or dystopian - and rather exploit and make manifest the nature of algorithmic intelligence as it is instead of how it is claimed to be in the media and mass-communication platforms.
This led us to another related question: how to show to an audience the agency of an AI algorithm, its capacity to tangibly affect the course of a performance, without falling into didascalic technical exercises? And how do audiences understand an AI, what do they think it is and it looks like? How is it possible to play with those expectations to create an aesthetic and poetic language of the machine? In which ways can we show that any AI is embodied, it has a body and as such it does not exist in a digital void, but rather cannot escape interacting with the world and its inhabitants?
Technical infrastructure
During our research, together with computer scientist Baptiste Caramiaux and AI engineer Meredith Thomas, we conceived and implemented a technical infrastructure using custom made software and hardware. This infrastructure is not merely a technological addendum to the scenography, but rather is the scenography within which the events unfold. Circuit boards, electrical, video and audio cables and all the technical equipment is an integral part of the set design.
The infrastructure relies on multiple systems that are interconnected with one another:
- a physical network of 10 computational units, each consisting of an embedded computing board (RaspberryPi) that captures real-time video input from a camera. The physical network of computational units is distributed across the theater space, watching over all the events and the actions of the human performers, becoming thus a network of “sensing organs” for the AI.
- a digital network of 10 AI algorithms running on a tower PC offstage. Technically, the type of algorithms we use are called autoencoders. The algorithms are managed through a visual interface, where data can be visually examined and learning parameters can be changed on the fly.
- a rig of 12 loudspeakers and 4 subwoofers for sound playback. 8 of these speakers are positioned in a overhead circle above the stage and are used to create spatialised music using ambisonic techniques (virtual speakers are used as well to obviate to the difficulty of placing the needed speakers at stage level, or below it). The remaining 4 speakers are located at different, asymmetric spots on the stage floor in order to give more depth and capacity of movement to the sound.
- a light rig with 62 analog fixtures (PAR, Profiler, Fluorescent Tube and Domino) and 24 LED Fixtures (Wash and Bar)
The 10 video outputs are then streamed to a tower PC where 10 AI algorithms analyse the data, learn the patterns that characterize the data and identify possible anomalies that appear in the data.
Software
Our software system includes several type of programs:
- a custom network of 10 convolutional autoencoders, coded in python and C++. An autoencoder is a type of machine learning algorithm. It learns how to represent data using a method called dimensionality reduction, which essentially consists of reducing a dataset to its most salient features; a sort of compression, so to say. Autoencoders are used widely today, from semantic meaning of text, to facial recognition, medical imaging and photo manipulation. Read more about it here.
- a visual interface to manage the AI learning process, coded in OpenFrameworks. The interface collects and visualises all video inputs and outputs, as well as the loss value and the latent space (see the section Autoencoder below for an explanation of these two aspects of the algorithm). Through the interface it is also possible to set the learning rate, that is - roughly saying - the speed at which the AI compute.
- an AI sonification software generating 8-channel ambisonic output, coded in Pure Data, Renoise, Reaper and IEM Plugin Suite
- an AI to lighting software generating multiple DMX output, coded in Touch Designer
- an interactive timeline software directing cues and parametric changes for the AI, sound, light and dramaturgy, done in Chataigne
Hardware
- 3 computers: the main tower computer that serves as video streaming hub and as a host for the autoencoders; one for AI-to-sound conversion; one for AI-to-light conversion.
- 3 sound cards, one for each computer
- 10 RaspberryPi 4Gb
- 10 RaspberryPi Screen Monitors in multiple dimensions: 5 inches, 7 inches and 10 inches
- Audio mixer, at least 12 channels and 8 subgroups
- at least 8 loudspeakers and 4 subwoofers to set up an ambisonic sound system, ideally 12 loudspeakers
- a lighting rig with various fixtures ( Analog and LED)
The autoencoder
After a few years of research experimenting with reinforcement learning through sound and biodata (see for instance Eingeweide and Humane Methods [ΔNFANG] ), in 2021 we opted to work with an autoencoder in order to experiment with the capabilities of AI-based computer vision. At the time, this was a decision based on our will to experiment with the medium on a theater stage; only later we realised that a computer-vision system offers greater opportunities to establish an interaction between audience, performers and AI that is more easily graspable by a general public. The reason is not only that the algorithm uses video material, but that the learning process of the algorithm can - after particular hacks - be literally visualised in an intelligible manner.
The photo below on the right may help visualise the concept. The photo shows a test where the algorithm learns the features of a leaf. The leftmost square is the video input. The middle square is the video output, i.e., the input as it is being reconstructed by the algorithm from scratch in real time. The rightmost square shows the loss, which is a variable that the algorithm monitors to understand how close or far the reconstructed video is from the original input video. The loss tells the algorithm how to optimize its output.
The structure of our autoencoders is as follows (illustrated in the block diagram below). Each autoencoder receives as an input a matrix of black and white, 128×128 pixels. Using a convolutional method, each algorithm learns the feature map of the input, that is, the relationship between neighboring pixels. The feature map is eventually represented by a vector of 7 variables, which is technically called “latent space” or “bottleneck” in jargon. At this point, the algorithm begins the reconstruction of the input data. Using deconvolution, the algorithm uses the latent space to visually reconstruct the input into a new output made entirely from scratch, but very closely resembling the input.
In theory, what the algorithm discards is only the “noise” contained in the original image. But, conceptually speaking, who defines what is the noise? This was an interesting question that accompanied us throughout our research. The answer is not simple, nobody directly chooses what noise is, but it is the way in which the algorithm is coded that implicitly defines what noise is.
The network of autoencoders does not learn from an existent dataset, but learns only from the events of the performance in real time. The network, however, has a memory. At the end of each performance, the autoencoders save the models they created so as to use them at the start of the next performance. In this way, the network of algorithms constructs, day by day, its own “memory” of the events taking place during the performance, and therefore, develops a continuous “understanding” of what is noise, what is an anomaly or what is important information. This continuous and endless process of learning is the performance of the AI.
In the next section, we describe how the AI learning process is manifested during the performance through sound and light, and how this mechanism creates a feedback system between the analog world of the performance and the digital world of the AI.
A performing machine
Our first step towards manifesting the performance of the AI algorithms was to keep the algorithm from completing its learning process. Autoencoders are quick and efficient; they typically learn at a very high rate, i.e., between milliseconds and a few seconds, depending on the complexity of the input. The gif below (taken from Rkkuang's blog) shows an autoencoder learning digits at a very low learning rate, which results in a learning process of a few seconds.
This, for a theater performance, is quite an important issue. How can one manifest the workings of a process that last, at best, a few seconds? In order to impede the algorithm to learn fully we created a feedback system between the digital algorithm and the analog events of the performance. In a typical computing implementation of an autoencoder (like in the gif above: recognising digits), once the loss reaches 0, the algorithm stops, because it has learned the problem at hand. Also, in many implementations the input is static (this may not be the case with some types of facial recognition systems). In our own implementation, by using the latent space to drive the brightness and pulsation of the light rig, it is impossible for the autoencoder to fully learn its input, because, as the algorithm learns, it automatically changes the lights and, therefore, it changes its own input.
Technically, the algorithms do learn and do manage to produce a convincing output, but through its own learning it affects what it sees, entering thus an ever-changing feedback loop (one could take this as a metaphor for neoliberalist, algorithmic societies). In order to lengthen the learning process even further, the learning rate of the autoencoder is set to be relatively slow and it is changed throughout the piece in relation to the dramaturgy.
Another more subtle way of linking the temporality of the algorithm to that of the performers on stage is through sound. Every half minute or so, a program takes a snapshot of the latent space and uses those 7 variables to generate the musical patterns of 7 digital synthesisers. Practically, what happens is that the music changes constantly - and quite drastically too, at times - creating a syneasthetic flow of sound and light. The human performers are trained to listen to the sound and to adjust - in real time - their movement qualities to the timbre and rhythm changes in the sound.
How to
The core of the system runs on three different machines:
- Tower desktop PC (PC) - runs the AI and Chataigne, a timeline software which controls both automated and state-machine-based cues.
- Laptop 1 (L1) - controls audio
- Laptop 2 (L2) - controls video and light
As described above, our technological infrastructure relies on 10 RaspberryPi 4, where each RPI is coupled with a camera module. Each RPI sends the video signals from the camera to our PC through a dedicated wireless network.
For each video stream, an autoencoder outputs a total of eight OSC messages: a vector of seven values (the latent space) and one loss value. At the same time, each autoencoder creates a visual reconstruction of the camera input, and the OpenFrameworks software visualizes each autoencoder's loss and latent space.
L1 receives the OSC messages which are conditioned and processed by a Pure Data patch. Then, Pure Data transforms the data into MIDI messages that generate musical patterns for 7 digital instruments (running in Renoise). From Renoise, the resulting music is routed to Reaper where, using the IEM Plugin Suite, the music is encoded and decoded as an ambisonic audio stream. The loss value is used to drive the movement of different sound sources in space, as well as a threshold-based trigger for a variety of audio effects.
L2 runs a TouchDesigner patch that controls the light system and acts as a media server for redistributing the video contents (cameras input and autoencoders resynthesis) on the monitors located in the audience areas. The DMX values controlling the light fixtures are generated through a combination of two values: a processed sample of the latent space captured every second by the software; a value created from audio analysis of the sum of Left and Right sound channels from L1.
For what concerns the video, L2 grabs the cameras inputs and the autoencoder resynthesis of those input in Touchdesigner, where the video signal is further processed and redistributed across ten different outputs. According to the dramaturgy, during the performance the final video outputs are mixed/switched with the real feed from the cameras and distributed to the ten monitors located in the audience.
Basic example patches
You can download some example patches/files from this link.
In the folder:
- Chataigne project
- Touchdesigner project for video and light
- Pure Data patch for audio