I made a game. It’s all in ASCII. I wondered if it would be possible to turn it into full motion graphics. In real time. With AI. Let me share how I did it.
Let’s start with the game. Lately, I’ve been exploring just how far I can push old-school ASCII RPG style game frameworks. My latest one is called “Thunder Lizard,” which procedurally generates a prehistoric island populated with dinosaurs fighting for dominance as an active volcano threatens the whole island. You can go play it if you’d like.
To render it in AI, the basic plan was to grab a frame from the game, run it through an image generation model, and replace the displayed frame with the resulting image, for every frame. This presented a number of challenges and requirements that lead me on a deep dive into the wide offering of cutting edge image generation models offered today. But first, let me show how it turned out.
The need for speed
The main constraint for real-time AI rendering is latency. Most games run at least at 30 frames per second (FPS) which only gives you 30 milliseconds to do the following:
Connect (and authenticate) with an inference provider
Transmit the prompt (including source image data)
Wait generation to complete
Receive the new image data and display it
Considering a normal latency to load an image can be a couple hundred milliseconds, this constraint seems impossible. However, fal.ai specializes in offering “lightning-fast inference capabilities” for generative media, including a few Latent Consistency Models (LCM) that approach 100ms generation times. To further minimize latency, fal.ai also offers a WebSocket connection to remove the connect and authenticate steps from subsequent requests. Finally, they offer the option to stream images as Base64 encoded data for immediate direct access.
By taking advantage of all of these optimizations, and by using 512×512px images, I was able to run at 10 FPS with around one second of latency. This is in fact the same frame rate targeted in the original game because it felt best with the ASCII visuals. I could likely go faster, but doing so would risk rendering images out of order if newer requests finished before older ones came in. The latency was the larger issue because it is quite noticeable as a “lag” between the source game and player inputs and the AI rendered frames, which you can see in the video above.
Choosing a model
Fal.ai has over 300 image generation related models currently! I found this quite overwhelming when I started. I spent quite bit of time learning about the different model providers and their specialties and capabilities. I generated hundreds of images in their online playground to understand how they each worked and which ones would work for my needs.
I narrowed down the list based on these hard requirements:
Very fast generation times
Source image adherence
Decent look and feel
Only a handful of models were fast enough, so that part was easy. The challenge was getting a good enough output. Fast models by definition are distilled or compressed in some way, making them less powerful than larger, slower models. Not only does this effect image quality, but it also limits generation options, like using LoRA or ControlNet.
One of the big challenges was finding the appropriate way to make sure the output image “lined up” with the actual game frame to preserve the solidity of the game world. I explored a number of approaches, including image-to-image (i2i) models, ControlNet, and image editing models.
Layout consistency via ControlNet
I originally thought that ControlNet would be the way to go. ControlNet is a way to provide extra “guidance” to a model to control the layout of the generated image. It comes in many flavors, such as poses, outlines and depth. I was particularly focused on “segmentation,” which is a way to visually indicate “zones” for content to align with. Since I controlled the rendering engine of my game, I was able to output a version of the frame that seemed like a good segmentation ControlNet reference.
I had hoped to be able to use a text-to-image model with a prompt describing a generic top-down prehistoric terrain, and let the ControlNet guide the zones, like blue for water and red for lava. Not all models offered ControlNet, but a few did.
I was surprised to find out that this approach didn’t work very well at all. I think this was partially because I didn’t figure out how to correctly label the segmentation zones, and partially because the blocky terrain layout doesn’t carry enough meaningful visual semantics for the model. I experimented with other ControlNet types, like outlines, and had slightly better results, but ultimately gave up on this approach.
Layout consistency via image-to-image models
The other option was to use the game image as a source for a model that could do image-to-image transformations. Some models like FLUX Kontext are specialized for image editing, while others use the provided image as a prompt. I tried both approaches, with prompts like “turn this blocky image into realistic aerial terrain” or “aerial terrain of a prehistoric island.”
The resulting images followed the layout of game frame much better with this approach. However, it became harder to control the look and feel of the generated image. Too much image “strength” and the output would be blocky just like the input. Too little and it would look better, but not line up. The included text prompt had surprisingly little effect. Some models offered additional options such as IP adapters, style transfers, and LoRAs, which I experimented with as well.
The right look
I tried endless combinations of settings and techniques to get a look I liked. Here are just a few examples.
Ultimately, I settled with the best result I could get on the models there were fast enough for my needs. I used the fast-lcm-diffusion/image-to-image endpoint with the StableDiffusion 1.5 model and the prompt “top down, rpg terrain map for a game set in prehistoric times. red is lava, blue is water, green is grass / trees”. I learned that the blocky version of the game frame that I intended to use as a segmentation map resulted in a better i2i result than the ASCII equivalent.
I wasn’t very satisfied with the look, but it was the best I could do. I also found it disappointing that the models were never able to render the dinosaurs as sprites within the terrain, despite my best prompting efforts.
Before moving on, I did spend some extra time exploring ways to get a better look with a better model, even though I knew it wouldn’t work for real-time, just to see how far I could get.
LoRA
To get a better look, I tried a few “style transfer” models, but ultimately made my own “style” by training custom LoRA weights. LoRA is a technique to fine-tune a model in a lightweight and portable way. Here was my process:
I used the general FLUX model with the prompt “Aerial view of terrain. Realistic that has water, lava, trees, and grass” to find a style I liked. I used the same model with variations of the prompt, “top down rpg game map of prehistoric landscape with water and dense trees and a medium dinosaur and some lava on the edge” plus the original image as a “reference image” parameter to take advantage of FLUX’s built-in generic style reference, to generate about ten sample images. I used the FLUX LoRA Fast Training model with my sample images to train directly on fal.ai, which only took a couple of minutes and cost less than a dollar. I used the resulting LoRA weights on FLUX LoRA Image to Image with the prompt “tl_rpg terrain - blue is water, green is grass and trees, red is lava, crosses are dinosaurs” where “tl_rpg” is the trigger word from my training, plus the image frame from my game to get style consistent, source adhering, high-quality generated images.
Here are the results. Interestingly, the ASCII frames worked better than the segmentation versions, which was opposite from what I found with the smaller models. These still are not as nice as the generated images that aren’t bound to the source game frames, but I was unable to get any closer than this.
For what it is worth, I did try using this version for game rendering, though at a four-second latency, it was un-playable, which is a shame because it looks much better.
Putting it all together
With everything in place, it was time to integrate image generation with the game. I used the JavaScript SDK that fal.ai provides to connect via WebSockets. On every render call in my game loop, I actually drew to two separate canvases, one for the original output and a second hidden one that rendered the block version at 512×512 pixels. Then I captured the image data from the second canvas as a base64 data url and sent it with the rest of the prompt and settings as a message over the WebSocket connection. Upon the async response, I rendered the resulting image data into a third canvas for that purpose. You can see the actual code if interested, and the image below shows the three canvases side by side to demonstrate the three different renderings.
I would call this experiment a success. It only costs a few cents per minute with the fast model at the low frame size and frame rate. I accidentally ran the LoRA model above at 10 FPS and blew through $10 in a few seconds, so model selection matters!
One remaining issue is frame to frame consistency. While the models actually are quite stable, especially with a set seed and high image strength setting, you can still see “jumpy” differences where trees turn into rocks or grass with each new frame which is disorienting. I thought of a few techniques that could help with this. One is to use LoRA, though even if the image style is consistent, the exact placement of terrain features still jumps from frame to frame. A more advanced technique that could work in this use case is to use “out painting” to provide the previous generated frame shifted by the movement delta, the current source frame, and a mask over the new “empty” pixels. This might make a really impressive and smooth visual experience, though it currently wouldn’t be possible at the required latency with the models available today.
All in all, it is amazing to see AI image generation happen in real time. I am excited to see where this will go in the future. I also really like the idea of using AI to enhance or alter an existing source that controls the consistency of the underlying world. This project was a fun exercise to push the envelope, but I could imagine this being a powerful technique to quickly and easily experiment with multiple different styles from a single low-res prototype. Thanks for following along!