In the Summer of 2022, I used Midjourney to generate my first artificial image. It was a clockwork hand. I did, without irony, choose the very thing that Generative AI was terrible at. It returned some images that vaguely resembled hands. These weren't just any hands, of course. They were abominations in the eyes of God. It was accidental Salvatore Dali, a warped thing with wobbly appendages. Vague suggestions of numberless clock faces emerged from this “art.” Despite the crude response, I was fascinated and immediately set to spending way too much money on hundreds of pictures that were completely unusable for any purpose other than to say, “Hey, isn't this weird and cool?” Take a look at the monstrosity in Figure 1.

This was just with Midjourney 2.0. Other iterations followed, and they came quickly. Through the updates, the hand started to take shape. It took a few years to get the appropriate number of fingers worked out, but all along, I watched the technology change in real time. With every upgrade, my clockwork hand looked more and more like I imagined. I'm a little slow on the uptake sometimes, but around Midjourney 3.0, the thing that had been nagging at me since the beginning finally revealed itself. If you could make an image of something, you could get Generative AI to make lots of those images in a row, and if you wanted to make lots of images in a row, you could pack them in tightly—24 images every second, maybe.

The Myth of Filmmaking

Growing up, filmmaking was some mystical art. It was an occulted thing, a special club for people with exciting lives, like astronauts and karate champions. Only rich people and wizards got to make movies. Then film gave way to digital. We started to have affordable, non-linear editing systems in our homes and the “prosumer” camera market exploded. In 2001, Danny Boyle directed the zombie flick “28 Days Later” on one such camera, the Canon XL1. I bought my own right after that for around $1500. As you can see in Figure 2, it was portable enough to carry in one hand and it used tape storage instead of film.

Figure 2: The Canon XL1 was groundbreaking.

Before long, tapes were swapped out for digital storage. The plunging cost per compute put complex FX systems into the hands of hobbyists. Blender, an open-source 3D creation suite you can see in Figure 3, supports everything from modeling and animation to compositing and video editing. And it's free. The idea of filmmaking as an exclusive club started to erode. In 2015, filmmaker Sean Baker shot the award-winning “Tangerine” entirely on an iPhone 5S. Just a few years later, Steven Soderbergh used an iPhone 7 Plus to make the thriller “Unsane.” Movies are shot on phones all the time now, with built-in cameras that outperform many older Star Wars movies. Filmmaking has been democratized, in many ways. Aspiring filmmakers have near limitless tools available.

A New Toy in the Toybox

In February of 2024, the public's access to AI art tools was expanding like a fractal. Text-to-image design was everywhere, with countless competitors entering the market. After much speculation, OpenAI, the creators of ChatGPT, announced Sora, a text-to-video model. Early clips were astonishing. It represented a massive leap, letting users craft realistic videos using a text prompt. Nerds like me checked the feeds every day, waiting for some sort of sign that we'd soon be able to play with it. And we kept waiting.

During that wait, the other tools didn't stop coming. Google teased Lumiere and all sorts of other research. We got our hands on novel ways to provide simple and crude animations to still images. Runway ML and Kling both made big splashes in the market, and the anticipation of OpenAI's little miracle began to dwindle. Skeptics called it vaporware, just some Silicon Valley sideshow to bilk investors out of billions. Less than a year after it was announced, after much refining and safety testing, Sora was released to the public in December of 2024. Right now, you can go to https://sora.com to sign up and try it out. With a ChatGPT Plus subscription at $20 per month, you can get just enough of a taste to accomplish absolutely nothing. More on that later.

How Does It Work?

To generate small clips of video, Sora takes input from text, images, or even other videos. It's built from the DALL-E image generating models and GPT language task technologies, using diffusion models to create its visuals. Diffusion models function by generating data such as images, audio, or text through a two-step process. In the forward process, the model progressively adds noise to the image. It does this over and over until the data becomes pure noise. This intentional muddying of the information instructs the model on how a clean piece of data can transition into noisy data. At this point, you enter the reverse process, where the model learns, step by step, to strip the noise from the noisy data. It does this in iterations until it rebuilds the original data or, in this case, generates something entirely new from the noise. Through extensive and repetitive training, the model learns about the underlying distribution of the data. This enables it to generate new images tabula rasa by reversing the noise process.

Hands On

The first thing I notice when I visit https://sora.com is its simplicity. As you see in Figure 4, the presentation is elegant, like an Apple Product. That's both good and bad, for reasons that will become obvious after a bit of exploration. Once I create an account, I can choose between the two main tiers. The $20/month tier that I already have with my ChatGPT Plus subscription gives me access to up to 50 “priority” videos per month. That equates to 1,000 credits to spend on five-second videos at 720p resolution.

Figure 4: Sora advertises its simplicity right up front.

Once I jump into the tool itself, I find something shockingly simple. After playing with countless other tools in the race to capture the market, I'm struck by how spare Sora is. It's immediately clear that this isn't a full suite of tools you get from something like Runway ML or even Canva. It's crafted with ease of use in mind, to lure in as many subscribers as possible. You'll find more complexity of options in TikTok or Instagram, as you'll see in Figure 5.

Figure 5: The Sora interface is just the basics.

The primary interface at the bottom of the screen puts all the focus on the prompt. Considering that ChatGPT is one of the juggernauts of the Large Language Model sprint, this makes sense. If I want to supplement the process with my own image, I can do that by selecting the plus sign. Along with the aspect ratio, there are limited options to adjust the resolution, length, and number of variations produced by the prompt. The presets setting is where I suspect the tool will really shine. That's where the customization lives. Up front, Sora offers several presets labeled things like Stop Motion or Film Noir. Digging a little deeper reveals that these are built simply by appending the prompt with a pre-baked description.

Balloon World, for instance, adds the following to whatever you enter in the prompt:

Theme: Everything is made of inflated balloons
Color: Glossy, bright colors—reds, yellows, and blues
Film Stock: Clean digital with exaggerated reflections on shiny surfaces
Lighting: High-key lighting with glossy highlights to mimic rubbery textures
Content Transformation: All characters, objects, and environments are made of inflated balloons, with visible seams and a bouncy quality
Vibe: Fun

Customizing my own preset through trial-and-error is simple enough. Once I refine a style I like, I can add my preferred specifications to the list of custom presets.

I choose a 16:9 aspect ratio, like any decent human, but the maximum resolution I'm offered is a pitiable and primitive 720p. That's a shameful level of fidelity, like trying to watch the Phantom Menace trailer on RealPlayer using a dial-up connection. The length available with my subscription level is five seconds, but I also need to select how many variations are generated by one prompt. Sora will generate up to four options at once. I stick with two videos per prompt. Four will certainly burn up my credits faster.

On the far right end of the task bar is the Storyboard tool. This feature allows users to break video projects into a timeline. Rather than use the first prompt for the entire five seconds, Storyboard enables users to add up to two prompts for each second for more precise control. Rather than getting that granular on the first attempt, I'd like to see what it can do with one well-crafted prompt.

Crafting an effective prompt involves clear and detailed descriptions to guide the AI. Start by knowing what your goal is. If you want to generate a new video from scratch, say so. If want to create an animation or to edit existing footage, be precise in your instructions so that Sora understands the task. Once that's established, add the specifics. Include any context and all the major details like tone, emotion, characters, and what drives the scene. It's entered in the basic prompt window, as seen in Figure 6.

Figure 6: The Sora prompt window is focused on simplicity.

After much consideration, I craft the following:

A close-up of a man examining his clockwork mechanical hand. The hand is built with finely detailed, spinning cogs, gears, and delicate wheels that glint under the dim light. The man appraises the hand as he looks down at it. He flexes his fingers, making a fist and releasing it slowly. The mechanisms within the hand respond and begin to turn and click rhythmically. The scene is dimly lit with a warm, industrial ambiance, featuring subtle hints of a clockpunk workshop in the background.

The First Attempt

Within seconds, Sora gives me two options, as seen in Figure 7. They're both highly detailed, but off in that specific way only Generative AI can misfire. Both have that slow, floating camera that offers the illusion of movement. Once again, the hand is something evocative of a hand, rather than an accurate representation. The fingers don't flex. The cogs and wheels barely spin, and the wielder of the groovy new appendage doesn't seem particularly interested in the steampunk miracle at the end of his arm. In fact, if I look closer, I see that he's holding the hand. It's not even a part of him. What are we even doing if we can't create clockwork cyborgs?

With image prompting, the first pass is usually a question of how close it is to what I want. Is this something I want to try to refine? Or should I start over? Because I'm not that far into the process, I'll refine the prompt and try again. Can I craft the perfect query, coaxing the model to do exactly what I'm envisioning? Yes, but it will take some effort, time, and money. That's the nature of prompt engineering, refining to find the just the right words that allow you to communicate your message perfectly to the Large Language Model. That amount of trial and error can get expensive, but as a brilliant scientist once said, “We do what we must because we can.”

A close-up of a man with a clockwork mechanical hand. The hand flexes, fingers opening and closing. The hand is built with finely detailed, spinning cogs, gears, and delicate wheels that glint under the dim light. The man appraises the hand as he looks down at it. The mechanisms within the hand begin to turn and click rhythmically. The scene is dimly lit with a warm, industrial ambiance, featuring subtle hints of a clockpunk workshop in the background.

The core of this prompt is still solid, but to get the fingers to flex, I list that first, hoping that Sora will give that action a higher priority. Little else is in the prompt is altered. As shown in Figure 8, the result is similar. It's closer to what I want in some ways, but farther off in others. The fingers still don't really close. I want a mechanical flex, like Schwarzenegger repairing his hand in The Terminator.

Figure 8: The second attempt isn't much better.

With a third and final effort to generate my base video, I go with the following refinement on the prompt.

A clockwork hand, opening and closing its mechanical fingers. The hand is made from finely detailed, spinning cogs, gears, and tiny wheels. A man looks down at the hand as the fingers open and close. The scene is dimly lit with a warm, industrial ambiance, featuring subtle hints of a clockpunk workshop in the background.

Here's where things horribly go awry.

Figure 9: The third attempt sends me dramatically off course.

The results of the third attempt are awful, as seen in Figure 9. The two variations are so much farther away from the intent than I predicted. It's time to see if I can give the prompt a bit of outside guidance. Rather than relying fully on the power of Sora, I take my original prompt back over to Midjourney.

A clockwork hand is the only prompt I offer, the same as my original image. This time, the response represents a hand! It's identifiable as something humans would call a hand. It's also clockwork, as seen in Figure 10.

Figure 10: The updated Midjourney hand is much better.

This is closer to the style I had in mind and it's clearly at the end of someone's arm. I upload this one to Sora to help influence our results. Sora provides a warning that my account doesn't currently support uploading media that contains images of people, as seen in Figure 11. These guardrails are likely in place to prevent nefarious uses of the technology, like making videos of Gwyneth Paltrow crouched, eating a dead deer on the side of the road. Who would do such a thing?

Figure 11: Sora's “no-people” guidelines

I pair the image with my second prompt, because that one got me closest to the goal.

Once again:

A close-up of a man with a clockwork mechanical hand. His hand flexes, fingers opening and closing. The hand is built with finely detailed, spinning cogs, gears, and delicate wheels that glint under the dim light. The man appraises the hand as he looks down at it. The mechanisms within the hand begin to turn and click rhythmically. The scene is dimly lit with a warm, industrial ambiance, featuring subtle hints of a clockpunk workshop in the background.

The results shown in Figure 12 are a marked improvement, but there are still the telltale signs. The man doesn't look at his hand. The hand doesn't really do much. The wheels don't spin, and the digits don't flex. It's just a few drowsy movements that fit the spirit of the prompt, but not the letter. This is my “fine, it's close enough” moment.

Basic Editing Tools

It's all too easy to get caught in the trap of generating and regenerating. At the end of the day, I'll end up with 100 five-second reels that are all so similar I'll get lost in the details. Because option two in Figure 12 is the closest, I want to see what some of the editing tools can do for me. Once I select it, Sora provides a new and simple toolbar, as seen in Figure 13.

Figure 13: The editing interface is simple and familiar.

Again, OpenAI's efforts to appeal to the mass market come into clearer focus. This menu would fit in perfectly with a TikTok or Instagram toolset. Editing the prompt simply generates a new video, as I've been doing, but the View story option provides a different approach, as seen in Figure 14.

I'm presented with the storyboard, a single editing track where I can get a bit more granular with the instructions for the five-second clip. The first image is the clockwork hand from Midjourney. This acts as the launchpad, but from there, I'm surprised. The next panel, which shapes the video beginning at the two second mark, is a prompt I didn't write.

The camera slowly pans across an intricately designed steampunk mechanical hand. The hand, crafted from brass, exhibits an array of visible gears, pistons, and cogs, meticulously rotating and moving within its open palm. The metallic fingers are articulated with jointed segments, showcasing fine engineering. A cuff-like structure at the wrist blends into the shadowy, industrial background, enhancing the hand's function and aesthetic as a plausible, mechanical creation. The scene evokes a sense of curiosity and wonder, highlighting the intricate craftsmanship of the steampunk universe, with its blend of Victorian-era design and advanced technology.

This is the prompt Sora generated based on the picture I provided. “The camera slowly pans…” seems to be Sora's whole schtick. The third panel returns to the prompt I initially uploaded with the picture. Looking at the process like this gives me a better idea of how it works. With that in mind, I make a slight adjustment to my prompt.

A man flexes the fingers of his clockwork hand opening and closing into a fist. The hand is built with finely detailed, spinning cogs, gears, and delicate wheels that glint under the dim light. The man appraises the hand as he looks down at it. The mechanisms within the hand begin to turn and click rhythmically. The scene is dimly lit with a warm, industrial ambiance, featuring subtle hints of a clockpunk workshop in the background.

It's easy enough to manipulate the timeline seen in Figure 15. With every second, I can provide more information, like a prompt for that specific half-second, or even an image or video. In this case, I simply move the third panel, the flexing of my clockwork hand, back one second to give it more time to animate.

Roughly ninety seconds later, Sora generates the two responses, as seen in Figure 16. The scene is improving by degrees. The two results are a bit more cinematic. The clockwork hand now seems to belong to the man in frame. However, Sora is still ignoring my efforts to get the hand to close. Most of its efforts focus on dramatic camera movement.

I go with the second option. The inner workings of the hand aren't clicking and whirring at all in the video, but the motion of this one seems more natural than the other. The Re-cut option allows me to trim a little fat from the beginning of my five second clip. With that edit, I select the Remix option and am given another prompt input box. The placeholder text suggests that I Describe what to add, remove, or replace… After writing paragraphs of the same thing over and over, I'm going to try the more direct approach.

The fingers of the clockwork hand close and open as the man looks at them.

When using Remix, I can adjust its creativity levels from Subtle to Strong. The Mild setting promises noticeable changes to the initial video, which is what I'm gunning for. After the briefest wait, I get disappointing results. The video moves the same. The man holds up his clockwork hand in both videos. The only thing this Mild Remix gives me is slightly different styles of the hand. Again and again, all of them end on some slight variation of what you see in Figure 17.

Figure 17: I get variations of this over and over.

Balloons and Other Silliness

The Blend feature allows me to upload another video to append to my first. There's more to this than it first seems. To test it, I'll need a second video. I stick with the steampunk/clockwork theme and use the following prompt:

The camera pulls back from above a steampunk work bench in a Victorian lab. A man in Victorian-style vest and shirt is at the bench. The camera rises above him to reveal the large laboratory full of steampunk cogs, wheels, gears, and various mechanical experiments. Lightning flashes, casting flickering light through the skylight, making his work appear sinister and villainous. The background is dark, with industrial elements like pipes and exposed bulbs, enhancing the steampunk ambiance.

The results are admittedly impressive. This time, Sora generates a dynamic scene that cuts between a steampunk scientist working at his bench and a wide shot of his cathedral-like laboratory. The scientist oddly looks almost exactly like he did in the previous videos. Those first attempts with the clockwork hand should have no bearing or influence on this second video. It was a fresh prompt. I suspect Sora has a default idea of a steampunk scientist, so that's what it's giving me.

I select the first of the two variations and choose my favorite output of the clockwork hand as my second video. As you can see in Figure 18, Sora pairs the two videos and provides a simple transition curve to determine where one video will begin to shift into the next.

The curve itself is an easy-to-manipulate slider, allowing me to control when video one shifts into video two. Once I let go of my dream of a clockwork fist, the results are remarkable. The new four-second clip begins with the scientist working in his lab. There's a quick cut to show a sweeping shot of the glass ceiling, more fidgeting with cogs and tools, and then the shot of him appraising his clockwork hand. It works far better than I anticipated, creating a micro-story in a tight, streamlined edit.

Here's where Blend gets interesting. The above output is what Sora calls a Transition blend. Sora also offers a Sample blend, to influence a primary clip with another, and a Mix blend, to merge the two clips together. These settings also allow you to adjust the sliders to fine tune the output. Sample doesn't return a smooth edit like the last time. There are no cuts. The scene is still intact in spirit and style, but the results start to get weird, straying into the frequent noise of AI video generation. In Figure 19, you can see how Sora tries to reimagine some of the elements, with questionable success.

Mix blend is similar to Sample, but with the influence sliders skewed to give a different type of reimagining. It provides more of the same, an interesting montage of rapid cuts of my steampunk scientist working in his lab with his clockwork hand. After a few attempts, I'm still having trouble figuring out the utility of Mix versus Sample.

The Fine Print

With that effort, I'm out of the credits provided from my ChatGPT Plus subscription. For more generations and longer videos, I can upgrade to the Pro plan for $200/month. Even with more time and iterations, I'm not confident I can get Sora to accomplish the initial goal of the clenched fist. My stubbornness has its limits, it seems. It doesn't help that I can watch my credits ticking away with each botched attempt.

There could be any number of reasons why it didn't work, but there's the rub with prompt engineering. It often seems like trying to figure out just the right way to phrase your wish to the genie. One false syllable, and you'll suddenly find yourself with a horrible curse.

Pricing Guide

The one thousand credits provided to a ChatGPT Plus subscriber go quickly. With the trial-and-error nature of text-to-video, this clearly establishes the Plus-tier as the public demo for the technology. Sora's billing credits FAQ breaks down what each action costs. It's just byzantine enough to make it difficult for penny-pinchers to watch every expenditure. Video generation can cost anything from 20 to 2000 credits, as you can see in Figure 20. If you're not paying attention, you'll be asked to upgrade to the $200/month tier in no time. The Pro tier offers up to 1080p resolution, twenty second videos, and unlimited relaxed generations, an option unavailable to the Plus tier.

Assessment

In no way is this a professional tool. Sora isn't going to take anyone's job just yet. My time would have been better spent just learning a 3D modeling program like Blender. With little consistency, strange aberrations, and very limited tools, Sora is a novelty.

That said, it would be naïve not to see it as a powerful proof of concept. The clockwork hand I generated a few years ago is now moving. I saw the cogs spin. I saw the scientist examine his work. In the days since I started writing this article, more video tools have been released and some of them are said to easily outperform Sora. That's what the race looks like now.

Over the next twelve months, we'll see this technology explode. This is as bad as it will ever be. Five-second clips will become five-minute short films. Those short films will grow into prompt-driven movies. Not long after that, we'll be able to tell our favorite movie-generating service that we're looking for a murder mystery set on Venus, starring Jean Claude Van Damme and Betty White, directed by David Fincher, and it's an unofficial sequel to “Madam Web.” It will be packaged for streamers like Netflix, who will be able to look at your preferences and manifest the perfect film. Ready or not, we'll all have movies made by clockwork hands.

Prompt Engineering

Prompt engineering is about shaping a very specific request in natural language to steer an AI model to the required output. A prompt engineer attempts to communicate the tone, context, and specific details to get as close to the intended result as possible. A well-engineered prompt ensures that the user's intent is clearly conveyed to the Large Language Model in a structured manner.

Prompt engineering is already being employed in literally every industry that uses generative AI. As these systems are integrated into various fields, prompt engineers will be responsible for customizing and tweaking inputs ensuring that these systems offer relevant responses.

AI-driven Video Tools

AI-driven video tools are disrupting video production by introducing new capabilities in every aspect of the process. Video editing and enhancement features like Adobe Premiere Pro's AI tools or Runway's suite, use various automations to streamline post-production. These can be used for tasks like object removal, automatic scene detection, noise reduction, and relighting. Features like AI-driven motion capture are eliminating many technical barriers and putting expensive and complex technology into the hands of independent filmmakers.

Content generation tools, like OpenAI's Sora or Pictory, offer the ability to generate raw video from text. Currently, these tools are most prominent in marketing campaigns, education, and social media. Although not widespread in entertainment yet, elements of generative AI image and video can currently be found in major film and video game releases.

Sora: The Clockwork Hand

Published in: