Designing for Apple Vision Pro: Lessons Learned from Puzzling Places

Published in

Realities.io

13 min readFeb 5, 2024

The Apple Vision Pro presents new design challenges to consider. Here are some of the lessons learned from redesigning Puzzling Places from the ground up for the Apple Vision Pro.

The Vision Pro image used is from Apple’s website and Presskit. All rights reserved to them.

I will mostly cover general lessons we learned along the way, which are applicable to anything you might want to design for the device. This will be accompanied by a few lessons we learned that were specific to the game loop of Puzzling Places. In addition to the design challenges, I will briefly cover some of the technical aspects as well.

A short disclaimer: Whatever you read here reflects subjective opinions and does not represent Apple’s official stance. For context, here is a short trailer of the game loop.

A new Platform

Apple’s entry into the AR market has been highly anticipated. That’s why we decided to port the game to the new headset. Usually, the process of porting between VR headsets involves adjusting the tech backend so that it works on the new platform, but mostly leaving the design untouched. However, the more we found out about the headset, the more we thought this might really be a new platform in the truest sense. Officially, Apple doesn’t talk about AR or VR but spatial computing. While some of that is marketing, there are some real differences between how this headset is designed compared to something like Quest 3. So before you even start designing for this platform, it is worth thinking about what that even means.

At the moment, I would say that there are three types of apps you can make for Apple Vision Pro. These are windowed, bounded, and unbounded. Bounded and unbounded are Unity terminology. Technically there is no division. From Apple side there are windows or volumes which can spawn in Shared or Full Space, in Passthrough or Fully Immersive. An app can combine them all together in various ways. For example, you could have a bunch of windows that combine with a bunch of volumes in a Full or Shared space. You could transition between them depending on what the user needs and get really creative. So if you read the page I referenced, you would see that the actual split is between how content is represented (on a 2D window, or a 3D volume) and which space it is in (shared with other apps, or has everything to itself) and whether it is Passthrough or blended VR enviroment.

Practically for us though, the question was between 2D (Windowed), bounded (one volume in a Shared Space), and unbounded (one volume in Full Space). The reasons why these were our actual options were mostly limited time and budget and technical limitations of using the Unity engine, more on that later.

If you want to port your design from VR to AVP, the simplest solution is to use Full Space. You would have the headset to yourself, which you can use to create an experience in Passthrough or a fully virtual environment. The only thing you would need to do is to design for hand/eye tracking instead of controller-driven interaction and you are good to go.

While the unbounded/ Full Space has many advantages, it has one main disadvantage: the user cannot open applications side by side to yours.

One of the most highly requested features in Puzzling Places on the Quest 3 is the ability to listen to music, watch a YouTube video, or listen to an audiobook. These are features that would be extremely expensive for us to implement, but they just work in Apple Vision Pro. If your app is bounded and in shared space, the user can do whatever they want while puzzling. The ability to jump into a meeting in FaceTime while working through a longer puzzle feels so seamless that it is borderline magical. When Apple mentions spatial computing, that is probably what they mean. The AVP is not a VR headset, not because of its hardware, but because of the ecosystem. It seems to me that Apple sees this as a personal computer on your face, in which you can do a whole lot.

As I mentioned before, ideally, your app supports all the various ways a user could interact with it, be it Shared or Full Space. But since realistically we had time to focus on one of these, we decided to put our chips on Apple’s vision for the headset (no pun intended), design for what makes the headset special, and learn something new about our game loop, instead of copying over something we already knew works.

Choosing what Space the app is in, however, was just the beginning. Already deciding to align with Apple’s own vision for the device, we had an easier time designing the control paradigm for the game.

Direct and Indirect Controls

One of the things that has suprisded me in the past few years, is how accessible VR games can be to non gamers. One of the reasons for that is probably that the main control paradigm for VR is Direct control. This is a fancy way of saying that you play the game through direct embodiment. This form of control is extermly intuative, because everyone knows how to use their own body.

The control paradigm for AVP for its operating system and Shared Space is almost the exact opposite of that. You look at the things you want to interact with and then you pinch. You can think of your eyes as the mouse cursor and your pinch as the left click. This is what Apple calls the Indirect control. If direct control is intuative, the Indirect control needs to be learned. Not that it feels unnatural to interface with the AVP, but nothing about pinching to select or the placement of your hands is immediately understood.

Depending on your game, you can equally support both Direct and Indirect control. But chances are you would have to choose one as your primrary mode of interaction, and put more budget on polishing that. So which one do you choose?

It is true that AVP’s indirect control needs to be learned. But that is not necessarily a bad thing. Just like how analogue sticks were first introduced in video games, some journalists claimed that they were too convoluted to learn and they would never catch on. Yet, nowadays, most games are played with controllers. Even a mouse cursor and its interaction with the operating system have a learning curve.

What makes these interaction system so wide spread? I believe the main two reasons are that they are very versatile and that they enable laziness. The AVP Indirect control hits both those boxes for me. You can do a whole lot of things, while barely moving an inch in the real world. While the Direct control has a transparent physical interface, the Indirect control has a transparent conceptual interface!

If you settle for a bounded volume in a Shared Space like we did, you actually don’t have many options but to enable good Indirect control. The reason being that the user is not supposed to be inside the volume but facing it. By default, the volume would spawn a meter or so away from you, which is out of arm’s reach. Given the distance, the user can’t manipulate the game world through an input method that directly maps its movement to gameplay. In the bounded Shared Space, you could think of the volume as a 3D spatial monitor instead of a physical space you bodily interact with.

For Puzzling Places, that meant a change in how you play the game. In the VR version, you either walk to a piece, or you pull a piece towards you. You physically rotate the piece in the correct orientation and physically place it on the right spot. Needless to say if the game field is out of your arms reach, you can’t play the game like that. Hence we change the game loop.

There is a center piece that acts as an anchor. This anchor is placed in a volume in the Shared Space a meter or so away from you. You are presented with series of pieces that can connect to the center piece. You look at a piece you want to connect, pinch and move it on the spot you think it should go, as if you are dragging it with a mouse. While your movements are on a 2D plane like a mouse, the game logic figures out the depth and correct orientation for you!

This enables you to play a 200 piece puzzle sitting on your work desk, while only moving your hands as much as you would have to move a mouse.

In summary desiging for Indirect control means going back to desiging the way you would for a mouse or a controller. Instead of mapping the physical movement of the player one to one into the virtual world, you remap it so that small amount of movement enables a vast possiblity space in the game world.

This control scheme relies heavly on eye tracking, and we realized some interesting stuff about eye tracking making it!

Problems with Eyetracking

AVP’s eyetracking is actually very solid. While technically it never gets it wrong as to where my eyes are looking at, the paradigm still has some funny problems.

The most prominent problem with eye tracking is something I would like to call the sequential intent problem. The fact that you have to look at somewhere to select it means that you can only do certain things sequentially. This might not sound like a big deal, but I was surprised just how often I do several things at the same time when using a computer, such as looking at somewhere while clicking on somewhere else. This is probably where I felt the most friction in AVP and where it takes a while to get used to. This doesn’t mean you can’t multitask with AVP, just that you can only communicate your intent to the device sequentially. For example, you can use your eyes to select a puzzle piece with your right hand and immediately after do the same to select a different thing with your left hand. Now you can multitask using the pieces you are holding in your two hands. But anything using the eyes as an input method forces the interaction to go through a bandwidth-limited interface.

There are further problems with assuming where our eyes are looking at aligns with our intention. Saccadic masking and visual distractions were the two I noticed the most.

Our eyes typically move saccadicly. That means they move in rapid descrete movements. Obvisouly that is not how we perceive the world. Our vision feels smooth and continues. This is due to something called Saccadic masking, which as a sort of post process, not only warps visual data to make a smooth transition, but also retrospactivly changes/ ereases our memories to mask out any evidence of the saccadic movement. This is terrible news for eye tracking, because where we think we were looking, was not nesscerly where our eyes were actually pointed at! You realize this as you are pinching to enable an input, and realize jarringly that your brain is lying to you about your past or present. It is dramaticly expressed, but it was a feeling I never quite got used to.

The second problem is that our eyes are still a sensory input method that reacts at times outside of our consciousness to ensure our survival. Rapid movements, catchy visuals, or high-level semantics like signs or text would pull my eyes away from the gameplay without me being able to do anything about it. This wouldn’t be a problem if not for how my brain seems to queue up actions in parallel or separate to what the eyes are actually doing. For example, I might decide I want to grab a piece, I send a pinch command to my fingers and a “look at that piece” command to my eyes. While this is going on, for some unknown reason, my eyes decide to jump to a button on the right to read what is written on it. The button says “Restart”. As I am looking at it, the pinch command is just executed by my fingers, picked up by the AVP, and fed into the game loop. I just restarted my progress. I laughed every time it happened!

On its own it is actually fascinating that computing and latency has gotten to the point where we have problems like these! What does this mean for you design wise? The above problems have the same solutions. It comes down to: 1. the tempo of the interaction with your game 2. distance between visual elements 3. the cost of a false positive.

To the first point, the faster reaction time you demand of the player with eye tracking/pinch duo, the higher the chance that something might go wrong. I am not saying that you can’t have fast-paced games with eye tracking, just that if gameplay-relevant reaction time badly overlaps with some other stuff your eyes/brain do, you will have some problems. Secondly, the further away interactables are from each other in eye space, the lower the chance that these problems pop up. Interestingly enough, just as we build “dead zones” in our analog stick control inputs to account for various unknown factors, you can physically give a minimum distance between interactable objects in your scene to account for weird behavior of the eyes.

Lastly, you can evaluate how large the cost of a false positive is, and if you can reduce it. Accidentally pressing the Restart button is terrible! But if after pressing it the user has to confirm an extra prompt, it becomes just an annoyance. An example for our game was how we switched between unpuzzled pieces on the shelf. To align with the OS design, we decided to first use a swipe motion. But since the shelf background and the pieces were so close to each other, we kept accidentally selecting a piece while swiping. This caused a jarring animation as a piece was thrown to the left or right. Instead of this, we switched to a button. Now, if the user wanted to select a button and accidentally selects a piece, all that happens is a wrong sound playing and the pieces not switching.

Shared Space means Shared Everything

One of the challenges you will face in designing for the Shared Space is what it means when your app runs seamlessly next to other apps. The most obvious implication is that your computing resources are shared, so you shouldn’t assume the entire processing power would be dedicated to your app.

But there are also further implications, such as the cognitive load. If you are designing for Shared Space, you intend people to use your app next to other apps. If not, why not just go for Full Space and spare yourself the extra work? If you do want that, you need to make sure that the cognitive load of your game loop leaves some mental processing power for the user to do other things, such as attend a meeting in FaceTime, or think about some work problem. This was one of the reasons why we decided to simplify the loop of Puzzling Places for AVP.

Leaving breathing room for other apps is a pattern you would have for all areas of your game. As you are designing your soundscape, you need to keep in mind that not only will other apps also produce some sounds, but potentially the Apple environments which the user might decide to use.

Speaking of sound, the AVP is good enough in its Passthrough that I started to find it weird when the reverb profile of a sound scape didn’t match the room I was looking at. This does not happen to me with the Quest 3 Passthrough for example.

Technical Limitations

At the time of writing this blog post, the first decision you need to make is where you want to develop your game. You can choose between native Swift/ RealitityKit combo or Unity Engine.

For us, native Swift was not really a viable option since we had no experience developing for Apple. Given the very tight development time, it made a lot more sense to stick with an ecosystem whose risks we could at least calculate. But if you have Swift experience, you will have quite a lot of advantages developing your game natively.

The biggest is limitation of features. Unity had some severe limitations for AVP. Things like lack of spatial audio, only one volume at the time or no access to default Swift UI functionality etc. Some of these limitations were on Apple side, some on Unity, some on how the whole rendering architecture of AVP works. But either way, at whatever time point you might decide to jump in, third party libraries are usually a bit behind the latest native capibilities.

If you build your game in Unity, Unity will split the game into two parts. Your game logic is mostly compiled into a CPP library where your main game loop lies, orchestrated with Swift code which initializes the program. Your scene is converted into a format which the Apple backend understands, where the relevant components are mapped to various Apple components. Since Apple has a whole different way of doing things, your Unity components, of course, don’t match one-to-one with what ends up happening. This is annoying at best.

Rendering-wise, some materials are converted into MaterialX. These can utilize the various PBR capabilities of the shaders Apple has provided, while using a lot of information that may not necessarily be available to you in Unity. On the other hand, from my tests, these shaders are a lot more expensive than the custom shaders you can compile in Metal. Speaking of performance and rendering, how powerful is AVP? I have no idea.

Hardware-wise, the AVP is obviously the most powerful commercial headset out there. But it also has very high resolution, framerate, and low latency. So fill rate seems to still be a problem. Because of the way rendering is dealt with as an Apple process while game logic is happening in Unity CPP code, you need to always figure out which of the two are actually the cause of the lag. To make matters worse, while you can profile Unity code as you are used to, that doesn’t give you any info regarding what the Apple side is doing. There is also an Xcode profiler, which has quite limited capabilities at the moment. I would have liked to have more information, regarding how much expensive rendering is on the Apple side.

Speaking of limitations, there are quite a lot of limitations regarding what data you can access. These are especially severe in Shared Space and when the user is outside of your dedicated volume. While I can understand from the point of view of data protection why I shouldn’t know where your eyes are looking, information like camera position is very relevant to game development. For example, in the future, I do hope there are ways to art direct the highlighting behaviors of elements that are hovered over by the eyes.

The Apple Vision Pro has a lot of ideas regarding how people will be using it. Desiging for it, you need to decide how far you will align yourself with those. Time will tell how many of these ideas will stick, and how many will be forgotten.

Thanks for reading, as usual you can follow me on various socials listed here: https://ircss.github.io/