Diagnosing performance issues in Unity
Performance is one of the most important part of delivering a successful product. In this post, I am going to share how I go about investigating performance issues in Unity.
How long your Unity application needs to finish a frame is dependent on many different things. That is why to optimize your application loop, you need a lot of knowledge. Yet having all the general knowledge about underlying hardware, how different systems work and … helps you only to make potentially correct hypothesis about why your loop is slow and how to fix it. There is no one fit solution for all performance problems. Each system is unique and each optimization brings in a potentially massive change. Because of the chain of complex dependencies, the only way to know for sure which component needs to be trimmed, and where the main problems are, is to test, document and repeat the bench marking under different conditions.
In the further reading section, I am going to link to some useful posts on performance optimization. Go and read them when you can, because I won’t repeat the stuff that are there here again. They have done a fantastic job of explaining what they wish to convey.
One main thing to keep in mind when thinking about performance: majority of the time, the source of the problem is neither the computer, the APIs, nor the engine, it is your code. Things being rendered doubled, inefficient look ups and nested loops, bad memory management and bad architecture.
Scientific Method is your friend
Remember those science papers you had to write in high school in natural sciences? Where you had to come up with a hypothesis, clear definition of all variables, documentation of the investigation method, the raw results, analyzation and representation of the results, evaluation and conclusion? For me, that turned out to be the perfect method for investigating performance issues.
Because of the complex nature of the performance issues, how little we know about the actual cause and effect relationship between different factors at the beginning and the deterministic nature of code and computers (same input, same output), using the scientific method as the workflow to diagnose performance issues makes total sense. This has a series of advantages which I will go over shortly, but first let me give a concrete example.
Let’s say you have a scene with performance problems. You guess that fragment overdraw might be a reason for it, and would like to investigate how much issue overdraw causes for your system. You plan an experiment, where you can keep all other influencing factors constant (vertex count, CPU calculations, gpu state changes and …), and only change the number of fragment overdraws.
An example of a setup like this would be a bunch of planes facing the camera, aligned on camera forward axis. The camera should be in orthogonal mode and the planes should cover 100 percent of the camera screen. The planes should render to Zbuffer but their ZTest should be off. Now all else should remain the same, but you can change the number of over lapping planes and record the GPU time.
The results you can show on a graph, x axis would be number of overlapping planes, the y axis the gpu render time. You would be surprise that most of the time, from seemingly noisy profiler data, you get a very clean correlation between related factors. You can analyze this data however you want (calculate gradient, project y for certain x value and…).
Next part is to reflect on it. For example you are investigating the fragment overdraw here in isolation. There is no burden on the bandwidth, cores resources and … What would be different in your actual application? What could change in the actual application to effect this? You will soon come up from analyzing this experiment, with bunch of other things you need to analyze. Maybe you should redo the experiment in your actual scene, or another scene that has similar bandwidth pressure. What is the effect of fragment shader complexity? How about dependent texture reads and latency?
It is important to document this cleanly and save the scene, scripts or the project you used somewhere accessible to all your team members. What is the advantage of that?
First of all you won’t be the performance guy in the team. If you are sick for a week or have to change scope, people can easily read through your well reasoned and documented performance profiling and continue where you left off.
Second it creates a system of standardize performance diagnosis which you can reuse again. If your CEO’s dad is not paying for the development of the project, I would assume you are planning to publish it for different environments (hardwares, platforms, OS etc.). Each environment has potential effect on your performance. Having documented, and saved the project files you used to do your performance testings, you can simply redo the test on a different platform. This gives you in-depth and fast knowledge about what to optimize for each platform.
The system also helps you spot mistakes easier. You can also look back and by looking through the hypothesis and reflection section, tell why you thought a certain way was best. You or a teammate might spot logical thinking errors in there later on, and change optimization plan accordingly.
By standardizing the test, you are already halfway in setting up automated performance tests. This might not be relevant for smaller teams that just finish projects and move to the next one, but for those with a more involved and systematic asset creation pipeline, this is a very crucial set up.
Last but not least, by going through the entire application and analyzing the different components, you develop an understanding for how your architecture works.
Know your architecture
You should map out your render loop well. You should have a plan for it, and you should have it early in the development. I am going to state two things that sound controversial but are actually the reasons why you should know your architecture early on. First, don’t build first optimize later (my twist on a popular saying under developers) and second early optimization is the spawn of the devil, and a waste of everybody’s time.
I am going to start off with the second point. There are series of reasons why early optimization is bad. For example, game development is a very iterative process. Rapid prototyping is a valuable skill to have, before spending the time to optimize the code to death, it is more important to figure out if it even makes sense under the goals/ feel of the game. Another example would be, optimizing where it is not needed, is wasted engineering time. The goal is not to write the most optimized code, the goal is to make the best product (which requires it to run at interactive frame rate). Also you need to first analyze, on which hill you are going to die before you start optimizing specific instructions. You are not helping anyone by optimizing your one or two missed caches or special instructions here and there in the frame, when your draw calls are taking 10ms of your frame time. You should tackle that first. Knowing your architecture is going to make it clear where you should put your optimization effort. To get there you need general knowledge and doing alot of testing.
As for the first point. The above paragraph is the reason why the make first optimize later is a saying under developers. However I don’t like that sentence, because it implies you leave any thought of performance until your game is done, or you have hit a performance wall. While you shouldn’t be optimizing your sinus or cosines for no reason, you should starting building a map of your loop’s architecture as soon as you can. If not, once you are done developing the game on your 2080Ti, you realize your potential market reach is 2 percent of what it can be if you don’t optimize. If you are unlucky, you realize some of the stuff you rely on are irredeemable. You will never get your interactive fluid simulation run on older GPUs. You have to throw away alot of things, design new solutions for problems you have already solved and pretty much redo alot. If you have done this a few times and are more experienced, maybe from the beginning of the project assign frame time budge for different things.
To understand your architecture it is important to build a solid understanding of what is running parallel and what not. If everything is running on one thread, then optimization is pretty much trimming where you can. If you make process A run 1 ms faster, frame time improves by one ms. However most systems are not like that. A game can have a layout like this: The CPU runs the game logic on one thread, it prepares what it can regarding culling and sorting, the render thread sends the API calls pipelined in parallel on a different thread and the GPU executes those calls parallel and pipelined. Some games utilize more cores on the CPU to execute either pipelined code, or independent tasks that can be executed in parallel. Knowing this is crucial because now exists a bottleneck, the part of the system which takes the longest. Optimizing the rest of the system, doesn’t make any difference in the frame time, unless you reduce the bottleneck time. Bottlenecks move. At a certain threshold, whatever you are optimizing stops being the bottleneck, and you need to move on to the next one. What makes this complicated, is that you are usually creating more calculation time some where else in the pipeline, to trim calculation time on your bottleneck. There will be a point during optimization, where you will realize you can’t trim more. Then stop!
A map of your architecture, is different for different projects, how you group different components of the loop together, and how much time you assign to each depends on what the main components of your loop are. The map could be something like this. First for the sake of simplicity, let’s assume everything is running in one thread first, and then map out a pipelined version. I have presented two iterations, which are revision of the map as your development progresses.
The boundaries of the map will change as time goes on and the sections will always be finer sub divided and defined. For example in later iteration you divide the GPU side in to different passes for the shadow map, different type of geometries, screen space effects and … Later in development, as you continue to understand your loop better, you further subdivide those in their Geometry part (vertex, tessellation, geometry shader), Rasterzation part, fragment shader part, and blending. You can subdivide different parts to different extent. It is good to not shoot in the dark. You will be saving weeks of wasted work, if you start with this early. This has no direct consequences on 90 percent of the codes you write, but it helps you to know what you can afford. If your game has water in it, and you would like to have a cool simulation you saw online, go and check how much frame time that takes. If it will take you 5 ms and you can only give 1 ms for the effect, maybe you should try to go for a simpler artistic representation. If you were planning 2 ms for game logic code on the CPU and suddenly you are seeing 4–5 milliseconds? Someone probably wrote stupid code, have whoever decided to search the entire Assets resources in a nested for loop every frame go and correct it.
This also makes it easier to work with other people. You give someone a task: implement the water in this scene, but make sure it doesn't take longer than 1 ms. The person knows what to do, they do it, and the code doesn’t need to be touched later. Or they hit a problem, the water needs 1.5 ms, everyone knows early enough, and a design meeting needs to be taken to consider if a specific effect is worth the production costs it needs to optimize it. Note that without the limit of 1 ms which came from the knowledge of architecture, you would have not realized this is a problem until way later as your other features are implemented and you start testing it on less powerful machines.
This is really a powerful team aspect. If you know early enough where your performance challenges are, you can include your design team in this meeting. This makes a big difference, because suddenly it is not a bunch of programmers deciding how to include a million skinned animated bears in the scene, but also your designers who might decide they don’t need a million bears. You can design around performance challenges and work with the constrains. For that you need to know what those constraints are.
Knowing your architecture saves your neck from bad decisions and wasted time. Example of a bad decision is throwing bad algorithm more parallel processing power. Multithreading is not always faster, the synchronization of worker threads to your main thread, could lead to more overhead, if the algorithm is not designed for it. If you know your architecture you will spot that pretty much immediately.
Know your Tools
I like Unity profiler. It is a very powerful tool, and you should learn how to use it. The area could use more tutorials though.
The profiler helps you map out your architecture. I regularly use the CPU, memory and GPU profiler. Using the profiler, there are somethings you need to keep in mind.
First of all, profiling has a overhead cost of its own. So have the profilers you don’t need turned off. Second Unity editor has a cost of its own. Test in build as much as you can. Third, don’t measure things in FPS! The total frametime doesn’t tell you anything, it is noisy data, and dependent on too many things, you should be trying to map out how long specific component of the loop take, measure those in ms separately. Fourth, the profiler doesn’t tell you everything that is happening in CPU or GPU. Only things it knows about and has collected data on. This has gotten me a few times. In one of Unity’s version, the windows defender stalled my render thread calls. This of course wasn’t showed on the profiler. The profiler showed waiting for GPU. while the GPU also showed waiting for GPU. For scenarios like this, to know where spikes are really located (CPU or GPU) it is good to have an external profiling application like the Frame counter from Steam. Your operating system, memory management done by the heap manager and such won’t be visible in the profiler. Last, keep an eye on garbage collector and memory. It is usually the source of spikes.
The tools I use for performance investigations are these:
- Unity profiler, to figure out how long each component is taking. I use this to keep an eye on the main, render and worker threads, application memory and GPU.
- External profiler like Steam fps counter, to know for sure if spikes are on the CPU or GPU
- Renderdoc/ framedebugger, to walk through the draw calls commands and make sure things are working as intended
And of course whatever timer I build in the code to measure specific function calls.
Conclusion
The key to optimization is where to optimize. For diagnosing sources of performance issues, you need A) solid understanding of how the underlying systems you rely on work B) systematic testing of the different components of your application, documenting it well and reflecting on how different elements are inter connected. For that you need a good set of tools, and you should know how to use them.
I hope you enjoy reading, you can follow me in twitter IRCSS and if you find factually incorrect information, please let me know.
Further Reading
- Official Unity resource for performance and optimization (also info on garbage collection). A bit long but a must read in my opinion, if you don’t already know the information in the post: https://unity3d.com/learn/tutorials/temas/performance-optimization/optimizing-graphics-rendering-unity-games?playlist=44069
- Official Unity resource for the profiler (of course some of this information is also in the doc) read this to know the tool you would have to use: https://learn.unity.com/tutorial/diagnosing-performance-problems#5c7f8528edbc2a002053b598
- Official Unity resource for frame debugger. Unfortunately it is behind premium wall (which I am not a fan of), you can also refer to the official doc on this: https://learn.unity.com/tutorial/working-with-the-frame-debugger#
- Video on Frame debugger: https://youtu.be/4N8GxCeolzM
- And my favorite resource of all time. If you can get your hand on Real Time Rendering, it has a extensive chapter on performance. It goes through the basics, lays out systematic workflows of determining where the bottleneck is, and gives a solution for pretty much each possible bottleneck: Real Time Rendering, Chapter 18 Pipeline Optimization, Pages 783–805.