When I started my gamedev adventure I wanted to be a rendering/3D programmer. This was where I felt most confident. Just like 90% of demoscene coders, I didn‚??t have much experience with AI, physics or gameplay, it wasn‚??t really needed for demos. Later it turned out that other areas are very interesting as well, I ended up doing them all at one point and nowadays I don‚??t really code much 3D stuff. I had brief period at CDPR when I was doing it professionally, but other than that I mostly deal with low-level system/gameplay stuff. I still try to follow latest trends/techniques, but I wouldn‚??t consider myself 3D programmer anymore, especially comparing to guys who do it every day. However, from time to time some idea may spark my interest strong enough to give it a try. This was the case with software occlusion culling.
I‚??ve first read about it in Johan Andersson‚??s (DICE Rendering Architect) presentation from Graphics Hardware 2008 – ‚??The Intersection of Game Engines and GPUs ‚?? Current & Future‚?Ě. Well, ‚??first‚?Ě may be not the best term, it‚??s an old technique, I‚??ve been using something similar in my software 3D engine some 10 years ago (it came almost for free, as I‚??ve been maintaining sofware depth buffer anyway). Then first GPUs arrived and took care of a big chunk of rendering pipeline. Recently CPUs became powerful enough to give this technique a try again.
Basic idea is very simple, you render depth of potential occluders to special buffer, then test bounding volumes of objects against values in the depth buffer. Rendering of occluders has to be done in software. For the purpose of this experiments I created simple scene with some buildings aka cubes (with subdivision, random number of faces), ground and 40 skinned characters wandering around the ‚??city‚?Ě. Without any kind of occlusion culling, just VFC, in typical situation we have around 200k triangles to render. It‚??s not a terribly realistic scenario, but should be OK for our experiments.
First, we need to render selected occluders. Ideal occluder would be something big, close to the camera and with small amount of faces. Perfect test would be to project it‚??s bounding box to screen and measure the area, but it could also prove rather expensive, so I opted for simpler solution. Occluder rating in my case is based on distance from the camera and box size. If we don’t apply any crazy transformations, this should give us a rough & nice estimation of final object‚??s size. I discard objects that appear too small (no sense in wasting time trying to rasterize them). Note: in a real-world scenario, you’ll probably want a flag saying “I’ve been a bad bad occluder” in editor, that can be enabled for fences and such. I sort occluders by their rating (from ‚??best‚?Ě to ‚??worst‚?Ě) and interrupt rasterization when I detect that big portion of the screen may be occluded already (dirty checks, but they help to avoid peaks when rendering huge triangles and zillion of small ones which are hidden behind them anyway). Important note here ‚?? I render original geometry, it may be a good idea to provide simplified versions just for occlusion purposes. I wanted to avoid this for my test, as one of the things I like the most with this approach is that it requires virtually no pre-processing, it just works, out of the box. I did, however, experiment with automatic simplification using proven Stan Melax‚?? algorithm and it works like a charm. Figure 2 shows example depth buffer with rendered occluders.
Resolution of occlusion buffer determines precision and efficiency. The bigger buffer ‚?? the more costly rasterization, but the more precise tests. I went for 256×192 and it worked rather OK for my purposes, DICE uses 256×114. What‚??s interesting, in my case it usually costed more to transform all the vertices/clip polygons, than to actually rasterize triangles. Just for kicks I optimized the transformation pipeline for SSE and it helped significantly. For rasterization I didn’t use anything fancy, just good old scanline algorithm, with constant delta per triangle (so no divisions and other evil operations per scanline). If I had more time I‚??d certainly give half-space approach a try, it looks like it could work better on modern computers. One very simple trick I applied to speed rasterization up, was to parallelize it (partially). The only point where two threads rendering different triangles may cause a problem is accessing the occlusion buffer. How can we ensure they‚??d never access same address? See Figure 3:
We divide the framebuffer (occlusion buffer in this case) into two halves. All triangles are classified into 3 sets: they can either be completely in the left half, completely in the right half or in both halves partially. Now, we can easily rasterize blue and red triangles in parallel, because there‚??s no risk they‚??d need to access same address in framebuffer. After this is done, I render green triangles, which are left. Performance gains vary, but on average it lets us to draw triangles about 30% quicker than in single threaded mode. One could expect that halving the frame buffer in respect to Y coordinate could give better results, because it‚??d reduce false sharing, but in practice most games use screen width more extensively than height (because of the skybox and such), so it‚??s not that obvious (in my test, it was actually slower).
Transformation/clipping can be easily parallelized as well. I went easy way and even with display list access guarded with mutexes it still is fair bit quicker than doing this in a single thread.
Results of our occlusion culling in action can be observed at Figures 4 & 5.
Of course, this is a very nice situation, because we‚??re standing right in front of a huge box, but gains are noticeable in most cases as well, albeit they may not be so significant.
Now, before you rush to code it yourself ‚?? a word of caution. One problem that I noticed is when we consider very ‚??long‚?Ě boxes tested against occlusion buffer with narrow gaps. See Figure 6:
Blue boxes are contents of occlusion buffer, red box is the bounding box of tested object. If we only test corners, it‚??d result in object not being rendered (corners are occluded). In order to avoid this, I test more points on edges connecting ‚??nearest to camera‚?Ě corners as well (black dots). This solves 90% of situations (at slight performance penalty, you can trade off speed with precision here, by tweaking number of tested points), but there may still be places where objects are rejected even if they shouldn‚??t (very narrow gaps between occluders). For this I couldn‚??t find any better way than finding such places and marking those meshes, so they‚??re not used as occluders. This is the only place in the system where I couldn‚??t easily make it to work without artist‚??s intervention.
Let‚??s wrap up. Advantages of software occlusion culling:
- very small amount of artist work required. No manual occluder placing needed.
- predictable/stable/small overhead. DICE simply limits their occluding geomtry to 10k triangles per frame. Determine your limit by yourself, measure the overhead and you won‚??t go over that (especially if you implement ‚??big occluder‚?Ě test I described before).
- easily parallelizable. You can either run whole process in different thread, or parallelize stages like I did here.
- works in screen space/with model geometry. Most occlusion culling algorithms perform geometry tests on bounding volumes and thus are less exact (conservative, but still).
- lends to high level optimizations nicely. I didn‚??t try it, but all the usual tricks that apply to hardware occlusion culling, should work here as well. One thing that I‚??d definitelly try if occlusion tests would prove costly is Jiri Bittner‚??s stuff.
- [forgot about this one orignally] unlike HW occlusion culling (using queries) runs 100% at CPU, so no GPU<->CPU synchronization problems and it can parallelize nicely. Should work great on systems with relatively powerful CPUs and slower GPUs (…)
One disadvantage I can think of right now is the problem mentioned in previous paragraph. I guess there could be more when implemented in a full-blown game situation. It still is worth a try, though, mainly because it‚??s really simple. Whole system can be easily implemented in a day or two. If it works ‚?? great, spend some more days on optimizations and you‚??re set. I‚??d say it certainly looks easier than alternatives.
Finally, in a typical fashion of including totally unrelated links, interesting talk I found today: YouTube Scalability by Cuong Do. It doesn‚??t have anything to do with gamedev, but it‚??s good to widen your horizons from time to time. I dont know crap about the subject he‚??s talking about, but it still was pretty interesting. Crazy thing is, they maintained YouTube with like 10 people for a long time.