Software Occlusion Culling

When I started my gamedev adventure I wanted to be a rendering/3D programmer. This was where I felt most confident. Just like 90% of demoscene coders, I didn’t have much experience with AI, physics or gameplay, it wasn’t really needed for demos. Later it turned out that other areas are very interesting as well, I ended up doing them all at one point and nowadays I don’t really code much 3D stuff. I had brief period at CDPR when I was doing it professionally, but other than that I mostly deal with low-level system/gameplay stuff. I still try to follow latest trends/techniques, but I wouldn’t consider myself 3D programmer anymore, especially comparing to guys who do it every day. However, from time to time some idea may spark my interest strong enough to give it a try. This was the case with software occlusion culling.

I’ve first read about it in Johan Andersson’s (DICE Rendering Architect) presentation from Graphics Hardware 2008 - ‘The Intersection of Game Engines and GPUs - Current & Future’. Well, ‘first’ may be not the best term, it’s an old technique, I’ve been using something similar in my software 3D engine some 10 years ago (it came almost for free, as I’ve been maintaining sofware depth buffer anyway). Then first GPUs arrived and took care of a big chunk of rendering pipeline. Recently CPUs became powerful enough to give this technique a try again.

Basic idea is very simple, you render depth of potential occluders to special buffer, then test bounding volumes of objects against values in the depth buffer. Rendering of occluders has to be done in software. For the purpose of this experiments I created simple scene with some buildings aka cubes (with subdivision, random number of faces), ground and 40 skinned characters wandering around the ‘city’. Without any kind of occlusion culling, just VFC, in typical situation we have around 200k triangles to render. It’s not a terribly realistic scenario, but should be OK for our experiments.

Figure 1: Original scene

First, we need to render selected occluders. Ideal occluder would be something big, close to the camera and with small amount of faces. Perfect test would be to project it’s bounding box to screen and measure the area, but it could also prove rather expensive, so I opted for simpler solution. Occluder rating in my case is based on distance from the camera and box size. If we don’t apply any crazy transformations, this should give us a rough & nice estimation of final object’s size. I discard objects that appear too small (no sense in wasting time trying to rasterize them). Note: in a real-world scenario, you’ll probably want a flag saying “I’ve been a bad bad occluder” in editor, that can be enabled for fences and such. I sort occluders by their rating (from ‘best’ to ‘worst’) and interrupt rasterization when I detect that big portion of the screen may be occluded already (dirty checks, but they help to avoid peaks when rendering huge triangles and zillion of small ones which are hidden behind them anyway). Important note here - I render original geometry, it may be a good idea to provide simplified versions just for occlusion purposes. I wanted to avoid this for my test, as one of the things I like the most with this approach is that it requires virtually no pre-processing, it just works, out of the box. I did, however, experiment with automatic simplification using proven Stan Melax’ algorithm and it works like a charm. Figure 2 shows example depth buffer with rendered occluders.

Figure 2: Occlusion buffer

Resolution of occlusion buffer determines precision and efficiency. The bigger buffer - the more costly rasterization, but the more precise tests. I went for 256x192 and it worked rather OK for my purposes, DICE uses 256x114. What’s interesting, in my case it usually costed more to transform all the vertices/clip polygons, than to actually rasterize triangles. Just for kicks I optimized the transformation pipeline for SSE and it helped significantly. For rasterization I didn’t use anything fancy, just good old scanline algorithm, with constant delta per triangle (so no divisions and other evil operations per scanline). If I had more time I’d certainly give half-space approach a try, it looks like it could work better on modern computers. One very simple trick I applied to speed rasterization up, was to parallelize it (partially). The only point where two threads rendering different triangles may cause a problem is accessing the occlusion buffer. How can we ensure they’d never access same address? See Figure 3:

Figure 3: Triangle rasterization using 2 threads.

We divide the framebuffer (occlusion buffer in this case) into two halves. All triangles are classified into 3 sets: they can either be completely in the left half, completely in the right half or in both halves partially. Now, we can easily rasterize blue and red triangles in parallel, because there’s no risk they’d need to access same address in framebuffer. After this is done, I render green triangles, which are left. Performance gains vary, but on average it lets us to draw triangles about 30% quicker than in single threaded mode. One could expect that halving the frame buffer in respect to Y coordinate could give better results, because it’d reduce false sharing, but in practice most games use screen width more extensively than height (because of the skybox and such), so it’s not that obvious (in my test, it was actually slower).

Transformation/clipping can be easily parallelized as well. I went easy way and even with display list access guarded with mutexes it still is fair bit quicker than doing this in a single thread.

Results of our occlusion culling in action can be observed at Figures 4 & 5.

Figure 4: Occlusion culling disabled.

Figure 5: Occlusion culling enabled.

Of course, this is a very nice situation, because we’re standing right in front of a huge box, but gains are noticeable in most cases as well, albeit they may not be so significant.

Now, before you rush to code it yourself - a word of caution. One problem that I noticed is when we consider very ‘long’ boxes tested against occlusion buffer with narrow gaps. See Figure 6:

Figure 6: Possible occlusion buffer problem.

Blue boxes are contents of occlusion buffer, red box is the bounding box of tested object. If we only test corners, it’d result in object not being rendered (corners are occluded). In order to avoid this, I test more points on edges connecting ‘nearest to camera’ corners as well (black dots). This solves 90% of situations (at slight performance penalty, you can trade off speed with precision here, by tweaking number of tested points), but there may still be places where objects are rejected even if they shouldn’t (very narrow gaps between occluders). For this I couldn’t find any better way than finding such places and marking those meshes, so they’re not used as occluders. This is the only place in the system where I couldn’t easily make it to work without artist’s intervention.

Let’s wrap up. Advantages of software occlusion culling:

  • very small amount of artist work required. No manual occluder placing needed.

  • predictable/stable/small overhead. DICE simply limits their occluding geomtry to 10k triangles per frame. Determine your limit by yourself, measure the overhead and you won’t go over that (especially if you implement ‘big occluder’ test I described before).

  • easily parallelizable. You can either run whole process in different thread, or parallelize stages like I did here.

  • works in screen space/with model geometry. Most occlusion culling algorithms perform geometry tests on bounding volumes and thus are less exact (conservative, but still).

  • lends to high level optimizations nicely. I didn’t try it, but all the usual tricks that apply to hardware occlusion culling, should work here as well. One thing that I’d definitelly try if occlusion tests would prove costly is Jiri Bittner’s stuff.

  • [forgot about this one orignally] unlike HW occlusion culling (using queries) runs 100% at CPU, so no GPU<->CPU synchronization problems and it can parallelize nicely. Should work great on systems with relatively powerful CPUs and slower GPUs (…)

One disadvantage I can think of right now is the problem mentioned in previous paragraph. I guess there could be more when implemented in a full-blown game situation. It still is worth a try, though, mainly because it’s really simple. Whole system can be easily implemented in a day or two. If it works - great, spend some more days on optimizations and you’re set. I’d say it certainly looks easier than alternatives.

Finally, in a typical fashion of including totally unrelated links, interesting talk I found today: YouTube Scalability by Cuong Do. It doesn’t have anything to do with gamedev, but it’s good to widen your horizons from time to time. I dont know crap about the subject he’s talking about, but it still was pretty interesting. Crazy thing is, they maintained YouTube with like 10 people for a long time.

Old comments

ed 2008-10-26 20:25:37

Wouldn’t be better use software clipping in 2d ? Instead of rendering occluders to buffer, you would just project occluders into 2d space and make a list of 2d convex (or non-convex) polygons. Then you can clip-intersect objects with this list. This would solve fig. 6. Additionally since modern processors have good paralel computing capabilities but still struggle with memory bandwidth; having few more computations can be better than memory accesses (especially that you need to sample more points to avoid fig.6).

Arseny Kapoulkine 2008-10-26 20:29:30

Interesting, I wanted to implement it both for PC and for SPU but did not have the time to try. Perhaps I’ll find some after your post :)
What are the performance numbers?

admin 2008-10-27 07:12:15

ed: interesting idea, didnt think about it. You couldn’t clip/intersect in full 2D, though, as depth is important here as well. Getting the “shape” of occluder may not be trivial, also.
Arseny: I imagine it should work very nicely with SPU especially. About performance numbers, I didnt post them, because the scenario isnt terribly realistic – rendering is rather cheap (just skinning+texture+directional light) and on the other hand there are many occluders. Anyway, normally I render about 150-170k faces at 17ms per frame without occluding and 70-90k at 10ms with occlusion culling (that’s normal situation, nothing extreme, like standing in front of a huge cube etc). Tested on my crappy laptop (GF8600).

Arseny Kapoulkine 2008-11-02 18:44:29

I meant the occlusion performance actually - i.e. for some of that screenshots, how much time does it take to draw how many occluders in occlusion depth buffer and how much time does it take to qualify how many objects (OBBs?)

admin 2008-11-02 19:44:03

I rasterize about 1200 faces per frame in 0.8-1.2ms (boxes consists of random number of small polys, just to increase complexity and make it more realistic). Transformation seems to be a bottleneck and takes ~3.5-4.5ms. Occlusion test of AABBs - up to 1.5ms (I test 8 points). I expect it can be speeded up noticeably, I didnt spend that much time on optimizations. - the blog » Catching up (part 2) 2009-06-08 08:52:05

[…] Sinilo wrote a good post on software occlusion culling. I know of several games (from different companies) that have used software-rasterized occlusion […]

convex graph 2010-03-29 01:46:20

[…] … the slope of my utility function strictly convex. Trust me ladies, with me there is no …Software occlusion culling | .mischief.mayhem.soap.Old occlusion culling method becoming cool again! … Instead of rendering occluders to buffer, you […]

ben 2010-04-24 00:14:49

Where can i find the source code to your ingenious solution?

Fix your game DICE - Page 2 - Electronic Arts UK Community 2012-05-21 16:06:46

[…] GPU Gems - Chapter 29. Efficient Occlusion Culling Hierarchical-Z map based occlusion culling Software occlusion culling | .mischief.mayhem.soap. I could post what Kalm's thinks it is but it won't really matter. They don't consider it a bug or […]