This Week In Veloren 178
This week, we have a great writeup from @Sharp about optimizations being made to the site2 system.
- AngelOnFira, TWiV Editor
Thanks to this week's contributors, @juliancoffee, @xMAC94x, @runrobdog, @BrandonDyer, @imbris, and @Sam!
Site2 optimizations by @Sharp
I've been working on speeding up site2 rendering time by roughly an order of magnitude across the board. Site2 greatly simplifies things for developers by giving them normal geometric primitives to work with (rather than forcing them to write everything in a sampling-based way), and allowing direct draws of primitives to occur in a specific order with different fills.
(Un)fortunately, site2's power and ease of use has led to a lot of rather expensive patterns and primitives being used to emulate things like archways, hollowed-out rooms, or sprites on the ground. This, combined with the actual rendering still being done in a mostly sampling-based way (albeit with a bit of cleverness around bounding boxes), has resulted in large regressions in overall render time for chunks, putting us far out of reach of our end goal if we want to support higher view distances or larger numbers of players.
Chunk generation times are highly variable and depend on many factors (player count, player distribution, and view distance) that we don't have much control over, and as maps get bigger and servers get fuller, these factors are likely to grow.
While it's true that chunks can mostly render in parallel, many players play single player or use servers with only a handful of cores; on such systems, driving generation into the background isn't as helpful as it seems. Even on beefy servers, any CPU cycles spent on chunk generation are not being spent on more useful work (plus, it's lame to restrict players to such a small view distance!).
Even if most cores are ideal, many systems (and players themselves!) can't run or perceive anything until a chunk is loaded, so slow chunk generation time has a disproportionate impact on perceived latency. This is especially true in the context of very fast transport systems, such as high speed trains, which require loading new chunks very rapidly in order to keep up with the player.
We've recognized that this is a problem for a while, so @aweinstock had been working on a pretty interesting idea--JIT compiling expensive nodes in the site2 render tree, then evaluating it! While he got impressive results from doing this, even taking the JIT compilation overhead into account, the total time still seemed much too high to me. Since even if you have a JIT compiler, having a faster interpreter is still always beneficial (since you can compile in the background, or use the interpreter for all but the most expensive nodes), I decided to see what improvements we could get by improving the interpreter instead.
Fortunately, because site2 is so high-level, there is a lot of room to optimize the interpreter! I quickly realized that there was a lot of low-hanging fruit in that department, and the improvements quickly grew to the point where the interpreter was finishing well before JIT compilation had a chance to finish. Some techniques applied so far include:
- Added UnionAll and IntersectAll primitives, which reduce the amount of time spent recursing through the tree and result in easier and more frequent chances for optimization.
- Added a proper cost model to decide join ordering on unions, intersections, and differences.
- Reducing the number and capabilities of primitives, allowing us to focus more on optimizing the remaining ones (for example, we have removed arbitrary fills, and now insist that all lazy primitives be convex).
- Changing from a sampling-based approach that walks the tree per-voxel, to one that walks the tree to find bounding volumes; we now maintain a context with a mask, transformation matrix, and stack of partially applied intersections, which allows us to delay evaluation efficiently.
- Pushing down unions and intersections as far as possible, only dropping to point-based approaches when we hit a primitive (so we iterate over small, mostly-disjoint AABBs).
- Some tentative initial applications of explicit SIMD (to parts of the segment prism algorithm). Over time, we will probably convert the entire render loop to SIMD, treating the mechanics of rendering more like we would a GPU.
- A lot of heroic optimizations for difficult primitives and special cases (for example, superquadrics are about 4x as fast as they were with a more naive approach, due to exploiting symmetry, inline caching, and the fact that they have explicit solutions in restricted cases).
- Avoiding per-voxel dynamic dispatch on fills (which turns out to have a huge impact in some cases, e.g. cliff towns).
- Restricting differences statically to ensure that the domain is strict (inspired by polarized type theory).
These, plus a number of other miscellaneous optimizations, have resulted in very significant wins! Caveat: some features still need to be enabled, and segments are broken in ways that may affect benchmarks, so some of these may regress a bit--but they are very impressive...
- Gnarling fortresses are about 15x faster than they were on master, and render in about 2.1 ms / chunk. While they are segment-heavy and therefore might regress, they also have the most complex geometry by far of any site2 to date, and don't use any custom fills, so they have much more room for improvement by optimizing geometry than any of the other sites do. Therefore, I'm fairly confident that we will end up at least 15x faster after my changes are complete.
- Dungeons are about 7x faster than they were on master, and render in about 3.3 ms / chunk (they lag other results because they are almost entirely bottlenecked on their complex custom fill, which I have not spent much time optimizing because dungeons are due for a rewrite anyway). This improvement also seems to have some impact on server startup time!
- Cliff towns are 7x faster than they were after some other fairly significant performance improvements, so they are likely at least 10x faster than master (I'd guess a lot more, since they heavily use large intersected superquadrics).
- Cities use a lot of primitives that haven't been optimized, so it's hard to say how they would compare to master (we didn't have an old benchmark for them), but they render in under 2ms / chunk.
- Giant trees were not benchmarked on master, but they now actually render at less than 125 μs per chunk (which was my original super stretch goal time for chunk rendering--I believe this is the first time we have hit that with a realistic structure!). Avoiding dynamic dispatch on fills, alone, made giant trees over 4x faster than they were after other fairly significant performance improvements compared to master, so it is likely that they also saw an order of magnitude speedup overall. However, they are very segment-heavy, so it's difficult to say where they'll end up after segments are fixed.
Finally, to validate that the existing optimizations are indeed useful, I ported citadel code from @zesterer's branch (which I'd never looked at before), fixed it to be compatible with my current code (which required only a handful of changes, since it was already correctly polarized), and then benchmarked it.
As it turns out, it was very superquadric, union, and intersection-with-AABB heavy, so it benefited from many of the optimizations mentioned above! Citadels now render in only about 650 μs/chunk--making them more efficient than any site except giant trees. I am extremely confident that this represents more than a 10x improvement over the old branch (I included some pictures of it to break up the text!).
Going forward, there are still several more optimizations to implement before I submit my changes (this is by no means the end of possible optimizations--I have many more ideas!--but it's a good stopping point assuming that we can maintain our order of magnitude improvement after segments are fixed and differences are added back):
- Draw immediately when fill is called, rather than allocating into a primitive array first and then dispatching later.
- Add efficient internal algorithms for rendering the remaining primitives, in both the intersection and masked cases (and consolidate any we no longer need).
- Make sharing of primitives explicit and visible to the user, so we avoid recomputing the same primitive too many times (at least in some cases) and instead cache enough information to efficiently re-render the volume. This should be especially useful for hard "problem" cases, such as the various hacks we use to ensure sprites in caves lie on the floor.
- Reimplement differences, using the fact that we have an explicit primitive to efficiently mask it out by inverting it.
- Also use the above technique when distributing intersections over unions (or at least cache some information, such as data allocated in the arena, to avoid exponential explosions in memory use).
As well as extra cleanup work:
- Finish up filling in cost model details that have been omitted so far (since it turned out that improving render performance was far more important than getting the cost model correct given where we currently are).
- Finish up work needed to make sure all values are validated on construction and that we don't have more than a handful of ways of directly constructing the primitive.
- Finish up making sure that correctly polarized combinations of unions and intersections all actually work (this part is quite unfinished, since validating the performance improvements only required dealing with the patterns that we actually use so far).
- Do a better job of avoiding recursion in common cases (this one may be hard).
- Figure out a proper structure for the various kinds of heroic optimizations we have, categorize them, and put them in their own files, so we can focus on them individually rather than having gigantic, messy functions. Keep the core loop of the interpreter small.
Overall, I am really excited about these (and future!) improvements, especially in tandem with improvements elsewhere (such as the preliminary design of a site2 visual editor, hierarchical plots, and discussions about how better to handle common cases like putting sprites on the floor and walls). Hopefully, these improvements are able to make it into the game in some capacity within the next month or two!
I seem to have gotten another order of magnitude performance win with the first future work suggestion in the post ("draw immediately").
New numbers (faster across the board, generally much faster, with roughly equivalent geometry). The "render" is the relevant one here, the
generate_ are testing something that only happens at server startup (AFAIK). Every site type tested now renders in under 1ms / chunk, and many of them run at around double that speed (500 microseconds / chunk). Giant trees, the surprise champions from earlier, render in a ridiculous 40 microseconds / chunk! These numbers will regress with proper segments, but not by nearly enough to lose our order of magnitude improvements.
Below are some of the results of the benchmarks, but you can see the rest of them here.
generate_city time: [19.520 ms 19.538 ms 19.559 ms] change: [-0.3957% -0.1977% +0.0013%] (p = 0.05 > 0.05) No change in performance detected. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 4 (4.00%) high severe render_city/identity_allocator time: [3.4023 ms 3.4114 ms 3.4207 ms] thrpt: [1.7540 Kelem/s 1.7588 Kelem/s 1.7635 Kelem/s] change: time: [-2.3020% -1.9525% -1.6172%] (p = 0.00 < 0.05) thrpt: [+1.6438% +1.9913% +2.3562%] Performance has improved. generate_cliff_town time: [6.9444 ms 6.9518 ms 6.9609 ms] change: [-0.2329% -0.0320% +0.1523%] (p = 0.75 > 0.05) No change in performance detected. Found 13 outliers among 100 measurements (13.00%) 6 (6.00%) high mild 7 (7.00%) high severe render_cliff_town/identity_allocator time: [2.0587 ms 2.0692 ms 2.0800 ms] thrpt: [1.9230 Kelem/s 1.9331 Kelem/s 1.9430 Kelem/s] change: time: [-4.5641% -3.8519% -3.1480%] (p = 0.00 < 0.05) thrpt: [+3.2503% +4.0062% +4.7823%] Performance has improved. generate_dungeon time: [4.2436 ms 4.2503 ms 4.2572 ms] change: [+0.3002% +0.5213% +0.7344%] (p = 0.00 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe
I just got a 5x improvement to forested chunks by (basically) rendering them with site2 after my changes, even with the old segments algorithm! Since tree chunks are very common they tend to in practice be a huge bottleneck, so this is a very welcome development. I still need to add back a bit of the old functionality and figure out the differences between old and new tree rendering to make this fully apples to apples, but I think that improvement will remain 🙂
That's all of the updates on site2 performance, but @Sharp will be back in later posts with more updates. Stay tuned!