Overview

The goal of this project is to address the perception that Aqsis is too slow to be useful. Many previous attempts have been made to address this, all failing to provide any worthwhile improvement. This project is an umbrella project for all work to improve speed and memory use of Aqsis.

This project will be run in a different way to most development branches. Instead of a single identifiable target, with 1 or 2 developers working on it, this is intended to be a much larger undertaking. The advantages of bringing all optimisation work under a single containing project is that there is less chance of concurrent optimisation efforts cancelling each other out.

Project Goals

The main goal of the project is to improve the speed and memory use of Aqsis during production style renders. While we will no doubt compare with other RenderMan renderers during this work, the primary concern is not to be able to compare well against other renderers, but rather to provide a solution that performs well in production. In addition to this condition, any such cross renderer comparisons must be made in the context of quality, our goal is speed, but not at the cost of quality.

Project Protocols

As this project is open to all developers, a more rigid structure must be in place to ensure that the work that is done is worthwhile. As such, a sub-project authorisation scheme will be used. All sub-projects will be discussed and defined here on this wiki page. Each sub-project will require some empirical evidence of the advantages of the work. When a satisfactory design has been completed, and the data provided to prove the worth, one of the primary developers [list] will authorise the project, and work can begin. The primary developers will watch the SVN repository of this branch closely, and roll back unauthorised work, so please take the time to ensure that the work is authorised through the proper channels before committing. Similarly, the primary developers will monitor closely all authorised commits for quality, and will perform strict code reviews on all commits.

Each sub-project will go through a number of stages, the current status of the sub-project should be indicated as part of the sub-project section. The status names are…

  1. Design
  2. Approval
  3. Implementation
  4. Testing
  5. Validation

Subversion

The development of proposals should occur on the Beaker branch in SVN, however, we have to ensure that the codebase is always in a known state for testing. Due to the nature of some of the proposals, and the time it will take to implement them, there are likely to be circumstances where a developer needs to commit multiple times during the development of a proposal. In this case, it's important that the work is done on a new branch, branched from the Beaker branch, and then merged into the Beaker branch when complete.

During such development, the developer should follow normal SVN practice of keeping their branch synced to ongoing changes in the Beaker branch.

Progress & Results

Sub Project Status Overview

Project Status Notes
Bucket Overlap Caching Implementation 2008/01/30: Implemented replacement occlusion code.
Sample Position Caching Design Concerns have been raised that this proposal will introduce unacceptable artifacts - if these can't be addressed this proposal may have to be scrapped.
ShaderVM Inner Loop Redesign Design Need to demonstrate techniques for reducing shaderVM code duplication.
Removal of IqBound Testing Implemented at #1841
CqMatrix Optimizations Testing Implemented at #1854
Reducing shader loads for triangle rendering Design
Geometric optimizations for quadrics Design Flesh out what to do about DicePoint().

Results Submission Protocol

The owners of the test machines should provide full test results at a predetermined SVN revision on request. The process to produce a full results set is as follows.

  1. Ensure that you have the exact requested revision, and make a clean build.
  2. Ensure that you have the latest/specified revision of the test suite.
  3. Ensure that the RTS will be finding the proper Beaker executable.
  4. Ensure that the pdiff tool can be found.
  5. From within the testing/performance folder, execute the RTS tool ../regression/rendertester.py beaker.
  6. Archive up the resulting html folder and send it to me.

Results Table

Test Content ID
SVN Revision capsules makehuman 250mm 25mm 0mm killeroo smoky petunia Performance Metric
1 khepri
#1810 3m55s 8m53s 3m3s 44s 10s 24s 2m34s 2m57s (1360s)
#1841 3m48s 8m45s 2m50s 43s 10s 23s 2m32s 2m45s 3.24%
#1854 3m44s 8m41s 2m48s 41s 8.8s 22s 2m28s 2m44s 4.65%
2 trinity
#1810 16m19s 23m28s 6m18s 1m22s 21s 53s 4m16s 5m22s (3499s)
#1841 16m7s 23m22s 5m36s 1m12s 21s 52s 4m14s 4m49s 3.03%
#1854 16m8s 23m14s 5m37s 1m12s 21s 52s 4m14s 4m48s 3.23%
3 osiris
#1810 4m20s 13m46s 4m46s 58s 13s 36s 3m14s 3m51s (1904s)
#1841 4m13s 13m48s 4m36s 52s 13s 35s 3m12s 3m35s 2.11%
#1854 4m7s 13m37s 4m29s 53s 13s 35s 3m15s 3m32s 3.31%
4 render0
#1810 4m52s 14m38s 6m33s 1m16s 17s 39s 3m51s 4m39s (2206s)
#1841 4m51s 14m56s 6m5s 1m9s 16s 38s 3m52s 4m23s 1.64%
5 niobe
#1810 2m14s 8m28s 1m54s 30s 6.2s 22s 1m47s 1m51s (1032.2s)
#1841 2m10s 8m24s 1m41s 27s 5.9s 21s 1m47s 1m42s 3.23%
#1854 2m8s 8m20s 1m40s 27s 5.9s 21s 1m47s 1m42s 4.01%
  • Green Text: Progression in render-time
  • Red Text: Regression in render-time
  • P: Result is perceptually different

Milestone Progress

Milestone Revision No. Description Overall Performance Metric
#1841 Removal of IqBound 2.65%
#1854 CqMatrix Optimizations 3.8%
  • Green Text: Progression in render-time
  • Red Text: Regression in render-time
  • Blue Text: Incomplete test results, awaiting final data

Sub Projects

Base Point

Before any meaningful optimisation can take place we need a base point for comparison. To be able to do this, we need two things, a suite of test content and a set of test machine configurations.

The test content suite should be defined to cover a wide range of features, and test performance specifically, so a cube in an empty world isn't going to cut it. We need a small, but broad set of test content, initially I'd suggest around 5 or 6 pieces. Once defined, the content can be used within the context of the regression test suite framework. The speed test content should be stored in a separate folder, and not included as part of the standard regression test configuration. A new configuration file should be provided to run the speed test easily.

The test machine configuration should cover at least Linux, Windows and MacOSX. Again, we don't need a huge number of machines, just a handful on which to test the content. The machines chosen will have to guarantee not to change setup dramatically during the lifetime of the project.

Once these two have been defined, a matrix will be created here in which we can keep track of the performance figures as we progress.

Test Content

ID Title Stressed Features Notes
1 capsules Occlusion, RIB Parsing, Scene Setup This file comes from the jrMan distribution.
2 makehuman Texturing This file comes from the MakeHuman project.
3 250mm Depth of Field Sampling
4 25mm Depth of Field Sampling
5 0mm Depth of Field Sampling
6 killeroo Subdivision Surfaces Model, textures and shaders courtesy of headus
7 smoky Shading
8 petunia Motion Blur Model courtesy of Macouno http://www.alienhelpdesk.com, motion capture courtesy of http://www.mocapdata.com, exported from Blender using Mosaic.

Test Configurations

To gather this information, use the following methods.

  • Windows - Run systeminfo from the command prompt.
  • Linux - Run cat /proc/cpuinfo, cat /proc/meminfo and cat /proc/version from the console.
  • MacOSX - Run system_profiler | grep CPU, system_profiler | grep Memory and system_profiler | grep Version from the console.
Hardware Software
ID Name Owner Model Processor No. Processors Memory OS (version) Compiler (version)
1 Khepri Paul Gregory Dell Precision Workstation 670 x86 Family 15 Model 4 Stepping 1, GenuineIntel ~3192Mhz 2 (HT) 1,022MB Microsoft Windows XP Professional (5.1.2600 Service Pack 2 Build 2600) Microsoft Visual Studio 2005 [Express Edition] (Version 8.0.50727.762 (SP.050727-7600))
2 trinity Leon Tony Atkinson Mac mini PowerPC G4 (1.2), 1.42GHz 1 256MB Mac OS X 10.4.11 (8S165) GCC Version 4.0.0 (Apple Computer, Inc. build 5026)
3 osiris Paul Gregory Dell Precision M50 Mobile Intel(R) Pentium(R) 4 - M CPU 2.00GHz (512KB cache) 1 768MB Linux version 2.6.23.1-49.fc8 gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
4 render0 Paul Gregory ???? AMD Athlon™ MP 1194.720MHz (256KB cache) 2 1GB Linux version 2.6.21-1.3194.fc7 gcc (GCC) 4.1.2 20070502 (Red Hat 4.1.2-12)
5 niobe Leon Tony Atkinson Mac mini Intel Core 2 Duo (1.8), 1.83GHz 1 (Dual Core) 1GB Mac OS X 10.5.1 (9B18) GCC Version 4.0.1

Bucket Overlap Caching

Overview

When rendering a bucket, a filter is applied to the samples to produce the final colour of each pixel. This filter has a user definable width, which means that it may, and indeed usually does, gather sample information from outside the normal bounds of the bucket. That is, if a bucket of size 8×8, at the top left corner of the image is being filtered with a width greater than 1.0, the filter will request sample data from negative screen coordinates in both X and Y, as well as from screen coordinates between 7 and 8.

In order to accommodate this, Aqsis currently artificially extends the bucket size by the filter width divided by 2 in each direction. This means that each bucket is actually sampling data outside of the final pixel values that will be presented to the display device. Worse, each bucket extends in the same way, so neighbouring buckets will re-sample sections of the image that have already been sampled. This is illustrated in the following diagram.

Bucker Overlaps

The image shows a segment of an image covering 3×3 buckets, each bucket at 8×8. The filter width of the image is set to 2×2, meaning that the bucket size being sampled is 9×9, as can be seen by the dotted green outline on the top left bucket. The darker areas of the image show where data is re-sampled, the very dark squares show the worst case scenario, where the samples in that area are actually sampled 4 times, once for each bucket overlapping that area.

The closeup shows the sample points in a particularly costly section of the image. The coloured dots represent the jittered sample points. Green samples are only tested once, yellow are tested twice due to overlap, and red are tested four times due to overlap.

The cost of this can easily be measured, for an image of 320×240 pixels, using a bucket size of 8×8, a filter width of 2×2 and 2×2 samples per pixel. In the ideal case, we'd test 320*240*2*2 = 307200 sample points, that's 2×2 per pixel. However, because of overlap, we end up doing a lot more. The bucket size is artificially expanded to 9×9, and we render (320/8)*(240/8) = 1200 buckets. So that's 1200*9*9*2*2 = 388800 which is 21% more sample tests than are necessary.

Expand this to an example that is more representative of a production scene, 1920×1080 image resolution, 16×16 bucket size, filter width of 4×4 with 8×8 pixel samples per pixel. Again, ideal case is 1920x1080x8x8 = 132710400 samples. However, the bucket is expanded to (16+4/2) = 18 in each direction, and there are (1920/16)*(1080/16) = 8100 buckets. So we sample 8100*18*18*8*8 = 167961600 which is still 21%, but more importantly, a total of 35251200 more samples than necessary. Bearing in mind that this is the number of samples that may be considered, and that each sample is likely to be tested against more than one micro-polygon, and you soon see the cost of this naive expansion method.

Proposal

In order to address this weakness, the proposal presents a different method of ensuring that the required extra samples are available without the cost of re-sampling them. This involves caching the overlap data for previous buckets and making it available to subsequent buckets during filtering. While doing so, the actual rendering and sampling proportions of buckets is adjusted to avoid re-sampling the data that is available in the cache. The diagram below shows how the buckets for the same image section are laid out.

Bucket Overlap Cache Proposal

The red bucket in the top left of the image shows the only bucket that needs to be extended to the same degree as those in the current system. After that, subsequent buckets in the first row and first column are extended only in the Y and X direction respectively, these are shown in yellow in the diagram. Buckets in the interior and on the bottom row or right column are actually sampled at the same size as the unextended bucket size, i.e. 8×8 in the example, these are shown in green.

The red dotted outline shows the size of the bucket that is passed onto the sampler, you can see that the yellow buckets are sampled at 8×9 or 9×8, and the green buckets are sampled at 8×8. However, the darkened areas show the parts of the buckets that are stored in the cache for use during filtering subsequent buckets. For example, the column of samples in the region 7.5,-0.5–>8.5,8.5 is cached from bucket 1,1 and will then be used for filtering by bucket 2,1. Similarly, the row of samples in the region 8.5,15.5–>16.5,16.5 is cached from bucket 2,2 for use during filtering of 2,3.

Implementation Detail

Refactor Sampling Classes

The current bucket code is complicated by a need to expand the bucket size artificially to encapsulate the filtering requirements. I propose we simplify this by instead having the bucket work with it's real proportions throughout. The passing of data to the display will then extract a portion of the final bucket pixels, and the processing would extract the overlap data and cache it for future buckets to use. This way, the complicated code that expands bounds throughout the code to compensate for filter requirements will be eliminated. The checks are reduced to simple containment checks. The main problem with this approach is that the internal (green) buckets will not actually be big enough to contain directly all the pixel data that they need. To address this, I propose that we separate the concept of pixels and samples. At the moment, a CqBucket object contains an array of CqImagePixel objects, which in turn contain an array of SqSampleData data structures. I propose we change this so that a CqBucket directly contains an array of references to SqSampleData structures, and works through most of the sampling stage with just that. The SqSampleData structures are referenced using a reference counting mechanism, allowing them to be moved between buckets as described below, and they will be discarded when all buckets are finished with them, automatically.

After completing the MP sampling, the processing will then combine/flatten all the samples, and then filtering will produce an array of final pixel colors that are presented to the display device. This way, the pixel information isn't required until late in the process, and can be constructed to cover the required area, including the extra that is produces purely from cached data, as can be seen in the diagram above. To illustrate the advantage of this, a list of functions that currently have to modify bounds to account for filtering…

  1. CqImageBuffer::CullSurface
  2. CqImageBuffer::AddMPG
  3. CqImageBuffer::PushMPGForward
  4. CqImageBuffer::PushMPGDown

The last three on that list are per MP operations, so are quite expensive, the first is a per GPrim, so again, isn't cheap. All would benefit from the additional simplification. The CqBucket class is overloaded with additional information to support this hidden expansion.

  • CqBucket
    • m_RealWidth
    • m_RealHeight
    • m_DiscreteShiftX
    • m_DiscreteShiftY

This further complicates the process, simplifying would benefit the clarity of the code too.

While the bucket proportions will be modified to represent the red dotted regions in the second diagram above, the array of SqSampleData structure references will cover the full size of the bucket including the filter overlap. This allows us to use the bucket itself as the cache storage. When a bucket is complete, and the samples have been flattened, the sample references in the overlap area will be inserted into the bucket that they overlap, ready for use during filtering. Because the bucket bound will not include this area, no MP sampling will further affect these samples, as they have already been completed.

Refactor the Sampling/Occlusion Code

The current sampling and occlusion code are combined, making for a complicated and costly sampling operation. In addition, the extra load on the combined tree makes it costly to construct and maintain. I propose a simplification of the system, returning to a scan based approach to MP sampling, and a separate tree for occlusion culling. The proposal involves disconnecting the occlusion hierarchy from the pixels and samples completely. Instead of organising the sample points into a tree structure, the space covered by the bucket, irrespective of the sample layout, is organised into a binary tree. Each sample is then associated with a leaf node in the tree, so that as samples are made, the tree is updated. The following diagram illustrates the principle. The bucket, shown as the grid, is 11×11 pixels, so the simple binary tree doesn't match the pixel boundaries. Note that the diagram only shows part of the tree, each node is split in alternating directions, but only the second child node is then subsequently split, allowing the diagram to show clearly the arrangement of nodes.

It can be seen that the occlusion hierarchy will successfully cover the entire bucket, allowing secure occlusion queries.

There is likely to be cases where terminal nodes in the tree contain more than one sample point, and also where terminal nodes contain none, this is addressed by applying the following logic.

for each node as n
    n.depth = FLT_MAX
endfor

for each samplepoint as s
    s.terminalnode = get_terminal_node(s.position)
endfor
  
while bucket not finished
  while micropolygons waiting
      for each sample hit as s against micropolygon as m
          if m.depth < s.max_opaque_depth
              s.max_opaque_depth = m.depth
          endif
      endfor
  endwhile

  for each terminal node as n
      n.depth = 0
  endfor

  for each samplepoint as s
    if s.max_opaque_depth > s.terminalnode.depth
      s.terminalnode.depth = s.max_opaque_depth
    endif
  endfor
  
  getdepth(node as n is root)
    if terminal(n)
      n.depth
    else
      n.depth = max(getdepth(child1(n)), getdepth(child2(n)))
endwhile

Using this method, the nodes that have no sample points will retain the zero initial value, and nodes that contain more than one sample will store the furthest opaque depth from those samples.

Filtering changes

The filtering code doesn't need significant changes, as the array of SqSampleData references on the bucket will contain references to samples to cover the full region including the filter width expansion. However, a refactor of the filtering code will be beneficial at this point, so the proposal is to use an accumulator approach.

result = 0
Accumulator accum(result)
bucket.applyFilter(accum)

where applyFilter is implemented as something similar to

for (x,y) in filter support
    if bucket contains sample position
        accum.accumulate(x, y, sample(x,y))
    endif
endfor
Sample cache

The sample cache is integrated into the bucket class. By representing the SqSampleData as shared references, the actual samples can be referred to from multiple places, with minimum cost. So when a bucket is complete, the samples in the region that overlaps another bucket are copied into that bucket. When all buckets have finished processing the data in that sample, it will simply be removed automatically thanks to the reference counting system.

By deferring the allocation of the reference array until a sample needs to be inserted into a bucket, the cost of the reference array is reduced to the approximately the buckets of a single row. For example, a 2K (1920×1080) image, rendered with bucket size of 16×16 will have 120 buckets per row. Rendering with 8×8 samples per pixel, and 4×4 filter width results in a reference array of 8x19x8x19, or 23104 samples, over 120 buckets, this is 2772480 samples, with each reference taking 4 bytes on a 32bit architecture this is an overhead of 11089920 bytes, 10.8Mb, or for 64 bit architectures 21.6Mb.

The proposal is to use the boost::intrusive_ptr as the mechanism for the SqSampleData reference counting.

Status

Current Status: Approval


Sample Position Caching

Overview

Currently, sample positions are calculated for each bucket independently. This ensures that the sample positions in adjacent buckets aren't correlated in any way, but such a requirement is probably not necessary. One major problem with this approach is that the sampling and occlusion trees have to be rebuilt, or at least adjusted, each time the sample positions change.

In the current system, it's somewhat costly to recalculate sample positions (the code may be found in CqImagePixel::JitterSamples()). However, the main concern related to the sample positions is not the repositioning of the samples, it's the recalculation of the tree structure, i.e. the splitting of the KD-Tree levels, and calculating the bound and samples contained therein.

Ideally, this reconstruction of the tree would be done once per bucket, but due to the associated cost, we currently use a different system: for each new bucket we jitter the sample positions within each conceptual pixel, making sure that they remain within the pixel. The result is that means that most of the tree structure will not change too much. We take advantage of this by simply readjusting the bounds of each KD-Tree level, and leaving the rest of the structure as it is. This means we may end up with some node overlap, but it's a small thing.

A problem with this approach (as opposed to rebuilding the tree for each bucket) is that we can't use the 'other' dimensions of the sample point - DoF index & shutter time - as splitting criteria in the tree. When the sample points are shuffled (in x and y) the resulting DoF indexes are completely changed for each pixel. This loses us a potentially beneficial optimisation for motion blurred and/or depth blurred scenes.

Proposal

The idea is that we calculate the sample positions once only, so that adjacent buckets use the same set of sample point positions, apart from being displaced by the spatial offset to a corner of the bucket. For a single bucket size this suggests a particularly simple algorithm:

  • At the start of the render, create a set of sample positions and build associated tree structures; save these newly initialized structures.
  • When rendering a bucket, take a copy of the cached samples and trees (only needed to the extent that they'll be modified by the process of rendering the bucket).
  • When finished with the bucket, discard the copy of the associated data structures.

To deal with multiple bucket sizes, we can create samples and sampling trees for a chosen given bucket size on demand, and store the newly initialized trees in a cache for later retrieval.

The outcome of applying the above changes will be that any data structure which depends only on the sample positions will be initialized once, and once only. If we do this there will be very little bucket setup cost, particularly if all trees are stored in arrays.

A simple way to get an initial feel for how this will perform may be to disable the jittering of samples, except when rendering the first bucket.

Status

Current Status: Design. Concerns have been raised that this proposal will introduce unacceptable artifacts - if these can't be addressed this proposal may have to be scrapped.


ShaderVM Inner Loop Redesign

Overview

There's a number of issues which seem non-optimal about how inner loops are currently implemented in the shaderVM. As an example, here's the code for the sin shadeop:

void CqShaderExecEnv::SO_sin( IqShaderData* a, IqShaderData* Result, IqShader* pShader )
{
    bool __fVarying;
    TqUint __iGrid;
 
    __fVarying=(a)->Class()==class_varying;
    __fVarying=(Result)->Class()==class_varying||__fVarying;
 
    __iGrid = 0;
    CqBitVector& RS = RunningState();
    do
    {
        if(!__fVarying || RS.Value( __iGrid ) )
        {
            TqFloat _aq_a;
            (a)->GetFloat(_aq_a,__iGrid);
            (Result)->SetFloat(static_cast<TqFloat>( sin( _aq_a ) ),__iGrid);
        }
    }
    while( ( ++__iGrid < shadingPointCount() ) && __fVarying);
}

Two issues stand out in this code:

  • The conditional inside the inner loop, if(!fVarying || RS.Value( iGrid )) is unnecessary - it can be avoided using indexing vectors.
  • GetFloat and SetFloat are virtual: even though they have trivial implementations, they're not inline-able by the compiler.

Both these issues appear to exist in every shadop function defined inside aqsis/shadercompiler/shaderexecenv. The second issue is not a problem for shadeops defined inside aqsis/shadercompiler/shadervm.

Proposal

There are two major aspects to this proposal.

Firstly, IqShaderData interface needs a major redesign to make it usable without virtual function overhead. In particular, all functions Get* and Set* which access a particular element of the shader data need to be removed. Instead, a lightweight vector class holding the actual data should be accessible; this vector class should be a concrete type, allowing full inlining inside any looping constructs which involve it. In addition to removing the per-element access functions, the interface redesign should address the concern that IqShaderData is a fat interface with the resulting lack of a proper kind-of relationship to the child classes.

Secondly, the machinery dealing with inner shader loops needs a careful redesign. It is possible to construct a representation of the shader running state which is considerably more efficient than that which is currently used. This representation may be based on arrays which index the currently running elements, rather than a boolean on/off for each element. This allows inner SIMD shader loops to run with many fewer conditionals.

Implementation Detail

Removing virtual function calls - interface redesign

The functions GetFloat and SetFloat are part of the IqShaderData interface, and have trivial implementations. However, since these are virtual and are accessed through the base class, they can't be inlined by the compiler. Having this behaviour in inner loops is generally a bad thing.

It's possible to remove this problem by dealing directly with pointers to the underlying arrays using GetFloatPtr etc, outside the loop. However, this is ignoring a deeper flaw in the underlying design of the IqShaderData interface.

IqShaderData is an example of a “fat” interface; it's the union of all methods for the child classes. Worse, most of these methods don't make sense in the context of any given child implementation class. In this case we say that the children of IqShaderData (CqShaderVariableVaryingFloat for instance) are not really a “kind-of” the base class. For more details related to the dilemma which leads to fat interfaces see http://www.parashift.com/c++-faq-lite/proper-inheritance.html.

The fix for the situations is simple: make sure the IqShaderData interface contains only those methods which make sense for all child classes. Use casts when the child-specific functionality is required (one example where casting is the best of a bad set of choices). In this case, dynamic_cast is appropriate, but for performance reasons it's probably preferable to design a restricted version via a specialized cast() member function or functions, rather than use the general RTTI mechanism. One way to do this may be found in a post by Walter Bright in the comp.lang.c++.moderated thread “RTTI Performance”. There are other ways to do the same thing, which may be necessary because the code will be heavily templatized.

Removing extra conditionals - shader running indices

Some preliminary shaderVM optimization was started a while ago in the runningstate_refactor branch in svn. The idea is to improve the way that we deal with shader running states: store an indexing vector to the currently running SIMD elements rather than an on/off bit for each one. There are two possible approaches to this. However, both can be encapsulated by an iterator interface, so the resulting shaderVM code will appear identical in either case. Explicit details are presented below, followed by a brief discussion detailing the iterator interface.

Running state indexing

The first approach is the simplest: for a running state bit vector, RS, the corresponding indexing vector contains only those indices, i for which RS[i] evaluates to true. This is the simplest way to eliminate the conditional from the inner loop.

The following figure shows a 6×6 grid which is partially running - the blue parts indicate running elements, while the white parts indicate stopped elements. 6x6 grid, partially running

The running state indexing vector for this grid reads

rsIndex = {6,7,8,9,12,13,14,15,18,19,20,21,24,25,26,27}.

Using this approach, the inner loop reduces to something like

const CqIndexVector& runningIndex = runningState();
for(TqInt i = 0; i < runningIndex.size(); ++i)
{
    TqInt currIndex = runningIndex[i];
    TqFloat _aq_a;
    a->GetFloat(_aq_a, currIndex);
    Result->SetFloat(static_cast<TqFloat>(sin(_aq_a)), currIndex);
}

Data from the runningstate_refactor branch has shown more than 10% speedup for a sky shader mapped onto a patch, which is very much a shader-dominated scene. This included a lot of the old running state code still in place, so it should be possible to do rather better - perhaps 20% speedup.

Running state ranges

There is an alternative to the direct indexing vector formulation given above which might be desirable in our case. Motivated by the guess that in many situations a grid contains mainly running or stopped elements, we see that the indexing approach above wastes both memory and computation time, since it has to store and retrieve an index for every single running state. It might be better to instead store a set of “start” and “stop” indices which indicate running state ranges.

In the “running state range” approach, we have a rsStart and rsStop arrays. Elements in the start array indicate the start of a running sub-range, while corresponding elements in the stop array indicates the ends of a sub-range. For the diagram shown above, the start and stop arrays look like:

rsStart = [6,12,18,24]
rsStop = [10,16,22,28]

With this alternative approach we are likely to get the best possible performance, at the small price of extra complexity in the shadop inner loops. The inner loop implementation for this case looks like:

const std::vector<int>& rsStart = runningState().startVec();
const std::vector<int>& rsStop = runningState().stopVec();
for(TqInt i = 0; i < rsStart.size(); ++i)
{
    for(TqInt j = rsStart[i]; j < rsStop[i]; ++j)
    {
        TqFloat _aq_a;
        a->GetFloat(_aq_a, j);
        Result->SetFloat(static_cast<TqFloat>(sin(_aq_a)), j);
    }
}

Iterator interface It should be possible to wrap either indexing scheme into an efficient iterator; the resulting code is independent of the scheme chosen:

const CqIndexVector& runningIndices = runningState();
for(CqIndexVector::const_iterator i = runningIndices.begin(), end =
        runningIndices.end(); i != end; ++i)
{
    TqInt currIndex = *i;
    TqFloat _aq_a;
    a->GetFloat(_aq_a, currIndex);
    Result->SetFloat(static_cast<TqFloat>(sin(_aq_a)), currIndex);
}
Reducing shader VM code duplication

The shader VM currently contains a very large amount of duplicated code. To eliminate both of the issues which have been discussed above in a way which isn't extremely manual and error-prone, we need to reduce the amount of code duplication. Apart from making the refactorings possible, this will make the code much more maintainable in the future.

This is almost certainly possible by considered use of templates and inline functions - further investigation is required.

Status

Current Status: Design. Need to demonstrate techniques for reducing shaderVM code duplication.


Removal of IqBound

Overview

IqBound is an unnecessary abstract base class wrapper around the single concrete child implementation, CqBound. IqBound is trivially removable, resulting in fully inlinable member functions, and any other optimizations which go with concrete classes.

Implementation

The design/implementation is trivial: remove the file ibound.h, replacing CqBound with IqBound throughout the core renderer.

This optimization is without any foreseeable downsides: it removes code, while leaving all necessary abstractions intact - in fact, it's even a good step for improving the general code quality. Triviality of implementation is also very attractive :-D

Preliminary Results

Running the speed test suite on a trial implementation shows speedups of about 1 to 2% in all examples apart from the makehuman one which showed a slowdown of 1%.

This is only a small speedup and possibly within the noise of the measurements which makes it difficult to measure. To be certain, capsules.rib was rendered multiple times by hand. A consistent speedup of about 1% was observed, with variation of about 0.2% between runs.

Status

Current Status: Testing

Summary

This change was approved during the dev meeting, 20080106. The necessary changes were implemented in the beaker branch as svn revision 1841.


CqMatrix optimizations

Proposal

Any speedups on basic types like matrices or vectors should result in general speedups across the board. There are a few minor ways in which CqMatrix can be made faster, including inlining and minor optimizations.

Implementation

Most of the functions of CqMatrix are not inlined, rather they're defined in the source file. Instead of this, all but the most complex CqMatrix functions should probably be taken into the header to allow inlining.

The function Vector3D CqMatrix::operator*(const CqMatrix3D&) const is used very extensively as indicated by a valgrind profile of capsules.rib, which shows 2.7% of the aqsis runtime dedicated to it. There is a small optimization which can be made to operator* to avoid unneeded floating point division by the parameter h.

Preliminary Results

Initial testing of capsules.rib with fully inlined functions shows a 1.1% speedup. An additional ~0.3% can be obtained by optimizing operator* for a total of approximately 1.4%.

Status

Current Status: Testing

Summary

This proposal was accepted for implementation at the 13th Jan developer meeting and implemented in the beaker branch as revision 1854.


Reducing shader loads for triangle rendering

Overview

Triangle rendering in aqsis works by turning the triangle into a quadrilateral by creating a ghost vertex. Ghost vertices propagate during splitting, such that the resulting grids may contain some “ghost micropolygons” which are eventually culled.

Currently, shaders run over the entire grids which result from this procedure, including the ghost micropolygons - this inefficiency can be avoided.

Proposal

Since some of the micropolygons are ghosts and will eventually be culled, shading never needs to be performed over this part of the grid. This proposal involves setting the running state of the newly created grid to false over the area containing ghost micropolygons. This would effectively turn off the shader in those regions.

Shading performance gains of up to 50% are expected with this approach, for dense meshes in which a single triangle corresponds to a single grid.

Status

Current Status: Design


Geometric optimizations for quadrics

Overview

Profiling of aqsis using valgrind shows that when rendering a large number of quadrics - such as in capsules.rib - the trig functions std::sin() and std::cos() are hit quite hard: together they make up 5.5% of the runtime.

Examining the code shows that the dicing and grid size estimation code is far from optimal, involving calling the virtual function DicePoint() inside inner loops. Further research also showed that there are various efficient ways to calculate values for sin and cos, particularly on a regular grid. These can easily be used inside the dicing and grid size estimation loops.

Implementation

Dicing

Currently, our code evaluates grids over quadric surfaces by directly calling std::sin() and std::cos() inside the inner loop, also involving a call to the virtual function DicePoint().

However, it's possible to evaluate the sequence

c_n = cos(a + b*n),
s_n = sin(a + b*n)

for integers n = 0 to N-1 for very low computational cost: We only need to calculate cos(a), cos(b), sin(a), sin(b) once, plus four multiplies and two additions for each n. In particular, using the two angle formulas, (or considering a 2D rotation matrix), we may derive:

c_n = cos(b) * c_{n-1} - sin(b) * s_{n-1}
s_n = sin(b) * c_{n-1} + cos(b) * s_{n-1}

where we initialize c_0 = cos(a), s_0 = sin(a).

This procedure can easily be utilized within the dicing loop to make the computations *much* more efficient: test code suggests a speedup of at least 20x over the naive algorithm. However, the current architecture relies on DicePoint() which will need to be removed.

Grid size estimation

The current method of estimating the quadric grid size coarsely dices the primitive at a constant resolution of 8×8, and looks for the largest resulting micropoly side lengths. This involves the same inefficient use of sin() and cos() which is used in the dicing code. A simple way to speed up the grid size estimation is to use the efficient iterative calculation outlined above.

In addition, it should be simple to modify the fixed 8×8 resolution to reduce it in many cases, particularly when the quadric has already been split into a relatively small part of the original.

Removal of DicePoint()

DicePoint() needs to be removed to effectively implement the iterative scheme above. A strategy invovling templates and a quadric-dependent function for each will probably be necessary.

Estimated value

In any scene like capsules.rib which uses quadrics extensively, elimination of nearly all sin() and cos() calls should result in a speedup of at least 5%, according to the valgrind profile. Additional speedups should be realized if we put extra effort into improving the grid size estimation.

Status

Current Status: Design. Flesh out what to do about DicePoint().



Personal Tools