Trace: » soc_geometry_cache » soc_multithreading
Abstract
With the advent of multi-core computers even for basic desktops, parallelizing an application which needs huge amounts of CPU power can bring big performance benefits. Using threads in applications to listen for incoming data (e.g. via sockets) has the benefits of not blocking the rest of application or not having to poll in a loop; but with N CPUs/cores an application with huge CPU usage can run up to N times faster (using a fraction 1/N of the time compared with single-threaded implementation) in the critical parts – provided that is highly parallelizable, of course.
One of the big problems of using threads is the different implementation of them in the different platforms. However, this can be circumvented with the use of Boost.Threads (or maybe alternatives such as OpenThreads).
This project aims to implement core parts (especially the rendering stages) in a multithreaded way, so the application can benefit from powerful hardware in a platform-independent and reliable way.
Initial Plan
- Weeks 1-2: Learning and Specification. Get acquainted with Aqsis development, choose a thread package, study the parts which can be made multi-threaded (measuring when necessary) and get a Requirements document.
* Week 3: Design. Specify which parts would need to be modified, and how.
* Weeks 4-8: Implementation. Implement those changes.
* Week 9: Testing. Test that everything works fine and fix possible problems.
* Weeks 10: Measuring and Tuning. Measuring the performance gains compared with the original implementation, and do the necessary tuning to optimize the performance.
* Week 11-12: Prepare Documentation and Final Bits. Prepare the necessary documentation and summary of the work accomplished; and leave some time for not unfinished bits.
Requirements
This is an informal [mini] document listing the Requirements for the job. It's a bit more technical than the usual Requirements Document for a software project, since it contains the opinion of developers about which libraries to use and so on. It contains a rough idea of what should be implemented.
About the library to use:
- It's not desired to introduce new dependencies for the project, since this tends to mess up things and it makes more difficult to get the software compiled or have packages for GNU/Linux distributions, etc.
- Boost.Threads seems to be the most standard thread library around, and Aqsis is already using Boost libraries (so we avoid to have a new dependency), so this is the way to go unless proven to not work as needed.
- If possible, wrapper classes should be created in order to encapsulate the library used underneath, so if the library being used is changed later the impact would be only inside this wrapper (hopefully
). This would also help to simplify things and implement only what we need as an interface in the wrapper, instead of shipping the whole functionality of the thread library throughout the code.
About the parts to be made parallel, measuring each part:
- Initially RenderImage() in imagebuffer.cpp, it spends almost 100% of time there, as this modified binary shows:
$ time aqsis -progress "vase.rib" -res 512 394 RenderImage: 1 seconds, 800 calls to RenderSurface RenderImage: 1 seconds, 800 calls to RenderSurface RenderImage: 13 seconds, 800 calls to RenderSurface real 0m15.220s user 0m14.269s sys 0m0.072s
- Measuring RenderSurfaces(), which is “inside” RenderImage(), it uses about 80% of the time:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 530416 usec Timer: 0 sec and 527628 usec Timer: 9 sec and 962242 usec real 0m14.345s user 0m13.805s sys 0m0.044s
- Maybe look at the Display part later, suggestion of tcolgate. It really seems to spend very little time in the line (“inside” RenderSurfaces()):
QGetRenderContext() ->pDDmanager() ->DisplayBucket( &CurrentBucket() );
These are the measurements:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 12829 usec Timer: 0 sec and 14945 usec Timer: 0 sec and 70676 usec real 0m14.642s user 0m13.529s sys 0m0.072s
- Measuring the first RenderMPGs() “inside” RenderSurfaces() represents 21% of the total time:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 214225 usec Timer: 0 sec and 195586 usec Timer: 2 sec and 524435 usec real 0m14.621s user 0m13.573s sys 0m0.068s
- Measuring the second RenderMPGs(), 10% of the total time:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 79738 usec Timer: 0 sec and 121420 usec Timer: 1 sec and 258639 usec real 0m18.084s user 0m13.905s sys 0m0.080s
- Measuring the third RenderMPGs() throws negligible values:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 242 usec Timer: 0 sec and 249 usec Timer: 0 sec and 1248 usec real 0m14.346s user 0m13.861s sys 0m0.040s
- So the combined RenderMPGs() calls represent 31% of the total time of execution, for our reference file “vase.rib”. The non-RenderMPGs() part represent then 80%-31% ~= 50% of the total time of execution.
- The Filter part inside RenderSurfaces() takes a negligible amount of time:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 198 usec Timer: 0 sec and 217 usec Timer: 0 sec and 34042 usec real 0m13.941s user 0m13.453s sys 0m0.056s
- The Filter part inside RenderSurfaces() takes another good share of time (21%):
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 53656 usec Timer: 0 sec and 54180 usec Timer: 2 sec and 851902 usec real 0m15.058s user 0m13.885s sys 0m0.064s
- The Combine part inside RenderSurfaces() takes about 2.6%:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 61014 usec Timer: 0 sec and 66017 usec Timer: 0 sec and 235253 usec real 0m13.962s user 0m13.477s sys 0m0.068s
- The Dice part (pGrid = pSurface→Dice()) inside RenderSurfaces(), 2%:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 45516 usec Timer: 0 sec and 43385 usec Timer: 0 sec and 201715 usec real 0m14.441s user 0m13.873s sys 0m0.064s
- The Shade part (pGrid→Shade()) inside RenderSurfaces(), 4.6%:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 29643 usec Timer: 0 sec and 30208 usec Timer: 3 sec and 541939 usec real 0m14.393s user 0m13.893s sys 0m0.052s
- The Shade part inside RenderSurfaces(), 4.5%:
$ time aqsis vase.rib -res 512 394 Timer: 0 sec and 12735 usec Timer: 0 sec and 13184 usec Timer: 0 sec and 60838 usec real 0m14.529s user 0m13.953s sys 0m0.052s
Measurements
Most of the measurements were made with the following code:
- Global variable:
#include <sys/time.h> static struct timeval timeCounter;
- Initializing and printing the result in some place including all the context to be measured:
timeCounter.tv_sec = 0L;
timeCounter.tv_usec = 0L;
[...]
printf("Timer: %li sec and %li usec\n", timeCounter.tv_sec, timeCounter.tv_usec);
- Using code around the specific part that wants to be measured:
// before
struct timeval base;
struct timeval interval;
gettimeofday(&base, 0);
[...]
// after
gettimeofday(&interval, 0);
if (interval.tv_sec > base.tv_sec)
timeCounter.tv_usec += 1000000;
int diff_usec = interval.tv_usec - base.tv_usec;
timeCounter.tv_usec += diff_usec;
while (timeCounter.tv_usec >= 1000000) {
timeCounter.tv_usec -= 1000000;
timeCounter.tv_sec += 1;
}
Design
In order to process things in parallel (in particular Buckets, as tcolgate and pgregory suggested), we have to avoid global calls that modify the internal state of the buckets (e.g. RenderMPGs() clearing MPGs of the buckets, thus making the later checks of the new Bucket.IsEmpty() wrong).
I think that the only feasible solution in order to achieve the parallelization that's asked is to convert all the processing inside the scope of RenderImage() as part of internal calculations of the buckets (the most obvious solution is to move all this into the Bucket class). That is, RenderImage() would divide the image in Buckets, and then it would invoke a method on them to make all the processing. Of course, inside the Bucket class there should be only external calls to obtain information from the renderer, but not to modify things.
This approach would then require to change the very core of the rendering process, including RenderImage(), RenderSurfaces(), RenderMPGs(), RenderMicroPoly() and all what they contain and invoke, as well as making sure that all these and the other internal parts of CqBucket class that are already being used in the process don't access shared resources. It also includes removing most of the static parts of the Bucket class, and prepare it to have several instances of it living at the same time.
So this is a tricky job that will change fundamental parts of Aqsis.
Leading on from this, I would suggest the following approach…
There will be three main objects/classes in use during this part of processing.
1. An object representing the bucket, there will be one each of these allocated for every bucket at the start of rendering. Consider an HD image at 1920×1080, at a bucket size of 16×16, that's 8100 buckets.
2. An object representing theprocessingdata for a bucket that is being rendered. This includes mainly the sample buffer, with sample positions, time and DoF data, and the resulting sample arrays resulting from actually doing the MP sampling.
3. An object representing theprocessor. This is the object that ties abucketto it'sprocessingdata, and calls the processing functions on the bucket. It will either be a single object that processes the 'current' bucket, or a thread based object (that overrised operator() for use with Boost.Threads), one for each thread.
In addition to these main classes, we need aschedulerfor determining whichbucketis next. This functionality is currently handled by the NextBucket function on CqImageBuffer, I see no reason for moving it at the moment.
The tasks to achieve this system include.
1. Modify CqBucket to contain only the storage containers for primitives, grids and micropolygons.
m_ampgWaiting (should be renamed to m_micropolygons to match new naming guidelines)
m_agridWaiting (should be renamed to m_grids)
m_aGPrims (m_gPrims)
2. All data members currently in CqBucket are static because they are only needed for the bucket that is being processed, in the current system, there is only ever one bucket being processed. In the proposed system, there can potentially be multiple buckets being processed concurrently. This data can no longer be static on the CqBucket, but we don't want it on every bucket, as there will be lots of them and it's heavy data. This is the data that should go into theprocessingdata object mentioned above. I suggest something like CqBucketData. At the moment there are some processes in place to avoid recalculating all the sample data for each bucket, instead we just shuffle the data around for each bucket. This may or may not be possible in the new system, I'd suggest taking the simple approach to begin with, i.e. recalculate it each time a CqBucketData object is attached to abucket.
3. Create a newprocessorclass, I suggest something like CqBucketProcessor, that manages the connection between a CqBucket and a CqBucketData. A CqBucketProcessor contains an instance of the CqBucketData class. Each time the CqBucketProcessor is given a new bucket to process, it stores a pointer to the CqBucket (boost::shared_ptr), and passes a pointer to it's CqBucketData into the bucket. Then the CqBucket knows about the CqBucketData class it is meant to store all it's data into. Every processing function in CqBucket that currently uses the static data in the CqBucket class, should be modified to refer to the data on the CqBucketData class that it now holds a pointer to.
The CqBucketProcessor repeatedly calls theRenderSurfacesfunction on it's bucket, until the bucket is finised. Thescheduleris able to determine when this is by knowing about which buckets have been processed. I would suggest that perhaps someone other than Manuel might write the scheduler functionality to reduce the workload on him.
So basically, each time that the CqBucketProcessor calls theRenderSurfacesfunction on the CqBucket, the function returns when it has nothing more to process. Then the CqBucketProcessor checks with theschedulerto see if this bucket could possibly get any more data, if not, it finalises the bucket (combine, filter, display), and relinquishes control of it. Theschedulerthen assigns a new bucket to the processor, and it begins again.
Update 20070719
What it's done so far (roughly in chronological order):
- Following the advices of the mentors I focused in the separation of Buckets, this is where they think that parallelizing will bring the best benefits.
- I started by making some changes, moving code to CqBucket side, to try to isolate the calculations related to it and so make the parallelization possible. This included:
- Moving much of the code of Render*() in imagebuffer.cpp to the Bucket class, so it depends less on global variables and calls.
- There's a border where it makes more sense to maintain code in imagebuffer.cpp than in bucket.cpp or others. This is because, in example, when adding a micropolygon it can end up in different buckets; so it makes more sense to maintain this code at higher levels rather than bucket levels. All this is caused by a mix of instructions “put things into buckets” and “render bucket contents” (Aqsis mixes them so the things to be rendered stay for as short as possible in memory). The border was between RenderImage()/RenderSurface() in ImageBuffer and RenderWaitingMPs() in Bucket.
- Converting RenderSurfaces (plural) to RenderSurface (singular), and moving the loop to RenderImages. Moving RenderGrids inside RenderSurface.
- Then I tried to make static data in buckets not static, which led me to a dead end, because we’re not supposed to afford to have one instance for every bucket (it uses too much memory). So instead the mentors introduced the idea of using a BucketProcessor class holding the important BucketData, and running a Bucket each time.
- After that I eliminated the calls PushMPG{Down,Forward} in ImageBuffer from Bucket, so after the bucket is filled it can proceed with its processing alone, and doesn’t interact with other buckets.
- Next I moved the Occlusion Box+Tree to BucketData class, and made a couple of wrappers in it and BucketProcessor; this along with modifications makes that now the accesses from ImageBuffer and Bucket are dynamical. As part of this I converted the static accesses to CqBucket in Occlusion in dynamic accesses, using pointers back to the calling bucket all over the place.
- Then I converted remaining static accesses from imagepixel.cpp to CqBucket to dynamic ones, mostly by passing as parameters of the method either a pointer to the bucket or to the direct data (usually they’re being called from the buckets themselves). The same with micropolygon.cpp and imagepixel.cpp.
- Passing hitTestCache as parameter to the methods using it (Occlusion::SampleMPG() and below), instead of storing a pointer inside the MP class, so several instances can run safely at the same time operating with different data.
- Creating the BoundList of the micropolygon as soon as possible, instead of letting this job for the time when it’s needed by the buckets: in this way we avoid conflicts when different buckets could possibly call the same MP and perform this operation at the same time.
- Performing another tasks while at it, such as: massive const conversion of things to ensure that they're thread-safe, eliminating some leaks, removing dead code, updating names and code formatting as ruled by coding standards…
There are still many areas that will most certainly cause threading problems, dangerous points are:
- The reference counted MPs and the shared access to them in the buckets: in RenderingWaitingMPs(), the use of the RELEASEREF concurrently can cause the refcount of the MP to be set wrongly. Also, operating with the same MP at the same time can cause other problems down the line, as we'll see. Maybe it would be possible to create different objects for them (hopefully the count of shared MP is not very big, so the memory footprint of having several copies in different buckets it's not very bad).
- In OcclusionTree::SampleMP() there are several methods that operate with the nodes of the tree: SetMaxOpaqueZ(), PropagateChanges(). This should not cause problems since there is a separate OcclusionTree per BucketData (thread).
- In OcclusionTree::SampleMP() again, there are accesses to ImageVal.Data() and sample.m_Data.push_back( ImageVal ) which are modifying data of the MPs, so probably it's operating over the same data, so it would be another source for threading issues.
- pMP→MarkHit() should not cause threading problems, since marking it hit more than once leads to the same result.
Conclusion of the Project
Today is the official “pencils down” date for the Summer of Code program, so this is a kind of review of what's been done so far during the program, and what can be done later to continue improving the code in the way that this project intended.
The ideal implementation would be a scheduler that would take buckets ready to be processed (after having processed all the surfaces, so having micropolygons ready to be rendered), and process them in the available threads at full capacity (the number of threads being chosen by the user via command line options, or similar).
However, there are limitations in the way that Aqsis is supposed to work, so this straightforward, clear and performant implementation is not possible given how the code works. A description of current problems, status and future directions follows.
preProcess() and postProcess() phases
There are the failures in some images that I observed when the postProcess() part of bucketProcessors is threaded, possible problems with the threaded preProcess() (not even tested yet). The pre and postProcess() phases could probably be threaded in a safe way despite the errors observed, I can't think of why this couldn't happen if it's operating on separate data. Since I focused on the rendering of MPs during the project, following Paul's advice, maybe there are things like static accesses somewhere in the execution paths of the postProcessing, but probably they can be fixed, and the same for possible similar preProcess() issues.
Display
Problems with concurrent write access to the display of the bucket after being processed, but these are supposed to be easily fixable by a lock.
RenderSurfaces()
The RenderSurface() phase is more difficult:
- It starts by calling OcclusionCullSurface, which uses bucketProcessor (and its occlusion tree) to check whether the surface can be culled. It then adds the surface to neighbour buckets if the surface touches them, in a similar way that the PushingForward/Down of the MPs. Since (I think) it needs the *same* occlusion tree of the bucket, which is in bucketdata of the bucketprocessor processing it, it gets tricky.
- The RenderSurface() itself can Split() a grid, which in turn calls back to the ImageBuffer to AddMPs to itself and maybe other buckets, the PostSurface() does similar things but with surfaces instead of MPs – so this should be in the main thread, in ImageBuffer.
- There are lots of code in those paths (and thus possible issues in every corner), but I think that the above ones are the main characteristics of how that part of code works.
So the problem is that we need the bucket to be associated with the bucketProcessor for the Occlusion test, which in turn is needed for RenderSurface(), which should ideally run before the bucket association (so the bucket could be assigned later to any available thread to be processed).
The bottom line is that I can't think of a way to break this interdependency. In example it might pay off to have one global occlusion tree (just for the RenderSurface phase) and one separate tree per bucketdata (for Sampling and the rest), or some other whacky workaround, but at this point I don't have ideas about how to fix this.
Current Status
The current implementation does roughly this:
- bucketProcessor.setBucket(nextBucket);
- bucketProcessor.preProcess(boundaries, etc);
- bucketProcessor.process();
- bucketProcessor.postProcess(imager, etc);
- display(bucketProcessor.getBucket());
The preProcess() calls bucket→PrepareBucket(), and bucketData→occlusion→SetupHierarchy(bucket).
Later, it uses the occlusion object in bucketData for the checks in OcclusionCullSurface in RenderSurfaces, so this phase needs that the same bucket↔buckerProcessor assotiation is still in place. Additionally, RenderSurfaces() has to be done in ImageBuffer because it's adding primitives to several buckets and so on, so this step should be in the main thread, not in separate threads.
The key point is that, currently, you have to do the RenderSurface() step *after* associating the bucket with the bucket processor, setting up the hierarchy in the occlusion tree; they can only get “ready” after being associated with the bucketProcessor.
So at this point, after RenderSurfaces() happen, I think that it doesn't make sense to detach the bucket from the bucketProcessor only to attach them later in a different order; and in addition, as I understand it, the bucket has to be processed in the same bucket processor as before, because the occlusion tree contained in the bucketProcessor is holding references to the bucket, set up before RenderSurfaces happen.
So that's why the current scheme is like it is: creating N bucketProcessors equal to the number of threads asked, then preProcessing in the single thread (sequentially), then processing in threads, if possible the same for postProcessing and display (or otherwise fall back to the main thread for those phases); and only when the set of bucket processors is done with the current buckets, start a new cycle.
In the ideal approach explained in the beginning, the list of “ready” buckets at a given time, the buckets would only be ready in chunks after being preProcessed and passed the RenderSurface() step. If we want to produce more buckets in the ready state, we would have to use the available bucketProcessors in the main thread to preProcess more buckets; if we want to consume the ready buckets, there's no better way than threading them right away, instead of re-scheduling them later (and taking into account that they can only end up in the same bucketProcessor where they were preProcessed; so detach them to attach them again to the same bucketProcessor doesn't make sense).
So, in the end, we need the same bucket↔bucketProcessor association all through the process, we cannot make the buckets get ready (all surfaces processed and MPs in place) and “consume” them while other buckets are getting ready, because bucketProcessor is needed for both getting them ready and render them.
The only thing that I think that could break the dependency is to somehow render the surfaces and create the micropolygons in the bucket before the bucket↔bucketProcessor association takes place; so that in effect would make the buckets “ready” first and then they'll be consumed by the bucketProcessors at full speed.
It's worth noting that there's an overall constraint which affects all this: we can not create BucketData objects for every bucket, since the memory consumed would be too high (and probably it would need much more time to render, since the data structures are reused from bucket to bucket saving much time setting them up). So that's why we have to introduce the concept of BucketProcessor, which holds an instance of BucketData and OcclusionTree; and we associate to them buckets, which contain primitives to be rendered at each portion of the image, along with algorithms to perform to them and the BucketData. That's the reason why we can't get the buckets ready before processing the micropolygons in different threads – we need BucketProcessors with BucketData and OcclusionTree to be associated with the bucket in both getting the micropolygons in the buckets ready, and rendering them.
Future Directions
The main problems explained so far need to be addressed if we want to get closer to the ideal approach; but specially in the case of RenderSurfaces this means refactoring a lot of code and changing dramatically core parts of the current code base.
It's yet unclear if this is possible at all, and if in the case that it's possible, if there are other tradeoffs that we have to consider.
Log
August 20:
- “Pencils down” timeline, officially the coding period finished today. I made a last commit for this period, but nothing important.
- Paperwork (filling surveys, etc).
- Added the Conclusion of the Project section.
August 17:
- Continuing with the work of the last few days, but without new conclusions.
August 16:
- Investigating RenderSurfaces() and the related parts of the code. This function needs the bucket to be associated to the bucketProcessor (and thus its data, specially the occlusion box), in order to perform occlusion calculations. So the bucket doesn't get “ready” without association to the bucketProcessor and performing RenderSurfaces() in the single thread, and it can't be consumed if it's not in the same bucketProcessor with the occlusion tree where it has reserved space to sample later, so it seems that Tristan's approach isn't achiveable as is after all (at least with the knowledge that I have about how things are done).
August 15:
- Fixing most (all?) of the outstanding problems of the ThreadScheduler, so now it doesn't seem to block or segfault inexplicably as it was doing before.
- However, tcolgate suggested a new approach that it might work better than this “launch N threads for N bucket processors, wait for them to finish after continuing”, and that it might be possible using another strategy: instead of sending a “bucketProcessor.process()” as job unit, the job unit would contain a bucketProcessor and a loop asking continuously for new Buckets until all of them are processed. There is a notable problem with this, since RenderSurfaces() needs to be done in ImageBuffer, and it's yet unclear whether this can be done initially (before the bucket processors start asking for buckets) or not.
August 14:
- The postProcess() operation of bucket processors doesn't work well with all the images (unfortunately it seemed promising with “vase.rib”…), it's failing in a few of the regression tests (about 3, and probably more for the rest who fail already for other reasons). So at this point it won't be included.
August 13:
- Starting to implement a thread scheduler, in order to abstract a bit more the use of threads in the rest of the code.
- Checking whether the postProcess() operation of BucketProcessor (which means CombineElements, Filter and Expose operations of the bucket) can be also included in the code to run in separate threads. It's working fine so far, with the vase.rib that I take as reference; but it has to be confirmed with the regression tests. If this is correct, it means that after the preprocessing and until the display of the buckets, all can be threaded; and probably the display too using locks for the different buckets; so it seems quite encouraging.
August 10:
- Using real (boost::)threads when rendering the micropolygons of the buckets. The code will need to be enhanced with options to use no threads, and how to select the number of threads, etc.
August 9:
- Quite complete infrastructure for using several bucket processors at once. At the moment is defined with a variable in the code, but it might be as well provided via runtime option or whatever (yet to be decided).
- First time that it runs with threads (and apparently successfully!!), but not commited yet.
August 8:
- Commit after a few days of inactivity due to network outages. The several changes introduced are at the moment equivalent to the code before (i.e., passes the regression tests succeeding and failing in the same places), but are more apt to using a variable of bucket processors at once.
August 6/7:
- Moving to a new location and having lots of troubles with network connectivity; so I spent the time trying to get the network to work and moving the development environment to a new computer.
August 3:
- Continuing with the tests, this is painfully slow because each round takes 1h to run or so. Unfortunately it seems that there are some other hidden corners messing up with things internally, because there are more tests failing than before (5 or 10% depending on the combinations of processing phases). The problem now is that I have no clue about where the problem is, there are no parts crashing or overwriting memory (using valgrind), so that means that it's somewhere in the rendering phases that I don't understand.
- Apart from that, the substitution of the memory pool for the usual allocation methods seems to fix all the previous problems, so I commited that (with the calls to the memory pooler commented out, in the case that it needs to be retrieved quickly).
August 2:
- Checking whether the solution found yesterday passes the regression tests “successfully” (failing only in the same ones as before), with the different combinations.
August 1:
- Trying to fix a problem with MPDump class, which seems to be enabled on the installations by default, and fails to compile due to some of my changes during the last few months. In my case is not enabled for some reason, and I have not found any way to enable it (the switches to tell SCons to enable it are not working), so after spending a few hours trying to fix the problem with the help of the remote auto-builder, I'll wait for a better occasion.
- Found more information about the problem that I've been investigating yesterday: it seems to be something in the memory pooler CqSampleDataPool, because when substituting the use of the allocator for memory allocated as usual (new/delete) the crash is gone. Probably the crash was due to the allocator having a stack of addresses to “lend”, and if the addresses are not given back LIFO order the pooler allocates or deallocates in wrong positions. As I understand it it works like this example: bucket1 uses up to 1000, bucket2 from 1000 to 2000, and if at that point bucket1 gives back their slots, the pooler stays with 0 on top, which is the wrong one (from 1000 to 2000 are still allocated).
- When fixing the problem above with the usual new/delete allocation, it seems to fix also the other problems; so now any combination of preProcess/process/postProcess/Display from two buckets that was giving problems before, seems gone (at least I haven't found problems so far in the combinations tested).
July 31:
- Tracking one of the problems mentioned yesterday, “With some of the combinations changed, like doing all (including preprocessing) for bucket 1 and then all for bucket 2, it crashes in CqImagePixel in the Combine method.”.
July 30:
- Trying to run several buckets at once in the main ImageBuffer::RenderImage() loop. It works under some circumstances, when processing+postProcessing+Display is done sequentially for each bucket (so processing+postProcessing+Display for bucket 1, then processing+postProcessing+Display for bucket 2). When the Display is interleaved (processing+postProcessing for bucket 1, processing+postProcessing for bucket 2, Display for bucket 1, Display for bucket 2), the image has every column of buckets repeated. When the processing is done first (processing for bucket1, processing for bucket2, postProcessing+Display for bucket1, postProcessing+Display for bucket2), every column is repeated and the pixels of the repeated column is kind of shuffled. With some of the combinations changed, like doing all (including preprocessing) for bucket 1 and then all for bucket 2, it crashes in CqImagePixel in the Combine method. All this behaviour is unexpected, so deep investigation is needed in the following days…
July 27:
- Processing the buckets in a bit different way: first add all surfaces, then render all the available MPs.
- Removing the accesses to CurrentBucketCol()/CurrentBucketRow() in ImageBuffer::PostSurface() – if we're manipulating several buckets at once in that part, we shouldn't assume to be processing one each time in part of the functions.
- Adding the current row/column that of the bucket being processed to the bucket processor, and methods to retrieve them. Removing the accesses to CurrentBucketCol()/CurrentBucketRow() in ImageBuffer::OcclusionCullSurface() – if we're manipulating several buckets at once in that part, we shouldn't assume to be processing one each time in part of the functions. We use the newly stored values of row/column that we pass around with the BucketProcessor, instead.
July 26:
- Moving parameters from the bucket processing methods to the bucket data (set in the bucket pre-processing phase), so we avoid passing those parameters many times in many methods later. This also makes slightly cleaner the main loops for rendering.
- Making a few methods of CqBoundList and MicroPolygon const – they are in the code to be threaded so it's clearer that they won't cause problems.
- Creating methods to set/reset/check the flags of SqImageSample instead of setting the flags directly.
July 25:
- Spending some time merging (big Piqsl merge back in the trunk) and running the regression tests.
- Revisiting the code in the critical paths to see if there are more parts of the code not thread safe. The code paths are intricated, with code starting from Bucket calling back to bucket to manipulate data, etc; but it seems to be thread safe.
July 24:
- Finishing to work in the problem of yesterday. The final solution was proposed by Paul, using boost::shared_ptr, which is thread-safe. There's a bit of penalty (200ms in barely less than 4s of run of my reference vase.rib), but it's not supposed to be important because that part is supposed to be reworked later by a scheme where MPs are copied to each bucket needed it (instead of shared, as it's now).
- It seemed to be something wrong with the reference count-related calls, because now all of the big leaks that I introduced previously in the branch are gone.
July 23:
- Starting to work in the problem of messing up with MP refcount from different bucket threads (listed as one of the remaining problems in the “Update 20070719” of the Design section).
July 20:
- Creating a Diagram of interrelationships of parts of the code that remains to be threaded.
July 19:
- Evaluating the job done so far, updating the Design section to reflect it.
- Adding a few const attributes when possible in the path of buckets rendering MPs to assure that things are thread safe (and 'neater' anyway); and also some pending naming/formating changes hoping to make things clearer.
July 18:
- Passing hitTestCache as parameter to the methods, instead of storing a pointer inside the MP class. In this way, different parallel buckets can call the same MP at the same time without threading issues. It passed the regression tests and it's not needing more time to execute than before (as it happened with some of my initial modifications).
- Fixing a leak introduced previously in my branch, with important memory consequences; there are still a few leaks though.
July 17:
- Looking into how to make thread safe several execution paths of buckets rendering the MPs. One of the interesting spots is CqHitTestCache, still working at it.
July 16:
- Creating the bound list of the micropolygon as soon as possible, instead of letting this job for the time when it's needed by the buckets: in this way we avoid conflicts when different buckets could possibly call the same MP and perform this operation at the same time.
- Seeking for more dangerous spots in the path that it's going to be multi-threaded.
July 13:
- Abstracting some of the Bucket operations in ImageBuffer through BucketProcessor.
- Making many static variables to be const all through the code, so it's clear that they doesn't have threading issues, and it's a cleaner approach anyway. In this way we are able to filter out the static consts and focus on the rest, which are the dangerous ones from the point of view of this project.
July 12:
- Continuing with the conversion of static accesses from imagepixel.cpp to CqBucket.
- Eliminating the static methods and attributes of CqBucket.
July 11:
- Converting static accesses from imagepixel.cpp to CqBucket to dynamic ones, mostly by passing as parameters of the method either a pointer to the bucket or to the direct data (usually they're being called from the buckets themselves).
- The same with micropolygon.cpp.
July 10:
- Moving the Occlusion Box+Tree to BucketData class, and made a couple of wrappers in it and BucketProcessor; this along with modifications makes that now the accesses from ImageBuffer and Bucket are dynamical.
- I converted the static accesses to CqBucket in Occlusion in dynamic accesses, using pointers back to the calling bucket all over the place.
- Quite a productive day, since those two pieces were major blockers to get parallelization in place, and it was easier than I expected. On the other hand, this doesn't mean that the job is near completion: a lot of other blockers have to be removed; and further modifications are needed before starting to introduce real thread code.
July 9:
- Evaluating next parts to be modified to allow for parallelization. Currently the focus is on Occlusion, because it's a part that the methods of Bucket rendering MPs use internally, and they're using static access back to the Bucket and so on.
- Finished investigating the problem of why an algorithm in ImageBuffer wants to add MPs to buckets already processed. We made advances searching for the source of the problem, but since it points to be a bug deep in the code and present in the trunk, Paul told me to leave it. The same with the problem of memory leaks in my branch at the moment, so I can concentrate on the core my project.
July 6:
- Investigating the problem of why an algorithm in ImageBuffer wants to add MPs to buckets already processed (which happens in the trunk of the code too). I added quite a lot of debugging info and some infrastructure to help to discover what's the problem, but it's yet a WIP.
- Investigating the problem of having big leaks (2.5MB in the RIB that I'm using as reference, vase.rib, which uses about 25MB at max). I haven't come to any conclusion yet, but I fixed some leaks and made a bit of cleanup while at it.
July 5:
- Eliminating the need for ImageBuffer::PushMPGForward and PushMPGDown, an important leap forward in the direction of parallelization: so that when processing MPs inside the buckets we don't need to add MPs to other buckets.
- I ran the regression tests after a while without doing it and things don't look good, there are lot of failures because I have to abort Aqsis for many of the tests. The reason is that the Aqsis process starts using memory up to 800MB, at that point my machine starts thrashing. So albeit my changes seem to be mostly working, probably there are some cases where it uses much more memory than the reasonable, which makes that I have to kill the process before they finish – maybe they would eventually finish and render the image properly. Anyway this is a big problem, so I'd better fix it before continuing adding changes…
July 4:
- Continuing to work towards the elimination of the references from the Bucket to ImageBuffer::PushMPGForward and PushMPGDown, which is proving to be tough.
- Removing references to ImageBuffer from Bucket::RenderWaitingMPs() by passing parameters to the functions from the top to down, and has as overall effect that the Bucket is more “autocontained”.
July 3:
- Starting to eliminate the calls PushMPG{Down,Forward} in ImageBuffer from Bucket, so after the bucket is filled it can be proceed with its processing alone, and doesn't interact with other buckets. It's being more tough than expected, though…
July 2:
- I've spent all time tracking down a bug introduced in the last commits and I haven't found the problem yet, I hope that Paul can help me…
June 28 and 29:
- Moving MP processing (RenderWaitingMPs) from ImageBuffer to CqBucket
- Converting RenderSurfaces (plural) to RenderSurface (singular), by moving the loop to RenderImages.
- Moving RenderGrids inside RenderSurface.
- Removing now unused code from CqBucket, such as the related with RenderGrids; and the access to the internal m_micropolygons.
- Moving also other parts of the former RenderSurfaces (the ones related with Combining and so on, after dicing/splitting) to RenderImage.
June 27:
- Reviewing code of the methods processing data that we'll have to change or move around.
- Mostly waiting for input from the mentors.
June 26:
- Adding CqBucketProcessor, a kind of wrapper binding together CqBucket and CqBucketData.
- Moving code around, making methods not static and so on.
- Fixing a bug left from yesterday, with the help of pgregory.
June 25:
- Implementing and committing CqBucketData and the way that CqBucket uses it, and fixing some bugs that appeared in the meantime.
June 22:
- Starting to modify code according to Paul's proposal, CqBucketData only at the moment.
June 21:
- Reading and studying how to modify code according to Paul's proposal.
June 20:
- Studying how to process several buckets at once. Static data and methods in the Bucket class will make the modifications painful, I already tried it in a way which led me to a dead end, because we're not supposed to afford to have one instance for every bucket (it uses too much memory).
- Making the grid Split() function (into MPs) to add the produced MPs to a vector instead of invoking ImageBuffer directly, so it allows for cleaner addition to buckets lately (v.g. locking the whole operation instead of doing it for every MP).
June 19:
- Continuing the activities of yesterday, but today I'm stumbling upon difficulties because I'm near the border where it makes more sense to maintain code in imagebuffer.cpp than in bucket.cpp or others. This is because, in example, when adding a micropolygon it can end up in different buckets; so it makes more sense to maintain it at higher levels rather than bucket leves. All this is caused by a mix of instructions “put things into buckets” and “render bucket contents” (Aqsis mixes them so the things to be rendered stay for as short as possible in memory), characteristic part of how Aqsis works, which seems incompatible with any sort of parallelization. We'll have to resolve these issues soon, because for parallelization we would need to have a different approach: first “put things into buckets”, then “render bucket contents” in parallel.
June 18:
- Moving much of the code of Render*() in imagebuffer.cpp to the Bucket class, so it depends less on global variables and calls. This might be a good idea for the trunk even if the parallelization wouldn't take place in the end. Since the code moved was quite a lot, regression tests are needed before commiting to the branch, so I'll do it probably tomorrow.
June 15:
- Writing the Design “document”, above.
- Focusing now in the separation of Buckets, this is where they think that parallelizing will bring the best benefits. I started to make some changes, moving code to CqBucket side, to try to isolate the calculations related to it and so make the parallelization possible.
June 14:
- Starting to modify the application to try to isolate parts that can be parallelized (MPG→Shade()), although the code is not commited yet (it's segfaulting and anyway needs revision to ensure that I'm doing the right thing).
- Discussing with tcolgate about how to parallelize the application (mostly referring to the calls MPG→Shade() at the moment).
- Reading about REYES architecture, this is one interesting article: http://www.plastickitten.net/wordpress/essays/reyes-primitives-some-philosophy/
June 12:
- Advice from tcolgate to read this article: http://www.ddj.com/article/printableArticle.jhtml?articleID=184401518&dept_url=/dept/cpp/
[16:25:32] <pgregory> i.e. parse-->distribute-->split-->dice-->shade-->sample-->combine-->filter-->display [16:25:38] <tcolgate> It really should be top down. Ask yourself "What in REYEs can be paralellised". [16:25:49] <pgregory> see how those parts interperate, how they can be separated.
- Performing measurements of parts where the program spends more time.
June 11:
- Investigating more about parts to be modified (Display).
- Merging changes from the last few days.
June 4-8:
- Busy with exams, no work for Aqsis during this week.
June 1:
- All the time devoted to diving into Aqsis code. Produced patches for trunk removing unused variables.
May 31:
- Practicing with Boost.Thread library.
- Reading http://vergil.chemistry.gatech.edu/resources/programming/threads.html (about threads in general, and POSIX threads as a possible alternative to Boost.Thread).
- Reading http://yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html (important the Thread Pitfalls section about Thread Safe operations).
May 30:
- Starting to write the Requirements Document (the part about the thread library at the moment), to be approved by the community of developers before starting to code.
- Starting to play with Boost.Threads library, learning to use it.
- Diving more into the code, trying to identify which main parts could be made parallel.
May 29:
- Really diving into the code for the first time, at the moment the parts processing the Buckets, to get an idea about what can be done.
- Learning about modifying parameters of the images to be rendered.
- Learning a bit about Callgrind tool from Valgrind, and KCacheGrind to view the results (profiling).
May 28:
- Learning about regression testing scripts.
- Compiling and running tests in two different machines (amd64 and Pentium4 with hyperthreading).
- Fixing a problem with SVN branch (it was imported when the trunk was broken).
- SoC program starts :)