Honours Project Development Blog – s2 – 11

My first task this week was to fix the multiple player spawning problem. My solution to this was to have a simple bool in the client script which indicated whether a player had successfully spawned, as soon as this had been recorder, the bootstrapper would no longer send out any requests. This was needed because the bootstrapper relied on coroutines which were waiting for a period of time and then calling a function to request the player creation again. If a player was created during this wait period, once the coroutine had been started, the bootstrapper would end up spawning multiple players.

An additional fix here was to increase the timeout value to 60.0 seconds so that even in edge cases of extreme latency, the player should still spawn correctly. During this first task I happened across a workflow optimisation. Whenever I was making changes to test on the cloud I would rebuild the entire project, redeploy it to the cloud and re launch the client which takes a while (although I am speeding up at this process as I do it so regularly). I realised that if I was making a client specific change, there was no need to redeploy, I could just connect to the existing deployment. This more than halved the amount of redeployments I needed to do during this and the following tasks this week and I’m annoyed I didn’t realise it sooner in the project.

Now I had a functioning application I decided to look at the metrics to decide how best to start capturing them, I noticed a memory leak:

270318_memoryLeak

By observing the console in my client, I noticed I was getting fairly frequent warnings from unhandled query responses. I noticed that this seemed to be tied to the heartbeat functionality which was built into the player object of the base project, but decided I didn’t need it so cut it out.

This did appear to make a small improvement but I still definitely had a memory leak:

270318_improvedMemoryLeak

Digging around online I found that perhaps the number of component updates I was making was too high, and the workers were unable to handle them. I changed them from using a reliable delivery method to unreliable. This meant that if the network was under heavy load, the updates could be dropped.

Ideally, I would not have implemented this, as I would lose some degree of accuracy for the simulation, however some concessions are necessary and I felt like this is one of them. By removing the load of updates to the workers I was massively able to stem the flow of the memory leak:

270318_memoryLeakSignificantImprovemtnPlusClients

I increased the number of entities for this build which is why the precentage is higher, but the rate of increase appears to be slower. The jumps up here are when a client connects to the deployment. Even though the client disconnects, there is no indicator of this in the memory usage. This may need investigating as an optimisation but for now I think my simulation is ready to be used to gather baseline performance information.

rd_cloud_21x21_20dt.gif

Honours Project Development Blog – s2 – 10

I had an assessment date this week for handing in a draft of my methodology. I spent some time this week trimming it a little and adding figures to it before handing it to my supervisor. I didn’t find much time to dedicate to my project development as I was working on my personal project OIL as I found out that I am releasing it on the 4th April (which is scarily soon). I spent a lot of time with my development partner working on OIL to get it ready for release.

Honours Project Development Blog – s2 – 9

After completing an initial draft of my methodology, I was keen to get back into development. I begun by trialling my locally functioning deployment in the cloud. As far as the improbable inspector was concerned, the simulation ran fine. However, whenever I tried to connect a client it would connect for a few seconds before going blank and losing its connection to SpatialOS. I found that the request to spawn a player was being returned unsuccessful and after a certain number of failed attempts the client was just being disconnected. To solve this issue, I had to delve into the client connection lifecycle of a SpatialOS application, which until this point I had not needed to understand.

The offending code was in the bootstrap script which handles the initial connection to the deployment. A query was being sent without setting a timeout value. The timeout parameter defaults to ‘timeout = null’ which I initially assumed meant that the query would not timeout and would wait indefinitely for a response. This assumption was incorrect, by setting the timeout to an arbitrarily high value (20.0 seconds), I could make the query wait out the latency and return correctly, successfully spawning a player.

I realised that this would affect all queries and so my ui query/command system would not reliably work in this way. I decided to instead attach a ‘CellSync’ mono behaviour to each cell which was visible to the client. When the reset button is now pressed, the client finds all game objects with a CellSync component and sets them to be reset. This script uses a regular component update to set the cells to be reset. The cells wait in their reset state until the start button is pressed on the ui which uses a similar process to start the cells performing their calculations again.

ui0

No I was no longer using the query command system, and due to an issue I noticed with player spawning, I decided to move the ui off the player entity and instead have it as a part of the client. I noticed that sometimes the application would continue to spawn players even though only one client had connected, and this would need a fix.

Honours Project Development Blog – s2 – 8

As I had a locally running simulation, I felt like it was now important to try and make a start on my dissertation and so this week was dedicated to writing my methodology section. My supervisor felt like this would be the best section to attempt first as this could help inform the continuing direction of the project.

I found ordering my ideas and findings from the development of my application helped massively in organising my thoughts about how best to proceed next. Through my explanation of how deployment configurations work, I could see that I had misunderstood the use cases of the different load balancing configurations. I realised that for a group of entities of a fixed number in fixed space, a static configuration using a known number of workers was the best approach. Especially when performing experiments as there needs to be as few changeable independent variables as possible for a rigorous experiment. I also realised that for the fixed configuration, the total bounds of the world must be covered by the domains of each worker (which are unchanging), therefore the bounds should not be any bigger than the size of the space the simulation is being performed in.

I rewrote how the cells were spawned in so that they would always fit exactly within a world of bounds 100m x 100m in an even distribution.

Honours Project Development Blog – s2 – 7

After being disheartened at the end of last week facing a seemingly unfixable problem, I had a reasonably positive start, uninstalling, deleting and reinstalling things left-right-and-centre until finally I had it functional again. Unfortunately, the next day I ran in to the same issue and could not work out what it was that I had done that had fixed it. The first thing I tried was the last thing I had done before which involved reinstalling the JDK but that proved fruitless. I happened across the solution whilst procrastinating on twitter:

cliCorruptionTweet

A totally random tweet from the lead technical writer at Improbable using the internal bugfix note for my bug as an example of something else entirely. This gave me what I needed to track down the cause and fix of the bug (which was technically improbable’s fault and not mine which made me feel a little relieved).

Now I could continue working properly on the project I linked the ui to the player entity which is spawned in when a client connects to the deployment. Values on the ui can be changed by the user and then when the reset button is pressed the simulation is restarted using the values on the ui. Pressing the reset button causes the player to send out a global SpatialOS query which returns all entities with an ‘Initialiser’ SpatialOS component. This map of entities is then used to send a command to each entity with the new simulation properties to be used. When the entities receive this command, their properties are updated according to the received values therefore restarting the simulation. This is tested locally and works as expected.

Honours Project Development Blog – s2 – 6

I began working on the client-side interaction of the simulation to aid with my debugging. Being able to reset the concentration values of the cells whilst the simulation is running will be able to help me visually identify how the simulation is behaving. I created the ui for this interaction system and began working on the framework.

I was making good progress in this area until late in the week, whilst going through one of the many stages of the build pipeline, SpatialOS decided it would stop doing the codegen of its bespoke Schemalang language, saying the compiler was missing (which appeared to be true, even though I’d definitely never even been to its directory, let alone deleted it from that location). I was unable to repair spatial so had to uninstall it and reinstall. This then turned up more errors which are completely inexplicable – I can’t see any reference to the errors online. Currently I am completely unable to use any of the spatial commands through the CLI and the SDK is completely non-functional in any Unity project.

Honours Project Development Blog – s2 – 5

Having already created a simulation and converted it into a SpatialOS deployment, the process of doing the same for this simulation was relatively painless. I was very quickly able to run a local deployment of the simulation and also was able to have a cloud deployment using a single virtual machine without too many issues. I did notice, however, that the performance was lower than that of the implementation in a regular unity application.

I began the expansion of the simulation to include multiple workers using a local deployment, initially, however this was really quite difficult. The performance overhead per local worker meant that even for a tiny simulation it would quickly max out my laptop’s CPU usage, making debugging of the simulation slow.

The solution for this was to upload the deployment to the cloud. Unfortunately, the time for every deployment to build, upload and process ready for running on the cloud was a lengthy process which made progress extremely slow and frustrating.

The main issue I was having was with the spatial layout of the workers and how their domains are calculated and assigned. This is managed through the SpatialOS deployment configs, specifically the load balancing configs within them. What these configurations essentially control are how many workers are active, what size their domain is, where in space their domain is and how they are arranged relative to one another. These configs can’t really be tested without deploying, and some of the parameters can have unexpected effects. This meant to get the load balancing configurations right multiple uploads were required.

The best layout discovered for this implementation was, counter-intuitively, a random placement with small maximum domain size. Worker death was still very high due to overloading workers, but the simulation did run. The worker overhead and the amount of network communication occurring for the updates of the bodies meant that this deployment was still slow in comparison to the standard unity implementation.

Honours Project Development Blog – s2 – 4

I began this week by analysing the source code provided with GPU Gems 2 to try and understand the GPU implementation of the reaction diffusion algorithm. Unfortunately I found this task to be gargantuan, a large percentage of the considerable amount of source code was more related to the operation of the underlying framework and the diffusion algorithm seemed to be unhelpfully split across many long and opaque functions. After a few hours trying to understand the operation of the algorithm as described here I decided to look elsewhere.

I realised upon looking further afield that nearly all implementations of this have been created for GPUs. It makes sense as this algorithm is perfect for performing in parallel on a texture, which makes a CPU implementation not particularly worthwhile. However, my aim is not to improve on the simulation using SpatialOS, my aim is to use a simulation to identify performance bottlenecks of SpatialOS and how these can be designed around.

I decided to write my own implementation of the algorithm using the equations and descriptions provided here. As I would not be relying on any kind of global manager and have all bodies self-managing, I would need to create a neighbour searching algorithm which would identify the physical neighbours of the cells. This initially began as 8 raycasts aligned to the compass points to detect neighbours. This was an extremely expensive operation to perform each frame and so I redesigned the neighbour searching to use a sphere cast. This improved the performance considerably, although it is still by far the least performant part of the simulation.

In general, the algorithm was relatively straight forward to implement. I created this just in a blank unity project as I wanted to be sure the simulation behaved as expected without the complication of SpatialOS. The result is very slow, however, being unable to compute many cells. Especially compared to the possibility of using a texture on a GPU to perform the same function. The biggest bottleneck, as described above, is the neighbour searching.

localDiffusionFunction

Above I have my algorithm where da is the diffusion rate of chemical a and db is the diffusion rate of chemical b. Additionally k is the kill rate and f is the feed rate.

Hopefully by putting this algorithm onto the cloud, with multiple workers managing the simulation, the performance can be significantly improved.

Honours Project Development Blog – s2 – 3

I continued where I left off, trying to come up with a new simulation model that would take advantage of SpatialOS’s design. The frustrating thing is that I can easily come up with solutions that would work great: city simulation, ant colony simulation, crowd simulation, etc. But these are huge undertakings in terms of developing the behaviour of the agents themselves, let alone the performance analysis I wish to conduct on the system. The limitation of having to have agents which are entirely self-sufficient, or at most only depend on their immediate vicinity, means that simulations of this type would usually have complex behaviours of the agents, which will take far too long to implement. This was why the n-body simulation was initially chosen, for its simplicity to implement. Unfortunately, the issues encountered were hard to predict.

I spoke with my supervisor, Ruth, about my findings and problems. She agreed that it would be a good idea to look for a different simulation model. It was suggested to investigate the diffusion-reaction simulation, specifically the Gray-Scott algorithm. This simulation, usually performed on a GPU, uses texture data where each pixel looks at the neighbouring pixels. This is perfect for the SpatialOS platform as there is no reliance on data from objects in distant parts of the world. There is no need for a manger if each cell is responsible for its own neighbour searching. It was suggested to look at an existing implementation which is given in GPU Gems 2