By martinmine on Jul 4, 2022 at 7:10 PM
  1. martinmine

    martinmine Server Engineer
    Staff Member Administrator

    Joined:
    Aug 6, 2019
    Messages:
    378
    Likes Received:
    266
    Trophy Points:
    63
    On the night to 30th of June, the random number generator on the Celeste server broke. Players received only vanity gear from quest rewards, and Moe’s gave out only chests that had legendary gear. This is the story of what happened behind the scenes that night, solar flares, and how two things happening at the same time can seriously break stuff.

    Behind the Scenes is a technical blog series that goes into what happens "behind the scenes" in project Celeste.

    Six Hours of Vanity Traits and Legendary Gear
    The first symptom we saw was that the player ByFstugan messaged PF2K that something was awfully wrong with Moe’s. He then posted a message at 03:22 on Discord to me and Kevsoft:

    [​IMG]

    This was also the message I woke up to on my machine as I was getting ready to head out to work (this was on a Thursday). However, I was not able to do much about the situation other than suggesting switching off Moe’s for now. As I got to work, PF2K made an executive decision to shut down the server to prevent any other damage happen to the game until we could figure out what was going on. Kevsoft also suggested that we could run a loot table simulation to understand what was happening with the loot tables after having restarted the server.

    In the game, players can get randomized items from various locations such as loot chests and quest rewards. The chances for each item and how they should drop is expressed in form of a loot table. The loot table simulation would test out the randomization so that we can confirm that an item would have the correct drop rate.

    PF2K turned the server back online again, but this time with maintenance mode enabled which prevents players from entering the game except a specific set of players. As this was happening, I was trying to reason about what had gone wrong with the game while having a conversation with my co-workers at the coffee machine. Neither of these was an easy task this early in the morning. We decided to establish a war-room channel on Discord to have all information gathered in one place.

    After doing some more testing in Moe’s and other loot mechanisms, PF2K could see that everything was working as intended. Moe’s gave out correct gear, and quest rewards was working as intended.

    [​IMG]
    After PF2K had done some more testing by speed-running some quests, we decided to restore the server to normal operation at 09:21 while closely monitoring if this would suddenly start happening again.

    The Investigation
    After getting back home from work, I started the work on analyzing what had gone wrong. The first goal was to get together all the facts we knew (the symptoms) and establish a timeline of what had happened. This makes it easier to get a clear overview of the incident. From the player reports we knew the following:
    • Moe’s gave out a specific set of rewards; The Lucky Chest gave out 1C coin sacks, the lvl 40 treasure chest gave out vanity gear, while the rest consistently gave out legendary items.
    • All quest rewards gave out vanity gear
    The first step was to check the server logs if anything was screaming in the error logs or if anything else was obviously off. But there were no errors or anything that screamed “this is broken” in the logs. I could see in the chatlogs that players were confused they only got vanity gear from their quest rewards. Looking at another log of what each player had received from a quest, I could very clearly see when this had started happening. Using this information, I could establish when this incident started happening and establish a full timeline:

    02:52 – First vanity gear reward given out to player
    03:22 - First report of broken drop rate (DM to PF2K)
    08:32 - Server shutdown
    08:40 - Starting server
    08:44 - Server online
    09:21 – Decided to restore the server to normal operations

    At this time, I could start creating theories about what had gone wrong and look for evidence to back up these theories. Some of the theories that we had briefly discussed were:
    • We had run out of item IDs which lead to consistent loot – This seemed unlikely as the players would experience this differently and probably not receive an item.
    • Broken loot tables because of an incorrect server configuration or something was accidentally pushed to git. – Unlikely as the server does not automatically pick up loot table changes unless it is restarted.
    • A bug with the random number generator – Seems highly unlikely as there is no good reason why it would suddenly stop working.
    None of the theories seemed to have a good fit, so I decided to dig more into the server code to try and reproduce the conditions of what had happened and look for patterns.

    I started investigating the quest drops a bit closer. The way the loot tables are implemented is that vanity gear is given out if the loot table does not contain any item to give out. This was an indication to me that the loot tables were configured incorrectly. However, the loot tables were working fine up until 02:52. Was there anything special happening at this specific time? There had been no changes to the loot tables themselves – the “last modified” attribute on the loot table configuration file at the server could confirm this. Could there be a background job that have broken and made the loot tables corrupt? We don’t have any background jobs that modifies the loot tables, so this seemed unlikely. Not being able to see anything that would modify the loot tables started to frustrate me and I started getting concerned that we might not be able to find out what had happened.

    If nothing inside the server itself, can something outside of it have modified the memory? There is a chance that cosmic radiation can alter memory. This had an effect on the election results in Belgium where cosmic radiation altered the counter that held the vote count. Out of curiosity, I checked out how the space weather had been over the last period, where I could see that there had been some solar flare an hour before we started getting issues:

    [​IMG]

    This wasn’t a big solar flare, nor was there anything in the news about other people experiencing issues because of increased cosmic activity. While there is a certain chance for this to happen, it is extremely unlikely as there are several layers of protection and mechanisms to mitigate this problem on a hardware level. Besides, if it had happened it would be very strange if it had hit the loot-table mechanism in our server and nothing else. I put this very complex and unlikely theory aside and instead start to look at Moe’s.

    After playing a bit with how Moe’s is implemented, I quickly noticed that the items that were reported by the players were the first configured item in the loot tables. By selecting the first item in the loot table, I could exactly reproduce the conditions that ByFstugan reported regarding Moe’s. Then the question arose: Why would only the first item be selected? The random generator then got into spotlight. I decided to try and rig the number generator (only on my test server of course) to see if that could give out any clues. This was it and made it possible to reproduce the exact bug we experience with the loot table. This supported the theory that the random number generator had broken down – it was only outputting 0 instead of a random number. But why?

    The Problem and the Solution
    The server uses a module in .NET called System.Random. A quick Google search revealed that it is entirely possible to make this module get stuck and output only 0 if two numbers are generated at the same time. For us, this means that if two players buy a loot chest and the loot is calculated at the exact same time, there is a small chance that this can break the generator, make it get stuck and always return 0. From Microsoft’s documentation on this matter (Random Class (System) | Microsoft Docs):

    If you don't ensure that the Random object is accessed in a thread-safe way, calls to methods that return random numbers return 0.

    Thread-safe in this case means the ability to generate two random numbers at the same time. On your computer, each application consists of multiple threads where whatever code each program needs to run is performed on a thread. As processors have multiple cores that each can execute threads in parallel, multiple things can happen at the same time. The solution to this problem was to have one random number generator per thread.

    The fact that the random number generator is not thread-safe is something I already partially knew about, however I was not aware that it can get stuck if a generator was creating two random numbers at the same time. Allowing support for thread-safe code often comes at the cost of performance. My initially assumption was that the generator was not thread-safe in that case it could create two numbers that are identical (and not random) if it was asked to make two numbers at the same time. The chances for this to happen in the first place is not very high – and this has been entirely possible to happen for over a year, but here we are thanks to Murphy’s law.

    Final Words
    It is funny to think about how complex computer systems can become. In cases like these, it can be frustrating when something important breaks and we do not fully understand why. Thanks to community interaction, we could much easier pinpoint what went wrong (thanks a lot to ByFstugan!) without having to scratch ourselves too much in the head. I hope it has been interesting to see a bit how we work behind the scenes and get a peek into how we handle incidents at Celeste.

    What do you think about the blog? Let us know in the comments below!

    ========

    All current and future content will stay 100% free and accessible to everyone.

    ========
    Thank you so much for reading, and we will see you all next time!

    Project Celeste Development Team
    ========
    The Romans are here! Watch the Overview Trailer.

    Project Celeste is completely free and always will be. However, we gladly accept donations for our overhead costs, which are larger than we have budgeted. If you want to support us, you can do so HERE.

    Read every Romans related blog HERE.
    Read every Indians related blog HERE.

    Tell your friends! Join our Discord HERE.
    ========
     
    Taffetatree01, PF2K, Aryzel and 7 others like this.
  2. Loading...


Comments

Discussion in 'News' started by martinmine, Jul 4, 2022.

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice