Welcome to Jester's Trek.
I'm your host, Jester. I've been an EVE Online player for about six years. One of my four mains is Ripard Teg, pictured at left. Sadly, I've succumbed to "bittervet" disease, but I'm wandering the New Eden landscape (and from time to time, the MMO landscape) in search of a cure.
You can follow along, if you want...

Friday, July 5, 2013

Embarrassing hand-me-downs

Yesterday was an EVE Online-free day for me, the first one that I can recall in a good long while.(1)  As a result, I missed all of the excitement in Z9PP-H.  Poetic Stanziel does a great job covering the basic facts and the aftermath.  EVE News 24 syndicated that piece, and added the really amusing video below and some brief commentary.  TMC has no fewer than four articles about it, of which I think this little opinion piece is the most interesting.  And finally CCP Falcon explains what happened from a straight factual perspective.  But I thought it'd be fun to summarize the facts, then meditate on them a bit.

Semi-short version?  "Asakai 2.0" broke out in Z9PP-H, an unintended and quickly escalating battle between TEST and Goon capital ships that committed more than 2200 players to a single unreinforced system.  Meanwhile, the fight was also being streamed on twitch.tv, with something near 5000 people watching the mayhem.  Normally this kind of fight would be given its own reinforced node.  But neither side intended in advance for the battle to get this big.  The miracle of 10% time dilation, though, means that once a major engagement starts, more and more reinforcements can be brought into a fight without the fight itself advancing too much.  In the 20 minutes that it takes to get a cyno ship or a gang into a system from 15 jumps away, only two minutes will have passed in the system itself.

Therefore, this kind of escalating fight lends itself well to even more escalation since you don't have to miss the fight and the action will still be going on when you get in there.  But it's not something that can be predicted and it doesn't lend itself well to the "fleet fight notification" form that CCP has for reinforcing nodes under these conditions.  The easy solution would be to simply put every system on the Fountain border on its own node, but I get the impression CCP doesn't have near the server hardware to do this.  Instead, each node hosts a number of systems, only sometimes systems that are associated with each other "geographically."

Caught by surprise -- and probably at least slightly motivated by the large number of twitch.tv viewers -- CCP decided to remap every system on the node Z9PP-H was using, except Z9PP-H.  That would give the system its own node and relieve some of the pressure.  Only due to a miscommunication or maybe a typo (or maybe both), every system on that node was remapped, including Z9PP-H.  When that happens, the players involved are logged off the server and have to be relogged in... without the aggression timer that normally occurs when you log out of the game while in combat.

Z9PP-H was remapped first and quickly brought back on-line.  Only the side that was losing -- TEST in this case -- simply ordered their pilots not to log back in.  At the time, TEST was facing the destruction of some 100-odd carriers in the system and the Goons had a strong upper hand.  The remap allowed TEST to make the strategic decision to use :ccp: to safely extract their assets from this situation.  Without their normal PvP timers, there was little motivation for them to come back in.  As you might expect, the Goons are not pleased.  An overwhelming victory in that battle might very well have put TEST on the run, at least for a while.  And the twitch.tv stream of course stopped dead in its tracks, which was kind of embarrassing for CCP.

Separate from the Skype channels that we use to talk to CCP, CSM8 has a private Skype channel that we use for informal internal discussion.  A week or so back, after another 10% Tidi fight, that channel erupted into a kind of one-off discussion: has EVE outgrown its architecture?  After all, the original game was never intended to support thousands of people in a single system, and hundreds of people in a single fight.  And yet, that's happening routinely in Fountain right now and every time it does, strange things are happening.  I've started joking that when a thousand Goons flap their wings in Fountain, you get 10% Tidi in Syndicate, but that's not really a joke.  It does happen.

EVE is now ten years old.  But the comical mental image that I can't shake is a short fat 10-year-old kid trying to fit into the hand-me-down clothes of a tall skinny 15-year-old brother.  And the clothes just... don't... fit.  Seams are pulling and zippers are breaking and buttons are straining all over the place... and every once in a while a button pops explosively free and hits someone in the eye.  But these are the colossal fights that are EVE's differentiating fact in the MMO market and so are what CCP uses to market the game much of the time.

I hear all kinds of rumors about EVE's architecture (usually reinforced by a couple of leaks of EVE's server source code over the years) that makes me wonder: what would EVE 2.0 look like?  CCP is always quick to say that they don't need an EVE 2.0 because they can "refit the ship at full speed", as it were.  But wouldn't it be kind of amazing for CCP to take :18months:, rebuild EVE's underlying architecture, and then overlay on top of it all of the game mechanics, art, and graphical effects we've come to know and love?  I doubt it'll ever happen.  But it's fun to think about.

In between dodging flying buttons.

(1) This wasn't entirely voluntary.  I had both a work issue and a personal issue come up... resolved now, though!


  1. Do you realize the scale of what you are saying, to let go the python code? Because that's one of the things that would need to go to lift the current restrictions.

  2. If nothing else it would seem to be a good idea to develop some way to freeze frame a system for a small period of time so it can be moved to a hypothetical super node and resumed without the typical DT like disconnect.

    1. It should be doable, without TOO much technical hassle. VMWare can duplicate an entire running system, move it halfway across the world, and shut down the original VM, without skipping a beat. Players are routed internally though the EVE cluster to the server the system they are in is hosted on, meaning their connection should be possible to be "suspended" while the system is being remapped, then resumed and rerouted to the new node, without disconnecting anyone. Ideally, the remapping could be done in the background and transparently, ala VMWare, but this is not CCP's core proficiency, unlike with VMWare, so understandably they might not be able to do it.

      If they let go of the Not Invented Here mentality, they might even license some technologies from people who do it best.

      This is not merely a theoretical discussion. Everyone, CCP and players, want to see EVE grow. But if the promise of a single universe is to be maintained, CCP will need to face these issues head on. 1% TiDi is not the answer, and is just a time delay tactic anyways.

  3. Hehe, nice to see Ripard using my 10 minute Photoshop in an article.

    On topic, IRL, i am a fledgling student of computer science, and over the summer I've been taking courses explaining the theory of OOP languages. One of the major benefits of an OOP language like Java, C, and Python, is that it is incredibly modular and easy to expand. When we make a program, we define interfaces manipulate objects, we define implementations that make the interfaces work, and methods to do the nuts and bolts.

    At any time, some part of the program can be swapped out for another as long as they serve the same purpose. Swapping out specific methods can greatly speed up a program even if the structure is not changed.

    I make no pretenses that EVE is as simple as our learning programs, and fiddling with the supports may very well bring the house down. But, if the underlying code of EVE is well compartmentalized and annotated, improving performance could be as easy as testing implementation and using the best one. It's a lot of work, but it's work that has clear progression and can be handled piecemeal, making it possible for a group like Team BFF to make several optimization per patch without seriously impacting the core features that CCP wants to add on.

  4. They spent the time and money on Dust, they have been spending money on World of Darkness, all of which they just HAD to write their own engines for. So it makes me wonder how doing the same for Eve would be any different. All the art and so on should be portable, the fundamental system modules should also be something that can be put in and taken out as time passes. Heck, they re-wrote crimewatch and wanted to re-write POS coding, but hey, let's make excuses and just not commit to the long haul. I just can't believe that a serious top down re-write of code to better use current computing power would not be helpful and make everything modular to allow expansion in the future.

  5. Conventia UnderkingJuly 5, 2013 at 10:55 PM

    Two things that I'd like to say about this, both technical:

    1. This is a technical issue and should have a technical solution. a) There should be check that refuses to remap nodes that have more than X characters in it (for configurable values of X and with an appropriate override). As for the typo, a few weeks ago, a coworker typo's a command used to build his private changes which resulted in us effectively wasting days of testing. The day after we found this out, a fix was made to the tool which makes this much harder. I would suggest having any tool that deals with inputting system names actually validate that the system exists and is currently in a valid state for that operation (basically, if you want to remap all nodes but Jita from a server, but Jita isn't on that server, the command fails). I hope that you can use these as points to bring up with the devs so this doesn't happen again.

    2. On the topic of EVE 2.0, I don't think it would work. The process that was used with CrimeWatch seems ideal, given the nature of software development, which is that iterative changes are significantly better in the long run than a complete rewrite, if the system is large enough. CrimeWatch in itself was a rewrite, but it's small enough that rewriting works. While I have never worked on a project that does this, I've heard of the concept that any system is broken into components of relatively small size, such that rewriting each is fairly short and then plan on having each component rewritten every 18 months. That makes sense to me, in theory but I've never seen it in practice. I only hope that CCP continues to invest in revamping and rewriting crufty old code. (Note: I don't think this remapping issue is a case of 'crufty code', it's just a case of tools being written without the appropriate checks that tools like this should have, even if they are found 'in production'.)

  6. Sounds like someone took a "mental health" day. Good for you.

    Great work keeping us informed with CSM work. Must be quite taxing...greatly appreciated, though.

  7. "But wouldn't it be kind of amazing for CCP to take :18months:, rebuild EVE's underlying architecture, and then overlay on top of it all of the game mechanics, art, and graphical effects we've come to know and love?"


  8. To say 18 months for server rebuild from (near) scratch is optimistic is... understating, heavily. Maybe I'm wrong and its a relatively simple matter of doing it right, but when you have a game this big, it would probably take weeks if not months just to migrate everything (never mind write the code itself).

  9. I would forgive CCP on alot of this stuff, but they market 1000 v 1000 dude fights. Shit happens

  10. I see this as an issue the CSM can take up. It is rather apparent CCP needs tools and procedures for handling this. As an experienced admin, I know not to make a big all inclusive change, from which there is no going back.

    Over the years I have learned that small changes, even in live are the way to go until you are confident in you syntax. GUI tools and visual confirmations of the changes that you are making can give you confidence in what you are doing.

    I don't know how many systems are on a node in null, but dad CCP decided to remap the systems at a slower pace things would have been better. Sure the effects would have been slower, but the battle would have raged on.

  11. "battle between TEST and Goon capital ships"

    Just going to say that no CFC caps were committed during the fight.

  12. That would take a hell of a lot longer than :18months:

  13. It's an interesting theory BUT

    Why would you do this for the same old -lame- mechanics?

    Realer physics (not sub battles in space), shooting (ie EVR stuff) won't support the battle sizes you are seeing today.

    Orbital mechanics? Hiding behind asteroids or other bodies in space?

    I question whether players would be able to adapt.

    For example - Local is the biggest joke in Eve.

    Pvp game where you have perfect knowledge of who could attack you at any time...

    Yet a huge chunk of the population can't even conceive of living without it.

    Can they even imagine an engine or game change where they didn't get a complete target list whenever entering the local grid?

    How do you expect Eve to have a new engine and new mechanics with the current crop of players?

  14. In a day and age where the only single core processors available are microcontrollers and any server grade hardware is parallelized from the ground up, it is obvious that CCP is on the back foot in that their server side code is still stricktly single threaded.

    TiDi is a great improvement, and so is their efforts to relocate solar systems to "reinforced nodes" (which are basically just servers with a higher clocked CPU so that they can thrash that single core a bit harder).

    But the fact remains that this architecture does not scale much further and we can really only hope that CCP has realized this and have a super-secret internal program running in high gear to get the "solar system kernels" out of the stone age of singlethreading and into the 21st century.

    1. In a system where ALL of your workload, except for housekeeping, must be executed in strict order, there are very few opportunities for parallelization. We don't know the extent of the single-threadedness to begin with it. Is it node-wide? Star system-wide? Grid-wide? I don't see a single thread queue being able to be broken down any lower than grid level. Events on grid must be executed in order. Outside the grid, doesn't really matter.

      Regardless, TiDi needs to affect the entire star system equally, I think, otherwise some really weird stuff would begin to happen, so parallelizing on a smaller than star system-scope doesn't buy you very much. All the threads in the star system would be tied to the same clock.

  15. Summary of events from a PL perspective:


    TL;DR of the above in gif form via Reddit:


  16. Well.. it wasn't a failure of architecture here, but of CCP systems allowing a simple typo to go through unchecked. That type of QC can floor a 10 week old system, let alone a 10 year old one.

    Pretty sure their actual servers are a wee bit newer than 10 years old, too ;-)

    I would agree however that CCP clearly have a massive aging code issue, but this rears it's head in terms of expanding on features - I think the code WRT server architecture is likely industry leading.

  17. I think architectural changes are really hard for them. I'd assume a rewrite would by far be the easiest/cheapest solution, but that comes with the cost of breaking "perfectly" working stuff and with the difficulty of developing the new architecture while also incorporating the ongoing changes of requirements not only into the current system but also into the new one. So incremental changes while maintaining a working code base would be preferable. But, i think it is quite possible, depending on the state of the current code base, that it's just not feasible with acceptable costs.
    Otherwise they would probably have done it already, i'd assume. Because if i take a look at the tasks the system has to support, a lot of them could be either perfectly parallelized or at least with near linear scalability with some tricks. So support of much larger battles should be a non issue while probably less server hardware would be needed at the same time, because of better utilization/load balancing.

    Do you know if they have a team working on improving the core engine/architecture? I really would love to work on that code myself :D.

  18. Oh, maybe CCP will look into it when all the old cruft has been reworked, but at the pace they're doing it now, it will take many years.

    I expect most of the old code is so intertwined and dependent on the components that must be replaced that there is no simple way to just transplant a new client-server layer that would actually fix things - they did a rework a number of years ago (don't remember the name), but that was just the low-level sockets fluff.

    Plus... well, the server runs mostly in Python, with the global interpreter lock forcing them to run it single-threaded (yes, bits are in C/C++, but the GIL still ruins everything). Actually "fixing" the performance issue would either require rewriting *all* the server components into something amenable to multithreading (thus disconnecting the server code completely from the client code), or segregating even more functions into separate server processes. None of the options are very convenient, so here we are.

    It *does* surprise me that CCP isn't trying somewhat more novel (... twenty years ago) solutions, such as building lists of bounding spheres and offloading that to separate threads/a GPU for collision detection, instead of grinding over a O(n^2) problem space, where N is getting depressingly big. And I'm sure they could split the pew pew part into threads too, but since the ship simulation is in Python, the GIL bites them in the ass again. I'm sure whoever picked Python back in the early 2000s is feeling REALLY proud of himself now.

    1. As painful as it might sound, they might want to look in to finding guys familiar with erlang for this. It's really good at doing massive threading tasks and clustering across multiple nodes. No matter what their solution though, a fully automated live migration system really needs to be in place soon or the huge scale wars that get attention on the game are going to become the reason everybody avoids the game.

    2. Yes Cor, just moving to one grid per server Thread would do a LOT of good. They started few years ago moving market to separate nodes, but its not enough.

      Grids need to be separated from the Sol code + they need to support live migration.

      How effing RETARDED is it that CCP, company doing this for 10 years now cant even do live server migration ??!?!

  19. We need to give CCP the breathing space to fail like this more often while they sort out more reliable architecture and process migration strategies. In CCP Fozzie's thread I suggested that aggression timers should last over DT, since the purpose of those timers is to prevent people using the Logoffski defence. In the meantime, switching sol sims between nodes should be accompanied by "ship in a box" and "brain in a box" proxies that allow the client processing to be paused, the sol to be shut down, moved, then re populated with the ships on grid, targets re established, client processing resumed.

    A possible bridge strategy would be to add notification in the client that sol remapping is occurring on the node which this client's character is using. At least this way people can be forewarned that a hazard to shipping and navigation is imminent :)

  20. lol the real question Jester is do they even know how the old code works? can they rebuild it? the way the talk about sky net I get the impression they dont want to touch it at all lol


Note: Only a member of this blog may post a comment.