• 0 Posts
  • 22 Comments
Joined 1 year ago
cake
Cake day: June 9th, 2023

help-circle



  • It’s not the company it once was, but there are also a lot of outrage-bait headlines about it that don’t hold up well to scrutiny.

    For instance, there have been a lot of Lemmy posts about Chrome supposedly removing the APIs used by adblockers. I figured I’d validate that on my own by switching to the version of uBlock that is based on the new API. Well… As it turns out, it works fine. It’s also faster.

    Mind you, figuring out the actual facts behind each post gets exhausting, and people just shutting down and avoiding the problem space entirely makes some sort of sense. That, and it is healthy for an ecosystem to have alternatives, so I’d keep encouraging usage of Firefox and such if only on that basis alone.


  • Balinares@pawb.socialtoAsklemmy@lemmy.mlCrowdstrike Cockup
    link
    fedilink
    English
    arrow-up
    13
    ·
    2 months ago

    This is actually an excellent question.

    And for all the discussions on the topic in the last 24h, the answer is: until a postmortem is published, we don’t actually know.

    There are a lot of possible explanations for the observed events. Of course, one simple and very easy to believe explanation would be that the software quality processes and reliability engineering at CrowdStrike are simply below industry standards – if we’re going to be speculating for entertainment purposes, you can in fact imagine them to be as comically bad as you please, no one can stop you.

    But as a general rule of thumb, I’d be leery of simple and easy to believe explanations. Of all the (non-CrowdStrike!) headline-making Internet infrastructure outages I’ve been personally privy to, and that were speculated about on such places as Reddit or Lemmy, not one of the commenter speculations came close to the actual, and often fantastically complex chain of events involved in the outage. (Which, for mysterious reasons, did not seem to keep the commenters from speaking with unwavering confidence.)

    Regarding testing: testing buys you a certain necessary degree of confidence in the robustness of the software. But this degree of confidence will never be 100%, because in all sufficiently complex systems there will be unknown unknowns. Even if your test coverage is 100% – every single instruction of the code is exercised by at least one test – you can’t be certain that every test accurately models the production environments that the software will be encountering. Furthermore, even exercising every single instruction is not sufficient protection on its own: the code might for instance fail in rare circumstances not covered by the test’s inputs.

    For these reasons, one common best practice is to assume that the software will sooner or later ship with an undetected fault, and to therefore only deploy updates – both of software and of configuration data – in a staggered manner. The process looks something like this: a small subset of endpoints are selected for the update, the update is left to run in these endpoints for a certain amount of time, and the selected endpoints’ metrics are then assessed for unexpected behavior. Then you repeat this process for a larger subset of endpoints, and so on until the update has been deployed globally. The early subsets are sometimes called “canary”, as in the expression “canary in a coal mine”.

    Why such a staggered deployment did not appear to occur in the CrowdStrike outage is the unanswered question I’m most curious about. But, to give you an idea of the sort of stuff that may happen in general, here is a selection of plausible scenarios, some of which have been known to occur in the wild in some shape or form:

    • The update is considered low-risk (for instance, it’s a minor configuration change without any code change) and there’s an imperious reason to expedite the deployment, for instance if it addresses a zero-day vulnerability under active exploitation by adversaries.
    • The update activates a feature that an important customer wants now, the customer phoned a VP to express such, and the VP then asks the engineers, arbitrarily loudly, to expedite the deployment.
    • The staggered deployment did in fact occur, but the issue takes the form of what is colloquially called a time bomb, where it is only triggered later on by a change in the state of production environments, such as, typically, the passage of time. Time bomb issues are the nightmare of reliability engineers, and difficult to defend against. They are also, thankfully, fairly rare.
    • A chain of events resulting in a misconfiguration where all the endpoints, instead of only those selected as canaries, pull the update.
    • Reliabilty engineering not being up to industry standards.

    Of course, not all of the above fit the currently known (or, really, believed-known) details of the CrowdStrike outage. It is, in fact, unlikely that the chain of events that resulted in the CrowdStrike outage will be found in a random comment on Reddit or Lemmy. But hopefully this sheds a small amount of light on your excellent question.




  • Mélenchon is… frustrating.

    He’s the main contender on the limited field of the actual left in France. He’s got a lot of proposals that are actually good and desirable.

    He’s also a narcissist and a populist whose stated approach to achieving his proposals is to denounce treaties he doesn’t like and somehow force other countries to replace clauses with whatever it is he wants.

    He’s also incapable of compromises, and right now busily torpedoeing the left wing alliance that won the election because his own party didn’t win enough seats to take charge of the alliance.

    What I don’t know is, how much of the populist/anti-system talk is just talk for political reasons, and whether he would in fact be capable of the nuance required to govern. He might. He might not. He’s clearly smart and charismatic. But he’s also the type to huff his own farts hard enough to mistake the visions for the truth of the world. So… In that respect, pretty much just like Macron.

    France has a big, big problem with overemphasizing individual politicians over policies.



  • Astounding, isn’t it? That’s publicly traded companies for you. The company’s objective is to keep its stock up and up and up. That means shareholders must want to keep buying the stock, which in turn means that the company must demonstrate that its value will keep growing, so that by buying the stock today the shareholders will get a positive return tomorrow.

    Of course, the universe is finite and no growth is forever. The end state for such companies is not bankruptcy, at least in the immediate, but, more or less, the IBM fate: a previously uber-dominant mastodon whose market capitalization is now worth maybe one tenth of its modern competitors. The fact that it’s still turning a profit is only secondary: none of the big tech shops want to be the next IBM. Their executives are, after all, mostly paid in stocks.

    And that’s how you end up with companies that are making amounts of revenue you and I can’t even comprehend flail in a panic like they’re on the edge of the precipice whenever the technological landscape shifts.

    It’s both fascinating and remarkably dumb.






  • It’s difficult to answer without a better understanding on your customers’ workloads and how those trigger your outages. There’s a bunch of valid angles from which to look at this.

    If your product consistently buckles under customer workloads that they paid to be able to run, it sounds like you have either an underprovisioning or an overcommitment problem.

    If you accept customer workload spikes that you don’t have the resources to serve but would be able to process if they were more spread over time, it sounds like you have an admission control problem.

    If it’s a matter of adding resources to respond to customer activity spikes and you just have to do it manually, it sounds like you have an automation problem.

    If your pager load is becoming such that you can’t do project work to address whichever ones of the above are relevant to you, it’s time to hand the pager back to devs. If you don’t have the institutional authority to hand back the pager to devs, it sounds like you have a management problem.




  • My friend, I don’t need to go read the video game history about Daggerfall: I wrote some of it. :)

    And I stand by my statement. That game was the height of storytelling that came out of Bethesda in a bunch of small but important ways, although Morrowind is not far behind, in a somewhat different fashion. And there is a definite shift in the series from the moment Ted Peterson left the team. Patently, not a shift I am personally very fond of, but to each her own.


  • Balinares@pawb.socialtoMemes@lemmy.mlAnother Starfield Post
    link
    fedilink
    English
    arrow-up
    16
    ·
    1 year ago

    Well, I’d argue that Daggerfall was their best game, story-wise, but Daggerfall is even older. And that’s the point, isn’t it? More time passed between Skyrim and Starfield than between Daggerfall and Oblivion. A lot can change in so many years, and I do believe that hoping for something new was not entirely unreasonable.

    Then again, the keyword there is entirely, isn’t it. I personally didn’t expect very much from Starfield, and, also personally, I can’t say I fully understand the amount of hype surrounding it.