As I’m sure everyone noticed, the server died hard last night. Apparently, even though OVH advised me to disable proactive interventions, I learned this morning that “the feature is not yet implemented” and that they have proceeded to go press the reset button on the machine every time their shitty monitoring detects the tiniest of ping loss. Last night, this finally made the server mad enough not to come back up.

Luckily, I did happen to have a backup from about 2 hours before the final outage. After a slow migration to the new DC, we are up and running on the new hardware. I’m still finalizing some configuration changes and need to do performance tuning, but once that’s done our outage issue will be fully resolved.


Issues-

[Fixed] Pict-rs missing some images. This was caused by an incomplete OVA export, all older images were recovered from a slightly older backup.

[Fixed?] DB or federation issues- seeing some slowness and occasional errors/crashes due to the DB timing out. This appears to have resolved itself overnight- we were about 16 hours out of sync with the rest of the federation when I had posted this.


Improvements-

  • VM migrated to new location in Dallas, far away from OVH. CPU cores allocated were doubled during the move.

  • We are now in a VMware cluster with the ability to hot migrate to other nodes in the event of any actual hardware issues.

  • Basic monitoring deployed, we are still working to stand up full-stack monitoring.

    • jon@lemmy.tfOPM
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 year ago

      I think we’re set, definitely won’t have any datacenter engineers randomly resetting the server anymore. The slowness I had noticed last night also seems to have cleared up, so that should mean we’re fully synced back to the fediverse as well.