Unplanned Outage (Resolved)


#1

We started something that was supposed to run slowly in the background on WK, but it decided to tie up all the resources on our database server, instead. We’re actively working to get it remedied, and will post updates every 15 minutes or so until it is. Our apologies for the inconvenience!


#2

Oh, no worries! It’s not like I have a lot of 皿洗い to do or anything…


#3

Alright, 15 minutes down. We’ve got a fix running, but it’s going to take a few to fix things. I’ll post again when it’s done or in another 15.

Rest assured, all your data is safe and sound. It’s just on a server that’s working really hard right now.


#4

And we’re back! As soon as Viet and I catch our breath, we’ll post a summary of what happened and what we’ll do to prevent similar outages in the future. Once again, sorry for the unplanned error messages and inconvenience!!! :sweat:


#5

Still sluggish (and it went down for a minute or so), but it’s working for the most part.


#6

As seanblue said, there definitely seem to be some lingering issues.

Loading reviews was quite slow, once I finished it just sat there for a bit, and when it did finished it popped up saying one more review, I refreshed and then that review was gone.


#7

It looks like it crashed again while in the midst of doing reviews. I hope that I won’t have to do those reviews again…


#8

I also lost connection half a minute ago, but reviews were saved


#9

For me still crashing while doing reviews


#10

Alright, the monitoring continues. Things were good for about 30 there, then the response times spiked again. We tweaked some resource settings, and things are getting a little speedier — at least, according to our reporting tools.

We’ll continue to watch for the next while.


#11

No canceling the background process then?


#12

We turned it waaaay down. We’re seeing response times that are in the happy, back-to-normal range for everything on the site, so we’re going to let it cruise.


#13

Phew. Breath caught. Performance restored (we hope).

Here’s the breakdown:

We were pushing some phased changes to the database. We planned to slowly populate some new fields in the background with a zero-downtime deployment today, then push out code that used those new fields tomorrow. (/me rolls eyes at the “zero” downtime today) Part of the phased approach was updating some indexes on the tables. We always update our indexes concurrently (non-locking, so the system can keep running) and today’s update was set to do the same.

As soon as we started deploying the changes, things broke down. Being all tidy, we dropped an old index before creating the new one. Both Viet and I missed how dropping that index first would affect the performance of the existing code — it made the queries to the database so long running, they’d never finish in time to load the page for users.

It took us the first 30 or so of the outage to figure out what was happening and what we needed to do. Once we figured out we needed that old index back, it took about 20 minutes to rebuild it. There was a lot of compulsive refreshing and finger tapping for those 20 minutes…

So, how do we avoid this in the future? Viet and I already do code reviews for all the code we push up to the site, and we have a ton of automated testing in place to catch code errors. That happens asynchonously, though, and that creates an opportunity for us to miss might what happen when we go to production. For future changes to the database that involve new fields, altered indexes, or touching certain tables, we’re going to sit down together and talk through the code and the changes, just to be sure we can see how it’ll perform with gigabytes of data involved.


#14

I had a few moments of connection errors but it seems to have corrected itself now. A little bit slow to load but nothing unusual I can see…

Edit: Good to hear all is well. Hope it stays that way and you can continue the updates (albeit a little slower).


#15

I just did a 40+ review session and had no idea there’d been a problem.

(Missed two burns though, boo.)


#16

Yep, just got done with 50 reviews and didnt run into any issues this time. Thanks for the updates :smiley:


#17

@viet @oldbonsai Site appears to be down again.

We’re sorry, but something went wrong.
We’ve been notified about this issue and we’ll take a look at it shortly.


#18

Middle of new lessons…just leveled up to 19! Whoot! Let’s get the gator going!


#19

yup, same here, right when the radicals were coming up


#20

Woo! Now I have an excuse for not doing my reviews.