IC Electrical Testing
Electrical maintenance at IC would have resulted in a total outage (either network or power) of all machines there.
Given that it’s not 2011 any more, we felt that this was not acceptable. Many thanks to Bytemark hosting who stepped into the breach and allowed us to move some servers there. With a primary database server hosted at Bytemark, and the replica in IC, we now have multi-site redundancy, although it is still a complicated matter to switch database primaries.
Thanks, also, to OSMF-US, who allowed us to use some of their AWS credits to fund a third replica in the cloud, just in case anything went horribly wrong with the move.
New server arrives
The new database server arrived, and was racked up and installed. The new server is being tested while we make sure it’s configured for best performance.
Electrical test aftermath
Ramoth index corruption
There was a partial API/website outage from around 18:00 UTC to 19:40 UTC on 2016-05-18, with some queries finishing successfully but many failing. Following the electrical outage, it seems that an index on the replica database server had become corrupt, and was causing queries to hang. Rebuilding this index slowed the replica down to the point where it was not in sync with the master database and was rejecting queries.
All traffic was moved to the master database until the replica had finished rebuilding and synchronised with the master again.
Poldi failures
Following the power outage, one of the Nominatim machines failed to boot. The issue turned out to be disk-related (again), and it was decided that now would be a good time to take it away and rebuild it.
Checking moved disks are okay
The disks in the main database server survived a road trip up to York (thanks, Bytemark!) without any apparent damage. To check more thoroughly, it is necessary to check the disks one-by-one with a more in-depth analysis.
Upgrading machines to Ubuntu 16.04 LTS
Ubuntu xenial xerus, the latest long-term support version, has been released. Work is underway to upgrade all the machines, checking that everything is working in the new version.
Failed disk in aerial imagery machine
A disk failed in the aerial imagery machine, generously hosted by Exonetric. It appears that the disk is under supplier’s warranty, and has been replaced.
More UCL access issues
A failed disk needs to be replaced in one of the tile server machines, but lack of access is hampering efforts to fix it. The access issues are being worked on and should be resolved soon.
Outage 2016-05-25
There was a full API/website outage from 00:56 UTC to 01:30 UTC on 2016-05-25. The cause of this is not entirely clear at present. It may be related to a combination of a failed server, highly variable load and a feature in Apache for handling graceful shutdowns. The effect of the highly variable load might have been to leave many worker processes in a “graceful shutting down” state while waiting up to an hour for requests to finish and still holding resources. If that’s the case then over time, these resources would be exhausted, leaving none free for new API/website requests.
It would be possible to forcibly free these resources after a timeout, but given the rarity of the event and the negative impact it might have on common changeset uploads, it seems another mitigation might have to be found.