From around 10:00 am last Tuesday, August 12, 2014, customers reported that they were unable to access our hosted services. Our server and services monitoring systems, however, did not detect any outages. Our external monitoring service was not activated either. Customers also reported on social media that our services were only unavailable when accessed through certain providers.
Based on these facts, we suspect that this situation was due to an external routing problem. We immediately investigated and contacted our IP transit partners. Tracking down these types of problems is usually very complicated because they involve multiple networks and several parties. After various additional tests, we determined that the cause of the problem was not to be found here.
Our next step was therefore to concentrate on the routers that connect our infrastructure to the outside world. These routers have routing tables to determine which network connected to us can be used to reach a specific addressee most quickly. Right when the problems began occurring, we checked to see whether there were problems with the entries in the routing table or whether these were perhaps even missing. This was not the case. Even with closer examination, everything was fine and the routers should have been behaving correctly.
We then discovered the first indications that other larger internet providers around the world were currently experiencing similar problems. These appear to have been triggered by a sudden, excessive increase in the number of entries in the internet routing table. We examined this as well although our BGP sessions could not have been affected by this increase as our routers are equipped with more than enough RAM. However, the routers are configured with an artificial limit of 512,000 entries for hardware-based routing (regardless of the BGP sessions) and although this limit had no longer been exceeded, at this point we decided to investigate a bit more closely. We carried out random checks of affected IP addresses which showed that these were included in the hardware tables (TCAM). After making this determination and although we normally only perform these kinds of reboots for critical components during the day in exceptional cases, we restarted the routers.
After further investigation, we discovered that despite the random checks, the overflow of the hardware tables was indeed responsible for the problems. After rebooting, everything worked fine again. We ultimately concluded that because the limits were exceeded, the very critical routes were not immediately apparent when we performed our random checks. Restarting the BGP sessions did not result in replacing these routes either since the limit was no longer exceeded.
We have taken several technical measures as a result of these events.
- The limits for hardware-based routing were increased to 800,000 entries. This considerably reduces the likelihood that this situation will occur again.
- We will expand the scope of our router monitoring so that these types of anomalies can be identified more quickly in the future.
- We are also trying to determine why the faulty behavior could be resolved simply by rebooting, although the numbers were below the limits once again.
We formally apologize for any inconvenience this may have caused. We take this incident very seriously and assure you that we are doing everything possible to continue providing our services with the high level of quality our customers are accustomed to and should continue to expect in the future.