This post is constantly updated with whatever information I can get my hands on. Refresh for updates.
World wide panic!!! Google is down! Grab your routers and run!!!11
Well not really but internet services were massively affected due to the Cyclone Vardah near Chennai.
To my knowledge three major things happened.
- Google(AS15169) had huge problems and traffic was shifted to POPs other than Mumbai.
- Vodafone’s Chennai location had an outage. Traffic was being moved to Mumbai. Enterprise and retail customers had issues.
- There was huge packet loss on most international websites, possibly due to capacity issues.
I am going to talk about the first one.
Google down you say?
Well I was alerted by this post by my friend and unusually slow internet.
Seems like traffic for Google routing to Chennai (instead of Mumbai) for few ISPs in India. Impact of #vardahcyclone ?
— Anurag Bhatia (@anurag_bhatia) December 12, 2016
So let’s explore what went wrong. Inspite of having very high packet loss while accessing RIPE’s website, I could manage to get one traceroute through RIPE Atlas project.
As you can see traffic destined for Chennai is terminated at Singapore, possibly carried there by Google itself cold potato style. Likewise I noticed that I was hitting Delhi rather than Mumbai. So traffic was just moved away from Mumbai to other locations. Likely reason for this would be connectivity issues between Mumbai and Chennai possibly due to fiber cuts.
I continued to monitor the situation some time later. Link to test: https://atlas.ripe.net/measurements/6959487/#!probes
As you can clearly see even probes close to Mumbai were hitting Chennai instead of their POP closest to their region. Let’s test out the aspmx.l.google.com domain used for Google Apps which isn’t terminated in India. Link to test: https://atlas.ripe.net/measurements/6960037/#!probes
As you can see the latencies are not good. Also my friend alerted to me a few strange traces like this.
What is happening here is that Google was rerouting traffic constantly and BGP routers were taking a while to respond.
So what exactly happened?
There is not much Google content actually in India right now. Most of it is either cached by Google itself, while terminating SSL on one of their Indian nodes – Mumbai, Chennai or Delhi. Eyeball networks/ISPs then either directly peer with Google or their upstreams do. Most YouTube content is just cached by GGC nodes inside ISP networks.
Google has limited capacity to Europe through Mumbai and most of Indian traffic is actually served from Taiwan via Singapore through a fat pipe in Chennai.
Moving traffic between sites is normally not a problem but this caused two major problems:
- ISPs were quite used to undersizing on IP transit capacity and would rely on peering with Google. Google is around 50% of their capacity so this does make sense.
- Google’s routes were constantly flapping and routers would be rather slow to react to that. Cheaper routers like Mikrotik would be most susceptible to this.
- Larger transit players and telecos likely had issues of their own.
So what happened was that Google wasn’t actually down. The connectivity from ISPs to Google was hampered because they didn’t have enough transit capacity if the plug was pulled on their peering capacity.
Long term Solutions
Likely solutions would be to avoid this in the future would be:
- Eyeball networks should always try to over build on transit capacity. So incase there is loss in connectivity to Google PNI, traffic can always be served from a different location from transit. If they are large and diverse enough, maybe they can peer at different locations. This would also prevent from carrying traffic over their backbone.
- Better routers for BGP or using something like BIRD would help. Mikrotik, which are very popular for small to medium networks in India can only do BGP on only one core, which can hamper performance.
- Smaller ISPs should have redundant upstreams. Vodafone was badly affected, while networks like Airtel and Tata were less so. Having redundancy helps.
- More content networks like Google should have more content inside the country. Google has plans of setting up a datacenter in Mumabi. More info here.
- Redundant connectivity between sites with different fiber paths would be nice for large networks.
- Outside connectivity with landing stations at multiple locations for transit and larger players would be nice for internet in India in general.
Change your DNS!
Also what I did was move from Google’s public DNS resolvers to OpenDNS because Google makes their queries from outside India and they are just cached at their POPs here so non-cached queries could have issues. That plus the fact that they use anycast and could hit a node like Mumbai would cause issues. Non-Google services were much better for me after I made the switch.
As I finish this post, I have noticed that connectivity is much better, atleast with my ISP(AS58405) so time to go watch some YouTube streams! 😀
Update 13/12/2016 3 PM:
I have some information that tells me that Google had interconnects from Powergrid(AS132215) that had issues. While connectivity by Powergrid was restored by morning, i2i cable leased by Google possibly has some issues as well and I am seeing a lot of Google traffic taking international route after my ISP’s upstream hands it off outside India. ping google.com is hitting South Carolina, after Airtel(AS9498) hands it off in London.
Update 13/12/2016 5:30 PM:
A friend confirmed that Airtel’s i2i cable indeed has issues.
Dear Customer, the cyclone in Chennai has impacted one of our undersea network cables which may affect your internet speeds. We deeply regret the inconvenience. Our engineers are working to resolve the issue and services will be normalized shortly.Airtel
Trace to Singapore was also going through LINX, London for me.
arhue@xubuntu-desktop:~/Videos/servo$ traceroute www.oneasiahost.com
traceroute to www.oneasiahost.com (188.8.131.52), 30 hops max, 60 byte packets
1 gateway (192.168.10.1) 0.198 ms 0.255 ms 0.259 ms
2 10.102.254.1 (10.102.254.1) 26.036 ms 28.425 ms 29.244 ms
3 10.240.2.10 (10.240.2.10) 29.980 ms 30.215 ms 30.248 ms
4 184.108.40.206 (220.127.116.11) 32.776 ms 33.312 ms 33.446 ms
5 10.139.103.1 (10.139.103.1) 33.539 ms 33.543 ms 33.557 ms
6 aes-static-025.51.246.61.airtel.in (18.104.22.168) 47.429 ms 22.547 ms 58.581 ms
7 22.214.171.124 (126.96.36.199) 182.880 ms 188.8.131.52 (184.108.40.206) 186.652 ms 220.127.116.11 (18.104.22.168) 190.037 ms
8 linx-juniper.sg.gs (22.214.171.124) 246.037 ms 246.187 ms 246.149 ms
9 switch-1.v3.uk1.sg.gs (126.96.36.199) 252.934 ms 248.928 ms 263.849 ms
10 ns1.oneasiahost.com (188.8.131.52) 260.528 ms 259.311 ms 261.143 ms
I did an another RIPE Atlas test. Link here: https://atlas.ripe.net/measurements/6960503/#!probes
As you can see, the map shows a very bleak picture, with most of the traffic going outside the country. Let’s look at the OpenIPMap to see where most of the traces are landing up.
As you can see very little traffic is going to Singapore directly via Chennai. This possibly means there is a problem with the i2i cable connecting India to Singapore. Google likely carries whatever traffic lands up in Europe to other places since it’s a cold potato network. It doesn’t prefer to carry it from India itself since the Europe-India pipe that they have isn’t very high capacity to carry all that traffic.
Update 13/12/2016 9:30 PM:
Traffic is back to hitting the Indian nodes. RIPE Atlas test here: https://atlas.ripe.net/measurements/6960735/#probes
As you can see traffic is hitting even the Mumbai node. I don’t know how these nodes are fed with content as it looks like the i2i cable system still has issues.
Router: SG1 (Epsilon) - Edge-2 (Customer, Standard Route)
Command: traceroute inet 184.108.40.206 as-number-lookup source 220.127.116.11
1 edge-1.sg2.sg.gs (18.104.22.168) 0.827 ms 3.495 ms 0.837 ms
2 22.214.171.124 (126.96.36.199) [AS 2914] 3.106 ms 1.203 ms 1.743 ms
3 ae-4.r21.sngpsi05.sg.bb.gin.ntt.net (188.8.131.52) [AS 2914] 1.886 ms 2.267 ms 2.102 ms
MPLS Label=440911 CoS= TTL=1 S=1
4 ae-8.r24.londen12.uk.bb.gin.ntt.net (184.108.40.206) [AS 2914] 191.297 ms 188.173 ms 186.029 ms
MPLS Label=617522 CoS= TTL=1 S=1
5 ae-8.r02.londen03.uk.bb.gin.ntt.net (220.127.116.11) [AS 2914] 184.094 ms 189.205 ms 189.433 ms
6 flagtelecom-0.r02.londen03.uk.bb.gin.ntt.net (18.104.22.168) [AS 2914] 209.137 ms 200.179 ms 205.002 ms
7 22.214.171.124 (126.96.36.199) [AS 15412] 208.916 ms xe---1.0-pjr04.mmb004.flagtel.com (188.8.131.52) [AS 15412] 226.170 ms 248.547 ms
8 * * *
9 * * *
10 184.108.40.206 (220.127.116.11) [AS 18101] 215.753 ms 199.318 ms 199.310 ms
11 10.139.103.1 (10.139.103.1) 215.223 ms 215.225 ms 211.806 ms
12 18.104.22.168 (22.214.171.124) [AS 58405] 212.304 ms 214.853 ms 215.016 ms
13 126.96.36.199 (188.8.131.52) [AS 58405] 210.071 ms 212.743 ms 200.098 ms
As you can see NTT(AS2914) is carrying the traffic to UK, before handing it off to Flagtel(AS15412). Most probably Indian telecos have stopped announcements of Indian routes on Singapore routers and are preferring routes learnt from their European routers.
Google services should be better but we’ll have to wait and watch about how the i2i problem plays out. Networks not using the i2i cable should be much better off, as far as connectivity to Singapore is concerned.
Update 13/01/2017 6:30 PM:
i2i cable issues are still not fixed. Airtel is using SE-ME-WE4 to get to Singapore instead of i2i but the Singapore to India capacity is probably not as much as required so they are sharing fewer Indian routes in Singapore. You can observe a lot of traffic going via Europe to India. The forward path from India to Singapore is fine, probably because there is very less traffic from India to Singapore so little choking in general. My ISP uses Reliance communications(AS18101) for return(heavily prepends announcements to Airtel) so I am much better off.
Tata Communications Ltd. has issues with TIC cable as well. This is what they sent to their customers.
You may observe additional latency of 30 -230ms for US West Coast and APAC prefix.
Please be informed that, this is due to under-sea cable fault on TIC submarine cable system. The fault is located in the wet segment in between repeater R58 & R59, this is near Singapore CLS, around 85-95 KMs from CLS.
As per latest update, the TIC cable fault repair is tentatively scheduled from 8th February 2017 and permit process is in progress. Additionally during repair operation Repeater R52 replacement will be carried out. The tentative ETR for the repair completion is 19th February 2017 depending on weather conditions.Tata Communications Ltd.
Fixing i2i and TIC cables could take a few more months, before which latency to Singapore could be affected. Fixing submarine cables is a long process. You need good planning, permits, allocating resources, sending diving teams etc. So we’ll have to be patient