Link-state tracking, VMware ESX and You

Posted in Uncategorized by shaw38 on January 22, 2010

This post could also be titled “How to build a healthy, long-lasting relationship with your system administration team”. One of the most important (and overlooked) pieces of deploying VMware ESX in a network is handling an upstream network failure. Because larger organizations have segregated network and system administration teams, the switchport tends to be the demarcation of responsibility. Where this particularly fails is in the perceived reaction of a network component failure, be it an upstream switch or router.

With the increased push towards server consolidation and deployment of VMware, the “routed is better” mantra has become muted by the layer 2 requirements of virtual machine mobility. A virtualized server also can present cable density issues with each server possibly needing 6 NICs (2 x Production, 2 x VMKernal, 1 x Backup, 1 x iLO). From a network design perspective, a VMware deployment screams for a top of rack switching model. Top of rack switching and VMware ESX physical NIC (pNIC) failure detection methods can present some interesting challenges.

VMware ESX allows for two options to detect a upstream network failure: Beaconing Probing and Link Status. Here is an in-depth summary on both methods:

Basically, beacon probing is pretty awful if you’re a network admin. It will send broadcasts out each physical interface of the ESX server for EACH vlan configured (if using dot1q tagging which you should be). So that is:

p number of physical servers x n number of pNICs per server x v number of vlans = broadcast storm

Link status is the preferred failure detection method but it will only track the state of the local link (between the ESX server and the switch). This tells the ESX server nothing about the switch’s ability to forward frames. This is where link-state tracking comes in. Link-state tracking will convey the switch’s upstream link-state to the local link of the ESX server by creating a logic gate between upstream and downstream links.

Suppose you have the following loop-free network topology deployed in your data center:

The network detection failure method configured on the ESX server is link status. Most likely your ESX server is sending frames out both interfaces due to the particular load balancing configuration but in this case we are only interested in frames sent to the switch on the left. In the event the left switch’s uplink fails, we will experience a black hole situation for some of our traffic leaving the ESX server:

By utilizing link status as our ESX failure method detection, the ESX server merely tracks physical link state at layer 1 and the ability of the upstream switch to forward frames is not taken into account:

Link-state tracking configured on the switch will convey this uplink failure to the link directly connected to the ESX server. Let’s get our switch configured correctly (which is stupidly simple):

First, define your link state group globally:

Switch(config)#link state track 1

Then define your upstream links within the link state group:

interface GigabitEthernet1/0/1

link state group 1 upstream

Lastly, define your downstream links:

interface GigabitEthernet1/0/2

link state group 1 downstream

Now the upstream link state will be conveyed to the downstream links which will cause the link to the ESX server to be shutdown in the event the upstream switch link goes down. Interfaces are coupled in :

Once the upstream link failure occurs and the interface is marked as down, the resulting action created by link state tracking is to bring down all downstream interfaces:

By bringing down the physical state of the interfaces to the ESX servers, the action by ESX link status tracking will be to initiate a pNIC failover event:

This will in turn create an long and happy relationship between network and system administrators and eliminate another instance of finger pointing when redundancy fails to function correctly.

Optimized Edge Routing Overview

Posted in Networking by shaw38 on January 8, 2010

If you can tell me of a more understated topic on the CCIE Routing and Switching v4.0 lab blueprint than Optimized Edge Routing (OER), I’ll buy you a beer. This was quietly snuck into the blueprint in between policy-based routing and redistribution, both fairly straightforward topics. Should be no big deal right? False.

OER removes rigidness of standard IP routing where typical routing metrics are derived from physical layer measurements and in turn, dictates a generic routing policy for all traffic. OER does this by gathering higher-lever performance metrics through IP SLA and Netflow information and uses this to determine the most optimal exit point for certain destination prefixes or traffic classes. Once the ideal exit point has been decided, routing policy is dynamically updated to influence the specific traffic class.

Navigating through the configuration guide for OER can be daunting but configuration can be broken down into 5 steps:

1. Profile

  • The selection of a subset of traffic to optimize performance
  • Learns the flows passing through the router with the highest delay or throughput
  • Statically configure class of traffic to performance route

2. Measure

Once traffic has been profiled, metrics need to be generated against it. This is down through:

  • Passive monitoring – measuring performance of a traffic flow as the flow is traversing the data path
  • Active monitoring – generating/measuring synthetic traffic to emulate the traffic class being monitored
  • Both can be deployed: passive monitoring can be used to determine if the flow doesn’t conform to an oer policy and active monitoring can find the most optimized alternate path

3. Apply Policy

  • Performance metrics are compared to a set of low and high thresholds and a determination is made if the metrics are out of policy
  • Traffic class policies – defined for prefixes or for applications
  • Link policies – defined for entrance or exit links at the edge

4. Control

  • Traffic flow is modified to enhance network performance
  • Methods for modifying routing policy:
    • For traffic classes defined using a specific prefix, traditional routing information can be modified using BGP or an IGP to add/remove a route
    • For traffic classes defined by application (prefix + upper layer protocol), there are two methods:
      • Device specific: Policy-based routing
      • Network specific:
        • Overlay performance – MPLS or mGRE to reach any other device at the network edge
        • Context Enhanced Protocols – BGP/OSPF/EIGRP are enhanced to communicate upperlayer information with a prefix

5. Verify

  • Once traffic is flowing through the perferred exit point, the traffic class is verified again against the traffic policy
  • If it is determined to the traffic is still out of profile, the controls put in place are reverted and the measurement phase restarts
Tagged with: , ,

Route Redistribution (Over?)Simplified…

Posted in Uncategorized by shaw38 on January 7, 2010

Here’s the best document on route redistribution I’ve read so far:

And it all can be summed up in this statement:

“Avoiding these types of problems is really quite simple: never announce the information originally received from routing process X back into routing process X.”

And it truly is that simple. Always mark/tag/color routes based on their source routing domain and when redistributing, select which routes to redistribute. After all, routes are merely destination information. It’s all about who needs to know and from whom they need to know it.

Generating Pseudo-Random IPv6 Global IDs for Unique Local Unicast Addresses

Posted in Networking by shaw38 on January 4, 2010

I’ve been delving into the multiple RFCs associated with the creation of IPv6 this weekend and came across an interesting section in RFC 4193 – Unique Local IPv6 Unicast Addresses. First, a little background:

IPv6 unique local unicast addresses are the equivalent of IP version 4 RFC 1918 space in most ways and are formatted in the following fashion:

  • 7-bit Prefix – FC00::/7
  • 1-bit Local bit (position 8 ) – Always set to “1”…for now
  • 40-bit “kinda-almost-unique” Global ID
  • 16-bit Subnet-ID
  • 64-bit Interface ID

The intention and scope of these addresses is for unicast-based intra/inter-site communication. The definition of a “site” within the plethora of IPv6 RFCs is slightly ambiguous  but in the case of RFC 4193, the demarcation of a “site” is between ISP and customer.  According to the RFC, unique local unicast addresses are permitted to be used between “sites” i.e. customer-to-customer VPN communication but the FC00::/7 prefix is to be filtered by default at any site-border router. This space is not intended to be advertised to any portion of the internet.

Now the interesting portion of this RFC is the recommended algorithm for generating a realistically unique yet theoretically common 40-bit Global ID for your local unicast addresses. Section 3.2.2 recommends the following:

  1. Obtain the current time of day in 64-bit NTP format
    • i.e.  reference time is C029789C.45564D4E
  2. Obtain an EUI-64 identifier from the system running this algorithm
    • i.e. bia of C201.0DC8.0000
  3. Concatenate the time of day with the system-specific identifier in order to create a key
    • i.e C029789C45564D4E.C2010DC80000
  4. Compute an SHA-1 digest on the key; the resulting value is 160 bits
    • Here’s a handy web-based calculator to generate multiple different message digest flavors :
    • Our resulting SHA-1 hash of C029789C45564D4E.C2010DC80000 is 4D958078E1C1C2f3DEBA10C1DC7899E6A21D2B9F
  5. Use the least significant 40 bits as the Global ID
    • Our 40 least-significant bits results in E6A21D2B9F (hint: just count over 10 hexadecimal digits from the right)
  6. Concatenate FC00::/7, the L bit set to 1, and the 40-bit Global ID to create a Local IPv6 address prefix
    • FC00::/7 with the L-bit (8th bit) set to 1 = FD00::/7
    • A concatenation of the prefix plus our generated Global ID and 16-bit Subnet ID = FDE6:A21D:2B9F::/64

Also included in the RFC is sample probabilities of IPv6 address prefix uniqueness depending on the number of peer connections to a site.  It’s safe to say if you experience an overlap using this method to assign Global IDs, play the damn lottery. While this method almost eliminates any overlap possibility between sites, the Global IDs generated with this method are hardly “pretty” numbers and there will undoubtedly be folks assigning Global IDs of ::1/40. If you have ever went through a merger/acquisition with IPv4, do yourself a favor and follow academia for assigning your Global IDs.

Tagged with: