Linux内核分析 - 网络[四]：路由表

路由表 在内核中存在路由表fib_table_hash和路由缓存表rt_hash_table。路由缓存表主要是为了加速路由的查找，每次路由查询都会先查找路由缓存，再查找路由表。这和cache是一个道理，缓存存储最近使用过的路由项，容量小，查找快速；路由表存储所有路由项，容量大，查找慢。首先，应该先了解路由表的意义，下面是route命令查看到的路由表：

Destination	Netmask	Gateway	Flags	Interface	Metric
169.254.0.0	255.255.0.0	*	U	eth0	1
192.168.123.0	255.255.255.0	*	U	eth0	1
default	0.0.0.0	192.168.123.254	UG	eth0	1

一条路由其实就是告知主机要到达一个目的地址，下一跳应该走哪里。比如发往192.168.22.3报文通过查路由表，会得到下一跳为192.168.123.254，再将其发送出去。在路由表项中，还有一个很重要的属性-scope，它代表了到目的网络的距离。路由scope可取值：RT_SCOPE_UNIVERSE, RT_SCOPE_LINK, RT_SCOPE_HOST 在报文的转发过程中，显然是每次转发都要使到达目的网络的距离要越来越小或不变，否则根本到达不了目的网络。上面提到的scope很好的实现这个功能，在查找路由表中，表项的scope一定是更小或相等的scope（比如RT_SCOPE_LINK，则表项scope只能为RT_SCOPE_LINK或RT_SCOPE_HOST）。 路由缓存 路由缓存用于加速路由的查找，当收到报文或发送报文时，首先会查询路由缓存，在内核中被组织成hash表，就是rt_hash_table。static struct rt_hash_bucket *rt_hash_table __read_mostly; [netipv4 oute.c]通过ip_route_input（）进行查询，首先是缓存操作时，通过[src_ip, dst_ip, iif,rt_genid]计算出hash值hash = rt_hash（daddr, saddr, iif, rt_genid（net））;此时rt_hash_table[hash].chain就是要操作的缓存表项的链表，比如遍历该链表for （rth = rt_hash_table[hash].chain; rth; rth = rth->u.dst.rt_next）因此，在缓存中查找一个表项，首先计算出hash值，取出这组表项，然后遍历链表，找出指定的表项，这里需要完全匹配[src_ip, dst_ip, iif, tos, mark, net]，实际上struct rtable中有专门的属性用于缓存的查找键值 – struct flowi。/* Cache lookup keys */struct flowi fl;当找到表项后会更新表项的最后访问时间，并取出dstdst_use（&rth->u.dst, jiffies）;skb_dst_set（skb, &rth->u.dst）; 路由缓存的创建inet_init（） -> ip_init（） -> ip_rt_init（）rt_hash_table = （struct rt_hash_bucket *） alloc_large_system_hash（"IP route cache", sizeof（struct rt_hash_bucket）, rhash_entries, （totalram_pages >= 128 * 1024）？ 15 : 17, 0, &rt_hash_log, &rt_hash_mask, rhash_entries ？ 0 : 512 * 1024）;其中rt_hash_mask表示表的大小，rt_hash_log = log（rt_hash_mask），创建后的结构如图所示： 路由缓存插入条目函数rt_intern_hash（）要插入的条目是rt，相应散列值是hash，首先通过hash值找到对应的bucketrthp = &rt_hash_table[hash].chain;然后对bucket进行一遍查询，这次查询的目的有两个：如果是超时的条目，则直接删除；如果是与rt相同键值的条目，则删除并将rt插入头部返回。while （（rth = *rthp）！= NULL） { if （rt_is_expired（rth）） { // 超时的条目 *rthp = rth->u.dst.rt_next; rt_free（rth）; continue; } if （compare_keys（&rth->fl, &rt->fl） && compare_netns（rth, rt）） { //重复的条目 *rthp = rth->u.dst.rt_next; rcu_assign_pointer（rth->u.dst.rt_next, rt_hash_table[hash].chain）; rcu_assign_pointer（rt_hash_table[hash].chain, rth）; …… } …… rthp = &rth->u.dst.rt_next;}在扫描一遍后，如rt还未存在，则将其插入头部rt->u.dst.rt_next = rt_hash_table[hash].chain;rcu_assign_pointer（rt_hash_table[hash].chain, rt）;如果新插入rt满足一定条件，还要与ARP邻居表进行绑定Hint：缓存的每个bucket是没有头结点的，单向链表，它所使用的插入和删除操作是值得学习的，简单实用。 路由缓存删除条目rt_del（）要删除的条目是rt，相应散列值是hash，首先通过hash值找到对应的bucket，然后遍历，如果条目超时，或找到rt，则删除它。rthp = &rt_hash_table[hash].chain;spin_lock_bh（rt_hash_lock_addr（hash））;ip_rt_put（rt）;while （（aux = *rthp）！= NULL） { if （aux == rt || rt_is_expired（aux）） { *rthp = aux->u.dst.rt_next; rt_free（aux）; continue; } rthp = &aux->u.dst.rt_next;}spin_unlock_bh（rt_hash_lock_addr（hash））;