January 23, 2020
By Tim Siglin Contributing Editor
Featured Articles

The Algorithm Series: The Math Behind the CDN

F或者是整体的交付部分流方程，cdn使用细化内容缓存和内容复制; 优化的网络路径，包括入口, egress, and midgress data transport—and strategic server placement at the core (如原始服务器)或边缘(通常称为在某个点缓存内容) of presence). 关键CDN组件的基础是一些基本算法，用于平衡战略核心和边缘架构需求.

This article, 这是我们新算法系列中的第一个, 深入研究流媒体传输魔力背后的数学，突出重要的数学概念——甚至是一些方程——这些概念为基础设施提供动力，在全球范围内提供直播和点播流.

如果你读了我的2019年11月/ 12月思想流专栏, 您应该还记得，本系列文章是我在阿姆斯特丹IBC上的一次讨论产生的, my 和几个博士中的一个.D. streaming solutions architects. 在那次谈话中， two of the three of us with math degrees—Michelle Fore and Yuriy Reznik, head of research and a fellow at Brightcove—began delving into 媒体播放器交叉点的数学运算性能和多编解码器清单文件.

这次谈话产生了一个最初的想法，即涵盖四个关键领域:交付绩效 (CDN), player performance (OTT or OVP bitrate and rendition ladder optimization), live event scaling (including authentication and other potential bottlenecks), and DRM—and 在洛杉矶2019年流媒体西部大会上，我对雷兹尼克进行了一次采访，详细阐述了这一点.

在这个过程中，我被介绍给了行业里的一些人，他们都不是我的朋友或者是我在过去20年里有限的互动，但谁对此做出了贡献对于我们如何走到今天这一步的路线图来说，有四个方面是至关重要的, 也为未来的媒体传播服务.

CDN Math

CDN数学到底是什么? 最常见的内容交付数学, at least from the paying customer's perspective, is billing algorithms. 无论如何，这些都不是新的, having been around during the area of telco-based data networks (think dial-up, ISDN, 甚至是固定电话长途服务).

除了这些和其他以客户为中心的算法，比如95/5(又名95百分位), one key area that all CDNs measure and optimize is caching server utilization, including ways to prevent overloading a single server's capacity, 通常被称为“淹没”服务器.

Server Utilization for Proper Caching(aka Consistent Hashing)

这里面有很多数学运算 modern CDNs, 但是，网页加速和流媒体的基本算法之一可以在很久以前提交的设计专利中找到 in 1997. U.S. Patent No. 8458259 is titled Method and Apparatus for Distributing Requests Among a Plurality of Resources and is, itself, 1998年3月13日的一项专利申请延续了之前的几项专利 what became U.S. Patent No. 6430618 in 2002. The original patent授予麻省理工学院的，是基于 research 由计算机科学实验室的成员在1997年5月举行的第29届ACM计算理论研讨会(stock97)上提出. Their paper is 题目是“一致哈希和随机树”: 缓解万维网热点的分布式缓存协议”，并且可以在第654-663页 symposium proceedings.

该论文的几位作者——大卫·卡尔格尔, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy—are now 在内容交付圈很有名. Leighton, for instance, who has retained roles in both 麻省理工学院计算机科学实验室(现称为 Computer Science & 人工智能 Laboratory) and in MIT's department of mathematics, went on to co-found Akamai the following year with the late Lewin.

那么这个“一致哈希”的想法是什么呢 that the paper's authors developed and presented? The money quotes from the patent spell it out:

Two causes of delay are heavy communication loading a part of a network 以及大量加载a的请求 particular server. When part of the network becomes congested with too much 那部分的交通，通讯网络变得不可靠和缓慢. Similarly, when too many requests are directed at a single server, 服务器过载, or 'swamped.'

To address both the network congestion and the swamping of the original server, 通常称为源服务器, the patent raises the concept of a caching server:

缓存可以减少网络流量缓存副本在网络中很接近拓扑感，给请求者因为 fewer network links and resources are used to retrieve the information. Caching can also relieve an overloaded server 因为有些请求会通常被路由到原始站点可以由缓存服务器等提供服务 the number of requests made of the original site is reduced.

但是需要多少缓存服务器在多少个位置? More importantly, from a CDN design standpoint, are there other issues that could still cause congestion, even 如果有很多缓存服务器? It turns out the answer is that just throwing caching servers at the problem is not an effective solution. And that's what consistent hashing attempts to solve.

要理解一致哈希的基础，我们首先要理解哈希. And to do that, we also need to understand modulo mathematics.

数学中的模块化

求模是一个数学运算除法后剩下哪个整数(任何自然数加0). 如果你还记得高中数学的话, 检验任意两个数是否有余数的方法称为合成除法或霍纳法.

还是不记得怎么做? OK, here's how modulo works. For instance, 7模2(通常简写为7模2)的答案要么是零，要么是某个自然数(任何大于零的正数). 而我们通常会把7/2写成3.5 when using the decimal to represent half of the divisor, for modulo mathematics, the answer would be 1 (essentially 2 to the power of three with 1 remaining).

因为取模的结果等于整数，在某些情况下 the result of the modulus would be no remainder. For instance, if the formula were 6 mod 2, the answer would be 0.

模对哈希的重要性在于其余的帮助确定可以将给定的数据块分配给哪个服务器并从中检索. 稍后会详细介绍.

Hashing It Out

The simplest definition of hashing is to chop 或者除以，但就我们的目的而言，哈希是映射一段数据(通常描述某种对象)的函数, often of arbitrary size—to another piece of data, typically an integer." Put a slightly different way, 根据Wolfram MathWorld网站, "A hash function (H) projects a value from a set with many (or even an infinite number of) members to a value from a set with 固定数量(更少)的成员." In other words, it's a way to notate an infinite number of values via a set, more-limited number of values. In practical terms, for content, 这也允许可变长度的内容用固定长度的表示表示.

我们用社保号作为a form of hashing variable data to a fixed-length number. 如果去掉破折号, 社会安全号码是唯一的, fixed-length-value-of-9-digits natural number (a positive integer above zero). 忽略的初始限制 the Social Security numbering scheme (3-digit 区域号，2位分组号，4位序列号)，假设社会安全号码从100开始,000,000才能准确地填满所有九个槽, the total number of possible fixed 9-digit values would be 899,999,999. 而号码是固定长度的，名称是固定长度的 attached to each Social Security number is a variable-length piece of content. It could be Mary J. Blige or Franklin Delano Roosevelt or even John Fitzgerald Kennedy.

在我们的示例中，名称本身同时使用可变长度和multiple-字符类型(数据库术语为“varchar”) 表示全局名称搜索需要计算能力比在我们的社会安全号码示例中，所有的899,999,999整数排列.

A final benefit of 哈希是积极的 affects more efficient searches and storage in 现代关系数据库. Most databases 使用一个键结构——在任何结构中都是唯一的值数据库表中的单个字段作为主键，不仅可以在其上搜索内容，还可以在其上搜索还可以将来自多个表的内容(关系)连接到跨多个数据库表的单个内容查询中.

我们在散列示例中描述的—固定长度值(社会安全号码)和可变长度的组合 value (any given name)—is, at its most basic, a traditional database construct called a key-value store, in which the key (hash) is associated with the value (a string of content, 例如任何给定的名称). 在哈希键唯一的情况下, it can potentially act as the primary key in any given database table.

Databases also perform very well with indexed content, which is essentially a road map to a group of key-value stores. The index effectively reverse-engineers an array of content stored in a database, 使用正在搜索的索引, 而不是整个数据库. For a grouping of 哈希表的作用是索引和显著减少与特定散列相关联的内容字符串的搜索时间.

Problems With Hashing

散列的主要潜在问题是什么叫做碰撞, meaning two variable-length values share the same defining parameters (e.g.有两个叫约翰的人 Reginald Smith Jr.两人都出生在美国的同一个地区.S. )，这可能会导致相同的固定长度值表示两者(在我们的示例中是社会安全号码).

一种基本的哈希方法由于具有相同参数的两个内容字符串共享相同的哈希键值，因此生成哈希冲突, could 导致无意的淹没. In practice, it could also result in one server holding too much content, with other servers in the cluster holding too little.

此外，哈希冲突可能会增加 significant delay in a browser receiving content. 在最坏的情况下，内容根本无法提供，或者错误的内容可以提供从错误的缓存服务器发送到在错误的时间使用错误的媒体播放器. In our 以社保为例比如社会保障局把支票寄错了人 Reginald Smith Jr.然后，他们就可以在法律上自由地兑现它.

解决CDN架构中潜在冲突错误的一种方法是添加更多的服务器, 复制相同的哈希表和 key-value store combinations to multiple servers. 如果内容本身有限制，这种方法就有效 collision likelihood; however, 如果一组服务器中的一台服务器出现故障, 哈希表就过时了, 因为故障服务器的某些内容需要重新映射到所有其他服务器. 最终的结果是，无论何时任何一个缓存服务器都将对原始服务器造成巨大的冲击, even one in a cluster, fails.