数据密集型应用系统设计

Mar 10, 2023

Designing Data-Intensive Applications

Designing Data-Intensive Applications

The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

可靠性、可扩展性和可维护性系统之背后的重大思想

Martin Kleppmann

马丁·克莱普曼

Designing Data-Intensive Applications

by Martin Kleppmann

马丁·克莱普曼（Martin Kleppmann）

Printed in the United States of America.

美国制造。

Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

由O’Reilly媒体公司出版，位于加利福尼亚州塞巴斯托波尔市Gravenstein北路1005号。

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com/safari ). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected] .

O’Reilly图书可用于教育、商业或销售促销用途。大多数书籍也可以在线购买（http://oreilly.com/safari）。如需更多信息，请联系我们的公司/机构销售部门：800-998-9938或[email protected]。

Editors: Ann Spencer and Marie Beaugureau

编辑：安·斯宾塞和玛丽·博戈尔

Indexer: Ellen Troutman-Zaig

索引员：艾伦·特劳特曼-扎伊格

Production Editor: Kristen Brown

生产编辑：Kristen Brown

Interior Designer: David Futato

室内设计师：David Futato

Copyeditor: Rachel Head

校对员：雷切尔·海德

Cover Designer: Karen Montgomery

封面设计师：凯伦·蒙哥马利

Proofreader: Amanda Kersey

校对员：阿曼达·柯西

Illustrator: Rebecca Demarest

插画师：丽贝卡·德马雷斯特

March 2017: First Edition

Revision History for the First Edition

2017-03-01: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781449373320 for release details.

请参见 http://oreilly.com/catalog/errata.csp?isbn=9781449373320 以获取发布细节。

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Designing Data-Intensive Applications , the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

O’Reilly徽标是O’Reilly Media, Inc.的注册商标。设计数据密集型应用程序，封面图像和相关商标是O’Reilly Media,Inc.的商标。

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

尽管出版商和作者已经诚信努力确保本作品中包含的信息和指令准确无误，但出版商和作者不对错误或遗漏负责，包括但不限于因使用或依赖于本作品所导致的损害。使用本作品中包含的信息和指令需自行承担风险。如果本作品中包含或描述的任何代码示例或其他技术受开放源代码许可证或他人的知识产权的限制，则您有责任确保遵守此类许可证和/或权利。

978-1-449-37332-0

978-1-449-37332-0 的翻译为简体中文是：978-1-449-37332-0。

[LSI]

[LSI] - 逻辑符号互连技术

Dedication

Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.

技术在我们社会中是一股强大的力量。数据、软件和通讯可以被用于恶意行为：巩固不公平的权力结构、削弱人权以及维护既得利益。但是，它们也可以被用于善行：让被忽视的人发出声音，为所有人创造机会，以及防止灾难。这本书致力于为所有致力于善行的人而写。

Computing is pop culture. […] Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future—it’s living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from].

计算机是流行文化。流行文化对历史不屑一顾。流行文化关注身份认同和参与感，与合作、过去或未来毫不相关——它只存在于当下。我认为，大多数为赚钱而编写代码的人也是如此，他们不知道自己所处的文化从何而来。

Alan Kay , in interview with Dr Dobb’s Journal (2012)

艾伦·凯（Alan Kay）, 在与《Dr Dobb’s Journal》（2012）的采访中表示：

Preface

If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzzwords relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!

如果您最近在软件工程领域工作，尤其是在服务器端和后端系统方面，您可能已经遭受了大量与数据存储和处理有关的信息轰炸。NoSQL!大数据! Web规模!分片!最终一致性! ACID! CAP定理!云服务! MapReduce!实时!

In the last decade we have seen many interesting developments in databases, in distributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments:

在过去的十年中，我们看到了许多有趣的数据库发展、分布式系统和构建在它们之上应用的方式。这些发展有不同的推动力。

Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn, Microsoft, and Twitter are handling huge volumes of data and traffic, forcing them to create new tools that enable them to efficiently handle such scale.

像谷歌、雅虎、亚马逊、Facebook、领英、微软和Twitter这样的互联网公司正在处理大量的数据和流量，这迫使它们创建新的工具，使它们能够高效地处理这样的规模。
Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and data models flexible.

企业需要具有敏捷性，在廉价测试假设并快速响应新的市场洞察方面，通过保持短时间的开发周期和灵活的数据模型。
Free and open source software has become very successful and is now preferred to commercial or bespoke in-house software in many environments.

免费开源软件已经变得非常成功，并且在许多环境中比商业或定制内部软件更受欢迎。
CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.

“CPU时钟速度几乎没有增加，但多核处理器已成为标准，网络速度越来越快。这意味着并行性只会增加。”
Even if you work on a small team, you can now build systems that are distributed across many machines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) such as Amazon Web Services.

即使您在小团队工作，也可以借助基础设施即服务（IaaS）（如Amazon Web Services）构建跨多台机器甚至多个地理区域的系统。
Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.

许多服务现在都被期望具有高度可用性；由于故障或维护而导致的延长停机时间越来越不可接受。

Data-intensive applications are pushing the boundaries of what is possible by making use of these technological developments. We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive , where CPU cycles are the bottleneck.

数据密集型应用程序正在利用技术发展推动着可能性的边界。如果数据是主要挑战，如数据量、数据复杂性或数据变化速度，我们将其称为数据密集型应用，而不是计算密集型应用，其中CPU周期是瓶颈。

The tools and technologies that help data-intensive applications store and process data have been rapidly adapting to these changes. New types of database systems (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks for batch and stream processing, and related technologies are very important too. Many applications use some combination of these.

帮助数据密集型应用程序存储和处理数据的工具和技术已经迅速适应这些变化。新型的数据库系统（"NoSQL"）受到了很多关注，但是消息队列、缓存、搜索索引、批处理和流处理的框架以及相关技术也非常重要。许多应用程序使用这些技术的组合。

The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing. However, as software engineers and architects, we also need to have a technically accurate and precise understanding of the various technologies and their trade-offs if we want to build good applications. For that understanding, we have to dig deeper than buzzwords.

这个空间充满了时髦词汇，显示了对新可能性的热情，这是一件好事。然而，作为软件工程师和架构师，如果我们想要构建出优秀的应用程序，我们还需要对各种技术及其取舍有准确而精确的理解。为了获得这种理解，我们必须比流行语更深入地挖掘。

Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using. If you understand those principles, you’re in a position to see where each tool fits in, how to make good use of it, and how to avoid its pitfalls. That’s where this book comes in.

幸运的是，在技术的迅速变化背后，存在一些不变的原则，无论你使用哪个版本的特定工具都是如此。如果你理解这些原则，就能够看到每个工具的适用范围，如何有效地利用它以及如何避免它的陷阱。这就是本书的作用。

The goal of this book is to help you navigate the diverse and fast-changing landscape of technologies for processing and storing data. This book is not a tutorial for one particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples of successful data systems: technologies that form the foundation of many popular applications and that have to meet scalability, performance, and reliability requirements in production every day.

这本书的目的是帮助您适应不断变化的数据处理和存储技术领域。本书不是针对某个特定工具的教程，也不是一本充满干燥理论的教科书。相反，我们将看一些成功的数据系统实例：这些技术是许多流行应用程序的基础，每天都需要在生产中满足可扩展性、性能和可靠性需求。

We will dig into the internals of those systems, tease apart their key algorithms, discuss their principles and the trade-offs they have to make. On this journey, we will try to find useful ways of thinking about data systems—not just how they work, but also why they work that way, and what questions we need to ask.

我们将深入研究这些系统的内部结构，分离出它们的关键算法，讨论它们的原理和需要做出的权衡。在这个过程中，我们将尝试找到有用的关于数据系统的思考方式——不仅要了解它们如何工作，还要了解为什么以及我们需要问什么问题。

After reading this book, you will be in a great position to decide which kind of technology is appropriate for which purpose, and understand how tools can be combined to form the foundation of a good application architecture. You won’t be ready to build your own database storage engine from scratch, but fortunately that is rarely necessary. You will, however, develop a good intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.

阅读完本书后，您将能够准确判断哪种技术适用于哪种目的，并了解如何将不同工具结合起来构建一个良好的应用架构基础。虽然您可能没有能力从零开始构建自己的数据库存储引擎，但幸运的是，这很少有必要。但是，您将能够对系统运行情况有良好的直觉，从而能够推理其行为，做出良好的设计决策，并找出可能出现的任何问题。

Who Should Read This Book?

If you develop applications that have some kind of server/backend for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this book is for you.

如果您开发的应用程序有一些服务器/后端来存储或处理数据，并且您的应用程序使用互联网（例如网络应用程序、移动应用程序或连接到互联网的传感器），那么这本书适合您。

This book is for software engineers, software architects, and technical managers who love to code. It is especially relevant if you need to make decisions about the architecture of the systems you work on—for example, if you need to choose tools for solving a given problem and figure out how best to apply them. But even if you have no choice over your tools, this book will help you better understand their strengths and weaknesses.

这本书适合软件工程师、软件架构师和技术经理，他们热爱编程。如果您需要对系统架构做出决策，例如选择解决特定问题的工具以及最佳应用它们，这本书尤为相关。但即使您对工具没有选择权，这本书也可以帮助您更好地了解它们的优势和劣势。

You should have some experience building web-based applications or network services, and you should be familiar with relational databases and SQL. Any non-relational databases and other data-related tools you know are a bonus, but not required. A general understanding of common network protocols like TCP and HTTP is helpful. Your choice of programming language or framework makes no difference for this book.

你应该拥有一定的构建基于网络的应用程序或网络服务的经验，并且应该熟悉关系型数据库和SQL。你所了解的任何非关系型数据库和其他数据相关工具都是bonus，但不是必需的。对于像TCP和HTTP这样的常见网络协议的基本理解是有帮助的。在本书中，你选择的编程语言或框架没有任何区别。

If any of the following are true for you, you’ll find this book valuable:

如果以下情况适用于您，您会发现这本书非常有价值：

You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.

你想学习如何使数据系统可扩展，例如，支持拥有数百万用户的网站或移动应用程序。
You need to make applications highly available (minimizing downtime) and operationally robust.

您需要使应用程序高度可用（最小化停机时间）和操作上健壮。
You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.

您正在寻找使系统在长期发展和需求和技术变化时更易于维护的方法。
You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.

你对事物的运作方式有天生的好奇心，想要了解主要网站和在线服务内部的工作原理。这本书剖析了各种数据库和数据处理系统的内部结构，探索其设计背后的精妙思考过程，非常有趣。

Sometimes, when discussing scalable data systems, people make comments along the lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses. As we shall see, relational databases are important but not the final word on dealing with data.

有时候，当讨论可扩展数据系统时，人们会做出类似的评论："你不是谷歌或亚马逊。别担心规模，只需使用关系数据库"。这种说法确实有一定道理：为未来的规模而建造是浪费的，可能会使设计变得不灵活，实际上是一种过早的优化。然而，选择正确的工具也同样重要，不同的技术各有其优点和缺点。正如我们所看到的，关系数据库很重要，但并不是解决处理数据的最终方案。

Scope of This Book

This book does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation for those things. Instead we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.

本书无意提供如何安装或使用特定软件包或API的详细说明，因为这些内容已经有了大量的文档。相反，我们讨论了数据系统中基本的原则和权衡，探讨了不同产品所采取的不同设计决策。

In the ebook editions we have included links to the full text of online resources. All links were verified at the time of publication, but unfortunately links tend to break frequently due to the nature of the web. If you come across a broken link, or if you are reading a print copy of this book, you can look up references using a search engine. For academic papers, you can search for the title in Google Scholar to find open-access PDF files. Alternatively, you can find all of the references at https://github.com/ept/ddia-references , where we maintain up-to-date links.

在电子书版本中，我们已经包含了指向在线资源全文的链接。所有链接在出版时都经过验证，但不幸的是，由于网络的本质，链接容易中断。如果您遇到破损的链接，或者正在阅读本书的印刷版，则可以使用搜索引擎查找参考资料。对于学术论文，您可以在Google学术中搜索标题以找到开放获取的PDF文件。或者，您可以在https://github.com/ept/ddia-references找到所有参考资料的最新链接。

We look primarily at the architecture of data systems and the ways they are integrated into data-intensive applications. This book doesn’t have space to cover deployment, operations, security, management, and other areas—those are complex and important topics, and we wouldn’t do them justice by making them superficial side notes in this book. They deserve books of their own.

我们主要关注数据系统的架构及其集成进数据密集型应用程序中的方式。本书不涉及部署，运营，安全，管理和其他领域 - 这些都是复杂且重要的主题，我们不会在本书中简单地点到它们。它们都值得有一本专门的书来讲述它们。

Many of the technologies described in this book fall within the realm of the Big Data buzzword. However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems.

本书描述的许多技术都属于Big Data热词的范畴。然而，“Big Data”这个术语已被过度使用和缺乏定义，因此在严肃的工程讨论中并不有用。本书使用较少含糊的术语，如单节点与分布式系统，或在线/交互式与离线/批处理系统。

This book has a bias toward free and open source software (FOSS), because reading, modifying, and executing source code is a great way to understand how something works in detail. Open platforms also reduce the risk of vendor lock-in. However, where appropriate, we also discuss proprietary software (closed-source software, software as a service, or companies’ in-house software that is only described in literature but not released publicly).

这本书有偏向于自由和开源软件（FOSS），因为阅读、修改和执行源代码是了解事物详细工作原理的好方法。开放的平台也可以降低厂商锁定的风险。然而，在适当的情况下，我们也会讨论专有软件（闭源软件、软件即服务或公司内部的软件只在文献中描述但未公开发布）。

Outline of This Book

This book is arranged into three parts:

这本书分为三个部分：

In Part I , we discuss the fundamental ideas that underpin the design of data-intensive applications. We start in Chapter 1 by discussing what we’re actually trying to achieve: reliability, scalability, and maintainability; how we need to think about them; and how we can achieve them. In Chapter 2 we compare several different data models and query languages, and see how they are appropriate to different situations. In Chapter 3 we talk about storage engines: how databases arrange data on disk so that we can find it again efficiently. Chapter 4 turns to formats for data encoding (serialization) and evolution of schemas over time.

在第一部分中，我们讨论了设计数据密集型应用程序所依托的基本思想。我们从第1章开始讨论我们实际要实现的目标：可靠性、可扩展性和可维护性；我们需要如何思考它们以及如何实现它们。在第2章中，我们比较了几种不同的数据模型和查询语言，并看到它们适用于不同的情况。在第3章中，我们讨论了存储引擎：数据库如何在磁盘上安排数据以便我们可以高效地找到它。第4章则转向数据编码格式（序列化）和模式随时间演变的格式。
In Part II , we move from data stored on one machine to data that is distributed across multiple machines. This is often necessary for scalability, but brings with it a variety of unique challenges. We first discuss replication ( Chapter 5 ), partitioning/sharding ( Chapter 6 ), and transactions ( Chapter 7 ). We then go into more detail on the problems with distributed systems ( Chapter 8 ) and what it means to achieve consistency and consensus in a distributed system ( Chapter 9 ).

在第二部分，我们从存储于单独设备的数据转移到分布在多台设备上的数据。这通常是为了可伸缩性，但也带来了许多独特挑战。我们首先讨论复制（第五章），分区/分片（第六章）和事务（第七章）。然后，我们更详细地讨论分布式系统的问题（第八章），以及在分布式系统中实现一致性和共识的意义（第九章）。
In Part III , we discuss systems that derive some datasets from other datasets. Derived data often occurs in heterogeneous systems: when there is no one database that can do everything well, applications need to integrate several different databases, caches, indexes, and so on. In Chapter 10 we start with a batch processing approach to derived data, and we build upon it with stream processing in Chapter 11 . Finally, in Chapter 12 we put everything together and discuss approaches for building reliable, scalable, and maintainable applications in the future.

第三部分中，我们讨论从其他数据集派生一些数据集的系统。派生数据通常出现在异构系统中：当没有一个数据库可以很好地完成所有任务时，应用程序需要集成几个不同的数据库、缓存、索引等等。在第10章中，我们从批处理方法开始处理派生数据，并在第11章中再次加以扩展处理。最后，在第12章中，我们将所有内容汇总，并讨论未来构建可靠、可扩展和可维护应用程序的方法。

References and Further Reading

Most of what we discuss in this book has already been said elsewhere in some form or another—in conference presentations, research papers, blog posts, code, bug trackers, mailing lists, and engineering folklore. This book summarizes the most important ideas from many different sources, and it includes pointers to the original literature throughout the text. The references at the end of each chapter are a great resource if you want to explore an area in more depth, and most of them are freely available online.

本书讨论的大部分内容在其他地方以某种形式或另一种方式已经过提及—在会议演示、研究论文、博客文章、代码、缺陷跟踪器、邮件列表和工程传说中。本书总结了许多不同来源的最重要的思想，且在整个文本中都包含有指向原始文献的指针。每章结尾的参考文献是深入探索一个区域的极佳资源，其中大多数免费在网上可获取。

O’Reilly Safari

Note

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Safari（原名Safari Books Online）是一个面向企业、政府、教育机构和个人的基于会员制的培训和参考平台。

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

会员可访问来自250多个出版商的数千本图书、培训视频、学习路径、交互式教程和策划播放列表，包括O'Reilly Media、哈佛商业评论、Prentice Hall Professional、Addison-Wesley Professional、Microsoft Press、Sams、Que、Peachpit Press、Adobe、Focal Press、Cisco Press、John Wiley & Sons、Syngress、Morgan Kaufmann、IBM Redbooks、Packt、Adobe Press、FT Press、Apress、Manning、New Riders、McGraw-Hill、Jones＆Bartlett以及Course Technology等。

For more information, please visit http://oreilly.com/safari .

请访问http://oreilly.com/safari获取更多信息。

How to Contact Us

Please address comments and questions concerning this book to the publisher:

请将有关此书的评论和问题发送给出版社：

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/designing-data-intensive-apps .

我们有一本书的网页，列出勘误、例子和其他信息。您可以访问 http://bit.ly/designing-data-intensive-apps 查看此页。

To comment or ask technical questions about this book, send email to [email protected] .

如需对本书进行评论或提出技术问题，请发送电子邮件至[email protected]。

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com .

有关我们的图书、课程、会议和新闻的更多信息，请访问我们的网站http://www.oreilly.com。

Find us on Facebook: http://facebook.com/oreilly

在Facebook上关注我们：http://facebook.com/oreilly

关注我们的Twitter账户：http://twitter.com/oreillymedia。

Watch us on YouTube: http://www.youtube.com/oreillymedia

在YouTube上观看我们：http://www.youtube.com/oreillymedia。

Acknowledgments

This book is an amalgamation and systematization of a large number of other people’s ideas and knowledge, combining experience from both academic research and industrial practice. In computing we tend to be attracted to things that are new and shiny, but I think we have a huge amount to learn from things that have been done before. This book has over 800 references to articles, blog posts, talks, documentation, and more, and they have been an invaluable learning resource for me. I am very grateful to the authors of this material for sharing their knowledge.

这本书是对大量其他人的思想和知识进行的汇编和系统化，结合了学术研究和工业实践的经验。在计算领域，我们往往被新奇的东西所吸引，但我认为我们有很多东西可以从以前的事情中学习。这本书引用了800多篇文章、博客文章、讲话和文档等，它们对我来说是一种宝贵的学习资源。我非常感谢这些材料的作者分享他们的知识。

I have also learned a lot from personal conversations, thanks to a large number of people who have taken the time to discuss ideas or patiently explain things to me. In particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra, Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Helland, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie McCaffrey, Josie McLellan, Christopher Meiklejohn, Ian Meyers, Neha Narkhede, Neha Narula, Cathy O’Neil, Onora O’Neill, Ludovic Orban, Zoran Perkov, Julia Powles, Chris Riccomini, Henry Robinson, David Rosenthal, Jennifer Rullmann, Matthew Sackman, Martin Scholl, Amit Sela, Gwen Shapira, Greg Spurrier, Sam Stokes, Ben Stopford, Tom Stuart, Diana Vasile, Rahul Vohra, Pete Warden, and Brett Wooldridge.

因为许多人抽出时间与我探讨问题或耐心向我解释事物，我也从个人交流中学到了很多。特别感谢以下人士： Joe Adler， Ross Anderson， Peter Bailis， Márton Balassi， Alastair Beresford， Mark Callaghan， Mat Clayton， Patrick Collison， Sean Cribbs， Shirshanka Das， Niklas Ekström， Stephan Ewen， Alan Fekete， Gyula Fóra， Camille Fournier， Andres Freund， John Garbutt， Seth Gilbert， Tom Haggett， Pat Helland， Joe Hellerstein， Jakob Homan， Heidi Howard， John Hugg， Julian Hyde， Conrad Irwin， Evan Jones， Flavio Junqueira， Jessica Kerr， Kyle Kingsbury， Jay Kreps， Carl Lerche， Nicolas Liochon， Steve Loughran， Lee Mallabone， Nathan Marz， Caitie McCaffrey， Josie McLellan， Christopher Meiklejohn， Ian Meyers， Neha Narkhede， Neha Narula， Cathy O’Neil， Onora O’Neill， Ludovic Orban， Zoran Perkov， Julia Powles， Chris Riccomini， Henry Robinson， David Rosenthal， Jennifer Rullmann， Matthew Sackman， Martin Scholl， Amit Sela， Gwen Shapira， Greg Spurrier， Sam Stokes， Ben Stopford， Tom Stuart， Diana Vasile， Rahul Vohra， Pete Warden，和 Brett Wooldridge.

Several more people have been invaluable to the writing of this book by reviewing drafts and providing feedback. For these contributions I am particularly indebted to Raul Agepati, Tyler Akidau, Mattias Andersson, Sasha Baranov, Veena Basavaraj, David Beyer, Jim Brikman, Paul Carey, Raul Castro Fernandez, Joseph Chow, Derek Elkins, Sam Elliott, Alexander Gallego, Mark Grover, Stu Halloway, Heidi Howard, Nicola Kleppmann, Stefan Kruppa, Bjorn Madsen, Sander Mak, Stefan Podkowinski, Phil Potter, Hamid Ramazani, Sam Stokes, and Ben Summers. Of course, I take all responsibility for any remaining errors or unpalatable opinions in this book.

这本书的写作离不开以下这些人的宝贵意见和反馈。我深表感激：Raul Agepati、Tyler Akidau、Mattias Andersson、Sasha Baranov、Veena Basavaraj、David Beyer、Jim Brikman、Paul Carey、Raul Castro Fernandez、Joseph Chow、Derek Elkins、Sam Elliott、Alexander Gallego、Mark Grover、Stu Halloway、Heidi Howard、Nicola Kleppmann、Stefan Kruppa、Bjorn Madsen、Sander Mak、Stefan Podkowinski、Phil Potter、Hamid Ramazani、Sam Stokes 和 Ben Summers。当然，本书中任何剩余的错误或不受欢迎的观点都由作者自负。

For helping this book become real, and for their patience with my slow writing and unusual requests, I am grateful to my editors Marie Beaugureau, Mike Loukides, Ann Spencer, and all the team at O’Reilly. For helping find the right words, I thank Rachel Head. For giving me the time and freedom to write in spite of other work commitments, I thank Alastair Beresford, Susan Goodhue, Neha Narkhede, and Kevin Scott.

感谢我的编辑玛丽·博戈罗、迈克·卢基德斯、安·斯宾塞以及O’Reilly的整个团队，他们帮助这本书成为现实，并耐心应对我的缓慢写作和不寻常的请求。感谢Rachel Head帮助我找到合适的词语。感谢Alastair Beresford、Susan Goodhue、Neha Narkhede和Kevin Scott给予我时间和自由，让我在其他工作任务的影响下完成写作。

Very special thanks are due to Shabbir Diwan and Edie Freedman, who illustrated with great care the maps that accompany the chapters. It’s wonderful that they took on the unconventional idea of creating maps, and made them so beautiful and compelling.

非常特别感谢沙比尔迪旺和埃迪弗里德曼，他们非常用心地插绘了随章节附带的地图。他们接受了非传统的创意，制作的地图美轮美奂，极富吸引力，实在太棒了！

Finally, my love goes to my family and friends, without whom I would not have been able to get through this writing process that has taken almost four years. You’re the best.

最后，我的爱送给我的家人和朋友们，没有你们的支持，我无法度过这长达四年的写作过程。你们是最棒的。

Part I. Foundations of Data Systems

The first four chapters go through the fundamental ideas that apply to all data systems, whether running on a single machine or distributed across a cluster of machines:

前四章介绍了适用于所有数据系统的基本思想，无论是在单台计算机上运行还是分布在一组计算机上。

Chapter 1 introduces the terminology and approach that we’re going to use throughout this book. It examines what we actually mean by words like reliability , scalability , and maintainability , and how we can try to achieve these goals.

第1章介绍了我们在本书中将要使用的术语和方法。它探讨了我们实际上是什么意思，如可靠性、可扩展性和可维护性等词，以及我们如何努力实现这些目标。
Chapter 2 compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations.

第二章比较了几种不同的数据模型和查询语言——这是从开发人员的角度来看，数据库之间最显著的区别因素。我们将看到不同的模型适用于不同的情况。
Chapter 3 turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance.

第3章介绍了存储引擎的内部机制，并探讨了数据库在磁盘上如何布置数据。不同的存储引擎针对不同的工作负载进行了优化，选择正确的引擎可以对性能产生巨大影响。
Chapter 4 compares various formats for data encoding (serialization) and especially examines how they fare in an environment where application requirements change and schemas need to adapt over time.

第4章比较了不同的数据编码格式(序列化)，特别是在应用需求变化和模式需要随时间调整的环境中它们的表现。

Later, Part II will turn to the particular issues of distributed data systems.

接下来，第二部分将转向分布式数据系统的特定问题。

Chapter 1. Reliable, Scalable, and Maintainable Applications

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?

互联网做得非常出色，以至于大多数人认为它是像太平洋这样的自然资源，而不是人造的东西。上一次有这样规模的技术是什么时候出现得如此无误呢？

Alan Kay , in interview with Dr Dobb’s Journal (2012)

艾伦·凯（Alan Kay），2012年接受Dr Dobb’s Journal采访。

Many applications today are data-intensive , as opposed to compute-intensive . Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

今天许多应用程序都是数据密集型的，相比计算密集型的。对于这些应用程序，原始CPU功率很少是限制因素 - 更大的问题通常是数据量，数据的复杂性以及其更改速度。

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

一个数据密集型应用通常是由提供常见功能的标准构建块构建而成。例如，许多应用程序需要：

Store data so that they, or another application, can find it again later ( databases )

将数据存储起来，以便它们或其他应用程序可以在以后找到它（数据库）
Remember the result of an expensive operation, to speed up reads ( caches )

记住昂贵操作的结果以加快读取速度（缓存）。
Allow users to search data by keyword or filter it in various ways ( search indexes )

允许用户通过关键字或不同的方式过滤数据（搜索索引）。
Send a message to another process, to be handled asynchronously ( stream processing )

发送消息到另一个进程，进行异步处理（流处理）。
Periodically crunch a large amount of accumulated data ( batch processing )

定期处理大量累积的数据（批处理）

If that sounds painfully obvious, that’s just because these data systems are such a successful abstraction: we use them all the time without thinking too much. When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job.

如果听起来非常显然，那只是因为这些数据系统是一种非常成功的抽象：我们经常在不过多思考的情况下使用它们。在构建应用程序时，大多数工程师不会想象从头开始编写新的数据存储引擎，因为数据库是完成工作的完美工具。

But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.

然而现实并不是那么简单的。有很多具有不同特征的数据库系统，因为不同的应用有不同的要求。有各种各样的缓存方法，几种建立搜索索引的方法等等。在构建应用程序时，我们仍然需要弄清楚哪些工具和方法最适合手头的任务。当您需要做一些单个工具无法完成的任务时，很难组合工具。

This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics.

这本书带你穿越数据系统的原则和实践，以及如何使用它们构建数据密集型应用。我们将探讨不同工具的共同点、区别以及它们如何实现其特性。

In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application.

在本章中，我们将首先探讨我们所致力于实现的基本原理：可靠、可扩展和可维护的数据系统。我们将澄清这些事物的含义，概述一些思考方式，并介绍我们后续章节所需要的基础知识。在接下来的章节中，我们将逐层深入，探讨处理数据密集型应用时需要考虑的不同设计决策。

Thinking About Data Systems

We typically think of databases, queues, caches, etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.

我们通常认为数据库、队列、缓存等是非常不同的工具类别。尽管数据库和消息队列有一些表面的相似之处——都可以存储数据一段时间——但它们具有非常不同的访问模式，这意味着不同的性能特征，因此实现也非常不同。

So why should we lump them all together under an umbrella term like data systems ?

那么我们为什么要把它们都归为一个数据系统的总称呢？

Many new tools for data storage and processing have emerged in recent years. They are optimized for a variety of different use cases, and they no longer neatly fit into traditional categories [ 1 ]. For example, there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.

近年来涌现了许多新的数据存储和处理工具。它们针对不同的使用情境进行了优化，并不再简单地符合传统分类[1]。例如，有些数据存储库也用作消息队列（Redis），有些消息队列具备类似数据库的耐久性保证（Apache Kafka）。各种类别的边界变得模糊不清。

Secondly, increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.

其次，越来越多的应用程序现在拥有如此苛刻或广泛的要求，以至于单个工具不能再满足其所有数据处理和存储需求。相反，工作被分解为可以在单个工具上高效执行的任务，并使用应用代码将这些不同的工具组合在一起。

For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database. Figure 1-1 gives a glimpse of what this may look like (we will go into detail in later chapters).

例如，如果您拥有一个应用程序管理的缓存层（使用Memcached或类似工具）或一个与您的主数据库分开的全文搜索服务器（例如Elasticsearch或Solr），通常应用程序代码需要负责保持这些缓存和索引与主数据库同步。图1-1展示了这可能是什么样子（我们会在后面的章节详细说明）。

When you combine several tools in order to provide a service, the service’s interface or application programming interface (API) usually hides those implementation details from clients. Now you have essentially created a new, special-purpose data system from smaller, general-purpose components. Your composite data system may provide certain guarantees: e.g., that the cache will be correctly invalidated or updated on writes so that outside clients see consistent results. You are now not only an application developer, but also a data system designer.

当你将几个工具组合在一起以提供服务时，服务的界面或应用程序接口（API）通常会从客户端隐藏这些实现细节。现在，您基本上从较小的通用组件创建了一个新的特定用途的数据系统。您的组合数据系统可能提供某些保证：例如，缓存将在写操作时正确失效或更新，以便外部客户端看到一致的结果。现在，您不仅是应用程序开发人员，还是数据系统设计师。

If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?

如果您正在设计数据系统或服务，会出现很多棘手的问题。如何保证数据即使出错，也仍正确和完整？如何确保在系统部分出现问题时，能为客户提供一致良好的性能？如何扩展以处理负载的增加？服务的良好API是什么样的？

There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation.

有许多因素可能影响数据系统的设计，包括涉及的人员的技能和经验，遗留系统的依赖性，交付时间表，组织对不同风险容忍度，监管限制等。这些因素在很大程度上取决于具体情况。

In this book, we focus on three concerns that are important in most software systems:

在这本书中，我们关注的是大多数软件系统中重要的三个问题：

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability” .

系统应该在逆境中继续正常工作（保持所需的性能水平执行正确的功能），甚至在硬件或软件故障和人为错误的情况下也是如此。请参见“可靠性”。

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth. See “Scalability” .

随着系统规模的增长（数据量、流量或复杂性），应有合理的处理方式。参见“可扩展性”。

Maintainability

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively . See “Maintainability” .

随着时间的推移，许多不同的人将在系统上工作（包括工程和运营，维护当前行为并适应新的用例），他们都应该能够以高效的方式工作。请参见：“可维护性”。

These words are often cast around without a clear understanding of what they mean. In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability, and maintainability. Then, in the following chapters, we will look at various techniques, architectures, and algorithms that are used in order to achieve those goals.

这些词经常被使用，然而人们往往缺乏明确的理解。为了进行深入的工程思考，我们将花费整个章节探究可靠性、可扩展性和可维护性的思考方式。紧接着，在下一章中，我们将探讨在实现这些目标时所使用的各种技术、架构和算法。

Reliability

Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include:

每个人都对什么是可靠或不可靠的东西有一个直观的想法。对于软件来说，典型的期望包括：

The application performs the function that the user expected.

应用程序执行用户期望的功能。
It can tolerate the user making mistakes or using the software in unexpected ways.

它可以容忍用户犯错误或以意想不到的方式使用软件。
Its performance is good enough for the required use case, under the expected load and data volume.

它的表现足够好，以满足预期的负荷和数据量的要求。
The system prevents any unauthorized access and abuse.

该系统防止未经授权的访问和滥用。

If all those things together mean “working correctly,” then we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.”

如果所有这些事情一起意味着“正常工作”，那么我们可以将可靠性大致理解为“即使在出现故障时仍能正确工作”。

The things that can go wrong are called faults , and systems that anticipate faults and can cope with them are called fault-tolerant or resilient . The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults.

能够出错的事情被称为故障，而能够预见并应对故障的系统被称为容错或弹性系统。前者的术语略有误导：它暗示我们可以使系统容忍每种可能的故障，但实际上这是不可行的。如果整个地球（和所有服务器）都被黑洞吞噬，要容忍那种故障就需要在太空中进行网络托管——但很难拿到预算。因此，只有谈论容忍某些类型的故障才有意义。

Note that a fault is not the same as a failure [ 2 ]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.

请注意，故障与失效不同。[2] 故障通常被定义为系统中的一个组件偏离其规范，而失效是指整个系统停止向用户提供所需的服务。将故障的概率降至零是不可能的，因此通常最好设计能够防止故障导致失效的容错机制。在本书中，我们介绍了几种利用不可靠部件构建可靠系统的技术。

Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately—for example, by randomly killing individual processes without warning. Many critical bugs are actually due to poor error handling [ 3 ]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. The Netflix Chaos Monkey [ 4 ] is an example of this approach.

在这样的容错系统中，出乎意料的是，通过故意触发它们，例如随机杀死单个进程而不发出警告，可能增加故障率是有意义的。许多关键错误实际上是由于错误处理不当引起的；通过故意引发故障，可以确保容错机制不断得到运作和测试，从而增加您对自然发生故障时正确处理故障的信心。Netflix的混沌猴就是这种方法的例子。

Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better than cure (e.g., because no cure exists). This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in the following sections.

虽然我们通常更喜欢容忍错误而不是预防错误，但在有些情况下，预防优于治疗（例如，因为没有治愈方法）。安全问题就是这种情况的例子：如果攻击者侵犯了某个系统并获得了敏感数据的访问权限，那么该事件就无法撤消。然而，本书大多数讨论的是可以治愈的故障类型，如下面的章节所描述的那样。

Hardware Faults

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large datacenters can tell you that these things happen all the time when you have a lot of machines.

当我们考虑系统故障的原因时，硬件故障很快就会跳入脑海。硬盘崩溃，内存损坏，电网停电，有人拔了错误的网络电缆。任何曾在大型数据中心工作过的人都可以告诉你，当你有很多机器时，这些事情经常发生。

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years [ 5 , 6 ]. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

硬盘通常被报告有一个平均故障时间（MTTF）约为10到50年[5, 6]。因此，在一个拥有10,000个硬盘的存储群集中，我们平均每天应该期望有一块硬盘损坏。

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.

我们通常的第一反应是将冗余添加到个别硬件组件中，以降低系统的故障率。可以将磁盘设置为RAID配置，服务器可以拥有双重电源和可热插拔的CPU，数据中心可以拥有备用电源的电池和柴油发电机。当一个组件出现故障时，冗余组件可以代替它的位置，而破损的组件则需要被更换。这种方法不能完全防止硬件问题导致故障，但是它是众所周知的，并且通常可以使一台机器连续运行多年。

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you can restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability was absolutely essential.

直到最近，硬件组件的冗余对于大多数应用程序来说已经足够，因为它使单台机器的完全失效变得相当罕见。只要能够相当快地将备份恢复到新机器上，在大多数应用程序中，发生故障时停机时间并不是灾难性的。因此，仅有少数应用程序需要多台机器的冗余，这对于绝对必要的高可用性是必需的。

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as Amazon Web Services (AWS) it is fairly common for virtual machine instances to become unavailable without warning [ 7 ], as the platforms are designed to prioritize flexibility and elasticity ⁱ over single-machine reliability.

然而，随着数据量和应用程序的计算需求增加，更多的应用程序开始使用更多的机器，这些机器的故障率也相应增加。此外，在一些云平台，例如AWS，虚拟机实例经常会在没有警告的情况下变得不可用[7]，因为这些平台的设计更注重灵活性和弹性而不是单机可靠性。

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (a rolling upgrade ; see Chapter 4 ).

因此，现在越来越倾向于使用软件容错技术来替代或补充硬件冗余，以实现整个机器的失效容忍系统。这样的系统还具有操作上的优势：单服务器系统需要计划停机时间才能重新启动机器（例如，应用操作系统安全补丁），而能够容忍机器故障的系统可以逐个升级节点而不需要整个系统停机（滚动升级；参见第四章）。

Software Errors

We usually think of hardware faults as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to fail. There may be weak correlations (for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same time.

通常我们认为硬件故障是随机和相互独立的：一台机器的硬盘故障并不意味着另一台机器的硬盘会发生故障。可能存在一些弱相关性（例如由于共同的原因，例如服务器机架中的温度），但除此之外，大量硬件组件同时发生故障的可能性很小。

Another class of fault is a systematic error within the system [ 8 ]. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [ 5 ]. Examples include:

另一类故障是系统内的系统性错误[8]。这种故障很难预测，因为它们在节点之间存在相关性，因此它们往往比不相关的硬件故障导致更多的系统故障[5]。例如：

A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel [ 9 ].

一个软件错误导致每个应用服务器在接收到特定的错误输入后都会崩溃。例如，考虑2012年6月30日的闰秒导致许多应用程序同时挂起，原因是Linux内核的错误[9]。
A runaway process that uses up some shared resource—CPU time, memory, disk space, or network bandwidth.

一个耗尽了一些共享资源的失控进程——CPU时间、内存、磁盘空间或网络带宽。
A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.

系统依赖的服务变慢、无响应或返回错误响应。
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults [ 10 ].

级联故障，一个组件的小故障会引发另一个组件的故障，接着又会触发更多的故障[10]。

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops being true for some reason [ 11 ].

导致这些软件故障的错误通常会在很长一段时间内处于休眠状态，直到它们被一组不寻常的情况激活。在那种情况下，软件会揭示出它对环境做出某种假设——虽然这个假设通常是正确的，但它最终由于某种原因停止成立[11]。

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production. If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [ 12 ].

软件系统存在系统性故障问题并非能立即解决。有许多小方法可以帮助解决：认真考虑系统中的假设和交互作用；进行彻底的测试；隔离处理；允许进程崩溃和重新启动；在生产中测量、监控和分析系统的行为。如果一个系统要提供某种保证（例如，在消息队列中，输入消息数等于输出消息数），它可以在运行时不断检查本身，并在发现差异时发出警报 [12]。

Human Errors

Humans design and build software systems, and the operators who keep the systems running are also human. Even when they have the best intentions, humans are known to be unreliable. For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [ 13 ].

人类设计和构建软件系统，保持系统运行的操作员也是人类。即使他们有最好的意图，人类被证明是不可靠的。例如，一项针对大型互联网服务的研究发现，运营商的配置错误是停机的主要原因，而硬件故障（服务器或网络）仅在10-25%的停机中发挥作用[13]。

How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:

我们如何在人类不可靠的情况下使我们的系统可靠？最好的系统结合了几种方法：

Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.

以最小化错误机会的方式设计系统。例如，设计良好的抽象、API和管理界面可以使“正确的事情”变得容易，并避免“错误的事情”。然而，如果接口太严格，人们会绕过它们，抵消其好处，因此这是一个棘手的平衡问题。
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.

将人们经常犯错的地方与可能导致失败的地方分离开来。特别是，提供完整功能的非生产沙盒环境，人们可以在其中安全地探索和实验，使用真实数据，而不会影响真实用户。
Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [ 3 ]. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.

彻底测试各个层面，从单元测试到整个系统集成测试和手动测试[3]。自动化测试被广泛使用，被很好地理解，并且在覆盖在正常操作中很少出现的边缘情况时尤为有价值。
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).

允许快速轻松地从人为错误中恢复，以最小化故障的影响。例如，快速回滚配置更改，逐步推出新代码（以便任何意外错误只影响一小部分用户），并提供重新计算数据的工具（以防旧计算结果不正确）。
Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry . (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures [ 14 ].) Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.

建立详细和清晰的监控，例如性能指标和错误率。在其他工程学科中，这被称为遥测。（一旦火箭离开地面，遥测对于跟踪发生的情况和理解故障至关重要[14]。）监测可以向我们显示早期警告信号，并允许我们检查是否违反任何假设或约束。当问题发生时，指标可以非常有价值地诊断问题。
Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.

实施良好的管理实践和培训——这是一个复杂而重要的方面，超出本书的范围。

How Important Is Reliability?

Reliability is not just for nuclear power stations and air traffic control software—more mundane applications are also expected to work reliably. Bugs in business applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of ecommerce sites can have huge costs in terms of lost revenue and damage to reputation.

可靠性不仅针对核电站和空中交通控制软件——更普通的应用程序同样需要保证可靠运行。商业应用程序中的漏洞会导致生产力下降（如果数字报告不正确，还会面临法律风险），电子商务网站的停机则会造成巨额损失，且声誉受损。

Even in “noncritical” applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application [ 15 ]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup?

即使在“非关键”应用程序中，我们也有责任对我们的用户负责。考虑一个将他们所有的孩子的照片和视频存储在你的照片应用程序[15]中的父母。如果该数据库突然损坏，他们会感觉如何？他们知道如何从备份中恢复它吗？

There are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype product for an unproven market) or operational cost (e.g., for a service with a very narrow profit margin)—but we should be very conscious of when we are cutting corners.

在某些情况下，我们可能选择在可靠性方面做出牺牲，以降低开发成本（例如，针对未经证实的市场开发原型产品）或运营成本（例如，针对利润非常微薄的服务）——但我们应该非常清醒地意识到我们是在削减开销。

Scalability

Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future. One common reason for degradation is increased load: perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

即使系统今天能够可靠地运行，也不意味着它未来也能保持可靠性。一个常见的退化原因是负载增加：可能系统从10,000个并发用户增长到100,000个并发用户，或者从1百万增长到1千万。也许它正在处理比以前更大量的数据。

Scalability is the term we use to describe a system’s ability to cope with increased load. Note, however, that it is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”

可扩展性是我们用来描述系统处理增加负载的能力的术语。然而，需要注意的是，它不是我们可以给系统附加的一维标签：说“X是可扩展的”或“Y不可扩展”是没有意义的。相反，讨论可扩展性意味着考虑问题，例如“如果系统以特定方式增长，我们处理增长的选择是什么？”和“我们如何添加计算资源来处理额外的负载？”

Describing Load

First, we need to succinctly describe the current load on the system; only then can we discuss growth questions (what happens if our load doubles?). Load can be described with a few numbers which we call load parameters . The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

首先，我们需要简洁地描述系统上的当前负载；只有这样我们才能讨论增长问题（如果我们的负载翻倍会发生什么？）。负载可以用一些数字来描述，我们称之为负载参数。参数的最佳选择取决于您系统的架构：可能是每秒钟对Web服务器的请求，数据库中读写比例，聊天室中的同时活跃用户数量，缓存命中率或其他内容。也许对您来说平均情况最重要，或者您的瓶颈由少数几个极端情况主导。

To make this idea more concrete, let’s consider Twitter as an example, using data published in November 2012 [ 16 ]. Two of Twitter’s main operations are:

为了使这个想法更具体化，让我们以Twitter为例，使用2012年11月发表的数据[16]。 Twitter的两个主要操作是：

Post tweet

A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).

一个用户可以向他们的关注者发布一条新信息（平均每秒4.6k个请求，峰值时每秒超过12k个请求）。

Home timeline

A user can view tweets posted by the people they follow (300k requests/sec).

一个用户可以查看其关注者发布的推文（每秒300k次请求）。

Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy. However, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-out ⁱⁱ —each user follows many people, and each user is followed by many people. There are broadly two ways of implementing these two operations:

处理每秒12000次写入（发布推文的峰值速率）将相当容易。然而，Twitter的扩展挑战不主要是由于推文数量，而是由于扇出-每个用户都关注许多人，每个用户都被许多人关注。实现这两个操作通常有两种方式：

Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2 , you could write a query such as:

发布一条推文只会将其插入到全球推文集合中。当用户请求其主页时间线时，找出他们关注的所有人，找到每个用户的所有推文，并合并它们（按时间排序）。在关系型数据库中，可以编写查询语句，例如图1-2中的查询语句：
```
SELECT tweets.*, users.* FROM tweets
  JOIN users   ON tweets.sender_id    = users.id
  JOIN follows ON follows.followee_id = users.id
  WHERE follows.follower_id = current_user
```
Maintain a cache for each user’s home timeline—like a mailbox of tweets for each recipient user (see Figure 1-3 ). When a user posts a tweet , look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

为每个用户的主页时间线维护一个缓存——就像每个接受者用户的推文邮箱（参见图1-3）。当用户发布一条推文时，查找所有关注该用户的人，并将新的推文插入到他们每个人的主页时间线缓存中。因为结果已经提前计算出来，所以读取主页时间线的请求非常便宜。

The first version of Twitter used approach 1, but the systems struggled to keep up with the load of home timeline queries, so the company switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time.

Twitter的第一版使用方法1，但是那时系统很难跟上主页时间轴查询的负载，所以公司转而使用方法2。这个方法效果更好，因为发布推文的平均速率比主页时间轴读取的速率低了近两个数量级，所以在这种情况下，在写入时做更多的工作，而在读取时做更少的工作更好。

However, the downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But this average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers. This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner—Twitter tries to deliver tweets to followers within five seconds—is a significant challenge.

然而，方法2的缺点是现在发布推文需要额外的工作量。平均而言，一条推文会被传递给约75个关注者，因此每秒4.6k条推文就变成了对主页时间轴缓存的345k次写入。但这个平均数隐藏了一个事实，即每个用户的关注者数量差异很大，有些用户拥有超过3000万的关注者。这意味着一条推文可能会导致3000万次对主页时间轴的写入！而在Twitter尝试在五秒内向关注者传递推文的情况下及时完成这项工作是一个巨大的挑战。

In the example of Twitter, the distribution of followers per user (maybe weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load. Your application may have very different characteristics, but you can apply similar principles to reasoning about its load.

以 Twitter 为例，用户的关注者分布（可能根据其推文频率加权）是讨论可扩展性的关键负载参数，因为它决定了扇出负载。你的应用程序可能具有非常不同的特点，但你可以应用类似的原则来推理它的负载。

The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance. We will revisit this example in Chapter 12 after we have covered some more technical ground.

发推特轶事的最后转折：现在实现了方法2，推特正在朝着两种方法的混合方式迈进。大多数用户的推特在发布时仍会在主页时间轴上扇出，但有一小部分拥有大量粉丝（即名人）的用户会被例外。用户可能会关注的任何名人的推特会单独获取并在读取时与该用户的主页时间轴合并，类似于方法1。这种混合方法能够提供一致的良好性能。在我们覆盖了一些更技术性的地面后，我们将在第12章重新审视这个例子。

Describing Performance

Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:

一旦你描述了系统的负载，你就可以调查当负载增加时会发生什么。你可以从两个方面来看待它：

When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?

当您增加负载参数并保持系统资源（CPU，内存，网络带宽等）不变时，系统的性能会受到影响吗？
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

当你增加负载参数时，如果想保持性能不变，需要增加多少资源？

Both questions require performance numbers, so let’s look briefly at describing the performance of a system.

两个问题都需要性能数字，因此让我们简要地描述一下系统的性能。

In a batch processing system such as Hadoop, we usually care about throughput —the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. ⁱⁱⁱ In online systems, what’s usually more important is the service’s response time —that is, the time between a client sending a request and receiving a response.

在像Hadoop这样的批处理系统中，我们通常关心吞吐量——每秒可以处理的记录数量，或者在特定大小的数据集上运行作业所需的总时间。在在线系统中，更重要的是服务的响应时间——即客户端发送请求和接收响应之间的时间。

Latency and response time

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time ), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent , awaiting service [ 17 ].

延迟和响应时间通常被视为同义词，但它们并不相同。响应时间是客户端所看到的：除了实际处理请求所需的时间（服务时间）之外，它还包括网络延迟和排队延迟。延迟是请求等待处理的时间，期间需要等待服务[17]。

Even if you only make the same request over and over again, you’ll get a slightly different response time on every try. In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.

即使你一遍又一遍地做出相同的请求，每次尝试得到的响应时间也会稍有不同。在实践中，对于处理各种请求的系统，响应时间可能会有很大的差异。因此，我们需要将响应时间看作一组可以测量的值的分布，而不是一个单一的数字。

In Figure 1-4 , each gray bar represents a request to a service, and its height shows how long that request took. Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps the slow requests are intrinsically more expensive, e.g., because they process more data. But even in a scenario where you’d think all requests should take the same time, you get variation: random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [ 18 ], or many other causes.

在图1-4中，每个灰色条代表对服务的请求，其高度显示了该请求所花费的时间。大多数请求的速度都比较快，但偶尔会出现一些耗时更长的异常值。也许慢请求本质上更昂贵，例如因为它们处理的数据更多。但即使在所有请求应该花费相同时间的情况下，你也会得到变化：随机额外的延迟可能会由于切换到后台进程、网络数据包的丢失和TCP重传、垃圾收集暂停、在服务器架中机械振动 [18] 或其他许多原因引入。

It’s common to see the average response time of a service reported. (Strictly speaking, the term “average” doesn’t refer to any particular formula, but in practice it is usually understood as the arithmetic mean : given n values, add up all the values, and divide by n .) However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.

通常会报告服务的平均响应时间。（严格来说，“平均值”一词并不指代任何特定的公式，但在实践中通常被理解为算术平均值：给定 n 个值，将所有值相加，然后除以 n。）然而，如果你想知道你的“典型”响应时间，平均数不是一个很好的度量标准，因为它不告诉你有多少个用户实际上经历了那个延迟。

Usually it is better to use percentiles . If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that.

通常最好使用百分位数。如果您将响应时间列表按从最快到最慢进行排序，则中位数是中间点：例如，如果中位数响应时间为200毫秒，则表示您一半的请求在200毫秒以下返回，一半的请求需要更长时间。

This makes the median a good metric if you want to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median. The median is also known as the 50th percentile , and sometimes abbreviated as p50 . Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.

这使得中位数成为一种良好的度量方法，如果您想知道用户通常需要等待多长时间：一半的用户请求在低于中位响应时间的情况下得到服务，另一半需要比中位数更长的时间。中位数也被称为第50个百分位数，并有时缩写为p50。请注意，中位数是指单个请求;如果用户发出多个请求（在会话期间或因为单个页面包括多个资源），则至少有一个请求慢于中位数的概率远大于50％。

In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th , 99th , and 99.9th percentiles are common (abbreviated p95 , p99 , and p999 ). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is illustrated in Figure 1-4 .

为了弄清异常数据有多严重，你可以查看更高的百分位数：95％、99％和99.9％百分位数是常见的（简写为p95、p99和p999）。它们是响应时间阈值，其中95％、99％或99.9％的请求速度比特定阈值更快。举例来说，如果95％分位数的响应时间为1.5秒，就意味着100个请求中有95个会在1.5秒内完成，而有5个需要1.5秒或更长时间。这在图1-4中有所说明。

High percentiles of response times, also known as tail latencies , are important because they directly affect users’ experience of the service. For example, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests. This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they’re the most valuable customers [ 19 ]. It’s important to keep those customers happy by ensuring the website is fast for them: Amazon has also observed that a 100 ms increase in response time reduces sales by 1% [ 20 ], and others report that a 1-second slowdown reduces a customer satisfaction metric by 16% [ 21 , 22 ].

高响应时间的百分位，也称为尾部延迟，非常重要，因为直接影响用户对服务的体验。例如，亚马逊在内部服务中以99.9百分位的响应时间要求为例，即使只影响了1,000个请求中的1个。这是因为最慢请求的客户通常是那些拥有许多账户数据的客户，因为他们进行了多次购买，即他们是最有价值的客户[19]。通过确保网站对这些客户快速，可以保持这些客户的满意度：亚马逊还观察到，响应时间增加100毫秒会导致销售额下降1%[20]，其他人则报告，1秒的减速会将客户满意度指标降低16% [21,22]。

On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.

另一方面，优化第99.99个百分位数（最慢的10000个请求中的1个）被认为对于亚马逊的目的来说太昂贵且利益不足。在非常高的百分位数上降低响应时间很困难，因为它们很容易受到无法控制的随机事件的影响，而且效益逐渐变小。

For example, percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availability of a service. An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under 1 s (if the response time is longer, it might as well be down), and the service may be required to be up at least 99.9% of the time. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

例如，百分位数通常用于服务水平目标（SLO）和服务水平协议（SLA），这些协议定义了服务的预期性能和可用性。SLA 可以说明，如果服务的中位响应时间小于 200 毫秒且 99 百分位数小于 1 秒，则认为该服务处于上线状态（如果响应时间更长，则可以认为已经下线），并且该服务可能需要保持至少 99.9% 的可用性。这些指标为服务的客户设定了期望，并允许客户在未达到 SLA 的情况下要求退款。

Queueing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking . Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client side.

排队延迟通常占高百分位响应时间的很大一部分。由于一个服务器只能处理少量并行事项（例如受其CPU核心数量限制），只需要极少量缓慢请求就足以阻塞后续请求的处理——这种效应有时被称为头阻塞。即使这些后续请求在服务器上处理很快，客户端也会因等待先前请求完成而看到较慢的总响应时间。由于此效应，客户端测量响应时间非常重要。

When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements [ 23 ].

当人工产生负载以测试系统的可扩展性时，产生负载的客户端需要独立于响应时间继续发送请求。如果客户端在发送下一个请求之前等待前一个请求完成，这种行为会在测试中人为地使队列比实际上要短，从而扭曲测量结果[23]。

Percentiles in Practice

High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request. Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete. It takes just one slow call to make the entire end-user request slow, as illustrated in Figure 1-5 . Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow (an effect known as tail latency amplification [ 24 ]).

在后端服务中，高百分位特别重要，因为它们会在为单个终端用户请求提供服务的过程中被多次调用。即使您并行进行调用，终端用户请求仍然需要等待最慢的并行调用完成。正如图1-5所示，只需要一个慢速调用就可以使整个终端用户请求变慢。即使只有少数后端调用较慢，如果终端用户请求需要多次后端调用，则获取较慢的调用的机会会增加，因此更高比例的终端用户请求会变慢（这种效应称为尾延迟放大[24]）。

If you want to add response time percentiles to the monitoring dashboards for your services, you need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling window of response times of requests in the last 10 minutes. Every minute, you calculate the median and various percentiles over the values in that window and plot those metrics on a graph.

如果您想将响应时间百分位数添加到您服务的监控仪表板上，您需要高效地持续计算它们。例如，您可能希望保持最近10分钟请求的响应时间的滚动窗口。每分钟，在该窗口中计算中位数和各个百分位数，并将这些指标绘制在图表上。

The naïve implementation is to keep a list of response times for all requests within the time window and to sort that list every minute. If that is too inefficient for you, there are algorithms that can calculate a good approximation of percentiles at minimal CPU and memory cost, such as forward decay [ 25 ], t-digest [ 26 ], or HdrHistogram [ 27 ]. Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms [ 28 ].

天真的实现方法是在时间窗口内保留所有请求的响应时间列表，并在每分钟对该列表进行排序。如果这对您来说太低效，那么有一些算法可以在最小的 CPU 和内存成本下计算出很好的百分位数近似值，例如前向衰减 [25]、t-digest[26] 或 HdrHistogram [27]。注意，对百分位数进行平均，例如为了降低时间分辨率或将来自多台机器的数据组合起来，这在数学上是没有意义的——聚合响应时间数据的正确方法是添加直方图 [28]。

Approaches for Coping with Load

Now that we have discussed the parameters for describing load and metrics for measuring performance, we can start discussing scalability in earnest: how do we maintain good performance even when our load parameters increase by some amount?

现在我们已经讨论了描述负载的参数和测量性能的指标，我们可以认真讨论可扩展性了：即使我们的负载参数增加了一定量，我们如何保持良好的性能？

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase—or perhaps even more often than that.

如果一种架构适合某个负载水平，那么它很可能无法应对10倍于此的负载。如果你在开发一个快速增长的服务，那么在每一次负载增加一个数量级时，你很可能需要重新思考你的架构，甚至需要更频繁地重构。

People often talk of a dichotomy between scaling up ( vertical scaling , moving to a more powerful machine) and scaling out ( horizontal scaling , distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture. A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

人们经常谈论缩放的二分法（垂直缩放，移动到更强大的机器和水平缩放，将负载分布在多台较小的机器上）。在多台机器上分布负载也被称为“不共享任何东西”的架构。可以在单个机器上运行的系统通常更简单，但高端机器可能非常昂贵，因此非常密集的工作负载通常无法避免缩放。实际上，良好的架构通常涉及实用主义的方法混合：例如，使用几台相对强大的机器仍然可能比大量小虚拟机更简单，更便宜。

Some systems are elastic , meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises (see “Rebalancing Partitions” ).

一些系统是弹性的，意味着它们在检测到负载增加时可以自动添加计算资源，而其他系统则需要手动扩容（由人员分析容量并决定是否需要添加更多机器）。当负载高度不可预测时，弹性系统可能非常有用，但手动扩容的系统更简单，可能会有更少的操作意外（请参见“重新平衡分区”）。

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed.

将无状态服务分发到多台机器上相对来说相当简单，但将有状态数据系统从单节点迁移至分布式环境则可能引入大量额外复杂性。因此，直到不久之前，通常被认为明智之举是将数据库保持在单一节点上（纵向扩展）直至扩展成本或高可用性需求迫使你将其变为分布式环境。

As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of applications. It is conceivable that distributed data systems will become the default in the future, even for use cases that don’t handle large volumes of data or traffic. Over the course of the rest of this book we will cover many kinds of distributed data systems, and discuss how they fare not just in terms of scalability, but also ease of use and maintainability.

随着分布式系统的工具和抽象层的不断完善，这些普遍的认识可能会改变，至少对某些应用而言是如此。未来，分布式数据系统有可能成为默认选择，即使是用于不处理大量数据或流量的案例。在本书的其余部分，我们将涵盖许多种分布式数据系统，并讨论它们在可扩展性、易用性和可维护性方面的表现。

The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce ). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.

大规模操作系统的架构通常高度特定于应用程序——没有通用的、一刀切的可扩展架构（俗称神奇扩展酱）。问题可能是读取量、写入量、存储数据量、数据复杂性、响应时间要求、访问模式，或（通常）所有这些问题的混合，加上许多其他问题。

For example, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput.

例如，一个被设计为每秒处理100,000个1kB请求的系统，看起来与一个被设计为每分钟处理3个2GB请求的系统非常不同，即使这两个系统具有相同的数据吞吐量。

An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.

一种可适用于特定应用的可扩展架构是基于哪些操作是常见的和哪些是罕见的假设——负载参数来构建的。如果这些假设被证明是错误的，那么扩展的工程努力将至少是浪费的，而在最坏的情况下可能是适得其反的。在初创阶段的创业公司或未经验证产品中，能够快速迭代产品特性通常比将来某个假设负载的扩展更加重要。

Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those building blocks and patterns.

尽管可扩展架构是针对特定应用而设计的，但通常是由常见的通用构建块组成的，排列成熟悉的模式。在本书中，我们将讨论这些构建块和模式。

Maintainability

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

众所周知，软件大部分的成本并不在其初期的开发，而是在其持续的维护工作中——修复程序错误，保持系统运行，调查故障，适应新平台，修改新的用例，偿还技术债务以及添加新功能。

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy systems—perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for. Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them.

然而，不幸的是，许多从事软件系统工作的人不喜欢所谓的遗留系统维护——或许这涉及修复他人的错误，或者使用已经过时的平台，或者系统被迫执行其从未打算过的功能。每个遗留系统都有其自己的不愉快之处，因此很难对处理它们给出一般性建议。

However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:

然而，我们可以并且应该设计软件，使其在维护过程中尽可能减少痛苦，从而避免自己创建遗留软件。为此，我们将特别注意软件系统的三个设计原则：

Operability

Make it easy for operations teams to keep the system running smoothly.

让运维团队更轻松地保持系统运行流畅。

Simplicity

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)

让新工程师更容易理解系统，尽量从系统中消除复杂性。（注意这与用户界面的简单性不同。）

Evolvability

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility , modifiability , or plasticity .

为工程师未来轻松修改系统、并根据变化的需求适应未预期的用例提供便捷操作。这也称为可扩展性、可修改性或可塑性。

As previously with reliability and scalability, there are no easy solutions for achieving these goals. Rather, we will try to think about systems with operability, simplicity, and evolvability in mind.

以前可靠性和可扩展性已经没有简单的解决方案来实现这些目标。相反，我们会尝试考虑在可操作性、简洁性和可变性方面的系统思路。

Operability: Making Life Easy for Operations

It has been suggested that “good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations” [ 12 ]. While some aspects of operations can and should be automated, it is still up to humans to set up that automation in the first place and to make sure it’s working correctly.

有人建议“好的操作通常可以克服不好（或不完整）的软件限制，但是好的软件不能可靠地运行不良操作”[12]。虽然某些操作方面可以和应该被自动化，但最终还是需要人类来进行第一次自动化设置，并确保其正确地运行。

Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more [ 29 ]:

运维团队对于保持软件系统平稳运行非常重要。一个优秀的运维团队通常负责以下工作，以及更多其他事项 [29]：

Monitoring the health of the system and quickly restoring service if it goes into a bad state

监控系统的健康状况，如果系统状态不佳，迅速恢复服务。
Tracking down the cause of problems, such as system failures or degraded performance

追踪问题的原因，例如系统故障或性能降低。
Keeping software and platforms up to date, including security patches

保持软件和平台更新，包括安全补丁。
Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage

监控不同系统之间的相互影响，以避免出现可能导致破坏的问题性变化。
Anticipating future problems and solving them before they occur (e.g., capacity planning)

预料未来的问题并在它们发生之前解决它们（例如，容量规划）。
Establishing good practices and tools for deployment, configuration management, and more

建立好的实践和工具，用于部署、配置管理等方面。
Performing complex maintenance tasks, such as moving an application from one platform to another

执行复杂的维护任务，例如将一个应用程序从一个平台移动到另一个平台。
Maintaining the security of the system as configuration changes are made

随着配置变化的发生，维护系统的安全性。
Defining processes that make operations predictable and help keep the production environment stable

定义使操作可预测并帮助保持生产环境稳定的流程。
Preserving the organization’s knowledge about the system, even as individual people come and go

保留组织对系统的知识，即使个别人员进出。

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

良好的可操作性意味着使日常任务变得更加容易，使运营团队可以将精力集中在高价值的活动上。数据系统可以采取各种方式来简化日常任务，其中包括：

Providing visibility into the runtime behavior and internals of the system, with good monitoring

提供对系统运行行为和内部情况的可见性，带有良好的监控。
Providing good support for automation and integration with standard tools

为自动化提供良好支持并与标准工具进行集成。
Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)

避免依赖特定的机器（使机器可以在维护期间关闭，而整个系统仍能无间断地运行）。
Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)

提供良好的文档和易于理解的操作模型（“如果我做 X，就会发生 Y”）。
Providing good default behavior, but also giving administrators the freedom to override defaults when needed

提供良好的默认行为，但也给管理员在需要时覆盖默认值的自由。
Self-healing where appropriate, but also giving administrators manual control over the system state when needed

在必要时进行自愈，但也在需要时赋予管理员手动控制系统状态的能力。
Exhibiting predictable behavior, minimizing surprises

展示可预测的行为，最小化惊喜。

Simplicity: Managing Complexity

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a big ball of mud [ 30 ].

小型软件项目可以拥有简单明了且表达清晰的代码，但随着项目规模的扩大，往往会变得非常复杂且难以理解。这种复杂性会拖慢每一个需要处理该系统的人，进一步增加了维护成本。一个陷入复杂困境的软件项目有时被描述为一个沉重的泥球。[30]

There are various possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more. Much has been said on this topic already [ 31 , 32 , 33 ].

复杂性可能会出现各种症状：状态空间爆炸，模块之间紧密耦合，依赖关系错综复杂，命名和术语不一致，为解决性能问题而进行的修改，为解决其他问题而进行的特殊处理等等。已经有很多关于这个话题的讨论[31，32，33]。

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.

当复杂性使得维护困难时，预算和计划常常会超支。在复杂的软件中，更容易引入错误风险：当开发人员难以理解和推理系统时，隐藏的假设、意外的后果和意外的交互就很容易被忽视。相反，减少复杂性极大地提高了软件的可维护性，因此简单性应成为我们构建系统的关键目标。

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [ 32 ] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

简化系统不一定意味着减少其功能；这也可能意味着消除偶然复杂性。Moseley和Marks [32]将复杂性定义为偶然的，如果它不是软件解决的问题本身固有的（如用户所看到的），而仅仅是由实现引起的。

One of the best tools we have for removing accidental complexity is abstraction . A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

我们用来消除无意的复杂性的最佳工具之一是抽象化。良好的抽象化可以在一个干净、简单易懂的立面后面隐藏许多实现细节。良好的抽象化也可以用于广泛的不同应用。这种重用不仅比多次重新实现相似的东西更有效率，而且还会导致高质量的软件，因为在被抽象化的组件中的质量改进将使所有使用它的应用受益。

For example, high-level programming languages are abstractions that hide machine code, CPU registers, and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes. Of course, when programming in a high-level language, we are still using machine code; we are just not using it directly , because the programming language abstraction saves us from having to think about it.

例如，高级编程语言是抽象层，它们隐藏了机器码、CPU寄存器和系统调用。SQL是一个抽象层，它隐藏了复杂的磁盘和内存数据结构，其他客户端的并发请求，以及崩溃后的不一致性。当然，当使用高级语言编程时，我们仍然使用机器码；我们只是不直接使用它，因为编程语言的抽象层帮助我们不必考虑它。

However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how we should be packaging them into abstractions that help us keep the complexity of the system at a manageable level.

然而，寻找好的抽象概念是非常困难的。在分布式系统领域，虽然有许多好的算法，但我们如何将它们打包成有助于我们将系统的复杂性保持在可控范围内的抽象概念，却不太清楚。

Throughout this book, we will keep our eyes open for good abstractions that allow us to extract parts of a large system into well-defined, reusable components.

在本书中，我们将密切关注好的抽象，以便将大系统的部分提取到明确定义的可重用组件中。

Evolvability: Making Change Easy

It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc.

你的系统需求永远不可能保持不变。它们更可能是不断变化的：你会学习到新的知识，出现以前未预料到的用例，业务重点会改变，用户会要求新的功能，新平台会取代旧平台，法律或监管要求会改变，系统的增长会迫使架构发生改变，等等。

In terms of organizational processes, Agile working patterns provide a framework for adapting to change. The Agile community has also developed technical tools and patterns that are helpful when developing software in a frequently changing environment, such as test-driven development (TDD) and refactoring.

在组织流程方面，敏捷工作模式提供了一个适应变化的框架。敏捷社区还开发了技术工具和模式，在频繁变化的环境中开发软件非常有帮助，例如测试驱动开发（TDD）和重构。

Most discussions of these Agile techniques focus on a fairly small, local scale (a couple of source code files within the same application). In this book, we search for ways of increasing agility on the level of a larger data system, perhaps consisting of several different applications or services with different characteristics. For example, how would you “refactor” Twitter’s architecture for assembling home timelines ( “Describing Load” ) from approach 1 to approach 2?

大多数关于这些敏捷技术的讨论都聚焦于相当小的、局部的范围（同一应用程序中的几个源代码文件）。在本书中，我们寻求在更大的数据系统层面上增加敏捷性的方法，这些系统可能包括多个具有不同特点的应用程序或服务。例如，您将如何将 Twitter 的首页时间线组装架构（“描述负载”）从方法1改进到方法2？

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability [ 34 ].

你能够轻松修改和适应不断变化的需求的数据系统的能力，与其简单性和抽象性密切相关：易于理解的简单系统通常比复杂系统更容易修改。但由于这是一个非常重要的想法，因此我们将使用一个不同的词来指代数据系统层面的敏捷性：可发展性[34]。

Summary

In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail.

在这章中，我们探讨了一些关于数据密集型应用的基本思考方式。这些原则将指导我们进入本书的其余部分，深入了解技术细节。

An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some nonfunctional requirements (general properties like security, reliability, compliance, scalability, compatibility, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail.

一个应用程序必须满足各种要求才能发挥作用。这些要求包括功能需求（它应该做什么，例如允许以不同方式存储、检索、搜索和处理数据），以及一些非功能性需求（例如安全性、可靠性、合规性、可扩展性、兼容性和可维护性等一般属性）。在本章中，我们详细讨论了可靠性、可扩展性和可维护性。

Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.

可靠性意味着即使出现故障，系统也能正常工作。故障可能是硬件问题（通常是随机且不相关的），软件问题（错误通常是系统性的，并且很难处理）以及人为问题（人们不可避免地偶尔会犯错误）。容错技术可以隐藏某些类型的故障，使最终用户不受其影响。

Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.

可扩展性意味着拥有保持良好性能的策略，即使负载增加。为了讨论可扩展性，我们首先需要用数量化的方式描述负载和性能。我们简要介绍了Twitter的主页时间线作为描述负载的示例，以及响应时间百分位作为衡量性能的方式。在可扩展系统中，您可以添加处理能力以保持高负载下的可靠性。

Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.

可维护性有许多方面，但本质上就是为工程和操作团队提供更好的工作体验。良好的抽象可以帮助降低复杂性，使系统更易于修改和适应新的用例。良好的可操作性意味着对系统的健康状况有良好的可见性，并且有有效的管理方式。

There is unfortunately no easy fix for making applications reliable, scalable, or maintainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.

很遗憾，没有简单的解决方案使应用程序可靠、可扩展或易于维护。然而，某些模式和技术在不同类型的应用程序中不断出现。在接下来的几章中，我们将看一些数据系统的示例，并分析它们如何朝着这些目标努力。

Later in the book, in Part III , we will look at patterns for systems that consist of several components working together, such as the one in Figure 1-1 .

书的后面，在第三部分，我们将研究由几个组件共同工作的系统模式，例如图1-1所示的模式。

Footnotes

ⁱ Defined in “Approaches for Coping with Load” .

我在“应对负载的方法”中进行了定义。

ⁱⁱ A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate’s output. The output needs to supply enough current to drive all the attached inputs. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.

"ii"这个术语源自电子工程，用于描述连接到另一个门输出的逻辑门输入数量。输出需要提供足够的电流来驱动所有连接的输入。在交易处理系统中，我们用它来描述为了处理一个传入请求，需要向其他服务发出的请求数量。

ⁱⁱⁱ In an ideal world, the running time of a batch job is the size of the dataset divided by the throughput. In practice, the running time is often longer, due to skew (data not being spread evenly across worker processes) and needing to wait for the slowest task to complete.

在理想的情况下，批处理作业的运行时间是数据集大小除以吞吐量。但实际情况下，由于倾斜（数据不均匀分布在工作进程之间）以及需要等待最慢的任务完成，运行时间通常会更长。

References

[ 1 ] Michael Stonebraker and Uğur Çetintemel: “ ‘One Size Fits All’: An Idea Whose Time Has Come and Gone ,” at 21st International Conference on Data Engineering (ICDE), April 2005.

[1] 迈克尔·斯通布雷克和乌古尔·塞廷特梅尔： “‘一尺之木，安否全局’：一个时代已经过去的思想”，收录于2005年4月的第21届国际数据工程大会（ICDE）上。

[ 2 ] Walter L. Heimerdinger and Charles B. Weinstock: “ A Conceptual Framework for System Fault Tolerance ,” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.

[2] Walter L. Heimerdinger和Charles B. Weinstock：“系统容错性的概念框架”，技术报告CMU/SEI-92-TR-033，软件工程研究所，卡内基梅隆大学，1992年10月。

[ 3 ] Ding Yuan, Yu Luo, Xin Zhuang, et al.: “ Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

"简单的测试可以避免大部分关键故障：对分布式数据密集型系统生产故障的分析"，丁原，骆煜，庄欣等，发表于2014年10月第11届USENIX操作系统设计与实现研讨会（OSDI）。"

[ 4 ] Yury Izrailevsky and Ariel Tseitlin: “ The Netflix Simian Army ,” techblog.netflix.com , July 19, 2011.

【4】Yury Izrailevsky和Ariel Tseitlin：“Netflix的猴子军队”，techblog.netflix.com，2011年7月19日。

[ 5 ] Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “ Availability in Globally Distributed Storage Systems ,” at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.

[5] Daniel Ford, François Labelle, Florentina I. Popovici等：“全球分布式存储系统中的可用性”，发表于2010年10月第9届USENIX操作系统设计与实现研讨会(OSDI)。

[ 6 ] Brian Beach: “ Hard Drive Reliability Update – Sep 2014 ,” backblaze.com , September 23, 2014.

[6] Brian Beach：“硬盘可靠性更新 - 2014年9月”，backblaze.com，2014年9月23日。

[ 7 ] Laurie Voss: “ AWS: The Good, the Bad and the Ugly ,” blog.awe.sm , December 18, 2012.

[7] Laurie Voss：“AWS：优点、缺点和丑陋”，blog.awe.sm, 2012年12月18日。

[ 8 ] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “ What Bugs Live in the Cloud? ,” at 5th ACM Symposium on Cloud Computing (SoCC), November 2014. doi:10.1145/2670979.2670986

【8】Haryadi S. Gunawi，Mingzhe Hao，Tanakorn Leesatapornwongsa等：“云中有哪些漏洞？”，第5届ACM云计算研讨会 (SoCC)，2014年11月。doi: 10.1145/2670979.2670986

[ 9 ] Nelson Minar: “ Leap Second Crashes Half the Internet ,” somebits.com , July 3, 2012.

[9] Nelson Minar：“闰秒导致互联网崩溃一半”，somebits.com，2012年7月3日。

[ 10 ] Amazon Web Services: “ Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region ,” aws.amazon.com , April 29, 2011.

[10] 亚马逊网络服务： “关于亚马逊EC2和Amazon RDS服务故障的总结美国东部地区”，aws.amazon.com，2011年4月29日。

[ 11 ] Richard I. Cook: “ How Complex Systems Fail ,” Cognitive Technologies Laboratory, April 2000.

[11] Richard I. Cook: “复杂系统的失效”，认知技术实验室，2000年4月。

[ 12 ] Jay Kreps: “ Getting Real About Distributed System Reliability ,” blog.empathybox.com , March 19, 2012.

[12] Jay Kreps：「认真对待分布式系统可靠性」，blog.empathybox.com，2012年3月19日。

[ 13 ] David Oppenheimer, Archana Ganapathi, and David A. Patterson: “ Why Do Internet Services Fail, and What Can Be Done About It? ,” at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.

[13] David Oppenheimer，Archana Ganapathi和David A. Patterson：「为什么互联网服务会出现故障，以及如何解决？」，发表于第四届USENIX互联网技术与系统研讨会（USITS），2003年3月。

[ 14 ] Nathan Marz: “ Principles of Software Engineering, Part 1 ,” nathanmarz.com , April 2, 2013.

[14] Nathan Marz：“软件工程原则，第一部分”，nathanmarz.com，2013年4月2日。

[ 15 ] Michael Jurewitz: “ The Human Impact of Bugs ,” jury.me , March 15, 2013.

[15] 迈克尔·朱雷维茨：“虫害的人类影响”，jury.me，2013年3月15日。

[ 16 ] Raffi Krikorian: “ Timelines at Scale ,” at QCon San Francisco , November 2012.

[16] Raffi Krikorian: “大规模时间表”，在2012年11月的QCon San Francisco上。

[ 17 ] Martin Fowler: Patterns of Enterprise Application Architecture . Addison Wesley, 2002. ISBN: 978-0-321-12742-6

【17】马丁·福勒：《企业应用架构模式》。Addison Wesley出版社，2002年。 ISBN：978-0-321-12742-6。

[ 18 ] Kelly Sommers: “ After all that run around, what caused 500ms disk latency even when we replaced physical server? ” twitter.com , November 13, 2014.

“在我们更换物理服务器后，为什么仍然出现500毫秒的磁盘延迟？”，凯利·萨默斯在Twitter上发表，2014年11月13日。

[ 19 ] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “ Dynamo: Amazon’s Highly Available Key-Value Store ,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

【19】Giuseppe DeCandia、Deniz Hastorun、Madan Jampani等：“Dynamo：亚马逊高可用性的键值存储”，于第21届ACM操作系统原理研讨会（SOSP），2007年10月。

[ 20 ] Greg Linden: “ Make Data Useful ,” slides from presentation at Stanford University Data Mining class (CS345), December 2006.

【20】格雷格·林登： “让数据更有用”，2006年12月在斯坦福大学数据挖掘课程（CS345）上演示文稿。

[ 21 ] Tammy Everts: “ The Real Cost of Slow Time vs Downtime ,” webperformancetoday.com , November 12, 2014.

[21] Tammy Everts: "慢响应时间与停机时间的实际代价," webperformancetoday.com, 2014年11月12日。

[ 22 ] Jake Brutlag: “ Speed Matters for Google Web Search ,” googleresearch.blogspot.co.uk , June 22, 2009.

【22】Jake Brutlag：“速度对于Google网络搜索至关重要”，googleresearch.blogspot.co.uk，2009年6月22日。

[ 23 ] Tyler Treat: “ Everything You Know About Latency Is Wrong ,” bravenewgeek.com , December 12, 2015.

[23] Tyler Treat：“关于延迟的一切你所知道的都是错的”，bravenewgeek.com，2015年12月12日。

[ 24 ] Jeffrey Dean and Luiz André Barroso: “ The Tail at Scale ,” Communications of the ACM , volume 56, number 2, pages 74–80, February 2013. doi:10.1145/2408776.2408794

"大规模下的尾部效应" Jeffrey Dean 和 Luiz André Barroso：通信ACM，第56卷，第2期，第74-80页，2013年2月 DOI: 10.1145/2408776.2408794

[ 25 ] Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “ Forward Decay: A Practical Time Decay Model for Streaming Systems ,” at 25th IEEE International Conference on Data Engineering (ICDE), March 2009.

【25】Graham Cormode，Vladislav Shkapenyuk，Divesh Srivastava和Bojian Xu：「前向衰变: 流式系统的实用时间衰减模型」，于2009年3月在第25届IEEE国际数据工程会议（ICDE）上发表。

[ 26 ] Ted Dunning and Otmar Ertl: “ Computing Extremely Accurate Quantiles Using t-Digests ,” github.com , March 2014.

[26] Ted Dunning和Otmar Ertl： “使用t-Digest计算极其准确的分位数”，github.com，2014年3月。

[ 27 ] Gil Tene: “ HdrHistogram ,” hdrhistogram.org .

"[27] Gil Tene: “HdrHistogram,” hdrhistogram.org." "[27] Gil Tene： “HdrHistogram”，hdrhistogram.org。"

[ 28 ] Baron Schwartz: “ Why Percentiles Don’t Work the Way You Think ,” vividcortex.com , December 7, 2015.

[28] Baron Schwartz：“为什么百分位数不像你想象的那样工作”，vividcortex.com，2015年12月7日。

[ 29 ] James Hamilton: “ On Designing and Deploying Internet-Scale Services ,” at 21st Large Installation System Administration Conference (LISA), November 2007.

[29] 詹姆斯·汉密尔顿: “关于设计和部署互联网规模服务”，于第21届大型安装系统管理会议（LISA），2007年11月。

[ 30 ] Brian Foote and Joseph Yoder: “ Big Ball of Mud ,” at 4th Conference on Pattern Languages of Programs (PLoP), September 1997.

"[30] Brian Foote和Joseph Yoder： “Big Ball of Mud”，于 1997年9月的第四届程序模式语言会议（PLoP）上。"

[ 31 ] Frederick P Brooks: “No Silver Bullet – Essence and Accident in Software Engineering,” in The Mythical Man-Month , Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3

[31] 弗雷德里克·P·布鲁克斯：“软件工程中的本质和意外——《神话般的人月》周年纪念版，Addison-Wesley出版社，1995年，ISBN：978-0-201-83595-3”。

[ 32 ] Ben Moseley and Peter Marks: “ Out of the Tar Pit ,” at BCS Software Practice Advancement (SPA), 2006.

[32] 本·莫斯利和彼得·马克斯：“走出沥青坑”，在BCS软件实践推进（SPA），2006年。

[ 33 ] Rich Hickey: “ Simple Made Easy ,” at Strange Loop , September 2011.

[33] Rich Hickey：“简单易用”，Strange Loop，2011年9月。

[ 34 ] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “ Analyzing Software Evolvability ,” at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.50

[34] Hongyu Pei Breivold，Ivica Crnkovic，Peter J. Eriksson：“分析软件可发展性”，32nd Annual IEEE国际计算机软件和应用会议（COMPSAC），2008年7月。doi：10.1109/COMPSAC.2008.50

Chapter 2. Data Models and Query Languages

The limits of my language mean the limits of my world.

我语言的界限即为我世界的界限。

Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)

路德维希·维特根斯坦，《逻辑哲学论》（1922年）。

Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.

数据模型可能是开发软件最重要的部分，因为它们具有如此深远的影响：不仅影响软件的编写方式，还影响我们思考解决问题的方式。

Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example:

大多数应用程序都是通过在另一个数据模型之上进行分层构建的。对于每个层次，关键问题是：在下一层次中如何表示它？例如：

As an application developer, you look at the real world (in which there are people, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data structures. Those structures are often specific to your application.

作为应用程序开发人员，您观察现实世界（其中包括人员、组织、商品、行动、货币流动、传感器等）并将其建模为对象或数据结构和操作这些数据结构的API。这些结构通常是特定于您的应用程序的。
When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database, or a graph model.

当您想要存储那些数据结构时，您将它们表示为通用数据模型，例如JSON或XML文档，关系型数据库中的表，或者图形模型。
The engineers who built your database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. The representation may allow the data to be queried, searched, manipulated, and processed in various ways.

构建您的数据库软件的工程师们决定了用字节在内存、磁盘或网络上表示JSON/XML/关系型/图形数据的方式。这种表示方式可以让数据以不同的方式被查询、搜索、操作和处理。
On yet lower levels, hardware engineers have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more.

在更低的层面上，硬件工程师已经想出了如何利用电流、光脉冲、磁场等来表示字节。

In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people—for example, the engineers at the database vendor and the application developers using their database—to work together effectively.

在复杂的应用程序中，可能会存在更多的中间层，例如建立在API之上的API，但基本思想仍然相同：每个层次通过提供清晰的数据模型来隐藏其下面的层次的复杂性。这些抽象允许不同的人群 - 例如，数据库供应商的工程师和使用他们的数据库的应用程序开发人员 - 有效地合作。

There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward.

有许多不同种类的数据模型，每个数据模型都必然包含着使用的假设。有一些使用方式比较容易，有些则不被支持；有些操作运行速度较快，而有些则表现较差；有些数据转换方式感觉自然，而有些则不太合适。

It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when working with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and can’t do, it’s important to choose one that is appropriate to the application.

掌握一个数据模型可能需要很多努力（想一想，关于关系数据建模有多少本书）。即使只使用一个数据模型且不必担心其内部运作，构建软件也很困难。但是由于数据模型在软件的所能与所不能影响深远，因此选择适合应用程序的模型非常重要。

In this chapter we will look at a range of general-purpose data models for data storage and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In Chapter 3 we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list).

在这一章节中，我们将研究一系列通用数据模型，用于数据存储和查询（在前面列表中的第2点）。特别是，我们将比较关系模型、文档模型和几个基于图形的数据模型。我们还将研究各种查询语言，并比较它们的使用情况。在第3章中，我们将讨论存储引擎的工作原理，即这些数据模型如何实际实现（在列表中的第3点）。

Relational Model Versus Document Model

The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970 [ 1 ]: data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples ( rows in SQL).

SQL目前最知名的数据模型可能是基于1970年Edgar Codd提出的关系模型的，数据被组织成关系（在SQL中称为表），每个关系都是元组（在SQL中为行）的无序集合。

The relational model was a theoretical proposal, and many people at the time doubted whether it could be implemented efficiently. However, by the mid-1980s, relational database management systems (RDBMSes) and SQL had become the tools of choice for most people who needed to store and query data with some kind of regular structure. The dominance of relational databases has lasted around 25‒30 years—an eternity in computing history.

关系模型最初只是一个理论提议，许多人当时怀疑它是否能够高效地实现。然而，在20世纪80年代中期，关系数据库管理系统（RDBMSes）和SQL已经成为大多数需要存储和查询具有某种规律结构数据的人所选择的工具。关系数据库的主导地位已经持续了大约25-30年，在计算历史上是一个永恒的时代。

The roots of relational databases lie in business data processing , which was performed on mainframe computers in the 1960s and ’70s. The use cases appear mundane from today’s perspective: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).

关系型数据库的根源可以追溯到20世纪60年代和70年代的大型机上进行的企业数据处理。从今天的角度来看，使用案例似乎很平凡：通常是交易处理（输入销售或银行交易、航空公司预订、仓库库存管理）和批处理（客户发票、工资单、报告）。

Other databases at that time forced application developers to think a lot about the internal representation of the data in the database. The goal of the relational model was to hide that implementation detail behind a cleaner interface.

其他数据库当时迫使应用程序开发人员在数据库中考虑数据的内部表示方式。关系模型的目标是将这种实现细节隐藏在更清晰的界面后面。

Over the years, there have been many competing approaches to data storage and querying. In the 1970s and early 1980s, the network model and the hierarchical model were the main alternatives, but the relational model came to dominate them. Object databases came and went again in the late 1980s and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each competitor to the relational model generated a lot of hype in its time, but it never lasted [ 2 ].

多年来，数据存储和查询一直存在很多竞争性的方法。在20世纪70年代和80年代早期，网络模型和分层模型是主要的选择，但是关系模型成为了主流。对象数据库在1980年代末和1990年代初出现，但又逐渐消失。 XML数据库在21世纪初出现，但只被少数人使用。每一种与关系模型竞争的方法在当时都引起了很大轰动，但它们从未持续下去。

As computers became vastly more powerful and networked, they started being used for increasingly diverse purposes. And remarkably, relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, ecommerce, games, software-as-a-service productivity applications, or much more.

随着计算机变得更加强大和联网，它们开始被用于越来越多样化的目的。值得注意的是，关系型数据库被证明具有很强的通用性，超越了其最初的业务数据处理范围，适用于各种各样的用例。今天你在网上看到的许多内容仍然由关系型数据库提供支持，无论是在线发布、讨论、社交网络、电子商务、游戏、以软件为服务的生产力应用，还是其他更多。

The Birth of NoSQL

Now, in the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance. The name “NoSQL” is unfortunate, since it doesn’t actually refer to any particular technology—it was originally intended simply as a catchy Twitter hashtag for a meetup on open source, distributed, nonrelational databases in 2009 [ 3 ]. Nevertheless, the term struck a nerve and quickly spread through the web startup community and beyond. A number of interesting database systems are now associated with the #NoSQL hashtag, and it has been retroactively reinterpreted as Not Only SQL [ 4 ].

现在，在2010年代， NoSQL是最新的尝试，试图推翻关系模型的统治地位。由于最初它只是一个流行的 Twitter 标签，用于描述开放源代码、分布式和非关系型数据库的会议，因此“NoSQL”这个名称有些不幸。然而，这个术语引起了人们的注意，并很快在网络创业社区和其他领域传播开来。现在，一些有趣的数据库系统与 #NoSQL 标签相关联，并被重新解释为“Not Only SQL”。

There are several driving forces behind the adoption of NoSQL databases, including:

NoSQL数据库被采用的原因有几个驱动力，包括：

A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput

需要更大规模的可扩展性，超过关系数据库可以轻松实现的范围，包括非常大的数据集或非常高的写入吞吐量。
A widespread preference for free and open source software over commercial database products

使用免费和开放源码软件的普遍选择，胜过于商业数据库产品。
Specialized query operations that are not well supported by the relational model

不受关系模型很好支持的专业查询操作。
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model [ 5 ]

对关系模式限制性的挫败感，以及对更动态和富有表现力的数据模型的渴望。

Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case. It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of nonrelational datastores—an idea that is sometimes called polyglot persistence [ 3 ].

不同的应用有不同的需求，而一个用例的最佳技术选择可能与另一个用例的最佳选择不同。因此，有可能在可预见的未来，关系型数据库将继续与各种非关系型数据存储一起使用，这个想法有时被称为多语言持久性。

The Object-Relational Mismatch

Most application development today is done in object-oriented programming languages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is sometimes called an impedance mismatch . ⁱ

如今，大多数应用程序开发都是使用面向对象的编程语言完成的，这导致了对 SQL 数据模型的普遍批评：如果数据存储在关系表中，则应用程序代码中的对象和基于表、行和列的数据库模型之间需要使用一种笨拙的翻译层来建立联系。这种模型之间的不连贯有时被称为阻抗不匹配。

Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models.

对象关系映射（ORM）框架如ActiveRecord和Hibernate可减少翻译层所需的样板代码量，但它们无法完全隐藏两个模型之间的差异。

For example, Figure 2-1 illustrates how a résumé (a LinkedIn profile) could be expressed in a relational schema. The profile as a whole can be identified by a unique identifier, user_id . Fields like first_name and last_name appear exactly once per user, so they can be modeled as columns on the users table. However, most people have had more than one job in their career (positions), and people may have varying numbers of periods of education and any number of pieces of contact information. There is a one-to-many relationship from the user to these items, which can be represented in various ways:

例如，图2-1说明了如何将简历（LinkedIn个人资料）表达为关系模式。整个个人资料可以通过唯一标识符user_id进行识别。像first_name和last_name这样的字段每个用户只出现一次，因此它们可以被建模为用户表中的列。然而，大多数人在他们的职业生涯中都有过多个工作（职位），人们可能会有不同数量的教育期间和任何数量的联系方式。从用户到这些项目的关系是一对多的，可以用不同的方法来表示。

In the traditional SQL model (prior to SQL:1999), the most common normalized representation is to put positions, education, and contact information in separate tables, with a foreign key reference to the users table, as in Figure 2-1 .

在传统的SQL模型（SQL:1999之前），最常见的标准化表示法是将职位，教育和联系信息放置在单独的表中，并参照用户表进行外键引用，如图2-1。
Later versions of the SQL standard added support for structured datatypes and XML data; this allowed multi-valued data to be stored within a single row, with support for querying and indexing inside those documents. These features are supported to varying degrees by Oracle, IBM DB2, MS SQL Server, and PostgreSQL [ 6 , 7 ]. A JSON datatype is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL [ 8 ].

SQL标准的更新版本支持结构化数据类型和XML数据；这使得可以将多值数据存储在单个行中，并支持在这些文档内进行查询和索引。这些功能在Oracle、IBM DB2、MS SQL Server和PostgreSQL [6、7]中的支持程度有所不同。JSON数据类型也被多个数据库支持，包括IBM DB2、MySQL和PostgreSQL [8]。
A third option is to encode jobs, education, and contact info as a JSON or XML document, store it on a text column in the database, and let the application interpret its structure and content. In this setup, you typically cannot use the database to query for values inside that encoded column.

第三个选项是将工作、教育和联系信息编码为JSON或XML文档，将其存储在数据库的文本列中，并让应用程序解释其结构和内容。在这种设置中，您通常无法使用数据库查询该编码列内部的值。

For a data structure like a résumé, which is mostly a self-contained document , a JSON representation can be quite appropriate: see Example 2-1 . JSON has the appeal of being much simpler than XML. Document-oriented databases like MongoDB [ 9 ], RethinkDB [ 10 ], CouchDB [ 11 ], and Espresso [ 12 ] support this data model.

对于像简历这样的数据结构，它主要是一个自包含的文档，JSON表示可以非常适合：参见例子2-1。JSON比XML简单得多，具有吸引力。文档导向型数据库，如MongoDB、RethinkDB、CouchDB和Espresso都支持这种数据模型。

Example 2-1. Representing a LinkedIn profile as a JSON document

{
  "user_id":     251,
  "first_name":  "Bill",
  "last_name":   "Gates",
  "summary":     "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id":   "us:91",
  "industry_id": 131,
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University",       "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog":    "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}

Some developers feel that the JSON model reduces the impedance mismatch between the application code and the storage layer. However, as we shall see in Chapter 4 , there are also problems with JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss this in “Schema flexibility in the document model” .

一些开发人员认为，JSON模型减少了应用程序代码和存储层之间的阻抗失配。然而，正如我们将在第4章中看到的那样，JSON作为数据编码格式也存在问题。缺乏模式通常被引用作为优点;我们将在“文档模型中的模式灵活性”中讨论这个问题。

The JSON representation has better locality than the multi-table schema in Figure 2-1 . If you want to fetch a profile in the relational example, you need to either perform multiple queries (query each table by user_id ) or perform a messy multi-way join between the users table and its subordinate tables. In the JSON representation, all the relevant information is in one place, and one query is sufficient.

JSON表示法比2-1图中的多表模式具有更好的局部性。如果您想在关系示例中获取配置文件，您需要执行多个查询（通过user_id查询每个表）或在用户表及其从属表之间执行混乱的多向连接。在JSON表示法中，所有相关信息都在一个地方，只需要一次查询即可。

The one-to-many relationships from the user profile to the user’s positions, educational history, and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit (see Figure 2-2 ).

用户档案与用户职位，教育历史和联系方式之间的一对多关系意味着数据中的树形结构，JSON表示明确了这种树形结构（请参见图2-2）。

Many-to-One and Many-to-Many Relationships

In Example 2-1 in the preceding section, region_id and industry_id are given as IDs, not as plain-text strings "Greater Seattle Area" and "Philanthropy" . Why?

为什么在前面的章节中的示例2-1中，region_id和industry_id是作为ID给出的，而不是作为纯文本字符串"Greater Seattle Area"和"Philanthropy"呢？

If the user interface has free-text fields for entering the region and the industry, it makes sense to store them as plain-text strings. But there are advantages to having standardized lists of geographic regions and industries, and letting users choose from a drop-down list or autocompleter:

如果用户界面有用于输入地区和行业的自由文本字段，则将它们存储为纯文本字符串是有意义的。但是，具有标准化的地理区域和行业列表，并让用户从下拉列表或自动完成器中选择是有优势的：

Consistent style and spelling across profiles

个人资料中风格和拼写要保持一致。
Avoiding ambiguity (e.g., if there are several cities with the same name)

避免歧义（例如，如果有几个同名城市）
Ease of updating—the name is stored in only one place, so it is easy to update across the board if it ever needs to be changed (e.g., change of a city name due to political events)

易于更新——名称只存储在一个地方，如果需要更改（例如，由于政治事件更改城市名称），则可以轻松在整个平台上进行更新。
Localization support—when the site is translated into other languages, the standardized lists can be localized, so the region and industry can be displayed in the viewer’s language

本地化支持-当网站翻译为其他语言时，标准化列表可以本地化，以便在观看者的语言中显示地区和行业。
Better search—e.g., a search for philanthropists in the state of Washington can match this profile, because the list of regions can encode the fact that Seattle is in Washington (which is not apparent from the string "Greater Seattle Area" )

更好的搜索-例如，在华盛顿州搜寻慈善家可以匹配此档案，因为地区列表可以编码西雅图是华盛顿州的事实（这一点在“大西雅图地区”这个字符串中不明显）。

Whether you store an ID or a text string is a question of duplication. When you use an ID, the information that is meaningful to humans (such as the word Philanthropy ) is stored in only one place, and everything that refers to it uses an ID (which only has meaning within the database). When you store the text directly, you are duplicating the human-meaningful information in every record that uses it.

无论你存储一个ID还是文本字符串，都是一个重复的问题。当你使用ID时，对于人类有意义的信息（例如“慈善”这个词）只存储在一个位置，而所有引用它的内容都使用ID（它只有在数据库中有意义）。当你直接存储文本时，你将在每个使用它的记录中复制人类有意义的信息。

The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future—and if that information is duplicated, all the redundant copies need to be updated. That incurs write overheads, and risks inconsistencies (where some copies of the information are updated but others aren’t). Removing such duplication is the key idea behind normalization in databases. ⁱⁱ

使用ID的优点在于，由于它对人类没有意义，所以它永远不需要更改：即使所识别的信息发生更改，ID仍然可以保持不变。对人类有意义的任何内容都可能在将来需要更改 - 如果该信息被复制，所有冗余副本都需要更新。这会增加写入开销，并存在不一致性的风险（其中一些信息的副本得到更新，但其他副本没有）。消除这种重复是数据库规范化的关键思想。

Note

Database administrators and developers love to argue about normalization and denormalization, but we will suspend judgment for now. In Part III of this book we will return to this topic and explore systematic ways of dealing with caching, denormalization, and derived data.

数据库管理员和开发人员喜欢就标准化和反标准化进行争论，但我们暂不考虑这个问题。本书的第三部分将回到这个主题，并探讨处理缓存、反标准化和派生数据的系统方法。

Unfortunately, normalizing this data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak. ⁱⁱⁱ

不幸的是，将这些数据进行规范化需要多对一的关系（许多人居住在一个特定地区，许多人在一个特定行业工作），这些关系不适合文档模型。在关系数据库中，通常通过ID引用其他表中的行，因为连接很容易。在文档数据库中，连接不需要用于一对多树结构，并且对连接的支持通常较弱。

If the database itself does not support joins, you have to emulate a join in application code by making multiple queries to the database. (In this case, the lists of regions and industries are probably small and slow-changing enough that the application can simply keep them in memory. But nevertheless, the work of making the join is shifted from the database to the application code.)

如果数据库本身不支持连接，你需要在应用程序代码中通过对数据库进行多次查询来模拟连接。(在这种情况下，地区和行业列表可能很小且变化缓慢，应用程序可以将它们保留在内存中处理。但是，将连接的工作从数据库转移到应用程序代码。)

Moreover, even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications. For example, consider some changes we could make to the résumé example:

此外，即使应用程序的初始版本适合无连接文档模型，随着功能的增加，数据往往会变得更加相互关联。例如，考虑以下我们可以对在线简历进行的更改：

Organizations and schools as entities

In the previous description, organization (the company where the user worked) and school_name (where they studied) are just strings. Perhaps they should be references to entities instead? Then each organization, school, or university could have its own web page (with logo, news feed, etc.); each résumé could link to the organizations and schools that it mentions, and include their logos and other information (see Figure 2-3 for an example from LinkedIn).

在先前的描述中，组织（用户所在的公司）和学校名称仅仅是字符串。也许它们应该是实体的引用？这样每个组织、学校或大学都可以有自己的网页（包含标志、新闻源等）。每份简历都可以链接到所提及的组织和学校，并包含它们的标志和其他信息（参见LinkedIn上的示例，如图2-3）。

Recommendations

Say you want to add a new feature: one user can write a recommendation for another user. The recommendation is shown on the résumé of the user who was recommended, together with the name and photo of the user making the recommendation. If the recommender updates their photo, any recommendations they have written need to reflect the new photo. Therefore, the recommendation should have a reference to the author’s profile.

假设你想添加一个新的功能：一位用户可以给另一位用户写一份推荐。这份推荐会显示在被推荐用户的个人简历上，同时还会附带推荐人的姓名和照片。如果推荐人更新了自己的照片，那么他所写的所有推荐都应该显示最新的照片。因此，推荐应该引用作者的个人资料。

Figure 2-4 illustrates how these new features require many-to-many relationships. The data within each dotted rectangle can be grouped into one document, but the references to organizations, schools, and other users need to be represented as references, and require joins when queried.

图2-4说明了这些新功能需要多对多的关系。每个虚线矩形中的数据可以分组成一个文档，但对组织、学校和其他用户的引用必须表示为引用，并在查询时需要连接。

Are Document Databases Repeating History?

While many-to-many relationships and joins are routinely used in relational databases, document databases and NoSQL reopened the debate on how best to represent such relationships in a database. This debate is much older than NoSQL—in fact, it goes back to the very earliest computerized database systems.

尽管在关系型数据库中经常使用多对多关系和连接，但文档数据库和NoSQL重新打开了在数据库中表示这种关系的最佳方式的辩论。这个辩论比NoSQL要古老得多，实际上可以追溯到最早的计算机化数据库系统。

The most popular database for business data processing in the 1970s was IBM’s Information Management System (IMS), originally developed for stock-keeping in the Apollo space program and first commercially released in 1968 [ 13 ]. It is still in use and maintained today, running on OS/390 on IBM mainframes [ 14 ].

20世纪70年代，商业数据处理的最流行数据库是IBM的信息管理系统（IMS），最初为阿波罗航天计划的存货管理开发，于1968年首次商业发布[13]。它仍在使用和维护，运行在IBM主机的OS/390上[14]。

The design of IMS used a fairly simple data model called the hierarchical model , which has some remarkable similarities to the JSON model used by document databases [ 2 ]. It represented all data as a tree of records nested within records, much like the JSON structure of Figure 2-2 .

IMS的设计使用了一个相当简单的数据模型，称为分层模型，这与文档数据库使用的JSON模型有一些相似之处。它将所有数据表示为嵌套在记录中的记录树，就像图2-2中的JSON结构一样。

Like document databases, IMS worked well for one-to-many relationships, but it made many-to-many relationships difficult, and it didn’t support joins. Developers had to decide whether to duplicate (denormalize) data or to manually resolve references from one record to another. These problems of the 1960s and ’70s were very much like the problems that developers are running into with document databases today [ 15 ].

像文档数据库一样，IMS适用于一对多关系，但却使得多对多关系更加困难，并且不支持连接查询。开发者们不得不决定是复制（去规范化）数据还是手动解决从一个记录到另一个记录的引用。这些上世纪60年代和70年代的问题与今天开发者们在文档数据库中遇到的问题非常相似[15]。

Various solutions were proposed to solve the limitations of the hierarchical model. The two most prominent were the relational model (which became SQL, and took over the world) and the network model (which initially had a large following but eventually faded into obscurity). The “great debate” between these two camps lasted for much of the 1970s [ 2 ].

有许多解决分层模型限制的解决方案被提出。其中最著名的两个是关系模型（成为SQL并统治了世界）和网络模型（最初有很大的追随者，但最终逐渐被遗忘）。这两个阵营之间的“大辩论”持续了很长一段时间，一直到1970年代晚期。

Since the problem that the two models were solving is still so relevant today, it’s worth briefly revisiting this debate in today’s light.

由于两个模型所解决的问题仍然如此相关，因此在今天的光环下简要回顾这场争论是值得的。

The network model

The network model was standardized by a committee called the Conference on Data Systems Languages (CODASYL) and implemented by several different database vendors; it is also known as the CODASYL model [ 16 ].

网络模型是由一个称为数据系统语言会议(CODASYL)的委员会进行标准化，并由多个不同的数据库供应商实现；它也被称为CODASYL模型[16]。

The CODASYL model was a generalization of the hierarchical model. In the tree structure of the hierarchical model, every record has exactly one parent; in the network model, a record could have multiple parents. For example, there could be one record for the "Greater Seattle Area" region, and every user who lived in that region could be linked to it. This allowed many-to-one and many-to-many relationships to be modeled.

CODASYL模型是层次模型的一般化。层次模型的树状结构中，每个记录都仅有一个父节点；而在网络模型中，一个记录可以有多个父节点。例如，一个代表“大西雅图地区”的记录可以与生活在该地区的每个用户建立关联。这种方式使得可以建立一对多和多对多的关系。

The links between records in the network model were not foreign keys, but more like pointers in a programming language (while still being stored on disk). The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path .

网络模型中记录之间的链接不是外键，而更像编程语言中的指针（同时仍存储在磁盘上）。访问记录的唯一方法是沿着这些链接链从根记录开始遵循路径。这被称为访问路径。

In the simplest case, an access path could be like the traversal of a linked list: start at the head of the list, and look at one record at a time until you find the one you want. But in a world of many-to-many relationships, several different paths can lead to the same record, and a programmer working with the network model had to keep track of these different access paths in their head.

在最简单的情况下，一个访问路径就像遍历链表一样：从列表头开始，逐个查看记录，直到找到所需的记录。但在多对多的关系世界中，有几条不同的路径可以导致相同的记录，一个使用网络模型的程序员必须在脑中跟踪这些不同的访问路径。

A query in CODASYL was performed by moving a cursor through the database by iterating over lists of records and following access paths. If a record had multiple parents (i.e., multiple incoming pointers from other records), the application code had to keep track of all the various relationships. Even CODASYL committee members admitted that this was like navigating around an n -dimensional data space [ 17 ].

在CODASYL中进行查询是通过迭代记录列表并遵循访问路径来移动游标来完成的。如果记录具有多个父项（即来自其他记录的多个入站指针），应用程序代码必须跟踪各种关系。即使CODASYL委员会成员也承认，这就像在n维数据空间中导航。

Although manual access path selection was able to make the most efficient use of the very limited hardware capabilities in the 1970s (such as tape drives, whose seeks are extremely slow), the problem was that they made the code for querying and updating the database complicated and inflexible. With both the hierarchical and the network model, if you didn’t have a path to the data you wanted, you were in a difficult situation. You could change the access paths, but then you had to go through a lot of handwritten database query code and rewrite it to handle the new access paths. It was difficult to make changes to an application’s data model.

虽然手动访问路径选择能够在20世纪70年代最有效地利用非常有限的硬件能力（例如磁带驱动器，它们的搜索非常缓慢），但问题在于它们使查询和更新数据库的代码变得复杂和僵硬。对于分层和网络模型，如果你没有到达所需数据的路径，那么你就会陷入困境。你可以更改访问路径，但是你必须经过大量手写的数据库查询代码并且重写它以处理新访问路径。很难改变应用程序的数据模型。

The relational model

What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it. There are no labyrinthine nested structures, no complicated access paths to follow if you want to look at the data. You can read any or all of the rows in a table, selecting those that match an arbitrary condition. You can read a particular row by designating some columns as a key and matching on those. You can insert a new row into any table without worrying about foreign key relationships to and from other tables. ^iv

相比之下，关系模型所做的是将所有数据公开展示出来：一个关系（表）只是元组（行）的集合，仅此而已。没有复杂嵌套的结构，也没有需要跟随的复杂访问路径，您可以查看任何或所有表中的行，并选择与任意条件匹配的行。只需指定一些列作为键并在这些列上进行匹配，即可读取特定行。您可以向任何表中插入新行，而不必担心与其他表之间的外键关系。

In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. Those choices are effectively the “access path,” but the big difference is that they are made automatically by the query optimizer, not by the application developer, so we rarely need to think about them.

在关系型数据库中，查询优化器会自动决定查询的哪些部分以什么顺序执行，以及使用哪些索引。这些选择实际上是“访问路径”，但是这些选择是由查询优化器自动完成的，而不是由应用程序开发人员完成的，因此我们很少需要考虑它们。

If you want to query your data in new ways, you can just declare a new index, and queries will automatically use whichever indexes are most appropriate. You don’t need to change your queries to take advantage of a new index. (See also “Query Languages for Data” .) The relational model thus made it much easier to add new features to applications.

如果你想以新的方式查询数据，你只需要声明一个新的索引，查询将自动使用最合适的索引。你不需要改变你的查询来利用一个新的索引。(另请参见“数据查询语言”.) 因此，关系模型使应用程序添加新功能变得更加容易。

Query optimizers for relational databases are complicated beasts, and they have consumed many years of research and development effort [ 18 ]. But a key insight of the relational model was this: you only need to build a query optimizer once, and then all applications that use the database can benefit from it. If you don’t have a query optimizer, it’s easier to handcode the access paths for a particular query than to write a general-purpose optimizer—but the general-purpose solution wins in the long run.

关系数据库的查询优化器是非常复杂的东西，需要花费多年的研究和开发工作[18]。但关系模型的一个关键认识是：你只需要建立一个查询优化器，然后所有使用数据库的应用程序都可以从中受益。如果你没有查询优化器，那么手工编码一个特定查询的访问路径比编写通用优化器更容易，但长远来看，通用方案胜出。

Comparison to document databases

Document databases reverted back to the hierarchical model in one aspect: storing nested records (one-to-many relationships, like positions , education , and contact_info in Figure 2-1 ) within their parent record rather than in a separate table.

文档数据库在某个方面回归了分层模型：将嵌套记录（例如图2-1中的职位、教育和联系信息等一对多关系）存储在其父记录中，而不是在单独的表中。

However, when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model [ 9 ]. That identifier is resolved at read time by using a join or follow-up queries. To date, document databases have not followed the path of CODASYL.

然而，当涉及到表示多对一和多对多关系时，关系型和文档数据库并没有根本的不同：在两种情况下，相关项目都有一个唯一标识符，它在关系模型中被称为外键，在文档模型中被称为文档引用[9]。这个标识符通过使用联接或后续查询在读取时解析。迄今为止，文档数据库没有遵循CODASYL的路径。

Relational Versus Document Databases Today

There are many differences to consider when comparing relational databases to document databases, including their fault-tolerance properties (see Chapter 5 ) and handling of concurrency (see Chapter 7 ). In this chapter, we will concentrate only on the differences in the data model.

当比较关系型数据库和文档数据库时，需要考虑许多差异，包括它们的容错性质（见第五章）和并发处理（见第七章）。在本章中，我们将仅集中讨论数据模型的差异。

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.

文档数据模型的主要优点是模式灵活性、由于局部性能更好，某些应用程序更接近于数据结构。关系模型则通过提供更好的联接支持以及一对多和多对多关系来反驳这些优点。

Which data model leads to simpler application code?

If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model. The relational technique of shredding —splitting a document-like structure into multiple tables (like positions , education , and contact_info in Figure 2-1 )—can lead to cumbersome schemas and unnecessarily complicated application code.

如果您的应用程序数据具有类似文档的结构（即一个一对多关系树，通常整个树会一次性加载），那么使用文档模型可能是一个好主意。将文档式结构分解为多个表（例如Figure 2-1中的职位、教育以及联系方式）的关系技术可能导致笨重的模式和不必要复杂的应用程序代码。

The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model). However, as long as documents are not too deeply nested, that is not usually a problem.

文档模型存在局限性：例如，您无法直接引用文档中的嵌套项，而是需要类似于分层模型中的访问路径，比如说“用户251的职位列表中的第二个项目”。然而，只要文档嵌套不太深，通常不会有问题。

The poor support for joins in document databases may or may not be a problem, depending on the application. For example, many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time [ 19 ].

文档数据库中连接的支持较差可能是一个问题，也可能不是，这取决于应用程序。例如，在使用文档数据库记录哪些事件发生在何时的分析应用程序中，可能永远不需要多对多关系。[19]。

However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent. Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database. In such cases, using a document model can lead to significantly more complex application code and worse performance [ 15 ].

然而，如果你的应用程序使用多对多关系，文档模型就会变得不那么吸引人了。可以通过去标准化来减少使用连接的需求，但是应用程序代码需要做额外的工作来保持非标准化数据的一致性。连接可以通过应用程序代码模拟，通过多次请求数据库来实现，但这也将复杂性转移到应用程序中，并且通常比数据库内部的专门代码执行的连接还要慢。在这种情况下，使用文档模型可能会导致应用程序代码变得更加复杂，并且性能更差 [15]。

It’s not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models (see “Graph-Like Data Models” ) are the most natural.

通常无法一般化地说哪种数据模型可以导致更简单的应用代码；这取决于数据项之间存在何种关系。对于高度相互关联的数据而言，文档模型不太方便，关系模型可接受，而图形模型（参见“类似图形的数据模型”）则最为自然。

Schema flexibility in the document model

Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.

大多数文档数据库以及关系型数据库中的JSON支持，在文档数据中不强制执行任何模式。关系型数据库中的XML支持通常带有可选模式验证。没有模式意味着文档中可以添加任意键和值，而在阅读时，客户端无法保证文档可能包含哪些字段。

Document databases are sometimes called schemaless , but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [ 20 ]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) [ 21 ].

文档数据库有时被称为无模式，但这是具有误导性的，因为读取数据的代码通常会假设某种结构 - 即存在隐式模式，但是它不由数据库强制执行[20]。更准确的术语是读时模式（数据结构是隐式的，只有在读取数据时才解释），与写时模式相对（关系数据库的传统方法，其中模式是明确的，数据库确保所有写入的数据符合它） [21]。

Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static and dynamic type checking have big debates about their relative merits [ 22 ], enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong answer.

读取模式的模式类似于程序语言中的动态（运行时）类型检查，而写入模式则类似于静态（编译时）类型检查。正如静态和动态类型检查的拥护者对它们的相对优点进行大辩论[22]一样，数据库模式的执行是一个有争议的话题，通常没有正确或错误的答案。

The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data. For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately [ 23 ]. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example:

不同方法的差异在应用程序想要更改其数据格式的情况下尤其明显。例如，假设您当前在一个字段中存储每个用户的全名，而您希望将名字和姓氏分别存储[23]。在文档数据库中，您只需开始编写具有新字段的新文档，并在应用程序中编写处理旧文档读取的代码。例如：

if (user && user.name && !user.first_name) {
    // Documents written before Dec 8, 2013 don't have first_name
    user.first_name = user.name.split(" ")[0];
}

On the other hand, in a “statically typed” database schema, you would typically perform a migration along the lines of:

另一方面，在“静态类型”数据库模式中，您通常会执行以下类型的迁移：

ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);      -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1);      -- MySQL

Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception—it copies the entire table on ALTER TABLE , which can mean minutes or even hours of downtime when altering a large table—although various tools exist to work around this limitation [ 24 , 25 , 26 ].

模式更改有一个不好的名声，被认为是慢且需要停机时间。这种声誉并不完全有理：大多数关系型数据库系统可以在几毫秒内执行 ALTER TABLE 语句。 MySQL 是一个值得注意的例外 - ALTER TABLE时它会复制整个表，这意味着在更改大型表时可能需要几分钟甚至几个小时的停机时间 - 尽管存在各种工具来解决这个限制。 [24，25，26]。

Running the UPDATE statement on a large table is likely to be slow on any database, since every row needs to be rewritten. If that is not acceptable, the application can leave first_name set to its default of NULL and fill it in at read time, like it would with a document database.

在任何数据库上运行大型表的UPDATE语句都很可能很慢，因为每行都需要重写。如果这是不可接受的，应用程序可以将first_name保留为NULL的默认值，并在读取时填充，就像在文档数据库中一样。

The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason (i.e., the data is heterogeneous)—for example, because:

“基于读取的架构方法非常有优势，如果集合中的项目出于某些原因并不都具有相同的结构（即数据是异构的）——例如，因为：” “基于读取的架构方法在数据异构的情况下非常有优势。”

There are many different types of objects, and it is not practical to put each type of object in its own table.

有许多不同类型的对象，把每种类型的对象都放在自己的表格中是不现实的。
The structure of the data is determined by external systems over which you have no control and which may change at any time.

数据的结构由你无法控制并且随时可能发生改变的外部系统决定。

In situations like these, a schema may hurt more than it helps, and schemaless documents can be a much more natural data model. But in cases where all records are expected to have the same structure, schemas are a useful mechanism for documenting and enforcing that structure. We will discuss schemas and schema evolution in more detail in Chapter 4 .

在这种情况下，模式可能会带来更多的麻烦，而无模式文档可以成为一种更自然的数据模型。但是，如果预计所有记录都具有相同的结构，则模式是记录和强制执行结构的有用机制。我们将在第4章中更详细地讨论模式和模式演化。

Data locality for queries

A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON). If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality . If data is split across multiple tables, like in Figure 2-1 , multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.

一个文档通常被存储为一个连续的字符串，编码为JSON、XML或其二进制变体（如MongoDB的BSON）。如果你的应用程序经常需要访问整个文档（例如，在网页上呈现它），那么这种存储本地性具有性能优势。如果数据分布在多个表中，如图2-1所示，需要多次索引查找才能检索所有数据，这可能需要更多的磁盘寻道并花费更长时间。

The locality advantage only applies if you need large parts of the document at the same time. The database typically needs to load the entire document, even if you access only a small portion of it, which can be wasteful on large documents. On updates to a document, the entire document usually needs to be rewritten—only modifications that don’t change the encoded size of a document can easily be performed in place [ 19 ]. For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document [ 9 ]. These performance limitations significantly reduce the set of situations in which document databases are useful.

在同一时间需要大部分文档时，地理位置优势才适用。数据库通常需要加载整个文档，即使你只访问其中的一小部分，这在大文档上可能是浪费的。在对文档进行更新时，通常需要重写整个文档 - 只有不改变文档编码大小的修改才能轻松地在原地执行[19]。因此，通常建议将文档保持较小，避免增加文档大小的写入[9]。这些性能限制显着减少了文档数据库有用的情况集。

It’s worth pointing out that the idea of grouping related data together for locality is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [ 27 ]. Oracle allows the same, using a feature called multi-table index cluster tables [ 28 ]. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality [ 29 ].

值得指出的是，将相关数据分组以实现本地化的理念并不仅限于文档模型。例如，谷歌的Spanner数据库在关系数据模型中也提供了相同的本地化特性，通过允许架构声明表的行应嵌套在父表中进行交错排列[27]。Oracle也可以使用称为多表索引群集表的功能实现相同的功能[28]。Bigtable数据模型（用于Cassandra和HBase）中的列族概念具有管理本地性的类似目的[29]。

We will also see more on locality in Chapter 3 .

我们还将在第三章中更多地了解地区性。

Convergence of document and relational databases

Most relational database systems (other than MySQL) have supported XML since the mid-2000s. This includes functions to make local modifications to XML documents and the ability to index and query inside XML documents, which allows applications to use data models very similar to what they would do when using a document database.

大多数关系型数据库系统（除了MySQL）自2000年代中期以来就支持XML。这包括对XML文档进行本地修改的功能以及索引和查询XML文档的能力，这使得应用程序可以使用非常类似于使用文档数据库时的数据模型。

PostgreSQL since version 9.3 [ 8 ], MySQL since version 5.7, and IBM DB2 since version 10.5 [ 30 ] also have a similar level of support for JSON documents. Given the popularity of JSON for web APIs, it is likely that other relational databases will follow in their footsteps and add JSON support.

自从PostgreSQL 9.3版本[8]，MySQL 5.7版本和IBM DB2 10.5版本 [30]，也开始支持JSON文档的类似级别的支持。鉴于JSON在Web API中的流行，其他关系型数据库很可能也会效仿并增加JSON支持。

On the document database side, RethinkDB supports relational-like joins in its query language, and some MongoDB drivers automatically resolve database references (effectively performing a client-side join, although this is likely to be slower than a join performed in the database since it requires additional network round-trips and is less optimized).

在文档数据库方面，RethinkDB的查询语言支持类似关系型数据库的联接，而一些MongoDB驱动程序会自动解析数据库引用（实际上是执行客户端联接，但这可能比在数据库中执行联接要慢，因为它需要额外的网络往返并且不太优化）。

It seems that relational and document databases are becoming more similar over time, and that is a good thing: the data models complement each other. ^v If a database is able to handle document-like data and also perform relational queries on it, applications can use the combination of features that best fits their needs.

看起來關聯型和文件型數據庫隨著時間越來越相似，這是好事情：數據模型相互補充。如果一個數據庫能夠處理類似文件的數據並且還能對其進行關聯查詢，應用程序可以使用最適合其需求的功能組合。

A hybrid of the relational and document models is a good route for databases to take in the future.

关系模型和文档模型的混合是数据库未来发展的良好选择。

Query Languages for Data

When the relational model was introduced, it included a new way of querying data: SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code. What does that mean?

当关系模型被引入时，它包括了一种新的查询数据的方式：SQL是一种声明式查询语言，而IMS和CODASYL使用命令式代码来查询数据库。这是什么意思？

Many commonly used programming languages are imperative. For example, if you have a list of animal species, you might write something like this to return only the sharks in the list:

许多常用编程语言是命令式的。例如，如果你有一个动物种类的列表，你可能会写出这样的代码仅返回列表中的鲨鱼：

function getSharks() {
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
        if (animals[i].family === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}

In the relational algebra, you would instead write:

在关系代数中，您应该写：

sharks = σ _{family = “Sharks”} (animals)

鲨鱼 = 鲨鱼家族（动物）

where σ (the Greek letter sigma) is the selection operator, returning only those animals that match the condition family = “Sharks” .

其中σ（希腊字母西格马）是选择运算符，仅返回符合条件“家族=“鲨鱼”的动物”。

When SQL was defined, it followed the structure of the relational algebra fairly closely:

当SQL被定义时，它相当 closely 地遵循了关系代数的结构。

SELECT * FROM animals WHERE family = 'Sharks';

An imperative language tells the computer to perform certain operations in a certain order. You can imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.

一种命令式编程语言会告诉计算机按照一定的顺序执行某些操作。你可以想象逐行地浏览代码，评估条件，更新变量，然后决定是否再次执行循环。

In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.

在声明式查询语言（如SQL或关系代数）中，您只需指定所需的数据模式 - 结果必须满足哪些条件以及如何转换数据（例如排序、分组和聚合） - 而不是如何实现该目标。由数据库系统的查询优化器决定使用哪些索引和连接方法，以及以哪种顺序执行查询的各个部分。

A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API. But more importantly, it also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.

声明性查询语言有吸引力，因为通常比命令式API更简洁易用。但更重要的是，它也隐藏了数据库引擎的实现细节，这使得数据库系统能够引入性能改进而无需对查询进行任何更改。

For example, in the imperative code shown at the beginning of this section, the list of animals appears in a particular order. If the database wants to reclaim unused disk space behind the scenes, it might need to move records around, changing the order in which the animals appear. Can the database do that safely, without breaking queries?

在本节开头所示的命令式代码中，动物列表以特定的顺序出现。如果数据库想要在后台回收未使用的磁盘空间，则可能需要移动记录并更改动物出现的顺序。数据库能安全地这样做，而不会破坏查询吗？

The SQL example doesn’t guarantee any particular ordering, and so it doesn’t mind if the order changes. But if the query is written as imperative code, the database can never be sure whether the code is relying on the ordering or not. The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.

这个SQL示例并不保证特定的排序，所以它不在意顺序的改变。但是如果查询被写成命令式代码，数据库就不能确定代码是否依赖于排序。事实上，SQL的功能更加有限，这给数据库留下了更多自动优化的空间。

Finally, declarative languages often lend themselves to parallel execution. Today, CPUs are getting faster by adding more cores, not by running at significantly higher clock speeds than before [ 31 ]. Imperative code is very hard to parallelize across multiple cores and multiple machines, because it specifies instructions that must be performed in a particular order. Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results. The database is free to use a parallel implementation of the query language, if appropriate [ 32 ].

最终，声明性语言通常适合并行执行。如今，CPU 通过添加更多核心来加快速度，而不是以比以前显著更高的时钟速度运行[31]。命令式代码很难跨多个核心和多个机器并行化，因为它指定必须按特定顺序执行的指令。声明性语言在并行执行方面有更大的机会变得更快，因为它们仅指定结果的模式，而不是用于确定结果的算法。如果合适，数据库可以自由地使用查询语言的并行实现[32]。

Declarative Queries on the Web

The advantages of declarative query languages are not limited to just databases. To illustrate the point, let’s compare declarative and imperative approaches in a completely different environment: a web browser.

声明式查询语言的优点不仅限于数据库。为了说明这一点，让我们比较声明式和命令式方法在一个完全不同的环境下：Web浏览器。

Say you have a website about animals in the ocean. The user is currently viewing the page on sharks, so you mark the navigation item “Sharks” as currently selected, like this:

假设你有一个关于海洋动物的网站。用户目前正在查看关于鲨鱼的页面，那么你可以将导航条目“鲨鱼”标记为当前选中状态，就像这样：

<ul>
    <li class="selected"> 
        <p>Sharks</p> 
        <ul>
            <li>Great White Shark</li>
            <li>Tiger Shark</li>
            <li>Hammerhead Shark</li>
        </ul>
    </li>
    <li>
        <p>Whales</p>
        <ul>
            <li>Blue Whale</li>
            <li>Humpback Whale</li>
            <li>Fin Whale</li>
        </ul>
    </li>
</ul>

The selected item is marked with the CSS class "selected" .

所选择的项目带有CSS类“selected”。

Sharks is the title of the currently selected page.

当前选定页面的标题是“鲨鱼”。

Now say you want the title of the currently selected page to have a blue background, so that it is visually highlighted. This is easy, using CSS:

现在假设您想要当前选定页面的标题具有蓝色背景，以便在视觉上突出显示。使用 CSS 很容易实现：

li.selected > p {
    background-color: blue;
}

Here the CSS selector li.selected > p declares the pattern of elements to which we want to apply the blue style: namely, all  elements whose direct parent is an <li> element with a CSS class of selected . The element Sharks in the example matches this pattern, but Whales does not match because its <li> parent lacks class="selected" .

此处 CSS 选择器 li.selected > p 声明我们要应用蓝色样式的元素模式：即所有直接父元素为 class="selected" 的 <li> 元素的 元素。在示例中，元素 Sharks 符合此模式，但 Whales 不符合，因为其 <li> 父元素缺少 class="selected"。

If you were using XSL instead of CSS, you could do something similar:

如果你使用XSL而不是CSS，你可以做类似的事情：

<xsl:template match="li[@class='selected']/p">
    <fo:block background-color="blue">
        <xsl:apply-templates/>
    </fo:block>
</xsl:template>

Here, the XPath expression li[@class='selected']/p is equivalent to the CSS selector li.selected > p in the previous example. What CSS and XSL have in common is that they are both declarative languages for specifying the styling of a document.

这里，XPath表达式li[@class='selected']/p相当于前面例子中的CSS选择器li.selected > p。CSS和XSL的共同点在于它们都是用于指定文档样式的声明性语言。

Imagine what life would be like if you had to use an imperative approach. In JavaScript, using the core Document Object Model (DOM) API, the result might look something like this:

想象一下，如果你必须使用命令式方法来生活会是什么样子。在 JavaScript 中，使用核心文档对象模型 (DOM) API，结果可能看起来像这样:

var liElements = document.getElementsByTagName("li");
for (var i = 0; i < liElements.length; i++) {
    if (liElements[i].className === "selected") {
        var children = liElements[i].childNodes;
        for (var j = 0; j < children.length; j++) {
            var child = children[j];
            if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
                child.setAttribute("style", "background-color: blue");
            }
        }
    }
}

This JavaScript imperatively sets the element Sharks to have a blue background, but the code is awful. Not only is it much longer and harder to understand than the CSS and XSL equivalents, but it also has some serious problems:

这个JavaScript命令式地将元素鲨鱼的背景设为蓝色，但代码十分糟糕。它不仅比CSS和XSL等价物更长，更难理解，而且还有一些严重的问题：

If the selected class is removed (e.g., because the user clicks a different page), the blue color won’t be removed, even if the code is rerun—and so the item will remain highlighted until the entire page is reloaded. With CSS, the browser automatically detects when the li.selected > p rule no longer applies and removes the blue background as soon as the selected class is removed.

如果所选的类别被删除（例如，因为用户点击了不同的页面），即使重新运行代码，蓝色也不会删除，因此该项将保持突出显示，直到整个页面重新加载。使用CSS，浏览器会自动检测li.selected > p规则不再适用并在移除所选类别时立即删除蓝色背景。
If you want to take advantage of a new API, such as document.getElementsByClassName("selected") or even document.evaluate() —which may improve performance—you have to rewrite the code. On the other hand, browser vendors can improve the performance of CSS and XPath without breaking compatibility.

如果您想利用新API，例如document.getElementsByClassName（“selected”）或document.evaluate（），可能会提高性能，那么您必须重新编写代码。另一方面，浏览器厂商可以提高CSS和XPath的性能，而不会破坏兼容性。如果想利用新的 API，比如document.getElementsByClassName("selected")，甚至是document.evaluate()——这些可能会提升性能——那么就必须重新编写代码。另一方面，浏览器厂商可以提升CSS和XPath的性能，而不毁于兼容性。

In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs. ^vi

在Web浏览器中，使用声明性CSS样式比在JavaScript中使用命令式样式要好得多。同样，在数据库中，声明性查询语言例如SQL被证明比命令式查询API更好用。

MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google [ 33 ]. A limited form of MapReduce is supported by some NoSQL datastores, including MongoDB and CouchDB, as a mechanism for performing read-only queries across many documents.

MapReduce是一种编程模型，用于在许多计算机上批量处理大量数据，由Google流行化[33]。一些NoSQL数据存储支持MapReduce的有限形式，包括MongoDB和CouchDB，作为在许多文档上执行只读查询的机制。

MapReduce in general is described in more detail in Chapter 10 . For now, we’ll just briefly discuss MongoDB’s use of the model.

总体上说，MapReduce在第10章中有更详细的描述。现在，我们只是简要讨论MongoDB使用该模型的情况。

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect ) and reduce (also known as fold or inject ) functions that exist in many functional programming languages.

MapReduce不是声明性查询语言也不是完全的命令式查询API，而是介于两者之间：查询逻辑用代码片段表达，这些代码片段在处理框架中被重复调用。它基于在许多功能编程语言中存在的map（也称为collect）和reduce（也称为fold或inject）函数。

To give an example, imagine you are a marine biologist, and you add an observation record to your database every time you see animals in the ocean. Now you want to generate a report saying how many sharks you have sighted per month.

举个例子，比如你是一位海洋生物学家，每当你在海洋中看到动物时，你都会添加一条观察记录到你的数据库中。现在你想生成一份报告，说明每个月你观察到了多少只鲨鱼。

In PostgreSQL you might express that query like this:

在PostgreSQL中，您可以像这样表达该查询：

SELECT date_trunc('month', observation_timestamp) AS observation_month, 
       sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;

The date_trunc('month', timestamp) function determines the calendar month containing timestamp , and returns another timestamp representing the beginning of that month. In other words, it rounds a timestamp down to the nearest month.

"date_trunc('month', timestamp)"函数确定包含时间戳的日历月份，并返回表示该月份开头的另一个时间戳。换句话说，它将时间戳向下舍入到最近的月份。

This query first filters the observations to only show species in the Sharks family, then groups the observations by the calendar month in which they occurred, and finally adds up the number of animals seen in all observations in that month.

该查询首先按照“鲨鱼”物种筛选观察记录，然后按照所发生的日历月份对观测进行分组，最后计算该月份所有观测中看到的动物数量。

The same can be expressed with MongoDB’s MapReduce feature as follows:

可以使用MongoDB的MapReduce功能以如下方式表达相同的内容：

db.observations.mapReduce(
    function map() { 
        var year  = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals); 
    },
    function reduce(key, values) { 
        return Array.sum(values); 
    },
    {
        query: { family: "Sharks" }, 
        out: "monthlySharkReport" 
    }
);

The filter to consider only shark species can be specified declaratively (this is a MongoDB-specific extension to MapReduce).

可以通过声明性的方式指定只考虑鲨鱼物种的过滤器（这是MongoDB特有的MapReduce扩展功能）。

The JavaScript function map is called once for every document that matches query , with this set to the document object.

JavaScript 函数map会针对每个与查询匹配的文档调用一次，将this设置为文档对象。

The map function emits a key (a string consisting of year and month, such as "2013-12" or "2014-1" ) and a value (the number of animals in that observation).

“map”函数发射一个键（由年份和月份组成的字符串，如“2013-12”或“2014-1”）和一个值（观察中动物的数量）。

The key-value pairs emitted by map are grouped by key. For all key-value pairs with the same key (i.e., the same month and year), the reduce function is called once.

map 发出的键值对根据键分组。对于所有具有相同键（即相同的月份和年份）的键值对，会调用一次 reduce 函数。

The reduce function adds up the number of animals from all observations in a particular month.

"reduce函数将特定月份内所有观察结果的动物数量相加。"

The final output is written to the collection monthlySharkReport .

最终输出将被写入集合monthlySharkReport。

For example, say the observations collection contains these two documents:

例如，假设观察集合包含以下两个文档：

{
    observationTimestamp: Date.parse("Mon, 25 Dec 1995 12:34:56 GMT"),
    family:     "Sharks",
    species:    "Carcharodon carcharias",
    numAnimals: 3
}
{
    observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
    family:     "Sharks",
    species:    "Carcharias taurus",
    numAnimals: 4
}

The map function would be called once for each document, resulting in emit("1995-12", 3) and emit("1995-12", 4) . Subsequently, the reduce function would be called with reduce("1995-12", [3, 4]) , returning 7 .

“map”函数将针对每个文档调用一次，导致“emit（“1995-12”，3）”和“emit（“1995-12”，4）” 。随后，reduce函数将使用reduce（“1995-12”，[3,4]）进行调用，返回7。

The map and reduce functions are somewhat restricted in what they are allowed to do. They must be pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects. These restrictions allow the database to run the functions anywhere, in any order, and rerun them on failure. However, they are nevertheless powerful: they can parse strings, call library functions, perform calculations, and more.

映射和归约函数在其允许执行的操作上有一定的限制。它们必须是纯函数，这意味着它们只能使用作为输入传递给它们的数据，不能执行额外的数据库查询，也不能产生任何副作用。这些限制使得数据库可以在任何地方以任何顺序运行这些函数，并在失败时重新运行它们。然而，它们仍然非常强大：它们可以解析字符串、调用库函数、执行计算等等。

MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations (see Chapter 10 ), but there are also many distributed implementations of SQL that don’t use MapReduce. Note there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.

MapReduce是一个基于集群机器分布式执行的相对较低级的编程模型。高级查询语言如SQL可以作为MapReduce操作的流水线来实现（见第10章），但也有许多分布式的SQL实现不使用MapReduce。需要注意的是，SQL并不局限于单机运行，而MapReduce也不是唯一的分布式查询执行方案。

Being able to use JavaScript code in the middle of a query is a great feature for advanced queries, but it’s not limited to MapReduce—some SQL databases can be extended with JavaScript functions too [ 34 ].

JavaScript代码在查询过程中的使用是高级查询的重要功能，但它并不局限于MapReduce—某些SQL数据库也可以使用JavaScript函数扩展。[34]。

A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query. Moreover, a declarative query language offers more opportunities for a query optimizer to improve the performance of a query. For these reasons, MongoDB 2.2 added support for a declarative query language called the aggregation pipeline [ 9 ]. In this language, the same shark-counting query looks like this:

MapReduce存在的一个可用性问题是你需要编写两个精心协调的JavaScript函数，通常比编写单个查询更困难。此外，声明性查询语言提供了更多的机会让查询优化器提高查询性能。出于这些原因，MongoDB 2.2添加了支持称为聚合管道的声明性查询语言[9]。在这种语言中，同样的鲨鱼计数查询看起来像这样：

db.observations.aggregate([
    { $match: { family: "Sharks" } },
    { $group: {
        _id: {
            year:  { $year:  "$observationTimestamp" },
            month: { $month: "$observationTimestamp" }
        },
        totalAnimals: { $sum: "$numAnimals" }
    } }
]);

The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a matter of taste. The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.

聚合管道语言在表达能力上与 SQL 的子集类似，但它使用基于 JSON 的语法，而不是 SQL 的英语句子样式语法；这种差异可能只是口味问题。故事的道德是，NoSQL 系统可能会意外地重新发明 SQL，尽管是伪装的。

Graph-Like Data Models

We saw earlier that many-to-many relationships are an important distinguishing feature between different data models. If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.

我们之前看到，多对多关系是不同数据模型之间的一个重要区别特征。如果你的应用程序主要有一对多关系（树形结构数据）或记录之间没有关系，那么文档模型是适合的。

But what if many-to-many relationships are very common in your data? The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

如果你的数据中有很多多对多的关系会怎样？关系型模型可以处理简单的多对多关系，但是随着你的数据内部连接变得更加复杂，建模你的数据成为一个图形就更加自然了。

A graph consists of two kinds of objects: vertices (also known as nodes or entities ) and edges (also known as relationships or arcs ). Many kinds of data can be modeled as a graph. Typical examples include:

一个图表由两种对象组成:顶点(也称为节点或实体)和边(也称为关系或弧)。许多种数据可以建模为图。典型的例子包括：

Social graphs

Vertices are people, and edges indicate which people know each other.

顶点是人，边表示哪些人彼此认识。

The web graph

Vertices are web pages, and edges indicate HTML links to other pages.

顶点是网页，而边表示 HTML 链接到其他页面。

Road or rail networks

Vertices are junctions, and edges represent the roads or railway lines between them.

顶点是交汇点，边代表它们之间的公路或铁路线。

Well-known algorithms can operate on these graphs: for example, car navigation systems search for the shortest path between two points in a road network, and PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results.

著名算法可以在这些图上运作：例如，汽车导航系统在道路网络中寻找两个点之间的最短路径，而PageRank可以在Web图上用于确定网页的受欢迎程度，从而确定其在搜索结果中的排名。

In the examples just given, all the vertices in a graph represent the same kind of thing (people, web pages, or road junctions, respectively). However, graphs are not limited to such homogeneous data: an equally powerful use of graphs is to provide a consistent way of storing completely different types of objects in a single datastore. For example, Facebook maintains a single graph with many different types of vertices and edges: vertices represent people, locations, events, checkins, and comments made by users; edges indicate which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, and so on [ 35 ].

在刚才的例子中，图中的所有顶点代表的是同一种类型的事物（人、网页或者路口）。然而，图并不限于这样的同质数据：使用图的同样强大的方式是在一个单一的数据存储中提供一种一致的方式来存储完全不同类型的对象。例如，Facebook维护一个包含多种不同类型的顶点和边的单个图：顶点代表人、位置、事件、签到和用户做出的评论；边表示哪些人是彼此的朋友，哪个位置发生了哪个签到，谁评论了哪篇帖子，谁参加了哪个活动等等[35]。

In this section we will use the example shown in Figure 2-5 . It could be taken from a social network or a genealogical database: it shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London.

在这个部分，我们将使用图2-5中展示的例子。它可以来自社交网络或家谱数据库：它展示了两个人，来自爱达荷州的露西和来自法国博纳的阿兰。他们结婚并住在伦敦。

There are several different, but related, ways of structuring and querying data in graphs. In this section we will discuss the property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). We will look at three declarative query languages for graphs: Cypher, SPARQL, and Datalog. Besides these, there are also imperative graph query languages such as Gremlin [ 36 ] and graph processing frameworks like Pregel (see Chapter 10 ).

有几种不同但相关的方式来在图中对数据进行结构化和查询。在本节中，我们将讨论属性图模型（由Neo4j、Titan和InfiniteGraph实现）和三元组存储模型（由Datomic、AllegroGraph等实现）。我们将看一下三种图形的声明性查询语言：Cypher、SPARQL和Datalog。除此之外，还有类似Gremlin的命令式图形查询语言以及Pregel等图形处理框架（请参见第10章）。

Property Graphs

In the property graph model, each vertex consists of:

在属性图模型中，每个顶点由以下内容组成：

A unique identifier

一个唯一标识符
A set of outgoing edges

出边集合
A set of incoming edges

一组入边
A collection of properties (key-value pairs)

一组属性（键值对）

Each edge consists of:

每个边缘由：

A unique identifier

一个唯一标识符
The vertex at which the edge starts (the tail vertex )

边起始的顶点（尾部顶点）
The vertex at which the edge ends (the head vertex )

边缘的终点顶点（头顶点）。
A label to describe the kind of relationship between the two vertices

两个各自描述不同的实体之间关系类型的标签。
A collection of properties (key-value pairs)

一组属性（键值对）

You can think of a graph store as consisting of two relational tables, one for vertices and one for edges, as shown in Example 2-2 (this schema uses the PostgreSQL json datatype to store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if you want the set of incoming or outgoing edges for a vertex, you can query the edges table by head_vertex or tail_vertex , respectively.

你可以将图存储视为由两个关系表组成，一个存储顶点，另一个存储边，如示例2-2所示（此架构使用PostgreSQL json数据类型来存储每个顶点或边的属性）。每个边都存储头部和尾部顶点；如果您想要一个顶点的传入或传出边的集合，则可以分别通过head_vertex或tail_vertex查询edges表。

Example 2-2. Representing a property graph using a relational schema

CREATE TABLE vertices (
    vertex_id   integer PRIMARY KEY,
    properties  json
);

CREATE TABLE edges (
    edge_id     integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices (vertex_id),
    head_vertex integer REFERENCES vertices (vertex_id),
    label       text,
    properties  json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

Some important aspects of this model are:

这个模型的一些重要方面是：

Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.

任何顶点都可以与任何其他顶点连接一条边。没有任何模式限制哪些物品可以或不能关联。
Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward. (That’s why Example 2-2 has indexes on both the tail_vertex and head_vertex columns.)

给定任何一个顶点，你可以高效地找到它的入边和出边，从而可以在图中遍历，即沿着顶点链条向前和向后走。（这就是为什么示例2-2在tail_vertex和head_vertex列上都有索引的原因。）
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.

通过为不同类型的关系使用不同的标签，您可以在单个图中存储多种不同类型的信息，同时仍然保持清晰的数据模型。

Those features give graphs a great deal of flexibility for data modeling, as illustrated in Figure 2-5 . The figure shows a few things that would be difficult to express in a traditional relational schema, such as different kinds of regional structures in different countries (France has départements and régions , whereas the US has counties and states ), quirks of history such as a country within a country (ignoring for now the intricacies of sovereign states and nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas her place of birth is specified only at the level of a state).

这些特性赋予图形数据建模以极大的灵活性，如图2-5所示。该图显示了一些传统关系模式难以表达的东西，例如不同国家的不同地区结构（法国有départements和régions，而美国有郡和州），历史上的奇怪现象，如国家内的国家（暂时忽略主权国家和民族的复杂性），以及数据的变化粒度（Lucy的目前居住地指定为城市，而她的出生地仅在州一级指定）。

You could imagine extending the graph to also include many other facts about Lucy and Alain, or other people. For instance, you could use it to indicate any food allergies they have (by introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an allergy), and link the allergens with a set of vertices that show which foods contain which substances. Then you could write a query to find out what is safe for each person to eat. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.

你可以想象将该图扩展以包括有关Lucy和Alain或其他人的许多其他事实。例如，您可以使用它来指示他们有哪些食物过敏症（通过为每种过敏原引入一个顶点，并在人和过敏原之间添加一条边来表示过敏症），并使用一组显示哪些食物含有哪些物质的顶点将过敏原连接起来。然后，您可以编写查询以查找每个人可以安全食用的食物。图表很适合扩展性：随着您向应用程序添加功能，图表可以轻松扩展以适应应用程序数据结构的变化。

The Cypher Query Language

Cypher is a declarative query language for property graphs, created for the Neo4j graph database [ 37 ]. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography [ 38 ].)

Cypher是一种面向属性图的声明式查询语言，为Neo4j图数据库创建[37]。 (它的名字来源于电影《黑客帝国》中的一个角色，并与密码学中的加密术语[38]无关。)

Example 2-3 shows the Cypher query to insert the lefthand portion of Figure 2-5 into a graph database. The rest of the graph can be added similarly and is omitted for readability. Each vertex is given a symbolic name like USA or Idaho , and other parts of the query can use those names to create edges between the vertices, using an arrow notation: (Idaho) -[:WITHIN]-> (USA) creates an edge labeled WITHIN , with Idaho as the tail node and USA as the head node.

示例2-3展示了将图2-5左侧部分插入图形数据库的Cypher查询。图中其他部分可以类似地添加，为了便于阅读而被省略。每个顶点都被赋予一个象征性的名称，例如美国或爱达荷，查询的其他部分可以使用这些名称来创建顶点之间的边，使用箭头符号：“（爱达荷）-[:WITHIN]->（美国）”创建了一个标记为WITHIN的边，其中爱达荷作为尾部节点，美国作为头节点。

Example 2-3. A subset of the data in Figure 2-5 , represented as a Cypher query

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'  }),
  (Idaho:Location    {name:'Idaho',         type:'state'    }),
  (Lucy:Person       {name:'Lucy' }),
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
  (Lucy)  -[:BORN_IN]-> (Idaho)

When all the vertices and edges of Figure 2-5 are added to the database, we can start asking interesting questions: for example, find the names of all the people who emigrated from the United States to Europe . To be more precise, here we want to find all the vertices that have a BORN_IN edge to a location within the US, and also a LIVING_IN edge to a location within Europe, and return the name property of each of those vertices.

当图2-5的所有顶点和边被添加到数据库后，我们可以开始提出有趣的问题：例如，查找所有从美国移民到欧洲的人的名字。更准确地说，我们要找到所有顶点，它们具有一条指向美国境内位置的BORN_IN边和一条指向欧洲境内位置的LIVING_IN边，并返回每个该顶点的名称属性。

Example 2-4 shows how to express that query in Cypher. The same arrow notation is used in a MATCH clause to find patterns in the graph: (person) -[:BORN_IN]-> () matches any two vertices that are related by an edge labeled BORN_IN . The tail vertex of that edge is bound to the variable person , and the head vertex is left unnamed.

例2-4展示了如何用Cypher表达该查询。在MATCH子句中使用相同的箭头符号来查找图中的模式：（person）- [：BORN_IN] - >（）匹配通过标记为BORN_IN的边缘相关的任意两个顶点。该边的尾顶点绑定到变量person，而头顶点则未命名。

Example 2-4. Cypher query to find people who emigrated from the US to Europe

MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name

The query can be read as follows:

该查询可以理解为:

Find any vertex (call it person ) that meets both of the following conditions:

寻找符合以下两个条件的任何顶点（称之为个人）：

person has an outgoing BORN_IN edge to some vertex. From that vertex, you can follow a chain of outgoing WITHIN edges until eventually you reach a vertex of type Location , whose name property is equal to "United States" .

某人将BORN_IN出边传递到某个顶点。从该顶点出发，您可以沿着输出WITHIN边的链，直到最终到达类型为“Location”的顶点，其名称属性等于“美国”。

That same person vertex also has an outgoing LIVES_IN edge. Following that edge, and then a chain of outgoing WITHIN edges, you eventually reach a vertex of type Location , whose name property is equal to "Europe" .

同样的顶点还具有一个向外的“居住于”边。沿着那条边，然后是一系列的向外的“位于”边，你最终会到达一个类型为位置的顶点，其名称属性等于“欧洲”。

For each such person vertex, return the name property.

对于每个这样的人顶点，返回名称属性。

There are several possible ways of executing the query. The description given here suggests that you start by scanning all the people in the database, examine each person’s birthplace and residence, and return only those people who meet the criteria.

有几种可能的执行查询的方式。这里给出的描述建议您首先扫描数据库中的所有人员，检查每个人的出生地和居住地，只返回符合条件的人员。

But equivalently, you could start with the two Location vertices and work backward. If there is an index on the name property, you can probably efficiently find the two vertices representing the US and Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US and Europe respectively by following all incoming WITHIN edges. Finally, you can look for people who can be found through an incoming BORN_IN or LIVES_IN edge at one of the location vertices.

然而同等地，你也可以从两个位置顶点开始向后推导。如果名称属性上有索引，你可能可以有效地找到代表美国和欧洲的两个顶点。然后，你可以通过跟随所有传入的WITHIN边缘，分别找到美国和欧洲的所有位置（州，地区，城市等）。最后，你可以在位置顶点之一通过传入的BORN_IN或LIVES_IN边缘查找可以找到的人。

As is typical for a declarative query language, you don’t need to specify such execution details when writing the query: the query optimizer automatically chooses the strategy that is predicted to be the most efficient, so you can get on with writing the rest of your application.

作为声明式查询语言的典型特点，写查询时无需指定执行细节：查询优化器会自动选择预测为最有效的策略，因此您可以继续编写应用程序的其他部分。

Graph Queries in SQL

Example 2-2 suggested that graph data can be represented in a relational database. But if we put graph data in a relational structure, can we also query it using SQL?

示例2-2指出图形数据可以在关系型数据库中表示。但是如果将图形数据放入关系结构中，我们是否也可以使用SQL查询它？

The answer is yes, but with some difficulty. In a relational database, you usually know in advance which joins you need in your query. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for—that is, the number of joins is not fixed in advance.

答案是肯定的，但需要一些困难。在关系数据库中，你通常已经知道了你查询需要哪些连接。而在图形查询中，你可能需要遍历不同数量的边缘，才能找到你想要的顶点——也就是说，连接的数量在事先是不固定的。

In our example, that happens in the () -[:WITHIN*0..]-> () rule in the Cypher query. A person’s LIVES_IN edge may point at any kind of location: a street, a city, a district, a region, a state, etc. A city may be WITHIN a region, a region WITHIN a state, a state WITHIN a country, etc. The LIVES_IN edge may point directly at the location vertex you’re looking for, or it may be several levels removed in the location hierarchy.

在我们的示例中，这发生在Cypher查询中的()-[:WITHIN*0..]->()规则中。一个人的“居住在”边缘可能指向任何类型的位置：一条街道，一个城市，一个区域，一个地区，一个州等等。一个城市可能在一个地区内，在一个地区内在一个州内，在一个州内在一个国家内等等。 “居住在”边缘可以直接指向您正在查找的位置顶点，也可以在位置层次结构中被移除数个级别。

In Cypher, :WITHIN*0.. expresses that fact very concisely: it means “follow a WITHIN edge, zero or more times.” It is like the * operator in a regular expression.

在密码中，:WITHIN*0..表达了这个事实非常简洁：它的意思是“遵循一个WITHIN边缘，零次或多次。” 就像正则表达式中的*操作符一样。

Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax). Example 2-5 shows the same query—finding the names of people who emigrated from the US to Europe—expressed in SQL using this technique (supported in PostgreSQL, IBM DB2, Oracle, and SQL Server). However, the syntax is very clumsy in comparison to Cypher.

自从1999年的SQL版本开始，可以使用称为“带递归的公共表达式”（WITH RECURSIVE语法）的东西来表达查询中的可变长度遍历路径这个思想。例如，示例2-5展示了通过使用这种技术（在PostgreSQL、IBM DB2、Oracle和SQL Server中支持）表达的查询——找到移民从美国到欧洲的人的姓名。但是，与Cypher相比，语法非常笨拙。

Example 2-5. The same query as Example 2-4 , expressed in SQL using recursive common table expressions

WITH RECURSIVE

  -- in_usa is the set of vertex IDs of all locations within the United States
  in_usa(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States' 
    UNION
      SELECT edges.tail_vertex FROM edges 
        JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
        WHERE edges.label = 'within'
  ),

  -- in_europe is the set of vertex IDs of all locations within Europe
  in_europe(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe' 
    UNION
      SELECT edges.tail_vertex FROM edges
        JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
        WHERE edges.label = 'within'
  ),

  -- born_in_usa is the set of vertex IDs of all people born in the US
  born_in_usa(vertex_id) AS ( 
    SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'born_in'
  ),

  -- lives_in_europe is the set of vertex IDs of all people living in Europe
  lives_in_europe(vertex_id) AS ( 
    SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'lives_in'
  )

SELECT vertices.properties->>'name'
FROM vertices
-- join to find those people who were both born in the US *and* live in Europe
JOIN born_in_usa     ON vertices.vertex_id = born_in_usa.vertex_id 
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;

First find the vertex whose name property has the value "United States" , and make it the first element of the set of vertices in_usa .

首先找到名称属性值为“美国”的顶点，并将其作为in_usa顶点集的第一个元素。

Follow all incoming within edges from vertices in the set in_usa , and add them to the same set, until all incoming within edges have been visited.

跟随集合in_usa中顶点的所有入边，并将它们添加到同一集合中，直到所有入边已访问。

Do the same starting with the vertex whose name property has the value "Europe" , and build up the set of vertices in_europe .

从名称属性值为“欧洲”的顶点开始，执行相同操作并建立在欧洲内的顶点集合。

For each of the vertices in the set in_usa , follow incoming born_in edges to find people who were born in some place within the United States.

对于集合in_usa中的每个顶点，沿着传入的born_in边跟踪以查找在美国境内出生的人。

Similarly, for each of the vertices in the set in_europe , follow incoming lives_in edges to find people who live in Europe.

类似地，对于集合 in_europe 中的每个顶点，跟随传入的 lives_in 边来查找生活在欧洲的人。

Finally, intersect the set of people born in the USA with the set of people living in Europe, by joining them.

最后，通过结合美国出生人口的集合和居住在欧洲的人口的集合，交集。

If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.

如果同样的查询在一种查询语言中可以写成4行，而在另一种语言中需要29行，那只说明不同的数据模型是为满足不同的用例而设计的。选择适合您应用的数据模型非常重要。

Triple-Stores and SPARQL

The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas. It is nevertheless worth discussing, because there are various tools and languages for triple-stores that can be valuable additions to your toolbox for building applications.

三元组模型基本上等同于财产图模型，只是使用不同的词语来描述相同的思想。然而，这值得讨论，因为有各种工具和语言可以用于三元组存储，这些都可以成为构建应用程序工具箱中有价值的补充。

In a triple-store, all information is stored in the form of very simple three-part statements: ( subject , predicate , object ). For example, in the triple ( Jim , likes , bananas ), Jim is the subject, likes is the predicate (verb), and bananas is the object.

在三元组存储中，所有信息都以非常简单的三部分语句（主语，谓语，宾语）的形式存储。例如，在三元组（Jim，likes，bananas）中，Jim是主语，likes是谓语（动词），bananas是宾语。

The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:

一个三元组的主语相当于图中的一个顶点。宾语则是两种情况之一：

A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, ( lucy , age , 33 ) is like a vertex lucy with properties {"age":33} .

简化版：一个原始数据类型的值，例如字符串或数字。在这种情况下，三元组的谓词和对象等同于主题顶点上属性的键和值。例如，（卢西，年龄，33）就像一个具有属性{"age":33}的顶点卢西。
Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in ( lucy , marriedTo , alain ) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.

图形中的另一个顶点。在这种情况下，谓词是图形中的一条边，主语是尾顶点，宾语是头顶点。例如，在(lucy，marriedTo，alain)中，主语和宾语lucy和alain都是顶点，谓词marriedTo是连接它们的边的标签。

Example 2-6 shows the same data as in Example 2-3 , written as triples in a format called Turtle , a subset of Notation3 ( N3 ) [ 39 ].

示例2-6展示了与示例2-3相同的数据，以称为Turtle的格式编写的三元组的形式表示，它是Notation3（N3）的子集。[39]。

Example 2-6. A subset of the data in Figure 2-5 , represented as Turtle triples

@prefix : <urn:example:>.
_:lucy     a       :Person.
_:lucy     :name   "Lucy".
_:lucy     :bornIn _:idaho.
_:idaho    a       :Location.
_:idaho    :name   "Idaho".
_:idaho    :type   "state".
_:idaho    :within _:usa.
_:usa      a       :Location.
_:usa      :name   "United States".
_:usa      :type   "country".
_:usa      :within _:namerica.
_:namerica a       :Location.
_:namerica :name   "North America".
_:namerica :type   "continent".

In this example, vertices of the graph are written as _: someName . The name doesn’t mean anything outside of this file; it exists only because we otherwise wouldn’t know which triples refer to the same vertex. When the predicate represents an edge, the object is a vertex, as in _:idaho :within _:usa . When the predicate is a property, the object is a string literal, as in _:usa :name "United States" .

在这个例子中，图形的顶点被写成_:someName。这个名字并没有任何意义，存在的唯一目的是因为我们不知道哪些三元组引用同一个顶点。当谓词表示边时，对象是顶点，例如_:idaho :within _:usa。当谓词是属性时，对象是一个字符串文字，例如_:usa :name "United States"。

It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use semicolons to say multiple things about the same subject. This makes the Turtle format quite nice and readable: see Example 2-7 .

重复地反复讲同一个主题相当乏味，但幸运的是，你可以使用分号来表达关于同一个主题的多个事情。这使得Turtle格式非常好读：见示例2-7。

Example 2-7. A more concise way of writing the data in Example 2-6

@prefix : <urn:example:>.
_:lucy     a :Person;   :name "Lucy";          :bornIn _:idaho.
_:idaho    a :Location; :name "Idaho";         :type "state";   :within _:usa.
_:usa      a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".

The semantic web

If you read more about triple-stores, you may get sucked into a maelstrom of articles written about the semantic web . The triple-store data model is completely independent of the semantic web—for example, Datomic [ 40 ] is a triple-store that does not claim to have anything to do with it. ^vii But since the two are so closely linked in many people’s minds, we should discuss them briefly.

如果你更多地了解三元存储，你可能会陷入一片关于语义网的文章漩涡中。三元存储数据模型完全独立于语义网，例如，Datomic 是一个三元存储，它并不声称与语义网有任何关系。但由于许多人将两者联系得如此之紧密，我们应该简要讨论一下它们。

The semantic web is fundamentally a simple and reasonable idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read? The Resource Description Framework (RDF) [ 41 ] was intended as a mechanism for different websites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data —a kind of internet-wide “database of everything.”

语义网基本上是一个简单而合理的想法：网站已经发布了文本和图片供人类阅读，那么为什么它们不也将信息发布为机器可读的数据供计算机阅读？资源描述框架（RDF）旨在成为不同网站以一致的格式发布数据的机制，允许来自不同网站的数据自动组合成数据网络 - 种互联网范围内的“万物数据库”。

Unfortunately, the semantic web was overhyped in the early 2000s but so far hasn’t shown any sign of being realized in practice, which has made many people cynical about it. It has also suffered from a dizzying plethora of acronyms, overly complex standards proposals, and hubris.

不幸的是，语义Web在2000年代初被过度夸大，但到目前为止还没有在实践中得到实现的迹象，这使得许多人对其持怀疑态度。它也遭受了令人眼花缭乱的缩写词、过于复杂的标准提案和傲慢的打击。

However, if you look past those failings, there is also a lot of good work that has come out of the semantic web project. Triples can be a good internal data model for applications, even if you have no interest in publishing RDF data on the semantic web.

然而，如果你去掉那些缺点，语义网项目也有许多好的工作成果。即使您不想在语义网上发布RDF数据，三元组可以成为应用程序的良好内部数据模型。

The RDF data model

The Turtle language we used in Example 2-7 is a human-readable format for RDF data. Sometimes RDF is also written in an XML format, which does the same thing much more verbosely—see Example 2-8 . Turtle/N3 is preferable as it is much easier on the eyes, and tools like Apache Jena [ 42 ] can automatically convert between different RDF formats if necessary.

我们在Example 2-7中使用的乌龟语是一种可读性强的RDF数据格式。有时RDF也以XML格式书写，但是它更加冗长，例如Example 2-8。Turtle/N3更受欢迎，因为它更加易读，使用诸如Apache Jena的工具还可以在必要时自动转换不同的RDF格式。

Example 2-8. The data of Example 2-7 , expressed using RDF/XML syntax

<rdf:RDF xmlns="urn:example:"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <Location rdf:nodeID="idaho">
    <name>Idaho</name>
    <type>state</type>
    <within>
      <Location rdf:nodeID="usa">
        <name>United States</name>
        <type>country</type>
        <within>
          <Location rdf:nodeID="namerica">
            <name>North America</name>
            <type>continent</type>
          </Location>
        </within>
      </Location>
    </within>
  </Location>

  <Person rdf:nodeID="lucy">
    <name>Lucy</name>
    <bornIn rdf:nodeID="idaho"/>
  </Person>
</rdf:RDF>

RDF has a few quirks due to the fact that it is designed for internet-wide data exchange. The subject, predicate, and object of a triple are often URIs. For example, a predicate might be an URI such as <http://my-company.com/namespace#within> or <http://my-company.com/namespace#lives_in> , rather than just WITHIN or LIVES_IN . The reasoning behind this design is that you should be able to combine your data with someone else’s data, and if they attach a different meaning to the word within or lives_in , you won’t get a conflict because their predicates are actually <http://other.org/foo#within> and <http://other.org/foo#lives_in> .

由于RDF旨在进行互联网范围的数据交换，因此它有一些小问题。一个三元组的主语、谓语和宾语通常是URI，例如，谓语可能是一个URI，例如<http://my-company.com/namespace#within>或<http://my-company.com/namespace#lives_in>，而不仅仅是WITHIN或LIVES_IN。这种设计的原因是您应该能够将您的数据与其他人的数据组合在一起，如果他们将within或lives_in这个词附加到不同的意义，您将不会遇到冲突，因为他们的谓语实际上是<http://other.org/foo#within>和<http://other.org/foo#lives_in>。

The URL <http://my-company.com/namespace> doesn’t necessarily need to resolve to anything—from RDF’s point of view, it is simply a namespace. To avoid potential confusion with http:// URLs, the examples in this section use non-resolvable URIs such as urn:example:within . Fortunately, you can just specify this prefix once at the top of the file, and then forget about it.

URL <http://my-company.com/namespace>从RDF的角度来看，并不一定需要解析任何内容，它只是一个命名空间。为了避免与http://网址潜在的混淆，本节中的示例使用不可解析的URI，如urn：example：within。幸运的是，您只需在文件顶部指定此前缀，然后忘记它即可。

The SPARQL query language

SPARQL is a query language for triple-stores using the RDF data model [ 43 ]. (It is an acronym for SPARQL Protocol and RDF Query Language , pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar [ 37 ].

SPARQL是使用RDF数据模型的三元存储的查询语言[43]。（它是SPARQL协议和RDF查询语言的缩写，发音为“sparkle”）。它早于Cypher，并且由于Cypher的模式匹配是从SPARQL借鉴的，它们看起来非常相似[37]。

The same query as before—finding people who have moved from the US to Europe—is even more concise in SPARQL than it is in Cypher (see Example 2-9 ).

在SPARQL中，与之前相同的查询——查找已经从美国移居到欧洲的人——甚至比Cypher更加简洁（参见示例2-9）。

Example 2-9. The same query as Example 2-4 , expressed in SPARQL

PREFIX : <urn:example:>

SELECT ?personName WHERE {
  ?person :name ?personName.
  ?person :bornIn  / :within* / :name "United States".
  ?person :livesIn / :within* / :name "Europe".
}

The structure is very similar. The following two expressions are equivalent (variables start with a question mark in SPARQL):

结构非常相似。以下两个表达式是等效的（在SPARQL中，变量以问号开头）：

(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location)   # Cypher

?person :bornIn / :within* ?location.                   # SPARQL

Because RDF doesn’t distinguish between properties and edges but just uses predicates for both, you can use the same syntax for matching properties. In the following expression, the variable usa is bound to any vertex that has a name property whose value is the string "United States" :

因为RDF不区分属性和边缘，而只是使用谓词来表示两者，所以您可以使用相同的语法来匹配属性。在以下表达式中，变量usa绑定到任何具有名称属性以及其值为字符串“美国”的顶点：

(usa {name:'United States'})   # Cypher

?usa :name "United States".    # SPARQL

SPARQL is a nice query language—even if the semantic web never happens, it can be a powerful tool for applications to use internally.

SPARQL是一种很好的查询语言，即使语义网从未实现，它也可以成为应用内部使用的强大工具。

Graph Databases Compared to the Network Model

In “Are Document Databases Repeating History?” we discussed how CODASYL and the relational model competed to solve the problem of many-to-many relationships in IMS. At first glance, CODASYL’s network model looks similar to the graph model. Are graph databases the second coming of CODASYL in disguise?

在“文档数据库是否在重复历史？”中，我们讨论了CODASYL和关系模型在解决IMS中多对多关系问题上的竞争。乍一看，CODASYL的网络模型类似于图形模型。图形数据库是CODASYL的化身吗？

No. They differ in several important ways:

不同。它们在几个重要方面有所不同：

In CODASYL, a database had a schema that specified which record type could be nested within which other record type. In a graph database, there is no such restriction: any vertex can have an edge to any other vertex. This gives much greater flexibility for applications to adapt to changing requirements.

在 CODASYL 中，数据库有一个模式，指定哪种记录类型可以嵌套在哪种其他记录类型中。在图形数据库中，没有这样的限制：任何顶点都可以有边连接到任何其他顶点。这给应用程序提供了更大的灵活性，使其能够适应不断变化的要求。
In CODASYL, the only way to reach a particular record was to traverse one of the access paths to it. In a graph database, you can refer directly to any vertex by its unique ID, or you can use an index to find vertices with a particular value.

在CODASYL中，到达特定记录的唯一方法是遍历其中一个访问路径。在图形数据库中，您可以通过其唯一ID直接引用任何顶点，或者您可以使用索引查找具有特定值的顶点。
In CODASYL, the children of a record were an ordered set, so the database had to maintain that ordering (which had consequences for the storage layout) and applications that inserted new records into the database had to worry about the positions of the new records in these sets. In a graph database, vertices and edges are not ordered (you can only sort the results when making a query).

在CODASYL中，一个记录的子集是有序的，因此数据库必须维护这种排序（这对存储布局有影响），而插入新记录的应用程序必须关注这些集合中新记录的位置。在图形数据库中，顶点和边不是有序的（只能在查询时对结果排序）。
In CODASYL, all queries were imperative, difficult to write and easily broken by changes in the schema. In a graph database, you can write your traversal in imperative code if you want to, but most graph databases also support high-level, declarative query languages such as Cypher or SPARQL.

在CODASYL中，所有查询都是命令式的，难以编写并且容易受架构更改影响而失效。在图形数据库中，您可以选择以命令式代码编写遍历，但大多数图形数据库也支持高级，声明式查询语言，如Cypher或SPARQL。

The Foundation: Datalog

Datalog is a much older language than SPARQL or Cypher, having been studied extensively by academics in the 1980s [ 44 , 45 , 46 ]. It is less well known among software engineers, but it is nevertheless important, because it provides the foundation that later query languages build upon.

Datalog比SPARQL或Cypher语言更为古老，已经在20世纪80年代被学者广泛研究过。虽然在软件工程师中不是很知名，但它仍然非常重要，因为后来的查询语言都建立在它的基础上。

In practice, Datalog is used in a few data systems: for example, it is the query language of Datomic [ 40 ], and Cascalog [ 47 ] is a Datalog implementation for querying large datasets in Hadoop. ^viii

在实践中，Datalog 在一些数据系统中使用：例如，它是 Datomic 的查询语言 [40]，Cascalog是一个用于在 Hadoop 中查询大数据集的 Datalog 实现 [47]。

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as ( subject , predicate , object ), we write it as predicate ( subject , object ). Example 2-10 shows how to write the data from our example in Datalog.

Datalog的数据模型类似于三元组存储模型，但稍微推广一下。我们将三元组的写法 (主语，谓语，宾语) 转换成谓语(主语，宾语)的形式。例如，示例 2-10 展示了如何在 Datalog 中写入我们示例中的数据。

Example 2-10. A subset of the data in Figure 2-5 , represented as Datalog facts

name(namerica, 'North America').
type(namerica, continent).

name(usa, 'United States').
type(usa, country).
within(usa, namerica).

name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).

name(lucy, 'Lucy').
born_in(lucy, idaho).

Now that we have defined the data, we can write the same query as before, as shown in Example 2-11 . It looks a bit different from the equivalent in Cypher or SPARQL, but don’t let that put you off. Datalog is a subset of Prolog, which you might have seen before if you’ve studied computer science.

现在我们已经定义了数据，我们可以像示例2-11那样编写与之前相同的查询。它看起来与Cypher或SPARQL中的等效语句有些不同，但不要让它吓到你。Datalog是Prolog的一个子集，如果你学过计算机科学，可能已经见过它了。

Example 2-11. The same query as Example 2-4 , expressed in Datalog

within_recursive(Location, Name) :- name(Location, Name).     /* Rule 1 */

within_recursive(Location, Name) :- within(Location, Via),    /* Rule 2 */
                                    within_recursive(Via, Name).

migrated(Name, BornIn, LivingIn) :- name(Person, Name),       /* Rule 3 */
                                    born_in(Person, BornLoc),
                                    within_recursive(BornLoc, BornIn),
                                    lives_in(Person, LivingLoc),
                                    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe').
/* Who = 'Lucy'. */

Cypher and SPARQL jump in right away with SELECT , but Datalog takes a small step at a time. We define rules that tell the database about new predicates: here, we define two new predicates, within_recursive and migrated . These predicates aren’t triples stored in the database, but instead they are derived from data or from other rules. Rules can refer to other rules, just like functions can call other functions or recursively call themselves. Like this, complex queries can be built up a small piece at a time.

Cypher和SPARQL会立即使用SELECT，但Datalog会逐步迈出小步。我们定义规则来告诉数据库新谓词的存在：在这里，我们定义了两个新谓词，within_recursive和migrated。这些谓词不是存储在数据库中的三元组，而是从数据或其他规则派生出来的。规则可以引用其他规则，就像函数可以调用其他函数或递归调用自己一样。通过这种方式，可以逐步构建复杂的查询。

In rules, words that start with an uppercase letter are variables, and predicates are matched like in Cypher and SPARQL. For example, name(Location, Name) matches the triple name(namerica, 'North America') with variable bindings Location = namerica and Name = 'North America' .

规则中，以大写字母开头的单词是变量，谓词的匹配方式类似于Cypher和SPARQL。例如，name(Location，Name)与三元组name(namerica，'North America')匹配，绑定变量为Location = namerica和Name ='North America'。

A rule applies if the system can find a match for all predicates on the righthand side of the :- operator. When the rule applies, it’s as though the lefthand side of the :- was added to the database (with variables replaced by the values they matched).

当系统可以在“：-”运算符的右侧找到所有谓词的匹配时，规则适用。当规则适用时，就好像“：-”的左侧已被添加到数据库中（变量被匹配的值替换）。

One possible way of applying the rules is thus:

一种可能的规则应用方式如下：

name(namerica, 'North America') exists in the database, so rule 1 applies. It generates within_recursive(namerica, 'North America') .

名称(name) (namerica, '北美洲') 在数据库中已存在，因此规则1适用。它生成within_recursive(namerica, '北美洲')。
within(usa, namerica) exists in the database and the previous step generated within_recursive(namerica, 'North America') , so rule 2 applies. It generates within_recursive(usa, 'North America') .

在数据库中存在 `within(usa, namerica)`，并且之前一步生成了 `within_recursive(namerica, 'North America')`，因此应用了规则 2。它生成了 `within_recursive(usa, 'North America')`。
within(idaho, usa) exists in the database and the previous step generated within_recursive(usa, 'North America') , so rule 2 applies. It generates within_recursive(idaho, 'North America') .

在数据库中存在位于美国爱达荷州的数据，并且之前的步骤生成了within_recursive(美国, '北美洲')，因此规则2适用。它会生成within_recursive(爱达荷州, '北美洲')。

By repeated application of rules 1 and 2, the within_recursive predicate can tell us all the locations in North America (or any other location name) contained in our database. This process is illustrated in Figure 2-6 .

通过反复应用规则1和规则2，within_recursive谓词可以告诉我们所有在我们数据库中包含的北美（或任何其他位置名称）的位置。该过程在图2-6中说明。

Now rule 3 can find people who were born in some location BornIn and live in some location LivingIn . By querying with BornIn = 'United States' and LivingIn = 'Europe' , and leaving the person as a variable Who , we ask the Datalog system to find out which values can appear for the variable Who . So, finally we get the same answer as in the earlier Cypher and SPARQL queries.

现在规则3可以查找出生在BornIn某地并住在LivingIn某地的人。通过查询BornIn = 'United States'和LivingIn = 'Europe'，并将人作为变量Who，我们要求Datalog系统查找哪些值可以出现在变量Who中。因此，最终我们得到与早期Cypher和SPARQL查询相同的答案。

The Datalog approach requires a different kind of thinking to the other query languages discussed in this chapter, but it’s a very powerful approach, because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.

Datalog方法需要与本章中讨论的其他查询语言不同的思维方式，但它是一种非常强大的方法，因为规则可以在不同的查询中组合和重复使用。对于简单的一次性查询来说可能不太方便，但如果你的数据较为复杂，它处理起来更加从容。

Summary

Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didn’t have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your application’s requirements.

数据模型是一个庞大的主题，在本章节中，我们简要介绍了各种不同的模型。我们没有足够的空间深入了解每个模型的所有细节，但是希望这个概述已经足够激发你对最符合你应用需求的模型的进一步了解的兴趣。

Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:

历史上，数据最初被表示为一棵大树（分层模型），但这并不适用于表示多对多关系，因此发明了关系模型来解决这个问题。最近，开发人员发现一些应用程序也不适用于关系模型。新的非关系型“NoSQL”数据存储已经分为两个主要方向。

Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.

文档数据库专注于自包含文档的数据使用案例，文档与文档之间的关系较少。
Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.

图形数据库朝着相反的方向发展，针对潜在的任何东西都可能相互关联的使用案例。

All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model—for example, graph data can be represented in a relational database—but the result is often awkward. That’s why we have different systems for different purposes, not a single one-size-fits-all solution.

所有三种模型（文档，关系和图形）今天都广泛使用，并且每个模型在其各自的域中都很好。一种模型可以用另一种模型来模拟 - 例如，可以用关系数据库表示图形数据 - 但结果通常是笨拙的。这就是为什么我们有不同的系统用于不同的用途，而不是单一的一刀切解决方案。

One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read).

文档数据库和图形数据库的共同点之一是它们通常不强制实施存储的数据模式，这可以使应用程序更容易适应不断变化的需求。但是，您的应用程序很可能仍然假定数据具有某种结构；只是问题是模式是否显式（在写入时强制执行）或隐含（在读取时处理）。

Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog. We also touched on CSS and XSL/XPath, which aren’t database query languages but have interesting parallels.

每个数据模型都配有其自己的查询语言或框架，我们讨论了几个例子：SQL、MapReduce、MongoDB的聚合管道、Cypher、SPARQL和Datalog。我们也提到了CSS和XSL/XPath，它们不是数据库查询语言，但有一些有趣的相似之处。

Although we have covered a lot of ground, there are still many data models left unmentioned. To give just a few brief examples:

虽然我们已经覆盖了很多领域，但还有许多数据模型未提及。只举几个简要的例子：

Researchers working with genome data often need to perform sequence-similarity searches , which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database software like GenBank [ 48 ].

研究基因组数据的人常常需要进行序列相似性搜索，这意味着将一个非常长的字符串（代表DNA分子）与一个大型数据库中类似但不完全相同的字符串进行匹配。这里描述的任何一个数据库都无法处理这种类型的使用，这就是为什么研究者编写了专门的基因组数据库软件，如GenBank [48]。
Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hundreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [ 49 ].

粒子物理学家们已经进行了大规模数据分析多年，类似于大型强子对撞机等项目现在已经使用数百个拍字节！在这样的规模下，需要定制解决方案来防止硬件成本失控[49]。
Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in Chapter 3 and Part III .

全文搜索可以说是经常与数据库一起使用的一种数据模型。信息检索是一个大的专业课题，在本书中我们不会详细介绍，但我们会在第三章和第三部分中涉及搜索索引。

We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when implementing the data models described in this chapter.

我们现在必须将其放在那里。在下一章中，我们将讨论在实施本章中描述的数据模型时发挥作用的一些权衡。

Footnotes

ⁱ A term borrowed from electronics. Every electric circuit has a certain impedance (resistance to alternating current) on its inputs and outputs. When you connect one circuit’s output to another one’s input, the power transfer across the connection is maximized if the output and input impedances of the two circuits match. An impedance mismatch can lead to signal reflections and other troubles.

电子领域常用术语。每个电路的输入和输出都有一定的阻抗（交流电阻）。当你连接一个电路的输出到另一个电路的输入时，如果两个电路的输出和输入阻抗相匹配，连接的功率传输将最大化。阻抗不匹配可能会导致信号反射和其他问题。

ⁱⁱ Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.

关于关系模型的文献区分了几种不同的正规形式，但这些区别在实际中没有太多的实用价值。作为一个经验法则，如果您正在复制可以只存储在一个位置的值，则模式未经归一化。

ⁱⁱⁱ At the time of writing, joins are supported in RethinkDB, not supported in MongoDB, and only supported in predeclared views in CouchDB.

在撰写本文时，RethinkDB支持连接操作，MongoDB不支持，而CouchDB仅支持在预定义视图中进行连接操作。

^iv Foreign key constraints allow you to restrict modifications, but such constraints are not required by the relational model. Even with constraints, joins on foreign keys are performed at query time, whereas in CODASYL, the join was effectively done at insert time.

外键约束可以限制修改，但这种约束并非关系模型所必需。即使有约束，在外键上的联接也是在查询时执行的，而在CODASYL中，联接实际上是在插入时执行的。

^v Codd’s original description of the relational model [ 1 ] actually allowed something quite similar to JSON documents within a relational schema. He called it nonsimple domains . The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like the JSON or XML support that was added to SQL over 30 years later.

Codd最初对关系模型的描述实际上允许在关系模式中使用类似于JSON文档的东西。他称其为非简单域。其思想是行中的值不一定只是原始数据类型，比如数字或字符串，而可以是嵌套关系（表）——因此可以具有任意嵌套的树形结构作为值，就像30年后添加到SQL中的JSON或XML支持一样。

^vi IMS and CODASYL both used imperative query APIs. Applications typically used COBOL code to iterate over records in the database, one record at a time [ 2 , 16 ].

VI IMS和CODASYL都使用命令式查询API。应用程序通常使用COBOL代码在数据库中迭代记录，一次处理一条记录[2,16]。

^vii Technically, Datomic uses 5-tuples rather than triples; the two additional fields are metadata for versioning.

技术上来说，Datomic使用的是五元组而不是三元组；另外两个字段是用于版本控制的元数据。

^viii Datomic and Cascalog use a Clojure S-expression syntax for Datalog. In the following examples we use a Prolog syntax, which is a little easier to read, but this makes no functional difference.

Datomic 和 Cascalog 使用Clojure的S表达式语法来处理Datalog。在以下示例中，我们使用Prolog语法，使其更易于阅读，但实际上功能并没有差异。

References

[ 1 ] Edgar F. Codd: “ A Relational Model of Data for Large Shared Data Banks ,” Communications of the ACM , volume 13, number 6, pages 377–387, June 1970. doi:10.1145/362384.362685

[1] Edgar F. Codd：“大型共享数据银行的数据关系模型”，ACM通讯，第13卷，第6期，1970年6月，页码377-387。DOI: 10.1145/362384.362685。

[ 2 ] Michael Stonebraker and Joseph M. Hellerstein: “ What Goes Around Comes Around ,” in Readings in Database Systems , 4th edition, MIT Press, pages 2–41, 2005. ISBN: 978-0-262-69314-1

[2] Michael Stonebraker和Joseph M. Hellerstein："东西转了一圈又回来了"，收录于《数据库系统读本》第四版，麻省理工学院出版社，2005年，第2-41页。ISBN: 978-0-262-69314-1。

[ 3 ] Pramod J. Sadalage and Martin Fowler: NoSQL Distilled . Addison-Wesley, August 2012. ISBN: 978-0-321-82662-6

[3] Pramod J. Sadalage 和 Martin Fowler：《NoSQL精粹》。Addison-Wesley出版社，2012年8月。ISBN：978-0-321-82662-6。

[ 4 ] Eric Evans: “ NoSQL: What’s in a Name? ,” blog.sym-link.com , October 30, 2009.

[4] Eric Evans：NoSQL：名称中有什么？”。博客.sym-link.com，2009年10月30日。

[ 5 ] James Phillips: “ Surprises in Our NoSQL Adoption Survey ,” blog.couchbase.com , February 8, 2012.

[5] 詹姆斯·菲利普斯：“我们的NoSQL采用调查中的惊喜”，blog.couchbase.com，2012年2月8日。

[ 6 ] Michael Wagner: SQL/XML:2006 – Evaluierung der Standardkonformität ausgewählter Datenbanksysteme . Diplomica Verlag, Hamburg, 2010. ISBN: 978-3-836-64609-3

[6] 迈克尔·瓦格纳：SQL / XML：2006年 - 评估所选数据库系统的标准符合性。Diplomica出版社，汉堡，2010年。ISBN：978-3-836-64609-3。

[ 7 ] “ XML Data in SQL Server ,” SQL Server 2012 documentation, technet.microsoft.com , 2013.

[7] “SQL Server 中的 XML 数据”，SQL Server 2012 文档，technet.microsoft.com，2013年。

[ 8 ] “ PostgreSQL 9.3.1 Documentation ,” The PostgreSQL Global Development Group, 2013.

[8] “PostgreSQL 9.3.1 文档”， PostgreSQL全球开发小组，2013年。

[ 9 ] “ The MongoDB 2.4 Manual ,” MongoDB, Inc., 2013.

[9] “MongoDB 2.4手册”，MongoDB，Inc.，2013年。

[ 10 ] “ RethinkDB 1.11 Documentation ,” rethinkdb.com , 2013.

请帮我翻译一下：“RethinkDB 1.11文档”， rethinkdb.com，2013.

[ 11 ] “ Apache CouchDB 1.6 Documentation ,” docs.couchdb.org , 2014.

【11】“Apache CouchDB 1.6 文档”，docs.couchdb.org，2014年。

[ 12 ] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “ On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

【12】Lin Qiao, Kapil Surlaker, Shirshanka Das等：「关于新鲜意式浓缩咖啡的研究：LinkedIn 的分布式数据服务平台」，ACM 数据管理国际会议 (SIGMOD)，2013年6月。

[ 13 ] Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: IMS Primer . IBM Redbook SG24-5352-00, IBM International Technical Support Organization, January 2000.

[13] Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: IMS初学者指南。 IBM红皮书SG24-5352-00，IBM国际技术支持组织，2000年1月。

[ 14 ] Stephen D. Bartlett: “ IBM’s IMS—Myths, Realities, and Opportunities ,” The Clipper Group Navigator, TCG2013015LI, July 2013.

[14] Stephen D. Bartlett: “IBM的IMS——神话、现实和机遇”，The Clipper Group Navigator，TCG2013015LI，2013年7月。

[ 15 ] Sarah Mei: “ Why You Should Never Use MongoDB ,” sarahmei.com , November 11, 2013.

[15] Sarah Mei：“为什么你永远不应该使用MongoDB”，sarahmei.com，2013年11月11日。

[ 16 ] J. S. Knowles and D. M. R. Bell: “The CODASYL Model,” in Databases—Role and Structure: An Advanced Course , edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 19–56, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4

[16] J. S. Knowles 和 D. M. R. Bell: "CODASYL 模型"，收录于数据库 - 角色与结构: 高级课程，由 P. M. Stocker、P. M. D. Gray 和 M. P. Atkinson 编辑，第 19 至 56 页，1984 年出版，剑桥大学出版社。ISBN: 978-0-521-25430-4。

[ 17 ] Charles W. Bachman: “ The Programmer as Navigator ,” Communications of the ACM , volume 16, number 11, pages 653–658, November 1973. doi:10.1145/355611.362534

[17] 查尔斯·W·巴赫曼： “程序员作为导航员，” ACM通讯杂志，第16卷，第11号，653-658页，1973年11月。 doi:10.1145/355611.362534

[ 18 ] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “ Architecture of a Database System ,” Foundations and Trends in Databases , volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002

【18】Joseph M. Hellerstein, Michael Stonebraker和James Hamilton：“数据库系统的架构”，《数据库基础和趋势》第1卷第2期，2007年11月，第141-259页。doi:10.1561/1900000002。

[ 19 ] Sandeep Parikh and Kelly Stirman: “ Schema Design for Time Series Data in MongoDB ,” blog.mongodb.org , October 30, 2013.

[19] Sandeep Parikh和Kelly Stirman：“MongoDB中的时间序列数据模式设计”，blog.mongodb.org，2013年10月30日。

[ 20 ] Martin Fowler: “ Schemaless Data Structures ,” martinfowler.com , January 7, 2013.

[20] Martin Fowler: “无模式数据结构”, martin fowler.com, 2013年1月7日。

[ 21 ] Amr Awadallah: “ Schema-on-Read vs. Schema-on-Write ,” at Berkeley EECS RAD Lab Retreat , Santa Cruz, CA, May 2009.

[21] Amr Awadallah: “读时模式与写时模式”，伯克利EECS RAD实验室撤退会议，加利福尼亚州圣克鲁斯，2009年5月。

[ 22 ] Martin Odersky: “ The Trouble with Types ,” at Strange Loop , September 2013.

[22] Martin Odersky: “类型的麻烦”，在奇怪的循环会议上，2013年9月。

[ 23 ] Conrad Irwin: “ MongoDB—Confessions of a PostgreSQL Lover ,” at HTML5DevConf , October 2013.

康拉德·欧文：《MongoDB——一个PostgreSQL爱好者的忏悔》，于2013年10月在HTML5DevConf上发表。

[ 24 ] “ Percona Toolkit Documentation: pt-online-schema-change ,” Percona Ireland Ltd., 2013.

[24] “Percona Toolkit 文档：pt-online-schema-change”，Percona Ireland Ltd.，2013年。

[ 25 ] Rany Keddo, Tobias Bielohlawek, and Tobias Schmidt: “ Large Hadron Migrator ,” SoundCloud, 2013.

[25] Rany Keddo、Tobias Bielohlawek 和 Tobias Schmidt: “大型强子迁移器”，SoundCloud，2013年。

[ 26 ] Shlomi Noach: “ gh-ost: GitHub’s Online Schema Migration Tool for MySQL ,” githubengineering.com , August 1, 2016.

[26] Shlomi Noach：「gh-ost：GitHub 的 MySQL 在线模式迁移工具」，githubengineering.com，2016 年 8 月 1 日。

[ 27 ] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “ Spanner: Google’s Globally-Distributed Database ,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

[27] James C. Corbett，Jeffrey Dean，Michael Epstein等人： “Spanner: Google的全球分布式数据库，” 于2012年10月在第10届USENIX操作系统设计和实现研讨会（OSDI）上。

[ 28 ] Donald K. Burleson: “ Reduce I/O with Oracle Cluster Tables ,” dba-oracle.com .

[28] Donald K. Burleson: "使用Oracle Cluster表降低I/O"，dba-oracle.com.

[ 29 ] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “ Bigtable: A Distributed Storage System for Structured Data ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[29] Fay Chang, Jeffrey Dean, Sanjay Ghemawat等人：「Bigtable：一种用于结构化数据的分布式存储系统」，发表于第七届USENIX操作系统设计与实现研讨会（OSDI），2006年11月。

[ 30 ] Bobbie J. Cochrane and Kathy A. McKnight: “ DB2 JSON Capabilities, Part 1: Introduction to DB2 JSON ,” IBM developerWorks, June 20, 2013.

[30] Bobbie J. Cochrane和Kathy A. McKnight: “DB2 JSON 能力，第1部分：介绍DB2 JSON，“IBM developerWorks，2013年6月20日。

[ 31 ] Herb Sutter: “ The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software ,” Dr. Dobb’s Journal , volume 30, number 3, pages 202-210, March 2005.

[31] Herb Sutter:《自由午餐的结束：软件并发的基本转向》《Dr. Dobb’s Journal》，2005年3月，卷30，号3，202-210页。

[ 32 ] Joseph M. Hellerstein: “ The Declarative Imperative: Experiences and Conjectures in Distributed Logic ,” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech report UCB/EECS-2010-90, June 2010.

[32] 约瑟夫·M·黑勒斯坦: “陈述式命令: 分布式逻辑中的经验与推测”，加州大学伯克利分校电气工程与计算机科学系，技术报告UCB/EECS-2010-90，2010年6月。

[ 33 ] Jeffrey Dean and Sanjay Ghemawat: “ MapReduce: Simplified Data Processing on Large Clusters ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[33] 杰弗里·迪安和桑杰·基马瓦特： “MapReduce：在大型集群上简化数据处理”，于2004年12月在第六届USENIX操作系统设计和实现研讨会（OSDI）上发表。

[ 34 ] Craig Kerstiens: “ JavaScript in Your Postgres ,” blog.heroku.com , June 5, 2013.

[34] Craig Kerstiens：「在Postgres中使用JavaScript」，blog.heroku.com，2013年6月5日。

[ 35 ] Nathan Bronson, Zach Amsden, George Cabrera, et al.: “ TAO: Facebook’s Distributed Data Store for the Social Graph ,” at USENIX Annual Technical Conference (USENIX ATC), June 2013.

[35] Nathan Bronson, Zach Amsden, George Cabrera等： “TAO：Facebook的分布式数据存储器用于社交图谱”，收录于 USENIX年度技术会议（USENIX ATC），2013年6月。

[ 36 ] “ Apache TinkerPop3.2.3 Documentation ,” tinkerpop.apache.org , October 2016.

“Apache TinkerPop3.2.3 文档”，tinkerpop.apache.org，2016年10月。

[ 37 ] “ The Neo4j Manual v2.0.0 ,” Neo Technology, 2013.

[37] “Neo4j手册v2.0.0”，Neo Technology，2013年。

[ 38 ] Emil Eifrem: Twitter correspondence , January 3, 2014.

[38] Emil Eifrem：Twitter通信，2014年1月3日。

[ 39 ] David Beckett and Tim Berners-Lee: “ Turtle – Terse RDF Triple Language ,” W3C Team Submission, March 28, 2011.

[39] David Beckett和Tim Berners-Lee："海龟 - 简洁的RDF三元语言"，W3C团队提交，2011年3月28日。

[ 40 ] “ Datomic Development Resources ,” Metadata Partners, LLC, 2013.

[40] “Datomic开发资源,” Metadata Partners, LLC, 2013.

[ 41 ] W3C RDF Working Group: “ Resource Description Framework (RDF) ,” w3.org , 10 February 2004.

[41] W3C RDF 工作组： “资源描述框架（RDF），” w3.org，2004年2月10日。

[ 42 ] “ Apache Jena ,” Apache Software Foundation.

Apache Jena，Apache软件基金会。

[ 43 ] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux: “ SPARQL 1.1 Query Language ,” W3C Recommendation, March 2013.

[43] Steve Harris，Andy Seaborne和Eric Prud'hommeaux：“SPARQL 1.1查询语言”， W3C推荐，2013年3月。【43】史蒂夫·哈里斯（Steve Harris），安迪·西博恩（Andy Seaborne）和埃里克·普鲁德霍默（Eric Prud’hommeaux）：“SPARQL 1.1查询语言”，W3C推荐，2013年3月。

[ 44 ] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: “ Datalog and Recursive Query Processing ,” Foundations and Trends in Databases , volume 5, number 2, pages 105–195, November 2013. doi:10.1561/1900000017

[44] Todd J. Green, Shan Shan Huang, Boon Thau Loo和Wenchao Zhou: “Datalog和递归查询处理，”数据库的基础和趋势，第5卷第2期，第105-195页，2013年11月。 doi：10.1561 / 1900000017

[ 45 ] Stefano Ceri, Georg Gottlob, and Letizia Tanca: “ What You Always Wanted to Know About Datalog (And Never Dared to Ask) ,” IEEE Transactions on Knowledge and Data Engineering , volume 1, number 1, pages 146–166, March 1989. doi:10.1109/69.43410

"Stefano Ceri、Georg Gottlob和Letizia Tanca：《数据日志（Datalog）的奥秘与应用》，IEEE知识与数据工程学报，1989年3月第1卷第1期，146-166页。 doi：10.1109/69.43410"

[ 46 ] Serge Abiteboul, Richard Hull, and Victor Vianu: Foundations of Databases . Addison-Wesley, 1995. ISBN: 978-0-201-53771-0, available online at webdam.inria.fr/Alice

[46] Serge Abiteboul，Richard Hull和Victor Vianu：数据库的基础。Addison-Wesley，1995年。 ISBN：978-0-201-53771-0，在webdam.inria.fr/Alice上可在线获取。

[ 47 ] Nathan Marz: “ Cascalog ,” cascalog.org .

"47. Nathan Marz: “Cascalog,” cascalog.org." translated to simplified Chinese is: 47. Nathan Marz：“Cascalog”，cascalog.org。

[ 48 ] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, et al.: “ GenBank ,” Nucleic Acids Research , volume 36, Database issue, pages D25–D30, December 2007. doi:10.1093/nar/gkm929

[48] Dennis A. Benson、Ilene Karsch-Mizrachi、David J. Lipman等人：“GenBank”，《核酸研究》第36卷数据库专题，2007年12月，页码D25-D30。DOI:10.1093/nar/gkm929。

[ 49 ] Fons Rademakers: “ ROOT for Big Data Analysis ,” at Workshop on the Future of Big Data Management , London, UK, June 2013.

"[49] Fons Rademakers: 'ROOT for Big Data Analysis,' 在未来大数据管理研讨会上，伦敦，英国，2013年6月。"

Chapter 3. Storage and Retrieval

Wer Ordnung hält, ist nur zu faul zum Suchen.

谁有秩序，只是懒得寻找。

(If you keep things tidily ordered, you’re just too lazy to go searching.)

如果你把东西整理得井井有条，那就是因为懒得到处找。

German proverb

德国谚语。

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.

在最基本的层级上，一个数据库需要做两件事情：当你提供一些数据时，它应该存储数据，当你以后再次询问时，它应该将数据返回给你。

In Chapter 2 we discussed data models and query languages—i.e., the format in which you (the application developer) give the database your data, and the mechanism by which you can ask for it again later. In this chapter we discuss the same from the database’s point of view: how we can store the data that we’re given, and how we can find it again when we’re asked for it.

在第2章中，我们讨论了数据模型和查询语言——也就是你（应用程序开发者）向数据库提供数据的格式以及以后可以再次请求它的机制。在本章中，我们从数据库的角度讨论相同的问题：如何存储我们收到的数据以及在被要求时如何找到它。

Why should you, as an application developer, care how the database handles storage and retrieval internally? You’re probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.

作为应用程序开发人员，为什么要关心数据库如何处理内部的存储和检索？您可能不会从头开始实现自己的存储引擎，但是您需要选择适合您的应用程序的存储引擎，从众多可用的引擎中进行选择。为了调整存储引擎以在您的工作负载类型上表现良好，您需要对存储引擎在幕后执行的操作有一个大致的了解。

In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. We will explore that distinction later in “Transaction Processing or Analytics?” , and in “Column-Oriented Storage” we’ll discuss a family of storage engines that is optimized for analytics.

特别是，那些优化事务工作负载的存储引擎和那些优化分析的存储引擎之间有很大的差异。我们将在“事务处理还是分析？”中探讨这一区别，并在“基于列的存储”中讨论一系列优化分析的存储引擎。

However, first we’ll start this chapter by talking about storage engines that are used in the kinds of databases that you’re probably familiar with: traditional relational databases, and also most so-called NoSQL databases. We will examine two families of storage engines: log-structured storage engines, and page-oriented storage engines such as B-trees.

然而，我们将首先开始本章，讨论在您可能熟悉的数据库类型中使用的存储引擎：传统关系型数据库，以及大多数所谓的NoSQL数据库。我们将研究两个存储引擎系列：日志结构存储引擎和基于页的存储引擎，如B树。

Data Structures That Power Your Database

Consider the world’s simplest database, implemented as two Bash functions:

考虑作为两个Bash函数实现的世界上最简单的数据库：

#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

These two functions implement a key-value store. You can call db_set key value , which will store key and value in the database. The key and value can be (almost) anything you like—for example, the value could be a JSON document. You can then call db_get key , which looks up the most recent value associated with that particular key and returns it.

这两个函数实现了键值存储。您可以调用db_set键值对，它会将键和值存储在数据库中。键和值可以是（几乎）任何您喜欢的东西 - 例如，值可以是JSON文档。然后，您可以调用db_get键，查找与该特定键相关的最新值并返回它。

And it works:

它起作用了：

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

The underlying storage format is very simple: a text file where each line contains a key-value pair, separated by a comma (roughly like a CSV file, ignoring escaping issues). Every call to db_set appends to the end of the file, so if you update a key several times, the old versions of the value are not overwritten—you need to look at the last occurrence of a key in a file to find the latest value (hence the tail -n 1 in db_get ):

底层存储格式非常简单：文本文件中每一行都包含一个键值对，通过逗号分隔（类似 CSV 文件，忽略转义问题）。每次调用 db_set 都会将数据追加到文件末尾，所以如果多次更新同一个键，旧版本的值是不会被覆盖的 —— 你需要在文件中查找键的最后一次出现来获取最新的值（因此在 db_get 中使用了 tail -n 1 命令）。

$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}

$ cat database
123456,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}

Our db_set function actually has pretty good performance for something that is so simple, because appending to a file is generally very efficient. Similarly to what db_set does, many databases internally use a log , which is an append-only data file. Real databases have more issues to deal with (such as concurrency control, reclaiming disk space so that the log doesn’t grow forever, and handling errors and partially written records), but the basic principle is the same. Logs are incredibly useful, and we will encounter them several times in the rest of this book.

我们的db_set函数在这么简单的情况下实际上有相当好的性能，因为向文件添加数据通常非常高效。与db_set类似的是，许多数据库内部使用日志，即追加模式的数据文件。真正的数据库会遇到更多问题（如并发控制，回收磁盘空间，以防日志无限增长，处理错误和部分写入记录等），但基本原理是相同的。日志非常有用，在本书的其余部分中我们将遇到它们多次。

Note

The word log is often used to refer to application logs, where an application outputs text that describes what’s happening. In this book, log is used in the more general sense: an append-only sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.

日志一词通常用于指应用程序日志，应用程序输出描述正在发生的文本。在本书中，日志的更一般含义是：仅追加记录的序列。它不必人类可读；它可能是二进制的，只供其他程序阅读。

On the other hand, our db_get function has terrible performance if you have a large number of records in your database. Every time you want to look up a key, db_get has to scan the entire database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the cost of a lookup is O ( n ): if you double the number of records n in your database, a lookup takes twice as long. That’s not good.

另一方面，如果你的数据库中有大量记录，我们的db_get函数性能非常差。每当你想查找一个键时，db_get必须从头到尾扫描整个数据库文件，寻找键的出现。在算法术语中，查找的成本是O（n）：如果你将数据库中的记录数量n翻倍，查找需要的时间就会增加一倍。这不好。

In order to efficiently find the value for a particular key in the database, we need a different data structure: an index . In this chapter we will look at a range of indexing structures and see how they compare; the general idea behind them is to keep some additional metadata on the side, which acts as a signpost and helps you to locate the data you want. If you want to search the same data in several different ways, you may need several different indexes on different parts of the data.

为了有效地在数据库中找到特定键的值，我们需要一个不同的数据结构：索引。在本章中，我们将研究一系列索引结构并看看它们的比较；它们背后的一般思想是在一侧保存一些额外的元数据，它作为标志牌，帮助您定位所需的数据。如果您想以几种不同的方式搜索相同的数据，则可能需要在数据的不同部分上具有几个不同的索引。

An index is an additional structure that is derived from the primary data. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

索引是从主要数据派生出来的额外结构。许多数据库允许您添加和删除索引，这不会影响数据库的内容，只会影响查询的性能。维护额外的结构会产生额外的负担，尤其是对写入操作。对于写入操作，很难击败简单地追加到文件的性能，因为那是最简单的写入操作。任何类型的索引通常会减慢写入速度，因为每次写入数据时也需要更新索引。

This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. For this reason, databases don’t usually index everything by default, but require you—the application developer or database administrator—to choose indexes manually, using your knowledge of the application’s typical query patterns. You can then choose the indexes that give your application the greatest benefit, without introducing more overhead than necessary.

在存储系统中，这是一个重要的权衡：精心选择索引可以加速读取查询，但是每个索引都会减缓写入速度。因此，数据库通常不会默认索引所有内容，而需要你——应用程序开发人员或数据库管理员——根据应用程序的典型查询模式手动选择索引。你可以选择对你的应用程序带来最大收益的索引，而不会引入更多不必要的开销。

Hash Indexes

Let’s start with indexes for key-value data. This is not the only kind of data you can index, but it’s very common, and it’s a useful building block for more complex indexes.

让我们从键值数据的索引开始。这不是你可以索引的唯一类型，但是它非常常见，是更复杂索引的有用基石。

Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map (hash table). Hash maps are described in many algorithms textbooks [ 1 , 2 ], so we won’t go into detail of how they work here. Since we already have hash maps for our in-memory data structures, why not use them to index our data on disk?

键值存储与大多数编程语言中的字典类型非常类似，通常实现为哈希表。哈希映射可以在许多算法教材中找到描述。既然我们已经在内存数据结构中使用了哈希映射，为什么不使用它们来索引我们的磁盘数据呢？

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found, as illustrated in Figure 3-1 . Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

假设我们的数据存储仅仅是在一个文件中添加内容，和前面的例子一样。那么最简单的索引策略是：在内存中保持一个哈希映射，其中每个键都被映射到数据文件中的一个字节偏移量，也就是值所在的位置，如图3-1所示。每当你把一个新的键值对添加到文件中，你也要更新哈希映射，以反映出你刚刚写入的数据的偏移量（这对于插入新键和更新现有键都适用）。当你想要查找一个值时，使用哈希映射来找到数据文件中的偏移量，把指针定位到该位置，然后读取值。

This may sound simplistic, but it is a viable approach. In fact, this is essentially what Bitcask (the default storage engine in Riak) does [ 3 ]. Bitcask offers high-performance reads and writes, subject to the requirement that all the keys fit in the available RAM, since the hash map is kept completely in memory. The values can use more space than there is available memory, since they can be loaded from disk with just one disk seek. If that part of the data file is already in the filesystem cache, a read doesn’t require any disk I/O at all.

这听起来很简单，但这是一种可行的方法。实际上，这就是Bitcask（Riak的默认存储引擎）的基本操作[3]。 Bitcask提供高性能的读写功能，要求所有键都适合可用的RAM，因为哈希映射被完全保存在内存中。值可以使用比可用内存更多的空间，因为它们可以通过只进行一次磁盘搜索，从磁盘加载。如果该数据文件的部分已经在文件系统高速缓存中，那么读取就不需要任何磁盘I/O。

A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory.

像Bitcask这样的存储引擎非常适用于每个键的值经常更新的情况。例如，键可以是猫视频的URL，而值可以是播放次数（每次有人点击播放按钮就会增加）。在这种负载下，有很多写入操作，但是不会有太多不同的键 - 每个键有大量的写入操作，但是在内存中保留所有键是可行的。

As described so far, we only ever append to a file—so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments, as illustrated in Figure 3-2 . Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

到目前为止，我们只添加到文件，那么我们如何避免最终用尽磁盘空间？一个好的解决方案是将日志分成一定大小的段，当一个段文件达到一定大小时关闭它，并在新的段文件中进行后续写操作。然后我们可以对这些段执行压缩，如图3-2所示。压缩是指在日志中丢弃重复键，并仅保留每个键的最新更新。

Moreover, since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction, as shown in Figure 3-3 . Segments are never modified after they have been written, so the merged segment is written to a new file. The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments—and then the old segment files can simply be deleted.

此外，由于紧缩通常会使段变得更小（假设一个键在一个段内被平均覆盖多次），所以我们也可以在紧缩的同时合并多个段，如图3-3所示。段一旦被写入后就不会被修改，因此合并的段被写入一个新文件中。冻结段的合并和紧缩可以在后台线程中完成，而在此期间，我们仍然可以像往常一样使用旧段文件来继续提供读写请求。在合并过程完成后，我们将读取请求切换到使用新合并的段而不是旧的段，然后旧的段文件可以被简单删除。

Each segment now has its own in-memory hash table, mapping keys to file offsets. In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on. The merging process keeps the number of segments small, so lookups don’t need to check many hash maps.

现在每个分段都有自己的内存哈希表，将键映射到文件偏移量。为了查找键的值，我们首先检查最近的分段哈希映射；如果键不存在，则检查第二个最近的分段，依此类推。合并过程使分段数保持较小，因此查找不需要检查许多哈希映射。

Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:

制作这个简单的想法需要考虑许多细节。简要来说，一些实际实现中重要的问题包括：

File format

CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).

CSV不是日志的最佳格式。使用一种二进制格式更快，更简单，首先会对字符串的长度进行编码（以字节为单位），然后是原始字符串（无需转义）。

Deleting records

If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone ). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.

如果您想删除一个键和它相关联的值，必须向数据文件附加一个特殊的删除记录（有时称为墓碑）。当日志段合并时，墓碑会告诉合并过程丢弃已删除键的任何先前值。

Crash recovery

If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.

如果数据库重新启动，内存中的哈希映射会丢失。原则上，可以通过从头到尾读取每个段文件并记录每个键的最新值的偏移量来恢复每个段的哈希映射。但是，如果段文件很大，这样做可能需要很长时间，从而使服务器重新启动非常痛苦。Bitcask通过在磁盘上存储每个段的哈希映射快照来加速恢复，可以更快地将其加载到内存中。

Partially written records

The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.

数据库可能会在任何时候崩溃，包括在将记录附加到日志的一半之处。 Bitcask 文件包括校验和，使可以检测和忽略日志中的这些损坏的部分。

Concurrency control

As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

由于写入操作必须严格按顺序附加到日志中，因此常见的实现选择是仅使用一个写入线程。数据文件段是只追加且不可变的，因此它们可以被多个线程同时读取。

An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons:

“乍一看，仅追加日志似乎很浪费：为什么不在原地更新文件，用新值覆盖旧值？但事实证明，仅追加设计在多个方面都非常优秀：”

Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives. To some extent sequential writes are also preferable on flash-based solid state drives (SSDs) [ 4 ]. We will discuss this issue further in “Comparing B-Trees and LSM-Trees” .

添加与段合并是连续写入操作，通常比随机写入快得多，尤其是在磁盘硬盘上。在基于闪存的固态硬盘（SSD）上，一定程度上也更喜欢连续写入[4]。我们将在“比较B-树和LSM-树”中进一步讨论此问题。
Concurrency and crash recovery are much simpler if segment files are append-only or immutable. For example, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together.

如果分段文件是只追加或不可变的，那么并发和崩溃恢复就要简单得多。例如，你不必担心在值被覆盖时发生崩溃的情况，使得你只得到一个包含新值和旧值的部分拼接在一起的文件。
Merging old segments avoids the problem of data files getting fragmented over time.

合并旧部分可以避免数据文件随着时间的推移而变得分散。

However, the hash table index also has limitations:

然而，哈希表索引也有一些限制：

The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck. In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic [ 5 ].

哈希表必须适合内存，所以如果您有大量的键，您就没那么幸运了。原则上，您可以在磁盘上维护哈希映射，但不幸的是很难使磁盘上的哈希映射表现良好。它需要大量的随机访问I/O，当它变满时扩展它是昂贵的，并且哈希冲突需要费力的逻辑 [5]。
Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999 —you’d have to look up each key individually in the hash maps.

范围查询不是高效的。例如，您无法轻松地在kitty00000和kitty99999之间扫描所有键-您必须逐个查找散列映射中的每个键。

In the next section we will look at an indexing structure that doesn’t have those limitations.

在下一部分中，我们将看到一种没有这些限制的索引结构。

SSTables and LSM-Trees

In Figure 3-3 , each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter.

在图3-3中，每个日志结构化存储段都是键值对的序列。这些对按照它们被写入的顺序出现，日志中后面的值优先于日志中先前相同键的值。除此之外，文件中键值对的顺序并不重要。

Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key . At first glance, that requirement seems to break our ability to use sequential writes, but we’ll get to that in a moment.

现在我们可以对我们的段文件格式进行简单更改：要求键值对的顺序按照键来排序。乍一看，这个要求似乎可能会破坏我们使用顺序写的能力，但是我们马上会解决这个问题。

We call this format Sorted String Table , or SSTable for short. We also require that each key only appears once within each merged segment file (the compaction process already ensures that). SSTables have several big advantages over log segments with hash indexes:

我们称之为排序字符串表格式，简称SSTable。我们还要求每个键在合并的段文件中只出现一次（合并过程已经确保了这一点）。SSTable比具有哈希索引的日志段有几个重要优点：

Merging segments is simple and efficient, even if the files are bigger than the available memory. The approach is like the one used in the mergesort algorithm and is illustrated in Figure 3-4 : you start reading the input files side by side, look at the first key in each file, copy the lowest key (according to the sort order) to the output file, and repeat. This produces a new merged segment file, also sorted by key.

合并段落是简单高效的，即使文件大小超过了可用内存。这种方法类似于归并排序算法中使用的方法，在图3-4中有图解：你可以一边读取输入文件，一边对比每个文件的第一个键值，复制最小的键值（根据排序顺序）到输出文件中，然后重复此操作。这将产生一个新的合并段落文件，也是按键值排序的。

Figure 3-4. Merging several SSTable segments, retaining only the most recent value for each key.

What if the same key appears in several input segments? Remember that each segment contains all the values written to the database during some period of time. This means that all the values in one input segment must be more recent than all the values in the other segment (assuming that we always merge adjacent segments). When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.

如果同一个键出现在多个输入段中怎么办？要记住，每个段都包含一段时间内写入数据库的所有值。这意味着一个输入段中的所有值必须比另一个段中的所有值更近（假设我们总是合并相邻段）。当多个段包含相同的键时，我们可以保留来自最新段的值，并且丢弃旧段中的值。如果多个段中有相同的键，应该保留最新段的值，丢弃老段中的值。
In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. See Figure 3-5 for an example: say you’re looking for the key handiwork , but you don’t know the exact offset of that key in the segment file. However, you do know the offsets for the keys handbag and handsome , and because of the sorting you know that handiwork must appear between those two. This means you can jump to the offset for handbag and scan from there until you find handiwork (or not, if the key is not present in the file).

为了在文件中查找特定的键，您不再需要在内存中保留所有键的索引。例如，假设您正在查找关键字"handiwork"，但您不知道该关键字在段文件中的确切偏移量。然而，您知道"handbag"和"handsome"的偏移量，并且由于排序，您知道"handiwork"必须出现在这两个关键字之间。这意味着您可以跳转到"handbag"的偏移量，并从那里扫描，直到找到"handiwork"（如果该键不存在于文件中，则不会找到）。

Figure 3-5. An SSTable with an in-memory index.

You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly. ⁱ

你仍需一个内存索引来告诉你一些键的偏移量，但它可以是稀疏的：每几千字节的段文件只需一个键就足够了，因为几千字节可以很快地扫描。
Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk (indicated by the shaded area in Figure 3-5 ). Each entry of the sparse in-memory index then points at the start of a compressed block. Besides saving disk space, compression also reduces the I/O bandwidth use.

由于读请求需要在请求范围内扫描多个键值对，因此可以将这些记录分组并在写入磁盘之前对其进行压缩（如图3-5中的阴影区域所示）。稀疏的内存索引的每个条目都指向一个压缩块的开头。除了节省磁盘空间外，压缩还可以减少I / O带宽使用。

Constructing and maintaining SSTables

Fine so far—but how do you get your data to be sorted by key in the first place? Our incoming writes can occur in any order.

到目前为止很好，但是您如何才能使您的数据首先按键排序？我们的传入写入可以以任何顺序发生。

Maintaining a sorted structure on disk is possible (see “B-Trees” ), but maintaining it in memory is much easier. There are plenty of well-known tree data structures that you can use, such as red-black trees or AVL trees [ 2 ]. With these data structures, you can insert keys in any order and read them back in sorted order.

在磁盘中维护排序结构是可能的（参见“B-树”），但在内存中维护它要容易得多。有很多众所周知的树形数据结构可以使用，例如红黑树或AVL树[2]。使用这些数据结构，您可以以任何顺序插入键，并以排序顺序读取它们。

We can now make our storage engine work as follows:

我们现在可以让我们的存储引擎按照以下方式工作：

When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable .

当收到一篇写作时，将其添加到内存中平衡树数据结构中（例如红黑树）。这个内存中的树有时被称为Memtable。
When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.

当内存表格变得比某个阈值（通常为几兆字节）更大时，将其写入磁盘作为SSTable文件。这可以高效地完成，因为树已经通过键对按键排序维护了这个对值。新的SSTable文件成为数据库的最近段。在将SSTable写入磁盘时，可以继续写入新的内存表。
In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.

为了响应读取请求，应首先在内存表中查找键，然后在最新的磁盘段、接下来较旧的段等逐步查找。
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

不断地在后台运行合并和压缩进程，以组合段文件并丢弃被覆盖或删除的值。

This scheme works very well. It only suffers from one problem: if the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, just like in the previous section. That log is not in sorted order, but that doesn’t matter, because its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

这个方案运作得很好，唯一的问题是如果数据库崩溃，最近的写入（在内存表中但尚未写入磁盘）会丢失。为了避免这个问题，我们可以在磁盘上保留一个单独的日志，每次写入都立即附加到其中，就像在前一节中一样。该日志没有排序，但这并不重要，因为它的唯一目的是在崩溃后恢复内存表。每次内存表写入SSTable时，相应的日志可以被丢弃。

Making an LSM-tree out of SSTables

The algorithm described here is essentially what is used in LevelDB [ 6 ] and RocksDB [ 7 ], key-value storage engine libraries that are designed to be embedded into other applications. Among other things, LevelDB can be used in Riak as an alternative to Bitcask. Similar storage engines are used in Cassandra and HBase [ 8 ], both of which were inspired by Google’s Bigtable paper [ 9 ] (which introduced the terms SSTable and memtable ).

这里描述的算法基本上是LevelDB和RocksDB所使用的，它们是键-值存储引擎库，旨在嵌入到其他应用程序中。除此之外，LevelDB还可作为Bitcask的替代方案在Riak中使用。类似的存储引擎在Cassandra和HBase中也被使用，两者都受Google的Bigtable论文启发（该论文介绍了SSTable和memtable这些术语）。

Originally this indexing structure was described by Patrick O’Neil et al. under the name Log-Structured Merge-Tree (or LSM-Tree) [ 10 ], building on earlier work on log-structured filesystems [ 11 ]. Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

最初，这种索引结构是由Patrick O’Neil等人在较早的日志结构文件系统[11]上的基础上描述的，被称为Log-Structured Merge-Tree（或LSM-Tree）[10]。基于此合并和压缩排序文件的存储引擎通常被称为LSM存储引擎。

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary [ 12 , 13 ]. A full-text index is much more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word. This is implemented with a key-value structure where the key is a word (a term ) and the value is the list of IDs of all the documents that contain the word (the postings list ). In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed [ 14 ].

Lucene是一个索引引擎，用于Elasticsearch和Solr的全文搜索，它使用类似的方法存储其术语字典。全文索引比键值索引复杂得多，但是基于类似的想法：给定搜索查询中的一个单词，查找提到该单词的所有文档（网页，产品描述等）。这是通过一个键值结构来实现的，其中键是一个单词（术语），值是包含该单词的所有文档的ID列表（帖子列表）。在Lucene中，从术语到帖子列表的映射保留在类似SSTable的排序文件中，根据需要后台合并。

Performance optimizations

As always, a lot of detail goes into making a storage engine perform well in practice. For example, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist. In order to optimize this kind of access, storage engines often use additional Bloom filters [ 15 ]. (A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.)

制作存储引擎时需要考虑许多细节，以使其在实际使用中表现良好。例如，当在数据库中查找不存在的键时，LSM树算法可能会变得很慢：您需要检查内存表，然后向后查找所有段直到最老的段（可能需要为每一个段都读取磁盘），才能确定键不存在。为了优化这种访问，存储引擎经常使用额外的布隆过滤器【15】。（布隆过滤器是一种内存效率高的数据结构，用于近似表示集合的内容。它可以告诉您键是否存在于数据库中，从而为不存在的键节省许多不必要的磁盘读取。）

There are also different strategies to determine the order and timing of how SSTables are compacted and merged. The most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction (hence the name of LevelDB), HBase uses size-tiered, and Cassandra supports both [ 16 ]. In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space.

决定SSTables紧凑和合并顺序和时间的策略也不同。最常见的选项是大小分层和级别紧凑。LevelDB和RocksDB使用级别紧凑（因此LevelDB的名称），HBase使用大小分层，Cassandra支持两种[16]。在大小分层紧凑中，新的和较小的SSTables逐步合并为旧的和较大的SSTables。在级别紧凑中，键范围被拆分成较小的SSTables，旧数据被移入单独的“级别”，这允许更逐步地进行压缩，使用更少的磁盘空间。

Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

即使有许多微妙之处，LSM树的基本思想-保持背景中合并的SSTables级联-是简单而有效的。即使数据集比可用内存大得多，它仍然能够良好地工作。由于数据以排序顺序存储，因此可以高效地执行范围查询（扫描所有最小值以上且最大值以下的所有键），而且由于磁盘写入是顺序的，因此LSM树可以支持极高的写入吞吐量。

B-Trees

The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely used indexing structure is quite different: the B-tree .

到目前为止，我们讨论的日志结构化索引已经得到了认可，但它们并不是最常见的索引类型。最广泛使用的索引结构是完全不同的：B树。

Introduced in 1970 [ 17 ] and called “ubiquitous” less than 10 years later [ 18 ], B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

自1970年被引入[17]并在不到10年的时间内被称为“无处不在”， B-树经过了时间的考验。它们仍然是几乎所有关系型数据库中的标准索引实现，许多非关系型数据库也使用它们。

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. But that’s where the similarity ends: B-trees have a very different design philosophy.

就像SSTables一样，B-树通过键对值进行排序，从而实现高效的键对值查找和范围查询。但这也是相似之处的尽头：B-树有一种非常不同的设计哲学。

The log-structured indexes we saw earlier break the database down into variable-size segments , typically several megabytes or more in size, and always write a segment sequentially. By contrast, B-trees break the database down into fixed-size blocks or pages , traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

早些时候我们看到的基于日志的索引将数据库拆分为可变大小的段，通常是几兆字节甚至更大，并且总是按顺序写入段。相比之下，B-树将数据库拆分为固定大小的块或页面，传统上为4 KB（有时更大），并一次读取或写入一个页面。这种设计更贴近底层硬件，因为磁盘也是按固定大小的块排列。

Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. We can use these page references to construct a tree of pages, as illustrated in Figure 3-6 .

每个页面都可以通过地址或位置进行识别，这使得一个页面可以引用另一个页面，类似于指针，但是在磁盘上而不是在内存中。我们可以使用这些页面引用来构建页面树，如图3-6所示。

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

一个页面被指定为B-tree的根；每当你想在索引中查找一个键时，你从这里开始。这个页面包含几个键和指向子页面的引用。每个子页面负责一段连续的键范围，引用之间的键指示这些范围之间的边界在哪里。

In the example in Figure 3-6 , we are looking for the key 251, so we know that we need to follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking page that further breaks down the 200–300 range into subranges. Eventually we get down to a page containing individual keys (a leaf page ), which either contains the value for each key inline or contains references to the pages where the values can be found.

在图3-6的例子中，我们正在寻找键值为251的密钥，因此我们知道需要在200和300的范围内跟随页面引用。这将带我们到另一个类似的页面，进一步将200-300范围分解成子范围。最终，我们会来到一个包含单个密钥的页面（叶子页），其中每个密钥包含该值或包含对包含该值的页面的引用。

The number of references to child pages in one page of the B-tree is called the branching factor . For example, in Figure 3-6 the branching factor is six. In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.

B树中一个页面引用子页面的数量被称为分支因子。例如，在图3-6中，分支因子为六。实际上，分支因子取决于存储页面引用和范围边界所需的空间量，但通常为数百个。

If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). If you want to add a new key, you need to find the page whose range encompasses the new key and add it to that page. If there isn’t enough free space in the page to accommodate the new key, it is split into two half-full pages, and the parent page is updated to account for the new subdivision of key ranges—see Figure 3-7 . ⁱⁱ

如果您想更新B-树中现有键的值，则需要搜索包含该键的叶页，更改该页中的值，并将该页写回磁盘（对该页的任何引用仍然有效）。如果您想添加新键，则需要找到其范围包含新键的页面，并将其添加到该页面中。如果页面中没有足够的空闲空间来容纳新键，则将其拆分为两个半满页面，并更新父页面以说明关键范围的新细分 - 请参见图3-7.ii。

This algorithm ensures that the tree remains balanced : a B-tree with n keys always has a depth of O (log n ). Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)

这种算法确保树保持平衡：具有n个键的B树始终具有O(log n)的深度。大多数数据库可以适应三到四层深度的B树，因此您不需要遵循许多页面引用来找到所需的页面。（具有500个分支因子的4 KB页面的四级树可存储高达256 TB。）

Making B-trees reliable

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page; i.e., all references to that page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place.

B-树的基本底层写操作是使用新数据覆盖磁盘上的页面。假设覆盖不改变页面的位置；即，在覆盖页面时，所有对该页面的引用仍保持不变。这与日志结构化索引（如 LSM-树）形成了鲜明的对比，后者仅追加到文件中（并最终删除过时的文件），但从不直接修改文件。

You can think of overwriting a page on disk as an actual hardware operation. On a magnetic hard drive, this means moving the disk head to the right place, waiting for the right position on the spinning platter to come around, and then overwriting the appropriate sector with new data. On SSDs, what happens is somewhat more complicated, due to the fact that an SSD must erase and rewrite fairly large blocks of a storage chip at a time [ 19 ].

你可以把在磁盘上覆盖一个页面看作是一个实际的硬件操作。在磁性硬盘上，这意味着把磁头移动到正确的位置，等待旋转盘上的正确位置到达，然后用新数据覆盖适当的扇区。在固态硬盘上，情况稍微复杂一些，因为固态硬盘必须一次擦除和重写存储芯片的相当大的块。

Moreover, some operations require several different pages to be overwritten. For example, if you split a page because an insertion caused it to be overfull, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages. This is a dangerous operation, because if the database crashes after only some of the pages have been written, you end up with a corrupted index (e.g., there may be an orphan page that is not a child of any parent).

此外，某些操作需要覆盖多个不同的页面。例如，如果您因为插入而导致页面过度填充而需要拆分页面，则需要编写两个拆分的页面，并覆盖它们的父页面以更新对两个子页面的引用。这是一项危险的操作，因为如果数据库在编写某些页面之后崩溃，则会导致索引损坏（例如，可能会产生未属于任何父页面的孤立页面）。

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log ). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state [ 5 , 20 ].

为了使数据库具有抗崩溃能力，常见的B树实现方式是在磁盘上包含一个额外的数据结构：写前日志（WAL，也称重做日志）。这是一个只追加的文件，每次对B树的修改必须先被写入该文件，然后才能应用于树本身的页面。当数据库在崩溃后重新启动时，使用该日志将B树恢复到一致状态[5，20]。

An additional complication of updating pages in place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time—otherwise a thread may see the tree in an inconsistent state. This is typically done by protecting the tree’s data structures with latches (lightweight locks). Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.

在原地更新页面的一个额外复杂性是，如果多个线程同时访问B树，则需要小心的并发控制，否则线程可能会看到树处于不一致状态。这通常通过使用闩（轻量级锁）保护树的数据结构来实现。相比之下，基于日志的方法在这方面更简单，因为它们在后台执行所有合并操作，而不会干扰传入的查询，定期以原子方式交换旧段和新段。

B-tree optimizations

As B-trees have been around for so long, it’s not surprising that many optimizations have been developed over the years. To mention just a few:

由于B树已经存在了很长时间，多年来已开发出许多优化，仅举几例：

Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme [ 21 ]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in “Snapshot Isolation and Repeatable Read” .

一些数据库（例如LMDB）使用写时复制方案[21]来避免覆盖页面并维护 WAL 以进行崩溃恢复。修改后的页面被写入不同的位置，并创建指向新位置的父页的新版本。这种方法对于并发控制也很有用，我们将在“快照隔离和可重复读”中看到。
We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages on the interior of the tree, keys only need to provide enough information to act as boundaries between key ranges. Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels. ⁱⁱⁱ

我们可以通过缩写密钥来节省页面空间。特别是在树的内部页面上，密钥只需要提供足够的信息作为密钥范围之间的边界即可。将更多的密钥打包进页面可以使树有更高的分支因子，从而减少层数。
In general, pages can be positioned anywhere on disk; there is nothing requiring pages with nearby key ranges to be nearby on disk. If a query needs to scan over a large part of the key range in sorted order, that page-by-page layout can be inefficient, because a disk seek may be required for every page that is read. Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk. However, it’s difficult to maintain that order as the tree grows. By contrast, since LSM-trees rewrite large segments of the storage in one go during merging, it’s easier for them to keep sequential keys close to each other on disk.

通常，页面可以位于磁盘的任何位置；没有任何要求就近键范围的页面必须在磁盘上附近。如果查询需要按排序顺序扫描键范围的大部分，那么按页面排列可能效率低下，因为每次阅读的页面可能需要磁盘查找。因此，许多B树实现尝试布局树，使叶页面按顺序出现在磁盘上。然而，随着树的增长，维护该顺序很困难。相比之下，由于LSM树在合并期间一次重写大量存储，因此它们更容易在磁盘上保持顺序的键靠近彼此。
Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

树上已添加了额外的指针。例如，每个叶页可能都有指向其左侧和右侧的兄弟页的引用，这样可以按顺序扫描键而不跳回父页。
B-tree variants such as fractal trees [ 22 ] borrow some log-structured ideas to reduce disk seeks (and they have nothing to do with fractals).

B-树的变体，如分形树[22]，借鉴一些日志结构的思想来减少盘寻找（并且它们与分形无关）。

Comparing B-Trees and LSM-Trees

Even though B-tree implementations are generally more mature than LSM-tree implementations, LSM-trees are also interesting due to their performance characteristics. As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads [ 23 ]. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.

尽管B-树实现通常比LSM-树实现更成熟，但由于性能特征，LSM-树也具有一定的吸引力。按照经验法则，LSM-树通常对写入更快，而B-树则被认为对读取更快[23]。由于它们必须检查几个不同的数据结构和不同的压实阶段的SSTables，因此LSM-树的读取通常较慢。

However, benchmarks are often inconclusive and sensitive to details of the workload. You need to test systems with your particular workload in order to make a valid comparison. In this section we will briefly discuss a few things that are worth considering when measuring the performance of a storage engine.

然而，基准测试通常不具可靠性，容易受到工作负载细节的影响。您需要使用特定工作负载测试系统，以确保有效的比较。在本节中，我们将简要探讨一些有关测量存储引擎性能的值得考虑的事情。

Advantages of LSM-trees

A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed. Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure [ 24 , 25 ].

B树索引需要至少两次写入每个数据：一次写入日志，一次写入树页本身（可能在分页时再次写入）。即使只有一页中的少量字节发生更改，也需要一次性写入整个页面，因此存在开销。有些存储引擎甚至会两次覆盖同一页，以避免在发生断电时出现部分更新的页面。

Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect—one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification . It is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.

由于定期压缩和合并SSTable，日志结构索引会多次重写数据。这种效果-数据库的一次写入导致在数据库的生命周期中多次写入磁盘-被称为写放大。这在SSD上尤为关注，因为它们只能有限次地覆盖块才会磨损。

In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk. In this case, write amplification has a direct performance cost: the more that a storage engine writes to disk, the fewer writes per second it can handle within the available disk bandwidth.

在写入量大的应用程序中，性能瓶颈可能是数据库写入磁盘的速率。在这种情况下，写入增益会直接影响性能成本：存储引擎写入磁盘越多，它在可用磁盘带宽内能处理的每秒写入数量就越少。

Moreover, LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree [ 26 ]. This difference is particularly important on magnetic hard drives, where sequential writes are much faster than random writes.

此外，LSM树通常能够支持比B树更高的写入吞吐量，部分原因是它们有时具有较低的写放大（尽管这取决于存储引擎配置和工作负载），部分原因是它们按顺序写入紧凑的SSTable文件，而不必覆盖树中的多个页面 [26]。在磁性硬盘上，这种差异特别重要，因为顺序写入比随机写入快得多。

LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused. Since LSM-trees are not page-oriented and periodically rewrite SSTables to remove fragmentation, they have lower storage overheads, especially when using leveled compaction [ 27 ].

LSM树可以更好地压缩，因此通常在磁盘上生成比B树更小的文件。B树存储引擎由于碎片化而留下一些未使用的磁盘空间：当页面被拆分或当一行无法适应现有页面时，页面中的一些空间将保持未使用。由于LSM树不是面向页面的，并且定期重写SSTables以消除碎片，因此它们具有较低的存储开销，特别是在使用级别压实时。

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [ 19 ]. However, lower write amplification and reduced fragmentation are still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.

在许多SSDs上，固件在内部使用日志结构算法将随机写入转换为底层存储芯片的顺序写入，因此存储引擎的写入模式的影响不太明显[19]。然而，降低写入扩展和减少碎片对于SSD仍然具有优势：更紧凑地表示数据可以在可用的I / O带宽内实现更多的读取和写入请求。

Downsides of LSM-trees

A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. Even though storage engines try to perform compaction incrementally and without affecting concurrent access, disks have limited resources, so it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation. The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance” ) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable [ 28 ].

日志结构存储的缺点是压缩过程有时会干扰正在进行的读写操作的性能。尽管存储引擎试图进行递增压缩并且不影响并发访问，但磁盘资源有限，请求在磁盘完成昂贵的压缩操作时需要等待。对吞吐量和平均响应时间的影响通常很小，但在更高的百分位数（见“描述性能”）下，针对日志结构存储引擎的查询响应时间有时可能会很长，而B树可以更为可预测（28）。

Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background. When writing to an empty database, the full disk bandwidth can be used for the initial write, but the bigger the database gets, the more disk bandwidth is required for compaction.

高写入吞吐量会导致压缩的另一个问题：磁盘有限的写入带宽需要在初始写入（将 memtable 记录和刷新到磁盘）和后台运行的压缩线程之间共享。在写入空数据库时，完整的磁盘带宽可以用于初始写入，但是数据库越大，压缩所需的磁盘带宽就越多。

If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes. In this case, the number of unmerged segments on disk keeps growing until you run out of disk space, and reads also slow down because they need to check more segment files. Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation [ 29 , 30 ].

如果写入吞吐量很高，而合并操作没有得到仔细配置，就可能出现合并不能跟上写入速率的情况。在这种情况下，磁盘上未合并段的数量会不断增加，直到磁盘空间不足，同时读取速度也会变慢，因为需要检查更多的段文件。通常，基于SSTable的存储引擎即使合并不能跟上也不会限制写入速率，因此需要明确的监控来检测此情况。

An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree [ 5 ]. In Chapter 7 we will discuss this point in more detail.

B树的一个优点是每个键在索引中仅存在一个位置，而日志结构化存储引擎可能会在不同的段中有多个相同的键的副本。这一方面使得B树在那些想要提供强事务语义的数据库中显得有吸引力：在许多关系型数据库中，事务隔离是通过对键范围上的锁进行实现的，在B树索引中，这些锁可以直接附加到树上[5]。在第7章中，我们将更详细地讨论这一点。

B-trees are very ingrained in the architecture of databases and provide consistently good performance for many workloads, so it’s unlikely that they will go away anytime soon. In new datastores, log-structured indexes are becoming increasingly popular. There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.

B树已经深深地融入了数据库的架构中，并为许多工作负载提供了持续良好的性能，因此它们不太可能很快被淘汰。在新的数据存储中，基于日志结构的索引（LSM）变得越来越受欢迎。没有一个快速而简单的规则来确定哪种存储引擎更适合您的用例，因此进行经验性测试是值得的。

Other Indexing Structures

So far we have only discussed key-value indexes, which are like a primary key index in the relational model. A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database. Other records in the database can refer to that row/document/vertex by its primary key (or ID), and the index is used to resolve such references.

到目前为止，我们只讨论了键值索引，这类似于关系模型中的主键索引。主键唯一标识关系表中的一行，文档数据库中的一个文档，或图形数据库中的一个顶点。数据库中的其他记录可以通过它的主键（或ID）引用该行/文档/顶点，索引用于解决这些引用。

It is also very common to have secondary indexes . In relational databases, you can create several secondary indexes on the same table using the CREATE INDEX command, and they are often crucial for performing joins efficiently. For example, in Figure 2-1 in Chapter 2 you would most likely have a secondary index on the user_id columns so that you can find all the rows belonging to the same user in each of the tables.

在关系型数据库中，创建多个相同表格的次要索引是非常常见的，可以使用CREATE INDEX命令进行创建，通常这些索引对于高效地执行连接操作至关重要。例如，如果在第二章中的图2-1中，您可能会在user_id列上制定一个次要索引，以便可以在每个表格中找到所有属于同一用户的行。

A secondary index can easily be constructed from a key-value index. The main difference is that keys are not unique; i.e., there might be many rows (documents, vertices) with the same key. This can be solved in two ways: either by making each value in the index a list of matching row identifiers (like a postings list in a full-text index) or by making each key unique by appending a row identifier to it. Either way, both B-trees and log-structured indexes can be used as secondary indexes.

二级索引可以轻松地从键值索引构建。主要区别在于键不是唯一的；即，可能有许多行（文档，顶点）具有相同的键。这可以通过两种方式解决：要么使索引中的每个值都是匹配行标识符的列表（例如全文索引中的发布列表），要么通过附加行标识符将每个键唯一。无论哪种方法，B树和日志结构索引都可以用作二级索引。

Storing values within the index

The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file , and it stores data in no particular order (it may be append-only, or it may keep track of deleted rows in order to overwrite them with new data later). The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.

索引中的关键字是查询搜索的内容，但值可以是以下两种之一：它可能是实际的行（文档、顶点），或者是对其他地方存储的行的引用。在后一种情况下，行存储的位置被称为堆文件，它以任意顺序存储数据（可能只是追加，或者可能跟踪已删除的行以便稍后用新数据覆盖它们）。堆文件方法很常见，因为它避免了在存在多个辅助索引时复制数据：每个索引只引用堆文件中的一个位置，而实际数据只保存在一个地方。

When updating a value without changing the key, the heap file approach can be quite efficient: the record can be overwritten in place, provided that the new value is not larger than the old value. The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location [ 5 ].

更新值时，如果键不变，堆文件方法可以相当有效：只要新值不比旧值大，记录就可以被原地覆盖。如果新值较大，则情况就更加复杂了，因为它可能需要移动到堆中有足够空间的新位置。在这种情况下，要么更新所有索引以指向记录的新堆位置，要么在旧的堆位置留下转发指针。

In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index . For example, in MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file location) [ 31 ]. In SQL Server, you can specify one clustered index per table [ 32 ].

在某些情况下，从索引到堆文件的额外跳跃对于读取来说是过于性能损失，因此将希望将索引行直接存储在索引中。这就被称为聚集索引。例如，MySQL的InnoDB存储引擎中，表的主键始终是聚集索引，而次要索引引用主键（而不是堆文件位置）[31]。在SQL Server中，您可以对每个表指定一个聚集索引[32]。

A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns , which stores some of a table’s columns within the index [ 33 ]. This allows some queries to be answered by using the index alone (in which case, the index is said to cover the query) [ 32 ].

覆盖索引或包含列的索引是聚簇索引（在索引内保存所有行数据）和非聚簇索引（仅在索引内保存对数据的引用）之间的一种折中。它将表格的某些列保存在索引内，从而允许某些查询仅使用索引来回答（在这种情况下，索引称为覆盖查询）。

As with any kind of duplication of data, clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes. Databases also need to go to additional effort to enforce transactional guarantees, because applications should not see inconsistencies due to the duplication.

如同任何資料的複製一樣，群集和覆蓋索引可以加快讀取速度，但它們需要額外的儲存空間，並且可能增加寫入時的負擔。資料庫還需要額外的努力來實施交易保證，因為應用程式不應因複製而看到不一致性。

Multi-column indexes

The indexes discussed so far only map a single key to a value. That is not sufficient if we need to query multiple columns of a table (or multiple fields in a document) simultaneously.

到目前为止，讨论的索引只能将单个键映射到一个值。如果我们需要同时查询表格中的多个列（或文档中的多个字段），这是不够的。

The most common type of multi-column index is called a concatenated index , which simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated). This is like an old-fashioned paper phone book, which provides an index from ( lastname , firstname ) to phone number. Due to the sort order, the index can be used to find all the people with a particular last name, or all the people with a particular lastname-firstname combination. However, the index is useless if you want to find all the people with a particular first name.

最常见的多列索引类型是称为连接索引的索引，它简单地通过将一个列附加到另一个列来将几个字段组合成一个键（索引定义指定了字段连接的顺序）。这就像一个老式的电话簿，它提供了一个从（姓氏，名字）到电话号码的索引。由于排序顺序，索引可用于查找所有具有特定姓氏的人，或所有具有特定姓氏-名字组合的人。然而，如果您要查找所有具有特定名字的人，则索引无用。

Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data. For example, a restaurant-search website may have a database containing the latitude and longitude of each restaurant. When a user is looking at the restaurants on a map, the website needs to search for all the restaurants within the rectangular map area that the user is currently viewing. This requires a two-dimensional range query like the following:

多维索引是一种更通用的同时查询多列的方式，这对于地理空间数据尤其重要。例如，一个餐厅搜索网站可能有一个包含每个餐厅纬度和经度的数据库。当用户在地图上查看餐厅时，网站需要搜索用户当前查看的矩形地图区域内的所有餐厅。这需要一个二维范围查询，如下所示：

SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                            AND longitude > -0.1162 AND longitude < -0.1004;

A standard B-tree or LSM-tree index is not able to answer that kind of query efficiently: it can give you either all the restaurants in a range of latitudes (but at any longitude), or all the restaurants in a range of longitudes (but anywhere between the North and South poles), but not both simultaneously.

标准的B树或LSM树索引无法有效地回答这种查询：它可以给出纬度范围内的所有餐厅（但经度可以任意），或者经度范围内的所有餐厅（但在南北极之间的任何地方），但不能同时给出。

One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index [ 34 ]. More commonly, specialized spatial indexes such as R-trees are used. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility [ 35 ]. We don’t have space to describe R-trees in detail here, but there is plenty of literature on them.

一种选择是使用填充空间曲线将二维位置转换为单个数字，然后使用常规B树索引[34]。更常见的是使用专门的空间索引，例如R-trees。例如，PostGIS将地理空间索引实现为R-trees，使用PostgreSQL的广义搜索树索引工具[35]。我们没有空间在此处详细描述R-trees，但有大量文献可供参考。

An interesting idea is that multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce website you could use a three-dimensional index on the dimensions ( red , green , blue ) to search for products in a certain range of colors, or in a database of weather observations you could have a two-dimensional index on ( date , temperature ) in order to efficiently search for all the observations during the year 2013 where the temperature was between 25 and 30℃. With a one-dimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex [ 36 ].

一个有趣的想法是，多维索引不仅仅适用于地理位置。例如，在电子商务网站上，你可以使用三维索引（红色，绿色，蓝色）搜索一定范围颜色的产品，或者在天气观测数据库中，你可以使用二维索引（日期，温度）以便高效地搜索所有2013年温度在25℃至30℃之间的观测数据。使用一维索引，你需要扫描所有2013年的记录（不考虑温度）然后再根据温度进行筛选，或者反之。二维索引可以同时缩小 by 时间戳和温度。这种技术由HyperDex[36]使用。

Full-text search and fuzzy indexes

All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques.

到目前为止讨论的所有索引都假定您具有精确数据，并允许您查询某个关键字的精确值或具有排序顺序的关键字值范围。但它们无法让你搜索相似的关键字，比如拼写错误的单词。这种模糊查询需要不同的技术。

For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text. To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [ 37 ].

例如，全文搜索引擎通常允许将一个单词的搜索扩展到包括该词的同义词，忽略单词的语法变化，并在同一文档中搜索单词出现在彼此附近的情况，并支持其他各种依赖于文本语言分析的功能。为了处理文档或查询中的打字错误，Lucene能够在一定的编辑距离内搜索文字（编辑距离为1表示添加、删除或替换一个字母）[37]。

As mentioned in “Making an LSM-tree out of SSTables” , Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. In LevelDB, this in-memory index is a sparse collection of some of the keys, but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie [ 38 ]. This automaton can be transformed into a Levenshtein automaton , which supports efficient search for words within a given edit distance [ 39 ].

在“将SSTable制作成LSM-树”中提到，Lucene使用类似SSTable的结构用于其术语字典。该结构需要一个小的内存索引，告诉查询需要在已排序的文件中的哪个偏移量查找键。在LevelDB中，这个内存索引是一组稀疏的键，但在Lucene中，内存索引是有限状态自动机，类似于trie[38]。这个自动机可以转化为Levenshtein自动机，支持在规定的编辑距离内高效地搜索单词[39]。

Other fuzzy search techniques go in the direction of document classification and machine learning. See an information retrieval textbook for more detail [e.g., 40 ].

其他模糊搜索技术越来越倾向于文档分类和机器学习。有关更多细节，请参阅信息检索教材 [例如，40]。

Keeping everything in memory

The data structures discussed so far in this chapter have all been answers to the limitations of disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, data on disk needs to be laid out carefully if you want good performance on reads and writes. However, we tolerate this awkwardness because disks have two significant advantages: they are durable (their contents are not lost if the power is turned off), and they have a lower cost per gigabyte than RAM.

这一章讨论的数据结构都是为克服磁盘的限制而设计的。与主存储器相比，磁盘处理起来比较麻烦。对于磁性盘和固态盘，如果要读写数据具有良好的性能，就需要仔细布局磁盘上的数据。但我们忍受这种麻烦，是因为磁盘有两个重要的优点：它们是耐用的（如果断电，内容不会丢失），并且每GB成本比RAM低。

As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases .

随着RAM的价格越来越便宜，每GB成本的争议已经消失了。许多数据集并不是非常庞大，因此将它们完全存储在内存中是相当可行的，还可以在多台机器之间进行分布。这导致了内存数据库的发展。

Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

一些内存键值存储，如Memcached，仅用于缓存，如果机器重新启动，数据丢失是可以接受的。但是，其他内存数据库的目标是实现持久性，可以通过特殊硬件（例如电池供电的RAM），将更改日志写入磁盘，定期将快照写入磁盘或将内存状态复制到其他机器来实现。

When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica (unless special hardware is used). Despite writing to disk, it’s still an in-memory database, because the disk is merely used as an append-only log for durability, and reads are served entirely from memory. Writing to disk also has operational advantages: files on disk can easily be backed up, inspected, and analyzed by external utilities.

当内存数据库重新启动时，它需要从磁盘或从副本通过网络重新加载其状态（除非使用特殊的硬件）。尽管写入磁盘，它仍然是内存数据库，因为磁盘仅用作附加日志以实现耐久性，并且读取完全从内存中提供。写入磁盘还具有操作优势：磁盘上的文件可以轻松备份，检查和分析外部实用程序。

Products such as VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model, and the vendors claim that they can offer big performance improvements by removing all the overheads associated with managing on-disk data structures [ 41 , 42 ]. RAMCloud is an open source, in-memory key-value store with durability (using a log-structured approach for the data in memory as well as the data on disk) [ 43 ]. Redis and Couchbase provide weak durability by writing to disk asynchronously.

像VoltDB、MemSQL和Oracle TimesTen这样的产品是具有关系模型的内存数据库，供应商声称可以通过删除与管理磁盘上的数据结构相关的所有开销来提供大幅度的性能提升。 RAMCloud是一个开放源代码的内存键-值存储，具有耐久性（使用对内存中的数据以及磁盘上的数据采用日志结构方法）。Redis和Couchbase通过异步写入磁盘提供较弱的耐久性。

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [ 44 ].

直觉上，内存数据库的性能优势并不是因为它们不需要从磁盘读取。即使磁盘存储引擎足够多的内存，也可能永远不需要从磁盘读取，因为操作系统会在内存中缓存最近使用的磁盘块。相反，其速度更快的原因是它们可以避免将内存数据结构编码为可写入磁盘的形式所需的开销。

Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.

除了性能外，内存数据库的另一个有趣领域是提供难以用基于磁盘索引实现的数据模型。例如，Redis提供了类似数据库的接口，可以处理各种数据结构，如优先级队列和集合。由于它将所有数据存储在内存中，因此它的实现相对简单。

Recent research indicates that an in-memory database architecture could be extended to support datasets larger than the available memory, without bringing back the overheads of a disk-centric architecture [ 45 ]. The so-called anti-caching approach works by evicting the least recently used data from memory to disk when there is not enough memory, and loading it back into memory when it is accessed again in the future. This is similar to what operating systems do with virtual memory and swap files, but the database can manage memory more efficiently than the OS, as it can work at the granularity of individual records rather than entire memory pages. This approach still requires indexes to fit entirely in memory, though (like the Bitcask example at the beginning of the chapter).

最近的研究表明，内存数据库架构可以扩展以支持比可用内存更大的数据集，而不会带来磁盘中心架构的开销[45]。所谓的反缓存方法的工作原理是在内存中驱逐最近最少使用的数据到磁盘上，当内存不足时，并在将来再次使用时重新加载数据到内存中。这类似于操作系统使用虚拟内存和交换文件的方式，但是数据库可以比操作系统更有效地管理内存，因为它可以在单个记录的粒度上工作，而不是整个内存页面。然而，这种方法仍然需要索引完全适合内存，就像本章开头的 Bitcask 示例一样。

Further changes to storage engine design will probably be needed if non-volatile memory (NVM) technologies become more widely adopted [ 46 ]. At present, this is a new area of research, but it is worth keeping an eye on in the future.

如果非易失性存储器（NVM）技术得到更广泛的应用，存储引擎设计可能需要进一步改变。目前，这是一个新的研究领域，但未来值得注意。

Transaction Processing or Analytics?

In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands, the term transaction nevertheless stuck, referring to a group of reads and writes that form a logical unit.

在商业数据处理的早期，对数据库的写入通常对应于商业交易的发生：销售商品，向供应商下订单，支付员工工资等。随着数据库扩展到不涉及资金流动的领域，事务这个术语仍然保持着，指的是形成逻辑单元的一组读写操作。

Note

A transaction needn’t necessarily have ACID (atomicity, consistency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day). We discuss the ACID properties in Chapter 7 and batch processing in Chapter 10 .

一笔交易不一定需要ACID (原子性、一致性、隔离性和持久性) 属性。事务处理只是允许客户进行低延迟读写，而不是周期性运行的批处理作业（例如，每天一次）。我们在第7章中讨论ACID属性，在第10章中讨论批处理。

Even though databases started being used for many different kinds of data—comments on blog posts, actions in a game, contacts in an address book, etc.—the basic access pattern remained similar to processing business transactions. An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).

尽管数据库开始用于许多不同类型的数据-博客评论、游戏中的活动、地址簿中的联系人等-但基本的访问模式与处理业务交易类似。应用程序通常通过索引按某个键查找少量记录。根据用户的输入插入或更新记录。由于这些应用程序是交互式的，因此访问模式被称为在线事务处理（OLTP）。

However, databases also started being increasingly used for data analytics , which has very different access patterns. Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. For example, if your data is a table of sales transactions, then analytic queries might be:

然而，数据库也开始越来越多地用于数据分析，其访问模式也有很大差别。通常情况下，分析查询需要扫描大量的记录，每个记录只阅读少数列，并计算聚合统计信息（如计数、求和或平均数），而不是将原始数据返回给用户。例如，如果您的数据是销售交易表，则分析查询可能包括：

What was the total revenue of each of our stores in January?

我们每家店在一月份的总收入是多少？
How many more bananas than usual did we sell during our latest promotion?

我们最近促销期间卖了比平常多多少香蕉？
Which brand of baby food is most often purchased together with brand X diapers?

哪个品牌的婴儿食品最常与品牌 X 的尿布一起购买？

These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions ( business intelligence ). In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP) [ 47 ]. ^iv The difference between OLTP and OLAP is not always clear-cut, but some typical characteristics are listed in Table 3-1 .

这些查询通常由业务分析师编写，并提供反馈给报告，帮助公司管理层做出更好的决策（商业智能）。为了区分数据库使用模式与事务处理的差异，它被称为在线分析处理（OLAP）[47]。OLTP和OLAP之间的区别并不总是清晰的，但一些典型的特征列在表3-1中。

Table 3-1. Comparing characteristics of transaction processing versus analytic systems
Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern 主要阅读模式	Small number of records per query, fetched by key 每次查询只获取少量数据，通过关键字获取。	Aggregate over large number of records 大量记录聚合
Main write pattern 主要写作模式	Random-access, low-latency writes from user input 用户输入的随机访问、低延迟写入	Bulk import (ETL) or event stream 批量导入（ETL）或事件流
Primarily used by 主要用于	End user/customer, via web application 终端用户/客户，通过web应用程序。	Internal analyst, for decision support 内部分析师，用于决策支持。
What data represents 什么数据代表着	Latest state of data (current point in time) 最新数据状态（当前时间点）	History of events that happened over time 历史事件的发生过程。
Dataset size 数据集大小	Gigabytes to terabytes 千兆字节到太字节	Terabytes to petabytes 千兆字节到拍字节

At first, the same databases were used for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP-type queries as well as OLAP-type queries. Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP systems for analytics purposes, and to run the analytics on a separate database instead. This separate database was called a data warehouse .

最初，同样的数据库被用于事务处理和分析查询。SQL 在这方面表现得相当灵活：它适用于 OLTP 类型查询和 OLAP 类型查询。然而，在 1980 年代末和 1990 年代初，公司采用停止在 OLTP 系统上进行分析，改为在另一个数据库上运行分析查询的趋势。这个单独的数据库被称为数据仓库。

Data Warehousing

An enterprise may have dozens of different transaction processing systems: systems powering the customer-facing website, controlling point of sale (checkout) systems in physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, administering employees, etc. Each of these systems is complex and needs a team of people to maintain it, so the systems end up operating mostly autonomously from each other.

企业可能有数十个不同的交易处理系统：为客户面向的网站提供支持的系统，控制实体商店销售点（结账）系统的系统，跟踪仓库中的库存，规划车辆路线，管理供应商，管理员工等等。每个系统都很复杂，需要一个团队来维护，因此这些系统大多从彼此独立地运作。

These OLTP systems are usually expected to be highly available and to process transactions with low latency, since they are often critical to the operation of the business. Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions.

这些OLTP系统通常被期望具有高可用性，并以低延迟方式处理事务，因为它们通常对业务操作至关重要。因此，数据库管理员严密保管他们的OLTP数据库。他们通常不愿意让业务分析师在OLTP数据库上运行即席分析查询，因为这些查询通常很昂贵，需要扫描数据集的大部分内容，这可能会损害同时执行的事务的性能。

A data warehouse , by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations [ 48 ]. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in Figure 3-8 .

数据仓库是一个单独的数据库，分析师可以随心所欲地查询它，而不会影响 OLTP 操作。数据仓库中包含了公司所有各种 OLTP 系统的数据的只读副本。数据是从 OLTP 数据库中提取的（使用定期数据转储或持续更新流），转换为适合分析的模式，清理并加载到数据仓库中。将数据装入仓库的过程称为 Extract-Transform-Load (ETL)，如图 3-8 所示。

Data warehouses now exist in almost all large enterprises, but in small companies they are almost unheard of. This is probably because most small companies don’t have so many different OLTP systems, and most small companies have a small amount of data—small enough that it can be queried in a conventional SQL database, or even analyzed in a spreadsheet. In a large company, a lot of heavy lifting is required to do something that is simple in a small company.

数据仓库现在几乎在所有大型企业中存在，但在小型公司中，它们几乎是不为人知的。这可能是因为大多数小型公司没有太多不同的OLTP系统，并且大多数小型公司具有较少的数据量-足够小，可以在传统的SQL数据库中查询，甚至可以在电子表格中进行分析。在大型公司中，需要进行大量的工作才能完成小型公司中的简单任务。

A big advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns. It turns out that the indexing algorithms discussed in the first half of this chapter work well for OLTP, but are not very good at answering analytic queries. In the rest of this chapter we will look at storage engines that are optimized for analytics instead.

使用单独的数据仓库进行分析的一个巨大优势是，数据仓库可以为分析访问模式进行优化。事实证明，在本章的前半部分讨论的索引算法适用于 OLTP，但不适合回答分析查询。在本章的其余部分中，我们将研究专为分析而优化的存储引擎。

The divergence between OLTP databases and data warehouses

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytic queries. There are many graphical data analysis tools that generate SQL queries, visualize the results, and allow analysts to explore the data (through operations such as drill-down and slicing and dicing ).

数据仓库的数据模型通常是关系型的，因为SQL通常非常适合分析查询。有许多图形化数据分析工具可以生成SQL查询，可视化结果，并允许分析人员探索数据（通过钻取和切片）等操作。

On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

表面上，数据仓库和关系型OLTP数据库看起来很相似，因为它们都有SQL查询界面。然而，系统内部可能会看起来非常不同，因为它们针对非常不同的查询模式进行了优化。许多数据库供应商现在专注于支持事务处理或分析工作负载，但不支持两者兼备。

Some databases, such as Microsoft SQL Server and SAP HANA, have support for transaction processing and data warehousing in the same product. However, they are increasingly becoming two separate storage and query engines, which happen to be accessible through a common SQL interface [ 49 , 50 , 51 ].

一些数据库，例如Microsoft SQL Server和SAP HANA，支持在同一产品中进行事务处理和数据仓库。然而，它们越来越成为两个独立的存储和查询引擎，通过共同的SQL接口访问[49，50，51]。

Data warehouse vendors such as Teradata, Vertica, SAP HANA, and ParAccel typically sell their systems under expensive commercial licenses. Amazon RedShift is a hosted version of ParAccel. More recently, a plethora of open source SQL-on-Hadoop projects have emerged; they are young but aiming to compete with commercial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill [ 52 , 53 ]. Some of them are based on ideas from Google’s Dremel [ 54 ].

数据仓库供应商，如Teradata、Vertica、SAP HANA和ParAccel通常以昂贵的商业许可证销售其系统。Amazon RedShift是ParAccel的托管版本。最近，涌现了许多开源的基于Hadoop的SQL项目，它们年轻但旨在与商业数据仓库系统竞争。其中包括Apache Hive、Spark SQL、Cloudera Impala、Facebook Presto、Apache Tajo和Apache Drill [52，53]。其中一些基于Google的Dremel的想法[54]。

Stars and Snowflakes: Schemas for Analytics

As explored in Chapter 2 , a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application. On the other hand, in analytics, there is much less diversity of data models. Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling [ 55 ]).

正如第2章所探讨的，事务处理领域中使用了各种不同的数据模型，具体取决于应用程序的需求。另一方面，在分析领域中，数据模型的多样性要少得多。许多数据仓库采用了一种相当公式化的样式，称为星型模式（也称为维度建模）。

The example schema in Figure 3-9 shows a data warehouse that might be found at a grocery retailer. At the center of the schema is a so-called fact table (in this example, it is called fact_sales ). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customer’s purchase of a product). If we were analyzing website traffic rather than retail sales, each row might represent a page view or a click by a user.

图3-9中的示例模式显示了一个可能在杂货零售商处找到的数据仓库。模式的中心是所谓的事实表（在本例中，它被称为fact_sales）。事实表的每一行表示发生在特定时间的事件（在这里，每一行表示客户购买产品）。如果我们分析的是网站流量而不是零售销售，每行可能表示用户的页面视图或点击。

Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later. However, this means that the fact table can become extremely large. A big enterprise like Apple, Walmart, or eBay may have tens of petabytes of transaction history in its data warehouse, most of which is in fact tables [ 56 ].

通常情况下，事实被捕捉为单独的事件，因为这允许以后进行最大限度的灵活性分析。然而，这意味着事实表可以变得非常大。像苹果、沃尔玛或eBay这样的大企业可能拥有数十个petabytes的交易历史记录在其数据仓库中，其中大部分位于事实表中[56]。

Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit margin to be calculated). Other columns in the fact table are foreign key references to other tables, called dimension tables . As each row in the fact table represents an event, the dimensions represent the who , what , where , when , how , and why of the event.

事实表中的某些列是属性，例如产品销售的价格以及从供应商购买产品的成本（可计算利润率）。事实表中的其他列是指向其他表的外键引用，称为维度表。由于事实表中的每行代表一个事件，因此维度代表事件的谁，何物，何处，何时，如何和为什么。

For example, in Figure 3-9 , one of the dimensions is the product that was sold. Each row in the dim_product table represents one type of product that is for sale, including its stock-keeping unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the fact_sales table uses a foreign key to indicate which product was sold in that particular transaction. (For simplicity, if the customer buys several different products at once, they are represented as separate rows in the fact table.)

例如，在图3-9中，一个维度是已售出的产品。dim_product表中的每一行代表一种销售产品，包括它的库存保持单位（SKU），描述，品牌名称，类别，脂肪含量，包装大小等。fact_sales表中的每一行使用外键表示在该特定交易中售出的产品。（为简单起见，如果客户同时购买多种不同产品，则在事实表中表示为单独的行。）

Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays.

日期和时间经常使用维度表来代表，因为这可以将有关日期的其他信息（例如公共假期）进行编码，从而使查询能够区别节假日和非节假日的销售。

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.

星型模式这个名字来源于表关系的可视化，事实表位于中心，被尺寸表所包围; 这些表的连接就像一颗星的射线一样。

A variation of this template is known as the snowflake schema , where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could reference the brand and category as foreign keys, rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with [ 55 ].

这个模板的一个变体称为雪花模式，其中维度被进一步分解为子维度。例如，品牌和产品类别可以有单独的表，并且dim_product表中的每一行都可以引用品牌和类别作为外键，而不是在dim_product表中将它们存储为字符串。雪花模式比星形模式更规范化，但星形模式通常更受分析师的欢迎，因为它们更简单[55]。

In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred [ 51 ]. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysis—for example, the dim_store table may include details of which services are offered at each store, whether it has an in-store bakery, the square footage, the date when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc.

在典型的数据仓库中，表通常非常宽：事实表通常拥有超过100个列，有时甚至几百个 [51]。维度表也可以非常宽，因为它们包括可能与分析相关的所有元数据 - 例如，dim_store表可能包括每个店铺提供的服务细节，是否有店内面包房，面积大小，店铺开业日期，上次翻新日期，距离最近的高速公路有多远等。

Column-Oriented Storage

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of rows), so in this section we will concentrate primarily on storage of facts.

如果您的事实表中有数万亿行和数PB的数据，高效地存储和查询它们成为一个具有挑战性的问题。维度表通常要小得多（数百万行），因此在本节中，我们将主要关注事实的存储。

Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time ( "SELECT *" queries are rarely needed for analytics) [ 51 ]. Take the query in Example 3-1 : it accesses a large number of rows (every occurrence of someone buying fruit or candy during the 2013 calendar year), but it only needs to access three columns of the fact_sales table: date_key , product_sk , and quantity . The query ignores all other columns.

虽然事实表通常有超过100列，但典型的数据仓库查询一次只访问4或5列（很少需要“SELECT *”查询进行分析）。在示例3-1中的查询中，虽然它访问了大量的行（每个人在2013年购买水果或糖果的每个事件），但它仅需要访问fact_sales表的三个列：date_key、product_sk和quantity。该查询忽略了所有其他列。

Example 3-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week

SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
  dim_date.weekday, dim_product.category;

How can we execute this query efficiently?

如何有效地执行这个查询？

In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes. You can see this in the CSV example of Figure 3-1 .

在大多数OLTP数据库中，存储是以行为导向的方式布置的：表的一行中的所有值都相互靠近存储。文档数据库类似：整个文档通常被存储为一个连续的字节序列。您可以在图3-1的CSV示例中看到这一点。

In order to process a query like Example 3-1 , you may have indexes on fact_sales.date_key and/or fact_sales.product_sk that tell the storage engine where to find all the sales for a particular date or for a particular product. But then, a row-oriented storage engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into memory, parse them, and filter out those that don’t meet the required conditions. That can take a long time.

为了处理类似于Example 3-1的查询，您可能会在fact_sales.date_key和/或fact_sales.product_sk上拥有索引，告诉存储引擎在哪里找到特定日期或特定产品的所有销售。但是，行定向存储引擎仍然需要将所有这些行（每个行包含100多个属性）从磁盘加载到内存中，解析它们并过滤掉不满足要求条件的行。这可能需要很长时间。

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. This principle is illustrated in Figure 3-10 .

列式存储的想法很简单：不要把一行中的所有值都存放在一起，而是把每一列的值都存放在一起。如果每一列都存储在不同的文件中，查询只需要读取和解析其中用到的列，这样可以节省很多工作量。这个原则在图3-10中有所体现。

Note

Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data. For example, Parquet [ 57 ] is a columnar storage format that supports a document data model, based on Google’s Dremel [ 54 ].

列存储最容易在关系数据模型中理解，但同样适用于非关系数据。例如，Parquet [57] 是一种基于 Google 的 Dremel [54] 的文档数据模型支持的列存储格式。

The column-oriented storage layout relies on each column file containing the rows in the same order. Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the individual column files and put them together to form the 23rd row of the table.

列式存储布局依赖于每个列文件中行的顺序相同。因此，如果您需要重新组合整个行，您可以从每个单独的列文件中取出第23个条目并将它们放在一起形成表的第23行。

Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data. Fortunately, column-oriented storage often lends itself very well to compression.

除了只从磁盘加载查询所需的列之外，我们可以通过压缩数据进一步减少对磁盘吞吐量的需求。幸运的是，列式存储通常非常适合压缩。

Take a look at the sequences of values for each column in Figure 3-10 : they often look quite repetitive, which is a good sign for compression. Depending on the data in the column, different compression techniques can be used. One technique that is particularly effective in data warehouses is bitmap encoding , illustrated in Figure 3-11 .

看一下图3-10中每列的值序列：它们常常看起来非常重复，这对于压缩来说是一个好的迹象。根据列中的数据，可以使用不同的压缩技术。在数据仓库中特别有效的一种技术是位图编码，如图3-11所示。

Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

通常，列中有不同值的数量相对于行数来说很小（例如，零售商可能有数十亿的销售交易，但只有10万个不同的产品）。我们现在可以将具有n个不同值的列转换为n个单独的位图：每个不同值都有一个位图，每行一个位，如果该行具有该值，则位为1，否则为0。

If n is very small (for example, a country column may have approximately 200 distinct values), those bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse ). In that case, the bitmaps can additionally be run-length encoded, as shown at the bottom of Figure 3-11 . This can make the encoding of a column remarkably compact.

如果n很小（例如，一个国家列大约有200个不同的值），那么这些位图可以以每行一个比特的方式存储。但是如果n更大，大部分位图中会有很多零（我们称之为稀疏）。在这种情况下，位图可以额外进行运行长度编码，如图3-11底部所示。这可以使列的编码非常紧凑。

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. For example:

这样的位图索引非常适合数据仓库中常见的查询类型。例如：

WHERE product_sk IN (30, 68, 69):

Load the three bitmaps for product_sk = 30 , product_sk = 68 , and product_sk = 69 , and calculate the bitwise OR of the three bitmaps, which can be done very efficiently.

加载 product_sk = 30、product_sk = 68 和 product_sk = 69 的三个位图，然后高效地计算这三个位图的按位或。

WHERE product_sk = 31 AND store_sk = 3:

Load the bitmaps for product_sk = 31 and store_sk = 3 , and calculate the bitwise AND . This works because the columns contain the rows in the same order, so the k th bit in one column’s bitmap corresponds to the same row as the k th bit in another column’s bitmap.

加载 product_sk = 31 和 store_sk = 3 的位图，并计算位与。这是因为列包含的行以相同的顺序排列，因此一个列位图中的第k位对应于另一个列位图中的第k位相同的行。

There are also various other compression schemes for different kinds of data, but we won’t go into them in detail—see [ 58 ] for an overview.

还有许多不同种类数据的压缩方案，但我们不会详细介绍-请参阅[58]以获取概述。

Column-oriented storage and column families

Cassandra and HBase have a concept of column families , which they inherited from Bigtable [ 9 ]. However, it is very misleading to call them column-oriented: within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Cassandra和HBase有一个来自Bigtable的列族的概念[9]。然而，称它们为面向列的非常误导：在每个列族中，它们将一行中的所有列与行键一起存储，并且它们不使用列压缩。因此，Bigtable模型仍然主要是面向行的。

Memory bandwidth and vectorized processing

For data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory. However, that is not the only bottleneck. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs [ 59 , 60 ].

数据仓库查询需要扫描数百万行数据，最大瓶颈是从磁盘读取数据到内存的带宽。不过，这并不是唯一的瓶颈。分析型数据库的开发人员还关注如何有效利用从主存储器到CPU高速缓存的带宽，避免分支预测失误和CPU指令处理流水线上的气泡，以及利用现代CPU的单指令多数据（SIMD）指令[59、60]。

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Column compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing [ 58 , 49 ].

列式存储布局不仅可以减少需要从磁盘加载的数据量，还可以有效利用 CPU 循环。例如，查询引擎可以获取一块压缩的列数据并在 CPU 的 L1 缓存中舒适地迭代它，而不需要函数调用。 CPU 可以比处理每个记录需要大量函数调用和条件的代码更快地执行这样的循环。列压缩允许更多的列行适合相同数量的 L1 缓存。操作符，如之前描述的按位 AND 和 OR，可以直接设计用于这样压缩的列数据块。这种技术被称为矢量化处理。

Sort Order in Column Storage

In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to store them in the order in which they were inserted, since then inserting a new row just means appending to each of the column files. However, we can choose to impose an order, like we did with SSTables previously, and use that as an indexing mechanism.

在列存储中，行的存储顺序并不重要。最简单的方法是按插入顺序进行存储，因为这样插入新行只需要将其追加到每个列文件中。然而，我们可以选择强制排序，就像之前使用SSTables一样，并将其用作索引机制。

Note that it wouldn’t make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row. We can only reconstruct a row because we know that the k th item in one column belongs to the same row as the k th item in another column.

请注意，独立对每一列进行排序是没有意义的，因为这样我们将不再知道每一列中的项目属于哪一行。我们只能重构一行，因为我们知道一个列中的第k个项目属于另一列中的第k个项目所在的同一行。

Rather, the data needs to be sorted an entire row at a time, even though it is stored by column. The administrator of the database can choose the columns by which the table should be sorted, using their knowledge of common queries. For example, if queries often target date ranges, such as the last month, it might make sense to make date_key the first sort key. Then the query optimizer can scan only the rows from the last month, which will be much faster than scanning all rows.

相反，数据需要整行排序，即使它是按列存储的。数据库管理员可以选择按哪些列对表进行排序，利用常见查询的知识。例如，如果查询经常会针对日期范围，比如最近一个月，那么将日期键作为第一排序键可能是有意义的。然后查询优化器可以只扫描最近一个月的行，这比扫描所有行要快得多。

A second column can determine the sort order of any rows that have the same value in the first column. For example, if date_key is the first sort key in Figure 3-10 , it might make sense for product_sk to be the second sort key so that all sales for the same product on the same day are grouped together in storage. That will help queries that need to group or filter sales by product within a certain date range.

第二列可以确定在第一列具有相同值的任何行的排序顺序。例如，在图3-10中，如果date_key是第一个排序键，那么product_sk可能成为第二个排序键，以便将同一天相同产品的所有销售分组在一起存储。这将有助于需要在特定日期范围内按产品对销售进行分组或筛选的查询。

Another advantage of sorted order is that it can help with compression of columns. If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row. A simple run-length encoding, like we used for the bitmaps in Figure 3-11 , could compress that column down to a few kilobytes—even if the table has billions of rows.

排序顺序的另一个优点是它有助于列的压缩。如果主排序列没有很多不同的值，那么排序后，它将有很长的序列，其中相同的值连续重复很多次。像图3-11中用于位图的简单运行长度编码可以将该列压缩到几千字节，即使表有数十亿行。

That compression effect is strongest on the first sort key. The second and third sort keys will be more jumbled up, and thus not have such long runs of repeated values. Columns further down the sorting priority appear in essentially random order, so they probably won’t compress as well. But having the first few columns sorted is still a win overall.

该压缩效应在第一个排序键上最为强烈。第二个和第三个排序键将会更加杂乱无章，因此不会有如此长的重复值序列。更靠后的排序优先级的列是基本上以随机顺序出现的，因此它们可能不会被很好地压缩。但是，总体而言，排序前几列仍然是一个胜利。

Several different sort orders

A clever extension of this idea was introduced in C-Store and adopted in the commercial data warehouse Vertica [ 61 , 62 ]. Different queries benefit from different sort orders, so why not store the same data sorted in several different ways? Data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.

这个思路的巧妙延伸被引入到C-Store中，并在商业数据仓库Vertica中被采用。不同的查询受益于不同的排序顺序，因此为什么不以几种不同的方式存储相同的数据？数据需要复制到多台机器上，以防止一台机器故障时丢失数据。因此，您可以将冗余数据以不同的方式排序，以便在处理查询时使用最符合查询模式的版本。

Having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store. But the big difference is that the row-oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. In a column store, there normally aren’t any pointers to data elsewhere, only columns containing values.

在面向列的存储中具有多个排序顺序有点类似于在面向行的存储中具有多个辅助索引。但是最大的区别在于面向行的存储将每行都保存在一个地方（在堆文件或聚集索引中），而辅助索引只包含指向匹配行的指针。在列存储中，通常没有指向其他数据的指针，只有包含值的列。

Writing to Column-Oriented Storage

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

这些优化在数据仓库中是有意义的，因为大多数负载都是由分析师运行的大型只读查询。列导向存储、压缩和排序都有助于加快这些读查询的速度。然而，它们的缺点是使写更加困难。

An update-in-place approach, like B-trees use, is not possible with compressed columns. If you wanted to insert a row in the middle of a sorted table, you would most likely have to rewrite all the column files. As rows are identified by their position within a column, the insertion has to update all columns consistently.

压缩列不可能使用像B树那样的就地更新方法。如果您想要在排序表的中间插入一行，您很可能不得不重新编写所有列文件。由于行是通过它们在列中的位置来标识的，因此插入操作必须一致地更新所有列。

Fortunately, we have already seen a good solution earlier in this chapter: LSM-trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk. This is essentially what Vertica does [ 62 ].

幸运的是，我们已经在本章早些时候看到了一个好的解决方案：LSM树。所有写操作首先都会进入内存存储，添加到一个排序结构中，并做好写入磁盘的准备。无论内存存储是面向行还是面向列都没有关系。当收集到足够的写操作后，它们会与磁盘上的列文件合并，并批量地写入新文件中。这本质上就是Vertica的做法[62]。

Queries need to examine both the column data on disk and the recent writes in memory, and combine the two. However, the query optimizer hides this distinction from the user. From an analyst’s point of view, data that has been modified with inserts, updates, or deletes is immediately reflected in subsequent queries.

查询需要同时检查磁盘上的列数据和最近在内存中的写入，并将两者结合起来。然而，查询优化器会隐藏这种区别。从分析师的角度来看，插入、更新或删除的数据在后续查询中立即反映出来。

Aggregation: Data Cubes and Materialized Views

Not every data warehouse is necessarily a column store: traditional row-oriented databases and a few other architectures are also used. However, columnar storage can be significantly faster for ad hoc analytical queries, so it is rapidly gaining popularity [ 51 , 63 ].

并非所有的数据仓库都必须是列存储：传统的行导向数据库和其他几种架构也被使用。然而，列存储对于特定于问题的分析查询可以明显地更快，因此正在快速地获得流行 [51, 63]。并非所有数据仓库都是列式存储：也有传统的行式数据库和一些其他架构被使用。然而，列式存储对于特定的分析查询可以明显更快，因此正在迅速普及。

Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates . As discussed earlier, data warehouse queries often involve an aggregate function, such as COUNT , SUM , AVG , MIN , or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often?

数据仓库中值得简要提及的另一个方面是物化聚合。如早先所述，数据仓库查询经常涉及SQL中的聚合函数，如COUNT、SUM、AVG、MIN或MAX。如果许多不同的查询使用相同的聚合，每次都通过原始数据进行计算可能是浪费的。为什么不缓存一些最常用的计数或总和？

One way of creating such a cache is a materialized view . In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual view, the SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query.

创建这样一个高速缓存的一种方法是利用实现视图。在关系式数据模型中，它通常与标准（虚拟）视图相似：一个类似表格的对象其内容是某些查询的结果。区别在于实现视图是查询结果的实际副本并写入磁盘，而虚拟视图只是查询的一个捷径。当你从虚拟视图中阅读时，SQL引擎会实时将其扩展为视图底层的查询然后处理扩展的查询。

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data. The database can do that automatically, but such updates make writes more expensive, which is why materialized views are not often used in OLTP databases. In read-heavy data warehouses they can make more sense (whether or not they actually improve read performance depends on the individual case).

当底层数据发生变化时，需要更新物化视图，因为它是数据的非规范化副本。数据库可以自动执行此操作，但这会导致写入更加昂贵，这就是为什么物化视图在OLTP数据库中并不常用的原因。在读取密集的数据仓库中，它们可能更有意义（无论它们是否实际改善读取性能取决于不同情况）。

A common special case of a materialized view is known as a data cube or OLAP cube [ 64 ]. It is a grid of aggregates grouped by different dimensions. Figure 3-12 shows an example.

一种常见的物化视图特例称为数据立方体或OLAP立方体[64]。它是按不同维度分组的聚合数据表格。图3-12显示了一个例子。

Imagine for now that each fact has foreign keys to only two dimension tables—in Figure 3-12 , these are date and product . You can now draw a two-dimensional table, with dates along one axis and products along the other. Each cell contains the aggregate (e.g., SUM ) of an attribute (e.g., net_price ) of all facts with that date-product combination. Then you can apply the same aggregate along each row or column and get a summary that has been reduced by one dimension (the sales by product regardless of date, or the sales by date regardless of product).

现在想象一下，每个事实仅有两个维度表的外键 - 在图3-12中，这些是日期和产品。您现在可以绘制一个二维表，一个轴上是日期，另一个轴上是产品。每个单元格包含具有该日期-产品组合的所有事实属性（例如净价）的聚合（例如SUM）。然后，您可以沿每行或列应用相同的聚合并获得减少一个维度的摘要（不考虑产品的销售或不考虑日期的销售）。

In general, facts often have more than two dimensions. In Figure 3-9 there are five dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a five-dimensional hypercube would look like, but the principle remains the same: each cell contains the sales for a particular date-product-store-promotion-customer combination. These values can then repeatedly be summarized along each of the dimensions.

通常来说，事实往往具有超过两个维度。在图3-9中，有五个维度：日期、产品、商店、促销和客户。想象一个五维超立方体是非常困难的，但原则仍然相同：每个单元格包含特定日期-产品-商店-促销-客户组合的销售额。然后可以沿着每个维度反复总结这些值。

The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed. For example, if you want to know the total sales per store yesterday, you just need to look at the totals along the appropriate dimension—no need to scan millions of rows.

材料化数据立方体的优点在于，某些查询变得非常快速，因为它们已经有效地预先计算过了。例如，如果你想知道昨天每个店铺的总销售额，你只需要查看沿着适当的维度的总数——不需要扫描数百万行。

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example, there is no way of calculating which proportion of sales comes from items that cost more than $100, because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.

数据立方体的劣势在于它没有像查询原始数据那样的灵活性。例如，没有办法计算销售中由成本高于100美元的商品组成的比例，因为价格不是其中一个维度。因此，大多数数据仓库尽可能地保留原始数据，仅将数据立方体等聚合方式用于某些查询的性能提升。

Summary

In this chapter we tried to get to the bottom of how databases handle storage and retrieval. What happens when you store data in a database, and what does the database do when you query for the data again later?

在这一章中，我们试图深入了解数据库如何处理存储和检索。当您将数据存储在数据库中时会发生什么，以及当您稍后查询数据时数据库会做什么？

On a high level, we saw that storage engines fall into two broad categories: those optimized for transaction processing (OLTP), and those optimized for analytics (OLAP). There are big differences between the access patterns in those use cases:

高层次上，我们看到存储引擎可以分为两个广泛的类别：针对事务处理（OLTP）进行优化的引擎和针对分析（OLAP）进行优化的引擎。这两种用例之间的访问模式存在很大的区别。

OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query. The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.

OLTP系统通常面向用户，这意味着它们可能会面临大量的请求。为了处理负载，应用程序通常只在每个查询中操作少量记录。应用程序使用某种键请求记录，存储引擎使用索引查找所请求的数据。磁盘查找时间通常是瓶颈。
Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time. Disk bandwidth (not seek time) is often the bottleneck here, and column-oriented storage is an increasingly popular solution for this kind of workload.

数据仓库和类似的分析系统并不太出名，因为它们主要被商业分析师使用，而不是终端用户。它们处理的查询量比OLTP系统少得多，但每个查询通常都十分严格要求，需要在短时间内扫描数以百万计的记录。磁盘带宽（不是寻道时间）通常是瓶颈，而针对这种工作负载的列式存储越来越受欢迎。

On the OLTP side, we saw storage engines from two main schools of thought:

在 OLTP 方面，我们看到两个主要思想派别的存储引擎：

The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.

日志结构化学派，只允许对文件进行追加和删除已过时的文件，但永远不更新已写入的文件。Bitcask、SSTables、LSM-trees、LevelDB、Cassandra、HBase、Lucene和其他属于这个组。
The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones.

“就地更新”的理念将磁盘视为一组固定大小的页面，可以覆盖已有数据。B-树是这种思想的最佳例证，在所有主要关系型数据库中都得到了使用，并且也被许多非关系型数据库所采用。”

Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.

日志结构化存储引擎是相对较新的开发。它们的关键思想是系统地将随机访问写入转换为磁盘上的顺序写入，这使得写入吞吐量更高，因为硬盘和固态硬盘的性能特征。

Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures, and databases that are optimized for keeping all data in memory.

我们结束了 OLTP 端，简短地介绍了一些更复杂的索引结构和针对将所有数据保留在内存中进行优化的数据库。

We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scanning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk. We discussed how column-oriented storage helps achieve this goal.

我们先转移到数据仓库的高层结构，抛开储存引擎的内部细节。这背景说明了分析工作负载与 OLTP 的巨大不同：当你需要按顺序扫描大量行时，索引就不那么重要了。相反，重要的是紧凑地编码数据，最小化查询需要从磁盘读取的数据量。我们讨论了列式存储如何帮助实现这个目标。

As an application developer, if you’re armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application. If you need to adjust a database’s tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have.

作为应用程序开发者，如果你掌握了关于存储引擎内部的知识，你就能更好地知道哪种工具最适合你的特定应用程序。如果你需要调整数据库的调优参数，这种了解能让你想象出较高或较低值可能带来的影响。

Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it has hopefully equipped you with enough vocabulary and ideas that you can make sense of the documentation for the database of your choice.

虽然本章节不能使您成为调优任何一个特定存储引擎的专家，但是它希望能够为您提供足够的词汇和理念，以便您能够理解您选择的数据库的文档。

Footnotes

ⁱ If all keys and values had a fixed size, you could use binary search on a segment file and avoid the in-memory index entirely. However, they are usually variable-length in practice, which makes it difficult to tell where one record ends and the next one starts if you don’t have an index.

如果所有的键和值都有固定的大小，你可以在段文件上使用二分查找，完全避免内存索引。然而，实际上它们通常是可变长度的，如果没有索引，很难确定一个记录在哪里结束，下一个记录从哪里开始。

ⁱⁱ Inserting a new key into a B-tree is reasonably intuitive, but deleting one (while keeping the tree balanced) is somewhat more involved [ 2 ].

插入新键到B树中的方法相对直观，但删除一个键（同时保持树的平衡）则较为复杂[2]。

ⁱⁱⁱ This variant is sometimes known as a B ⁺ tree, although the optimization is so common that it often isn’t distinguished from other B-tree variants.

这个变体有时被称为B+树，尽管这种优化是如此常见，以至于常常不能与其他B树变体区分开来。

^iv The meaning of online in OLAP is unclear; it probably refers to the fact that queries are not just for predefined reports, but that analysts use the OLAP system interactively for explorative queries.

在线分析处理（OLAP）中“在线”的含义不清晰；它可能指的是查询不仅仅是针对预定义的报告，而是分析员通过交互式方式使用OLAP系统进行探索性查询的事实。

References

[ 1 ] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman: Data Structures and Algorithms . Addison-Wesley, 1983. ISBN: 978-0-201-00023-8

[1] 阿尔弗雷德·V·阿霍、约翰·E·霍普克罗夫特和杰弗里·D·阿尔曼：数据结构与算法。Addison-Wesley，1983年。ISBN：978-0-201-00023-8。

[ 2 ] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: Introduction to Algorithms , 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8

[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: 算法导论, 第三版. MIT出版社, 2009. ISBN: 978-0-262-53305-8

[ 3 ] Justin Sheehy and David Smith: “ Bitcask: A Log-Structured Hash Table for Fast Key/Value Data ,” Basho Technologies, April 2010.

[3] Justin Sheehy和David Smith： “Bitcask：快速键/值数据的日志结构哈希表，”Basho Technologies，2010年4月。

[ 4 ] Yinan Li, Bingsheng He, Robin Jun Yang, et al.: “ Tree Indexing on Solid State Drives ,” Proceedings of the VLDB Endowment , volume 3, number 1, pages 1195–1206, September 2010.

[4] Yinan Li, Bingsheng He, Robin Jun Yang等人：“固态硬盘上的树索引”，VLDB Endowment会议论文集，第3卷，第1号，1195–1206页，2010年9月。

[ 5 ] Goetz Graefe: “ Modern B-Tree Techniques ,” Foundations and Trends in Databases , volume 3, number 4, pages 203–402, August 2011. doi:10.1561/1900000028

【5】Goetz Graefe：“现代B-Tree技术”，数据库基础与趋势，第3卷，第4期，2011年8月，第203-402页，doi：10.1561/1900000028。

[ 6 ] Jeffrey Dean and Sanjay Ghemawat: “ LevelDB Implementation Notes ,” leveldb.googlecode.com .

[6] Jeffrey Dean和Sanjay Ghemawat：“LevelDB实现笔记”，leveldb.googlecode.com。

[ 7 ] Dhruba Borthakur: “ The History of RocksDB ,” rocksdb.blogspot.com , November 24, 2013.

[7] Dhruba Borthakur：“RocksDB的历史”，rocksdb.blogspot.com，2013年11月24日。

[ 8 ] Matteo Bertozzi: “ Apache HBase I/O – HFile ,” blog.cloudera.com , June, 29 2012.

[8] Matteo Bertozzi： “Apache HBase I/O – HFile”，blog.cloudera.com，2012年6月29日。 [8] 马特奥·贝尔托齐： “Apache HBase I/O – HFile”，blog.cloudera.com，2012年6月29日。

[ 9 ] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “ Bigtable: A Distributed Storage System for Structured Data ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[9] Fay Chang、Jeffrey Dean、Sanjay Ghemawat等：「Bigtable：用於結構化數據的分散式存儲系統」，收錄於第7屆USENIX操作系統設計和實現研討會（OSDI），2006年11月。

[ 10 ] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil: “ The Log-Structured Merge-Tree (LSM-Tree) ,” Acta Informatica , volume 33, number 4, pages 351–385, June 1996. doi:10.1007/s002360050048

[10] Patrick O'Neil、Edward Cheng、Dieter Gawlick 和 Elizabeth O'Neil：“日志结构化合并树（LSM-Tree）”，Acta Informatica，卷33，编号4，页351-385，1996年6月。 doi:10.1007/s002360050048。

[ 11 ] Mendel Rosenblum and John K. Ousterhout: “ The Design and Implementation of a Log-Structured File System ,” ACM Transactions on Computer Systems , volume 10, number 1, pages 26–52, February 1992. doi:10.1145/146941.146943

"[11] Mendel Rosenblum 和 John K. Ousterhout: “日志结构文件系统的设计和实现”，ACM 计算机系统交易，卷10，号1，页26-52，1992年2月。doi:10.1145/146941.146943"

[ 12 ] Adrien Grand: “ What Is in a Lucene Index? ,” at Lucene/Solr Revolution , November 14, 2013.

[12] Adrien Grand: "Lucene索引中有什么？"，于2013年11月14日在Lucene/Solr Revolution上发表。

[ 13 ] Deepak Kandepet: “ Hacking Lucene—The Index Format ,” hackerlabs.org , October 1, 2011.

[13] Deepak Kandepet：“Hacking Lucene-索引格式”，hackerlabs.org，2011年10月1日。

[ 14 ] Michael McCandless: “ Visualizing Lucene’s Segment Merges ,” blog.mikemccandless.com , February 11, 2011.

[14] Michael McCandless：“可视化Lucene的段合并”，blog.mikemccandless.com，2011年2月11日。

[ 15 ] Burton H. Bloom: “ Space/Time Trade-offs in Hash Coding with Allowable Errors ,” Communications of the ACM , volume 13, number 7, pages 422–426, July 1970. doi:10.1145/362686.362692

[15] Burton H.Bloom："允许错误的哈希编码中的空间/时间折衷"，ACM通信，卷13，号7，1970年7月，页码422-426，doi：10.1145/362686.362692。

[ 16 ] “ Operating Cassandra: Compaction ,” Apache Cassandra Documentation v4.0, 2016.

【16】“运行Cassandra：压缩”，Apache Cassandra文档v4.0，2016年。

[ 17 ] Rudolf Bayer and Edward M. McCreight: “ Organization and Maintenance of Large Ordered Indices ,” Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970.

[17] Rudolf Bayer和Edward M. McCreight： “大型有序索引的组织和维护”，波音科学研究实验室，数学和信息科学实验室，报告编号20，1970年7月。

[ 18 ] Douglas Comer: “ The Ubiquitous B-Tree ,” ACM Computing Surveys , volume 11, number 2, pages 121–137, June 1979. doi:10.1145/356770.356776

[18] Douglas Comer：“B-树的无处不在”，ACM计算机调查，第11卷，第2期，页121-137，1979年6月。doi:10.1145/356770.356776

[ 19 ] Emmanuel Goossaert: “ Coding for SSDs ,” codecapsule.com , February 12, 2014.

"[19] Emmanuel Goossaert: “SSD编码”，codecapsule.com，2014年2月12日。"

[ 20 ] C. Mohan and Frank Levine: “ ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging ,” at ACM International Conference on Management of Data (SIGMOD), June 1992. doi:10.1145/130283.130338

C. Mohan和Frank Levine写了一份名为"ARIES/IM：一种高效且高并发性索引管理方法，使用写前日志记录"的论文，在1992年的ACM数据管理国际会议(SIGMOD)上发表。doi:10.1145/130283.130338。

[ 21 ] Howard Chu: “ LDAP at Lightning Speed ,” at Build Stuff ’14 , November 2014.

[21] Howard Chu：'Lightning Speed下的LDAP'，于2014年11月在Build Stuff '14展会上。

[ 22 ] Bradley C. Kuszmaul: “ A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees ,” tokutek.com , April 22, 2014.

[22] Bradley C. Kuszmaul：“分形树与日志结构合并树（LSM）的比较”，tokutek.com，2014年4月22日。

[ 23 ] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, et al.: “ Designing Access Methods: The RUM Conjecture ,” at 19th International Conference on Extending Database Technology (EDBT), March 2016. doi:10.5441/002/edbt.2016.42

[23] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas等人：「设计访问方法：RUM猜想」，发表于第19届国际数据库技术扩展会议（EDBT），2016年3月。doi:10.5441/002/edbt.2016.42

[ 24 ] Peter Zaitsev: “ Innodb Double Write ,” percona.com , August 4, 2006.

"Innodb Double Write"，彼得·扎伊采夫， percona.com，2006 年 8 月 4 日。

[ 25 ] Tomas Vondra: “ On the Impact of Full-Page Writes ,” blog.2ndquadrant.com , November 23, 2016.

[25] Tomas Vondra：“全页写入对性能的影响”，blog.2ndquadrant.com，2016年11月23日。

[ 26 ] Mark Callaghan: “ The Advantages of an LSM vs a B-Tree ,” smalldatum.blogspot.co.uk , January 19, 2016.

[26] Mark Callaghan： “LSM相对于B树的优势”，smalldatum.blogspot.co.uk, 2016年1月19日。

[ 27 ] Mark Callaghan: “ Choosing Between Efficiency and Performance with RocksDB ,” at Code Mesh , November 4, 2016.

[27] Mark Callaghan：在Code Mesh会议上的“在Efficiency和Performance之间选择RocksDB”演讲，日期为2016年11月4日。

[ 28 ] Michi Mutsuzaki: “ MySQL vs. LevelDB ,” github.com , August 2011.

[28] Michi Mutsuzaki：“MySQL vs. LevelDB”，github.com，2011年8月。 [28] Michi Mutsuzaki：“MySQL vs. LevelDB”，github.com，2011年8月。

[ 29 ] Benjamin Coverston, Jonathan Ellis, et al.: “ CASSANDRA-1608: Redesigned Compaction , issues.apache.org , July 2011.

[29] 本杰明·科尔弗斯顿、乔纳森·埃利斯等：“CASSANDRA-1608: 重新设计压缩”，issues.apache.org，2011年7月。

[ 30 ] Igor Canadi, Siying Dong, and Mark Callaghan: “ RocksDB Tuning Guide ,” github.com , 2016.

"RocksDB调优指南" Igor Canadi, Siying Dong和Mark Callaghan: github.com，2016年。

[ 31 ] MySQL 5.7 Reference Manual . Oracle, 2014.

[31] MySQL 5.7 参考手册。Oracle，2014年。

[ 32 ] Books Online for SQL Server 2012 . Microsoft, 2012.

[32]本书 SQL Server 2012 在线版。微软，2012年。

[ 33 ] Joe Webb: “ Using Covering Indexes to Improve Query Performance ,” simple-talk.com , 29 September 2008.

[33] Joe Webb：“使用覆盖索引来提高查询性能”，simple-talk.com，2008年9月29日。

[ 34 ] Frank Ramsak, Volker Markl, Robert Fenk, et al.: “ Integrating the UB-Tree into a Database System Kernel ,” at 26th International Conference on Very Large Data Bases (VLDB), September 2000.

[34] Frank Ramsak, Volker Markl, Robert Fenk等人： “将UB-Tree集成到数据库系统内核中”，收录于第26届大数据国际会议（VLDB)，2000年9月。

[ 35 ] The PostGIS Development Group: “ PostGIS 2.1.2dev Manual ,” postgis.net , 2014.

[35] The PostGIS 开发团队： “PostGIS 2.1.2dev 手册”， postgis.net，2014年。

[ 36 ] Robert Escriva, Bernard Wong, and Emin Gün Sirer: “ HyperDex: A Distributed, Searchable Key-Value Store ,” at ACM SIGCOMM Conference , August 2012. doi:10.1145/2377677.2377681

[36] Robert Escriva, Bernard Wong, and Emin Gün Sirer："HyperDex:分布式、可搜索的键值存储"，发表于ACM SIGCOMM会议，2012年8月。doi:10.1145/2377677.2377681。

[ 37 ] Michael McCandless: “ Lucene’s FuzzyQuery Is 100 Times Faster in 4.0 ,” blog.mikemccandless.com , March 24, 2011.

[37] Michael McCandless： “Lucene的FuzzyQuery在4.0中快了一百倍”，博客mikemccandless.com，2011年3月24日。

[ 38 ] Steffen Heinz, Justin Zobel, and Hugh E. Williams: “ Burst Tries: A Fast, Efficient Data Structure for String Keys ,” ACM Transactions on Information Systems , volume 20, number 2, pages 192–223, April 2002. doi:10.1145/506309.506312

[38] Steffen Heinz、Justin Zobel和Hugh E. Williams：《Burst Tries: 一种用于字符串键的快速高效的数据结构》，ACM Transactions on Information Systems，第20卷，第2期，页码192-223，2002年4月。doi:10.1145/506309.506312。

[ 39 ] Klaus U. Schulz and Stoyan Mihov: “ Fast String Correction with Levenshtein Automata ,” International Journal on Document Analysis and Recognition , volume 5, number 1, pages 67–85, November 2002. doi:10.1007/s10032-002-0082-8

【39】 Klaus U. Schulz和Stoyan Mihov:“利用Levenshtein自动机实现快速字符串纠错”，《国际文件分析与识别杂志》，第5卷第1期，2002年11月，第67-85页。doi：10.1007/s10032-002-0082-8。

[ 40 ] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval . Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book

[40] 克里斯托弗·D·曼宁、普拉巴卡尔·拉加万和欣里奇·史密斯：《信息检索导论》。剑桥大学出版社，2008年。 ISBN：978-0-521-86571-5，可在线访问nlp.stanford.edu/IR-book。

[ 41 ] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “ The End of an Architectural Era (It’s Time for a Complete Rewrite) ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[41] 迈克尔·斯通布雷克（Michael Stonebraker）、塞缪尔·马登（Samuel Madden）、丹尼尔·J·阿巴迪（Daniel J. Abadi）等人： “一个架构时代的终结（是时候进行彻底重写了），”于2007年9月举行的第33届极大数据库国际会议（VLDB）上。

[ 42 ] “ VoltDB Technical Overview White Paper ,” VoltDB, 2014.

“VoltDB技术概述白皮书”，VoltDB，2014年。

[ 43 ] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout: “ Log-Structured Memory for DRAM-Based Storage ,” at 12th USENIX Conference on File and Storage Technologies (FAST), February 2014.

【43】Stephen M. Rumble，Ankita Kejriwal和John K. Ousterhout：“基于DRAM存储器的日志结构化内存”，发表于2014年2月的第12届USENIX文件和存储技术会议（FAST）。

[ 44 ] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: “ OLTP Through the Looking Glass, and What We Found There ,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376713

[44] Stavros Harizopoulos，Daniel J. Abadi，Samuel Madden和Michael Stonebraker： “OLTP穿越镜子，我们在其中发现了什么”，ACM数据管理国际会议(SIGMOD)，2008年6月。 doi:10.1145/1376616.1376713

[ 45 ] Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: “ Anti-Caching: A New Approach to Database Management System Architecture ,” Proceedings of the VLDB Endowment , volume 6, number 14, pages 1942–1953, September 2013.

[45] Justin DeBrabant，Andrew Pavlo，Stephen Tu等： “反缓存：一种新的数据库管理系统架构方法”，《VLDB年鉴》第6卷，第14期，2013年9月，页码1942-1953。

[ 46 ] Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor: “ Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems ,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2749441

“让我们谈论非易失性内存数据库系统的存储和恢复方法”，Joy Arulraj、Andrew Pavlo和Subramanya R. Dulloor在2015年6月的ACM数据管理国际会议(SIGMOD)上发表。doi:10.1145/2723372.2749441

[ 47 ] Edgar F. Codd, S. B. Codd, and C. T. Salley: “ Providing OLAP to User-Analysts: An IT Mandate ,” E. F. Codd Associates, 1993.

"[47] Edgar F. Codd, S. B. Codd, and C. T. Salley: "为用户分析员提供OLAP：IT命令"，E.F. Codd Associates，1993年。"

[ 48 ] Surajit Chaudhuri and Umeshwar Dayal: “ An Overview of Data Warehousing and OLAP Technology ,” ACM SIGMOD Record , volume 26, number 1, pages 65–74, March 1997. doi:10.1145/248603.248616

[48] Surajit Chaudhuri和Umeshwar Dayal：“数据仓库和OLAP技术概述，”ACM SIGMOD Record，第26卷，第1期，页65-74，1997年3月。 doi：10.1145 / 248603.248616。

[ 49 ] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, et al.: “ Enhancements to SQL Server Column Stores ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[49] Per-Åke Larson, Cipri Clinciu, Campbell Fraser等人：“增强SQL Server 列存储”，于2013年6月ACM数据管理国际会议(SIGMOD)发表。

[ 50 ] Franz Färber, Norman May, Wolfgang Lehner, et al.: “ The SAP HANA Database – An Architecture Overview ,” IEEE Data Engineering Bulletin , volume 35, number 1, pages 28–33, March 2012.

“SAP HANA数据库——架构概述”，作者：Franz Färber，Norman May，Wolfgang Lehner等（共50人），发表于2012年3月，IEEE数据工程通报第35卷第1期，第28-33页。

[ 51 ] Michael Stonebraker: “ The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong ,” presentation at EPFL , May 2013.

[51] Michael Stonebraker：迈克尔·斯通布雷克：“传统的关系型数据库管理系统智慧（几乎可以确定）都是错误的”，于2013年5月在EPFL演讲。

[ 52 ] Daniel J. Abadi: “ Classifying the SQL-on-Hadoop Solutions ,” hadapt.com , October 2, 2013.

"[52] Daniel J. Abadi: “对 Hadoop 上的 SQL 解决方案进行分类”， hadapt.com，2013年10月2日。" Note: This is the translated version of the given text.

[ 53 ] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “ Impala: A Modern, Open-Source SQL Engine for Hadoop ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

"“Impala: A Modern, Open-Source SQL Engine for Hadoop,” 在第七届创新数据系统研究双年会 (CIDR)，2015年1月发布。"

[ 54 ] Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: “ Dremel: Interactive Analysis of Web-Scale Datasets ,” at 36th International Conference on Very Large Data Bases (VLDB), pages 330–339, September 2010.

[54] Sergey Melnik, Andrey Gubarev, Jing Jing Long等人： “Dremel：Web规模数据的交互式分析”，在36届国际大数据（VLDB）会议上发布，页码为 330-339，2010年9月。

[ 55 ] Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling , 3rd edition. John Wiley & Sons, July 2013. ISBN: 978-1-118-53080-1

[55] Ralph Kimball 和 Margy Ross：数据仓库工具箱：维度建模的权威指南，第三版。John Wiley & Sons，2013年7月。ISBN：978-1-118-53080-1。

[ 56 ] Derrick Harris: “ Why Apple, eBay, and Walmart Have Some of the Biggest Data Warehouses You’ve Ever Seen ,” gigaom.com , March 27, 2013.

"为什么苹果、eBay和沃尔玛拥有你从未见过的最大数据仓库"，gigaom.com，2013年3月27日。"

[ 57 ] Julien Le Dem: “ Dremel Made Simple with Parquet ,” blog.twitter.com , September 11, 2013.

[57] Julien Le Dem：“使用 Parquet 简化 Dremel”，blog.twitter.com，2013 年 9 月 11 日。

[ 58 ] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: “ The Design and Implementation of Modern Column-Oriented Database Systems ,” Foundations and Trends in Databases , volume 5, number 3, pages 197–280, December 2013. doi:10.1561/1900000024

[58] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos等人：《现代基于列的数据库系统的设计与实现》， Foundations and Trends in Databases, 2013年12月，卷5，第3期，197-280页。 doi:10.1561/1900000024

[ 59 ] Peter Boncz, Marcin Zukowski, and Niels Nes: “ MonetDB/X100: Hyper-Pipelining Query Execution ,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

Peter Boncz，Marcin Zukowski和Niels Nes：《MonetDB / X100：高速管道查询执行》第二届创新数据系统研究双年会（CIDR） 2005年1月。

[ 60 ] Jingren Zhou and Kenneth A. Ross: “ Implementing Database Operations Using SIMD Instructions ,” at ACM International Conference on Management of Data (SIGMOD), pages 145–156, June 2002. doi:10.1145/564691.564709

请帮忙翻译以下内容：[60]周敬仁和肯尼思·A·罗斯：“使用SIMD指令实现数据库操作”，发表于ACM数据管理国际会议(SIGMOD)2002年6月第145-156页。 doi:10.1145/564691.564709

[ 61 ] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: “ C-Store: A Column-oriented DBMS ,” at 31st International Conference on Very Large Data Bases (VLDB), pages 553–564, September 2005.

[61] Michael Stonebraker, Daniel J. Abadi, Adam Batkin等人： “C-Store：一种列式数据库管理系统”，于第31届国际大型数据库会议（VLDB）上，2005年9月，第553-564页。

[ 62 ] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: “ The Vertica Analytic Database: C-Store 7 Years Later ,” Proceedings of the VLDB Endowment , volume 5, number 12, pages 1790–1801, August 2012.

[62] Andrew Lamb，Matt Fuller，Ramakrishna Varadarajan等人：《Vertica分析数据库：七年后的C-Store》，VLDB Endowment会议录，第5卷，第12期，1790-1801页，2012年8月。

[ 63 ] Julien Le Dem and Nong Li: “ Efficient Data Storage for Analytics with Apache Parquet 2.0 ,” at Hadoop Summit , San Jose, June 2014.

"Julien Le Dem和Nong Li：在Hadoop峰会上提出的“Apache Parquet 2.0用于分析的高效数据存储”演讲，于2014年6月在圣何塞举行。"

[ 64 ] Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: “ Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals ,” Data Mining and Knowledge Discovery , volume 1, number 1, pages 29–53, March 2007. doi:10.1023/A:1009726021843

[64] Jim Gray, Surajit Chaudhuri, Adam Bosworth等： “数据立方体：一个泛化Group-By，Cross-Tab和Sub-Totals的关系聚合运算符”，《数据挖掘与知识发现》，第1卷，第1期，页29–53，2007年3月。 doi:10.1023/A:1009726021843

Chapter 4. Encoding and Evolution

Everything changes and nothing stands still.

一切都在变化，没有任何事情是静止不动的。

Heraclitus of Ephesus, as quoted by Plato in Cratylus (360 BCE)

古希腊哲学家赫拉克利特，在公元前360年的柏拉图的《克拉提洛斯》一书中被引用过。

Applications inevitably change over time. Features are added or modified as new products are launched, user requirements become better understood, or business circumstances change. In Chapter 1 we introduced the idea of evolvability : we should aim to build systems that make it easy to adapt to change (see “Evolvability: Making Change Easy” ).

应用程序难免会随着时间推移而改变。随着新产品的推出、用户需求的更好理解或业务情况的变化，功能会被添加或修改。在第一章中，我们介绍了可发展性的概念：我们应该致力于构建易于适应变化的系统（参见“可发展性：使变化易于实现”）。

In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way.

大多数情况下，应用程序功能的更改也需要更改其存储的数据：可能需要捕获新的字段或记录类型，或者可能需要以新方式呈现现有数据。

The data models we discussed in Chapter 2 have different ways of coping with such change. Relational databases generally assume that all data in the database conforms to one schema: although that schema can be changed (through schema migrations; i.e., ALTER statements), there is exactly one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases don’t enforce a schema, so the database can contain a mixture of older and newer data formats written at different times (see “Schema flexibility in the document model” ).

在第二章讨论的数据模型中，面对这种变化的方式是不同的。关系数据库通常假定数据库中的所有数据都符合一个模式：虽然这个模式可以通过模式迁移（即ALTER语句）进行更改，但在任何时刻只有一个模式正在生效。相比之下，“按需模式”（无模式）数据库不强制执行模式，因此数据库可以包含在不同时间编写的旧格式和新格式的数据混合在一起（参见“文档模型中的模式灵活性”）。

When a data format or schema changes, a corresponding change to application code often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field). However, in a large application, code changes often cannot happen instantaneously:

当数据格式或模式发生更改时，通常需要相应地更改应用程序代码（例如，您将新字段添加到记录中，应用程序代码开始读取和写入该字段）。但是，在大型应用程序中，代码更改通常无法即时发生。

With server-side applications you may want to perform a rolling upgrade (also known as a staged rollout ), deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.

对于服务器端应用程序，您可能希望执行滚动升级（也称为分阶段发布），将新版本逐个部署到几个节点，并检查新版本是否运行顺畅，逐步浏览所有节点。这样可以在不中断服务的情况下部署新版本，从而鼓励更频繁的发布和更好的可扩展性。
With client-side applications you’re at the mercy of the user, who may not install the update for some time.

使用客户端应用程序时，您完全得听从用户，而用户可能会在一段时间内不安装更新。

This means that old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions:

这意味着旧版本和新版本的代码，以及旧格式和新格式的数据，都有可能在系统中同时存在。为了系统继续平稳运行，我们需要在两个方向上保持兼容性。

Backward compatibility

Newer code can read data that was written by older code.

新的代码可以读取由旧代码写入的数据。

Forward compatibility

Older code can read data that was written by newer code.

旧代码可以读取由新代码编写的数据。

Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.

向后兼容通常不难实现：作为新代码的作者，您知道旧代码写入的数据格式，因此您可以明确地处理它（如果必要，可以通过保留旧代码来读取旧数据）。向前兼容可能更加棘手，因为它需要旧代码忽略新版本代码中添加的内容。

In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol Buffers, Thrift, and Avro. In particular, we will look at how they handle schema changes and how they support systems where old and new data and code need to coexist. We will then discuss how those formats are used for data storage and for communication: in web services, Representational State Transfer (REST), and remote procedure calls (RPC), as well as message-passing systems such as actors and message queues.

在本章中，我们将介绍几种编码数据的格式，包括JSON、XML、协议缓冲、Thrift和Avro。特别地，我们将研究它们如何处理模式更改以及它们支持旧数据和新数据以及代码共存的系统。然后，我们将讨论这些格式在数据存储和通信方面的用途：在Web服务、表现状态转移（REST）和远程过程调用（RPC）以及像角色和消息队列这样的消息传递系统中。

Formats for Encoding Data

Programs usually work with data in (at least) two different representations:

程序通常使用至少两种不同的表示方式来处理数据：

In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).

在内存中，数据被存储为对象、结构体、列表、数组、哈希表、树等形式。这些数据结构被优化为通过CPU（通常使用指针）进行高效访问和操作。
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory. ⁱ

当您想要将数据写入文件或通过网络发送时，您必须将其编码为某种自包含的字节序列（例如JSON文档）。由于指针对任何其他进程都没有意义，因此这个字节序列表示看起来与通常在内存中使用的数据结构非常不同。

Thus, we need some kind of translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling ), and the reverse is called decoding ( parsing , deserialization , unmarshalling ). ⁱⁱ

因此，我们需要一些类型的翻译来在两种表示之间进行转换。从内存表示到字节序列的转换称为编码（也称为序列化或编组），而反向则称为解码（解析，反序列化，取消编组）。

Terminology clash

Serialization is unfortunately also used in the context of transactions (see Chapter 7 ), with a completely different meaning. To avoid overloading the word we’ll stick with encoding in this book, even though serialization is perhaps a more common term.

序列化不幸地也在交易的上下文中使用（参见第7章），具有完全不同的含义。为了避免过载这个词，本书将采用编码，即使序列化可能是更常用的术语。（请注意，这是翻译文本，不包括原文）

As this is such a common problem, there are a myriad different libraries and encoding formats to choose from. Let’s do a brief overview.

由于这是一个常见的问题，所以有无数不同的库和编码格式可供选择。让我们做一个简要的概述。

Language-Specific Formats

Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example, Java has java.io.Serializable [ 1 ], Ruby has Marshal [ 2 ], Python has pickle [ 3 ], and so on. Many third-party libraries also exist, such as Kryo for Java [ 4 ].

许多编程语言都配备了将内存对象编码成字节序列的内置支持。例如，Java有java.io.Serializable，Ruby有Marshal，Python有pickle等等。还有许多第三方库存在，比如Java的Kryo。

These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code. However, they also have a number of deep problems:

这些编码库非常方便，因为它们允许内存对象以最少的额外代码进行保存和恢复。然而，它们也存在许多深层问题：

The encoding is often tied to a particular programming language, and reading the data in another language is very difficult. If you store or transmit data in such an encoding, you are committing yourself to your current programming language for potentially a very long time, and precluding integrating your systems with those of other organizations (which may use different languages).

编码通常与特定的编程语言相关联，使用另一种语言读取数据非常困难。如果您以这种编码方式存储或传输数据，可能会长期限制自己使用当前的编程语言，并阻碍与其他组织（可能使用不同语言）集成系统的能力。
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems [ 5 ]: if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code [ 6 , 7 ].

为了在相同的对象类型中恢复数据，解码过程需要能够实例化任意类。这经常是安全问题的源泉：如果攻击者能够让您的应用程序解码任意字节序列，他们可以实例化任意类，这通常允许他们做可怕的事情，比如远程执行任意代码。
Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility.

版本控制常常是这些库的边角料：由于它们的目的是快速和简单地编码数据，所以它们往往忽略了前向和后向兼容性的不方便问题。
Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought. For example, Java’s built-in serialization is notorious for its bad performance and bloated encoding [ 8 ].

效率（编码或解码所需的CPU时间以及编码结构的大小）通常也是后来的想法。例如，Java内建的串行化因其性能差和编码过大而臭名昭著。

For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.

出于这些原因，除非是非常短暂的用途，通常不建议使用您的语言内置的编码方式。

JSON, XML, and Binary Variants

Moving to standardized encodings that can be written and read by many programming languages, JSON and XML are the obvious contenders. They are widely known, widely supported, and almost as widely disliked. XML is often criticized for being too verbose and unnecessarily complicated [ 9 ]. JSON’s popularity is mainly due to its built-in support in web browsers (by virtue of being a subset of JavaScript) and simplicity relative to XML. CSV is another popular language-independent format, albeit less powerful.

采用标准化编码，可以被多种编程语言编写和读取，JSON和XML是显而易见的竞争者。它们被广泛知晓，广泛支持，但几乎同样被人们所厌恶。XML经常因为太冗长和过于复杂而受到批评[9]。JSON之所以受欢迎，主要是因为它在网络浏览器中内置支持（由于是JavaScript的子集）并且相对于XML而言更为简单。CSV是另一种流行的独立于语言的格式，但它的功能比较有限。

JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a popular topic of debate). Besides the superficial syntactic issues, they also have some subtle problems:

JSON、XML 和 CSV 都是文本格式，因此在某种程度上可以被人类阅读（尽管语法是热门话题，容易引发争议）。除了表面的语法问题，它们还有一些微妙的问题。

There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.

数字编码存在很多歧义。在XML和CSV中，你无法区分数字和由数字组成的字符串（除非参考外部架构）。JSON区分了字符串和数字，但它不区分整数和浮点数，并且它不指定精度。

This is a problem when dealing with large numbers; for example, integers greater than 2 ⁵³ cannot be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become inaccurate when parsed in a language that uses floating-point numbers (such as JavaScript). An example of numbers larger than 2 ⁵³ occurs on Twitter, which uses a 64-bit number to identify each tweet. The JSON returned by Twitter’s API includes tweet IDs twice, once as a JSON number and once as a decimal string, to work around the fact that the numbers are not correctly parsed by JavaScript applications [ 10 ].

处理大量数字时会遇到问题。例如，大于253的整数无法在IEEE 754双精度浮点数中准确表示，因此在使用浮点数（例如JavaScript）的语言中解析这些数字时，这些数字就会变得不准确。 Twitter上的数字就有大于253的情况，该网站使用64位数字来识别每个推文。 Twitter API返回的JSON包括推文ID两次，一次作为JSON数字，一次作为十进制字符串，以解决这些数字无法正确解析JavaScript应用程序的问题[10]。处理大量数字时会遇到问题。例如，大于253的整数无法在IEEE 754双精度浮点数中准确表示，因此在使用浮点数（例如JavaScript）的语言中解析这些数字时，这些数字就会变得不准确。 Twitter上的数字就有大于253的情况，该网站使用64位数字来识别每个推文。 Twitter API返回的JSON包括推文ID两次，一次作为JSON数字，一次作为十进制字符串，以解决这些数字无法正确解析JavaScript应用程序的问题[10]。
JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.

JSON和XML对Unicode字符字符串（即人类可读的文本）有很好的支持，但它们不支持二进制字符串（没有字符编码的字节序列）。二进制字符串是一个有用的功能，因此人们通过使用Base64将二进制数据编码为文本来解决这个限制。然后使用模式来指示该值应解释为Base64编码。这可行，但它有点hacky，并且会将数据大小增加33％。
There is optional schema support for both XML [ 11 ] and JSON [ 12 ]. These schema languages are quite powerful, and thus quite complicated to learn and implement. Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. Since the correct interpretation of data (such as numbers and binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.

XML和JSON都有可选的模式支持。这些模式语言非常强大，学习和实现都非常复杂。XML模式的使用相当广泛，但许多基于JSON的工具不使用模式。由于数据（例如数字和二进制字符串）的正确解释取决于模式中的信息，因此不使用XML/JSON模式的应用程序需要可能将适当的编码/解码逻辑硬编码。
CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you have to handle that change manually. CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). Although its escaping rules have been formally specified [ 13 ], not all parsers implement them correctly.

CSV 没有任何模式, 因此应用程序需要定义每行和每列的含义。如果应用程序的更改添加了一个新行或新列，则必须手动处理。 CSV 也是一种相当模糊的格式(如果一个值包含逗号或换行符会发生什么？)。尽管其转义规则已被正式指定，但并非所有解析器都正确实现了它们。

Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will remain popular, especially as data interchange formats (i.e., for sending data from one organization to another). In these situations, as long as people agree on what the format is, it often doesn’t matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything outweighs most other concerns.

尽管存在这些缺陷，JSON、XML和CSV对于许多目的来说已经足够好了。它们可能会继续流行，特别是作为数据交换格式（即用于从一个组织发送数据到另一个组织）。在这些情况下，只要人们对格式达成一致，格式的美观或效率通常并不重要。使不同组织就任何事情达成一致的难度超过了其他大多数考虑因素。

Binary encoding

For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format. For example, you could choose a format that is more compact or faster to parse. For a small dataset, the gains are negligible, but once you get into the terabytes, the choice of data format can have a big impact.

对于仅在组织内部使用的数据，使用最低公共编码格式的压力较小。例如，你可以选择更紧凑或更快速解析的格式。对于小数据集，收益微不足道，但是当你进入了以太字节时，数据格式的选择就会有很大的影响。

JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This observation led to the development of a profusion of binary encodings for JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile, to name a few) and for XML (WBXML and Fast Infoset, for example). These formats have been adopted in various niches, but none of them are as widely adopted as the textual versions of JSON and XML.

JSON比XML更简洁，但与二进制格式相比，两者仍然使用很多空间。这个观察结果导致了大量针对JSON（如MessagePack、BSON、BJSON、UBJSON、BISON和Smile）和XML（例如WBXML和Fast Infoset）的二进制编码的开发。这些格式已被用于各种领域，但它们中没有一个像JSON和XML的文本版本那样被广泛采用。

Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers, or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In particular, since they don’t prescribe a schema, they need to include all the object field names within the encoded data. That is, in a binary encoding of the JSON document in Example 4-1 , they will need to include the strings userName , favoriteNumber , and interests somewhere.

其中一些格式扩展了数据类型集合（例如，区分整数和浮点数，或添加对二进制字符串的支持），但是它们保持了JSON/XML数据模型不变。特别地，由于它们不规定模式，因此它们需要在编码数据中包含所有对象字段名称。也就是说，在JSON文档的二进制编码中，它们需要在某个地方包含字符串userName、favoriteNumber和interests。

Example 4-1. Example record which we will encode in several binary formats in this chapter

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

Let’s look at an example of MessagePack, a binary encoding for JSON. Figure 4-1 shows the byte sequence that you get if you encode the JSON document in Example 4-1 with MessagePack [ 14 ]. The first few bytes are as follows:

让我们来看一个例子，MessagePack是一种用于JSON的二进制编码。图4-1显示了在MessagePack [14]中对Example 4-1中的JSON文档进行编码时所获得的字节序列。前几个字节如下：

The first byte, 0x83 , indicates that what follows is an object (top four bits = 0x80 ) with three fields (bottom four bits = 0x03 ). (In case you’re wondering what happens if an object has more than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type indicator, and the number of fields is encoded in two or four bytes.)

第一个字节0x83表示接下来的内容是一个对象（高四位=0x80），有三个域（低四位=0x03）。如果你想知道对象有超过15个属性的情况下，没有办法用四位来编码属性的数量，那么就用不同的类型指示符，并用两个或四个字节来编码属性数量。
The second byte, 0xa8 , indicates that what follows is a string (top four bits = 0xa0 ) that is eight bytes long (bottom four bits = 0x08 ).

第二个字节0xa8表示接下来的内容是一个字符串（前四位=0xa0），并且该字符串共有8个字节长度（后四位=0x08）。
The next eight bytes are the field name userName in ASCII. Since the length was indicated previously, there’s no need for any marker to tell us where the string ends (or any escaping).

下一个八个字节是ASCII码格式的字段名userName。由于长度已经先前标示，因此不需要任何标记来告诉我们字符串的结尾（或任何转义）。
The next seven bytes encode the six-letter string value Martin with a prefix 0xa6 , and so on.

下七个字节编码了带有0xa6前缀的六个字母字符串值Martin，以此类推。

The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing) is worth the loss of human-readability.

二进制编码长66字节，比文本JSON编码（无空格）占用的81字节略少。所有JSON的二进制编码在这方面都相似。尚不清楚这样的空间减少（可能会提高解析速度）是否值得失去人类可读性。

In the following sections we will see how we can do much better, and encode the same record in just 32 bytes.

在接下来的章节中，我们将看到我们如何可以做得更好，仅使用32个字节来编码相同的记录。

Thrift and Protocol Buffers

Apache Thrift [ 15 ] and Protocol Buffers (protobuf) [ 16 ] are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08 [ 17 ].

Apache Thrift和Protocol Buffers（protobuf）是基于相同原则的二进制编码库。Protocol Buffers最初在Google开发，Thrift最初在Facebook开发，并在2007-08年开源 [17]。

Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in Example 4-1 in Thrift, you would describe the schema in the Thrift interface definition language (IDL) like this:

Thrift和Protocol Buffers都需要为编码的任何数据提供模式。为了在Thrift中编码Example 4-1中的数据，您需要使用Thrift接口定义语言（IDL）描述模式，如下所示：

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

The equivalent schema definition for Protocol Buffers looks very similar:

Protocol Buffers的等效模式定义看起来非常相似：

message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown here, and produces classes that implement the schema in various programming languages [ 18 ]. Your application code can call this generated code to encode or decode records of the schema.

节俭和协议缓冲区都配备了一个代码生成工具，它可以采用类似于此处所示的模式定义并生成在各种编程语言中实现模式的类[18]。您的应用程序代码可以调用此生成的代码来编码或解码模式的记录。

What does data encoded with this schema look like? Confusingly, Thrift has two different binary encoding formats, ⁱⁱⁱ called BinaryProtocol and CompactProtocol , respectively. Let’s look at BinaryProtocol first. Encoding Example 4-1 in that format takes 59 bytes, as shown in Figure 4-2 [ 19 ].

使用此架構編碼的數據長什麼樣子？令人困惑的是，Thrift 有兩種不同的二進制編碼格式，分別稱為 BinaryProtocol 和 CompactProtocol。讓我們先看看 BinaryProtocol。在這種格式中，編碼示例 4-1 需要 59 個字節，如圖 4-2 所示。

Similarly to Figure 4-1 , each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and, where required, a length indication (length of a string, number of items in a list). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded as ASCII (or rather, UTF-8), similar to before.

与图4-1类似，每个字段都有类型注释（用于表示它是字符串、整数、列表等），并且在必要时有长度指示（字符串的长度，列表中的项数）。出现在数据中的字符串（“Martin”、“daydreaming”、“hacking”）也被编码为ASCII（或更准确地说，是UTF-8），与以前类似。

The big difference compared to Figure 4-1 is that there are no field names ( userName , favoriteNumber , interests ). Instead, the encoded data contains field tags , which are numbers ( 1 , 2 , and 3 ). Those are the numbers that appear in the schema definition. Field tags are like aliases for fields—they are a compact way of saying what field we’re talking about, without having to spell out the field name.

与图4-1相比的重大区别在于没有字段名称（用户名、收藏数、兴趣爱好）。相反，编码数据包含字段标记，即数字（1、2和3）。这些是模式定义中出现的数字。字段标记就像字段的别名-它们是一种紧凑的方式，用于说明我们正在谈论哪个字段，而无需拼写出字段名称。

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol, but as you can see in Figure 4-3 , it packs the same information into only 34 bytes. It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.

Thrift CompactProtocol 编码与 BinaryProtocol 在语义上是等价的，但正如图 4-3 所示，它将相同的信息压缩到仅 34 字节内。它通过将字段类型和标记号码打包成单个字节，并使用可变长度整数来实现。它不是使用完整的八个字节来表示数字 1337，而是使用两个字节进行编码，每个字节的最高位用于指示是否还有更多的字节。这意味着在 -64 和 63 之间的数字在一个字节中编码，在 -8192 和 8191 之间的数字在两个字节中编码，以此类推。更大的数字使用更多的字节。

Finally, Protocol Buffers (which has only one binary encoding format) encodes the same data as shown in Figure 4-4 . It does the bit packing slightly differently, but is otherwise very similar to Thrift’s CompactProtocol. Protocol Buffers fits the same record in 33 bytes.

最终，协议缓冲（只有一个二进制编码格式）对于显示的相同数据进行了编码，如图4-4所示。它稍微不同地进行了位包装，但与Thrift的CompactProtocol非常相似。协议缓冲将相同的记录放在33个字节中。

One detail to note: in the schemas shown earlier, each field was marked either required or optional , but this makes no difference to how the field is encoded (nothing in the binary data indicates whether a field was required). The difference is simply that required enables a runtime check that fails if the field is not set, which can be useful for catching bugs.

需要注意的一点是：在之前展示的架构中，每个字段都标记为必需或可选，但是这对字段的编码方式没有任何影响（二进制数据中没有指示一个字段是否必需的信息）。不同之处仅在于必需能够启用运行时检查，如果字段未设置，就会失败，这对于捕捉错误非常有用。

Field tags and schema evolution

We said previously that schemas inevitably need to change over time. We call this schema evolution . How do Thrift and Protocol Buffers handle schema changes while keeping backward and forward compatibility?

我们之前曾经提到过，模式必然会随着时间的推移而发生改变。我们称之为模式演变。Thrift和协议缓冲如何在保持向前和向后兼容性的同时处理模式更改？

As you can see from the examples, an encoded record is just the concatenation of its encoded fields. Each field is identified by its tag number (the numbers 1 , 2 , 3 in the sample schemas) and annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply omitted from the encoded record. From this you can see that field tags are critical to the meaning of the encoded data. You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag, since that would make all existing encoded data invalid.

从这些示例中可以看出，编码记录只是其编码字段的串联。每个字段由其标签号（示例模式中的数字1、2、3）标识，并带有数据类型（例如，字符串或整数）的注释。如果字段值未设置，则会从编码记录中省略它。由此可以看出，字段标签对于编码数据的含义至关重要。你可以更改模式中字段的名称，因为编码数据从不引用字段名称，但你无法更改字段的标签，因为这将使所有现有的编码数据无效。

You can add new fields to the schema, provided that you give each field a new tag number. If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field. The datatype annotation allows the parser to determine how many bytes it needs to skip. This maintains forward compatibility: old code can read records that were written by new code.

如果您给每个字段分配新标签号，那么可以向模式添加新字段。如果旧代码（不知道您添加的新标签号）尝试读取由新代码编写的数据，包括使用无法识别的标签号的新字段，则可以简单地忽略该字段。数据类型注释允许解析器确定需要跳过多少字节。这保持向前兼容性：旧代码可以读取由新代码编写的记录。

What about backward compatibility? As long as each field has a unique tag number, new code can always read old data, because the tag numbers still have the same meaning. The only detail is that if you add a new field, you cannot make it required. If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added. Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.

对于向后兼容性的处理，只要每个字段都有唯一的标签号码，新代码就能够读取旧数据，因为标签号码的含义没有改变。唯一的问题是，如果你添加了一个新字段，就不能将其设为必填字段。如果你添加了一个必填字段，新代码读取旧代码写入的数据时，该检查将失败，因为旧代码将无法写入你添加的新字段。因此，为了保持向后兼容性，在架构的初始部署之后添加的每个字段都必须是可选的或者有一个默认值。

Removing a field is just like adding a field, with backward and forward compatibility concerns reversed. That means you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again (because you may still have data written somewhere that includes the old tag number, and that field must be ignored by new code).

删除一个字段就像添加一个字段一样，需要考虑向前和向后兼容性问题的反转。这意味着你只能删除一个可选的字段（必需字段永远不能被删除），而且你永远不能再次使用相同的标签号码（因为你可能仍然有数据写在某个地方，包括旧的标签号码，而且新代码必须忽略该字段）。

Datatypes and schema evolution

What about changing the datatype of a field? That may be possible—check the documentation for details—but there is a risk that values will lose precision or get truncated. For example, say you change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code, because the parser can fill in any missing bits with zeros. However, if old code reads data written by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit value won’t fit in 32 bits, it will be truncated.

更改字段的数据类型怎么样？这可能是可能的-请检查文档以获取详细信息-但存在值将失去精度或被截断的风险。例如，假设您将32位整数更改为64位整数。新代码可以轻松读取旧代码编写的数据，因为解析器可以用零填充任何缺失的位。但是，如果旧代码读取新代码编写的数据，则旧代码仍在使用32位变量来保存该值。如果解码的64位值无法放入32位中，则将被截断。

A curious detail of Protocol Buffers is that it does not have a list or array datatype, but instead has a repeated marker for fields (which is a third option alongside required and optional ). As you can see in Figure 4-4 , the encoding of a repeated field is just what it says on the tin: the same field tag simply appears multiple times in the record. This has the nice effect that it’s okay to change an optional (single-valued) field into a repeated (multi-valued) field. New code reading old data sees a list with zero or one elements (depending on whether the field was present); old code reading new data sees only the last element of the list.

一个好奇的细节是，Protocol Buffers 没有列表或数组的数据类型，而是为字段添加了一个重复标记（这是 required 和 optional 之外的第三个选项）。如图 4-4 所示，重复字段的编码就像它的名字一样：同样的字段标签在记录中出现多次。这具有良好的效果，即可以将可选（单值）字段更改为重复的（多值）字段。读取旧数据的新代码会看到一个具有零个或一个元素的列表（取决于字段是否存在）；读取新数据的旧代码只会看到列表的最后一个元素。

Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements. This does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does, but it has the advantage of supporting nested lists.

Thrift有一个专用的列表数据类型，它带有列表元素的数据类型参数。与Protocol Buffers不同，它不允许从单值到多值的相同演变，但它支持嵌套列表的优点。

Avro

Apache Avro [ 20 ] is another binary encoding format that is interestingly different from Protocol Buffers and Thrift. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases [ 21 ].

Apache Avro是另一种有趣而与Protocol Buffers和Thrift不同的二进制编码格式。它始于2009年，作为Hadoop的一个子项目，由于Thrift不适合Hadoop的用例而产生。

Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.

Avro还使用模式来指定被编码的数据的结构。它有两种模式语言：一种（Avro IDL）用于人类编辑，另一种（基于JSON的）更容易被机器读取。

Our example schema, written in Avro IDL, might look like this:

我们的示例模式，使用Avro IDL编写，可能如下所示：

record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

The equivalent JSON representation of that schema is as follows:

该模式的等效JSON表示形式如下：

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",       "type": "string"},
        {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
        {"name": "interests",      "type": {"type": "array", "items": "string"}}
    ]
}

First of all, notice that there are no tag numbers in the schema. If we encode our example record ( Example 4-1 ) using this schema, the Avro binary encoding is just 32 bytes long—the most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown in Figure 4-5 .

首先，请注意模式中没有标签号。如果使用此模式对我们的示例记录（示例4-1）进行编码，则Avro二进制编码仅为32个字节，是我们所见过的所有编码中最紧凑的。编码字节序列的分解如图4-5所示。

If you examine the byte sequence, you can see that there is nothing to identify fields or their datatypes. The encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a string. It could just as well be an integer, or something else entirely. An integer is encoded using a variable-length encoding (the same as Thrift’s CompactProtocol).

如果您检查字节序列，您会发现没有东西可以标识字段或它们的数据类型。编码只是将值串联在一起。字符串只是长度前缀后跟UTF-8字节，但编码数据中没有任何东西告诉您它是字符串。它也可以是整数或其他任何东西。整数使用可变长度编码（与Thrift的CompactProtocol相同）。

To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field. This means that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data.

解析二进制数据时，按照模式中的顺序遍历各个字段，使用模式告诉你每个字段的数据类型。这意味着只有在读取数据的代码与写入数据的代码使用完全相同的模式时，二进制数据才能被正确解码。读者和写者之间模式的不匹配会导致数据解码错误。

So, how does Avro support schema evolution?

那么，Avro如何支持模式演化？

The writer’s schema and the reader’s schema

With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema .

当一个应用程序想要对一些数据进行编码（写入文件或数据库，通过网络发送等），使用Avro，它会使用任何它知道的模式版本来对数据进行编码 - 例如，该模式可能已编译到应用程序中。这被称为编写者的模式。

When an application wants to decode some data (read it from a file or database, receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema . That is the schema the application code is relying on—code may have been generated from that schema during the application’s build process.

当一个应用程序想要解码一些数据（从文件或数据库中读取，从网络中接收等等），它期望数据以某种模式存在，这被称为读取者模式。这是应用程序代码所依赖的模式，代码可能在应用程序构建过程中从该模式中生成。

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same —they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification [ 20 ] defines exactly how this resolution works, and it is illustrated in Figure 4-6 .

Avro的关键思想是，编写者的模式和读者的模式不必相同-它们只需要兼容。当数据被解码（读取）时，Avro库通过并排查看编写者的模式和读者的模式并将数据从编写者的模式转换为读者的模式来解决差异。 Avro规范明确定义了这种解析方式，并在图4-6中进行了说明。

For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

例如，如果写入者的模式和读者的模式具有不同顺序的字段，也不会有问题，因为模式分辨率通过字段名称将字段匹配起来。如果读取数据的代码遇到出现在写入者模式中但不在读者模式中的字段，则会被忽略。如果读取数据的代码期望某个字段，但写入者的模式中不包含该名称的字段，则会使用读者模式中声明的默认值填充。

Schema evolution rules

With Avro, forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.

使用 Avro，向前兼容性意味着可以将新版本的模式作为写入者，并将旧版本的模式作为读取者。反之，向后兼容性意味着可以将新版本的模式作为读取者，并将旧版本的模式作为写入者。

To maintain compatibility, you may only add or remove a field that has a default value. (The field favoriteNumber in our Avro schema has a default value of null .) For example, say you add a field with a default value, so this new field exists in the new schema but not the old one. When a reader using the new schema reads a record written with the old schema, the default value is filled in for the missing field.

为了保持兼容性，您只能添加或删除具有默认值的字段。（在我们的Avro模式中，favoriteNumber字段具有null的默认值）。例如，假设您添加了一个具有默认值的字段，因此此新字段存在于新模式中但不存在于旧模式中。当使用新模式的读取器读取使用旧模式编写的记录时，缺失字段的默认值将被填充。

If you were to add a field that has no default value, new readers wouldn’t be able to read data written by old writers, so you would break backward compatibility. If you were to remove a field that has no default value, old readers wouldn’t be able to read data written by new writers, so you would break forward compatibility.

如果您添加的字段没有默认值，新的读者将无法读取旧的写入的数据，这将破坏向后兼容性。如果您删除没有默认值的字段，旧的读者将无法读取新的写入的数据，这将破坏向前兼容性。

In some programming languages, null is an acceptable default for any variable, but this is not the case in Avro: if you want to allow a field to be null, you have to use a union type . For example, union { null, long, string } field; indicates that field can be a number, or a string, or null. You can only use null as a default value if it is one of the branches of the union. ^iv This is a little more verbose than having everything nullable by default, but it helps prevent bugs by being explicit about what can and cannot be null [ 22 ].

在一些编程语言中，null 是任何变量的可接受默认值，但在 Avro 中不是这种情况：如果您想要允许某个字段为空，则必须使用联合类型。例如，union {null，long，string} field; 表示字段可以是数字、字符串或 null。只有当 null 是联合类型的一个分支时，才能将其用作默认值。这比默认情况下所有内容都可为空要冗长一些，但通过明确规定哪些内容可以为空以及哪些内容不可以为空可以帮助预防出现错误 [22]。

Consequently, Avro doesn’t have optional and required markers in the same way as Protocol Buffers and Thrift do (it has union types and default values instead).

因此，Avro没有像Protocol Buffers和Thrift那样的可选和必需标记（它有联合类型和默认值）。

Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the name of a field is possible but a little tricky: the reader’s schema can contain aliases for field names, so it can match an old writer’s schema field names against the aliases. This means that changing a field name is backward compatible but not forward compatible. Similarly, adding a branch to a union type is backward compatible but not forward compatible.

改变字段的数据类型是可能的，前提是Avro可以转换类型。改变字段名称是可能的，但有点棘手：读取器模式可以包含字段名称的别名，因此它可以将旧写入器模式字段名称与别名进行匹配。这意味着更改字段名称是向后兼容但不是向前兼容。同样，向联合类型添加分支是向后兼容但不是向前兼容。

But what is the writer’s schema?

There is an important question that we’ve glossed over so far: how does the reader know the writer’s schema with which a particular piece of data was encoded? We can’t just include the entire schema with every record, because the schema would likely be much bigger than the encoded data, making all the space savings from the binary encoding futile.

这里有一个重要的问题我们之前忽略了：读者如何知道写手用哪种模式来编码特定的数据？我们不能每个记录都包含整个模式，因为模式很可能比编码出来的数据还要大，这样二进制编码所节省的空间就变得无意义了。

The answer depends on the context in which Avro is being used. To give a few examples:

答案取决于Avro使用的上下文。举几个例子：

Large file with lots of records

A common use for Avro—especially in the context of Hadoop—is for storing a large file containing millions of records, all encoded with the same schema. (We will discuss this kind of situation in Chapter 10 .) In this case, the writer of that file can just include the writer’s schema once at the beginning of the file. Avro specifies a file format (object container files) to do this.

Avro的一种常见用途，尤其是在Hadoop的背景下，是用于存储包含数百万条记录、所有记录都使用相同模式进行编码的大型文件。（我们将在第10章中讨论这种情况。）在这种情况下，该文件的作者只需在文件开头包含一次作者的模式即可。Avro指定了一个文件格式（对象容器文件），用于执行此操作。

Database with individually written records

In a database, different records may be written at different points in time using different writer’s schemas—you cannot assume that all the records will have the same schema. The simplest solution is to include a version number at the beginning of every encoded record, and to keep a list of schema versions in your database. A reader can fetch a record, extract the version number, and then fetch the writer’s schema for that version number from the database. Using that writer’s schema, it can decode the rest of the record. (Espresso [ 23 ] works this way, for example.)

在数据库中，不同的记录可能使用不同的编写者模式在不同的时间点编写 - 您不能假设所有记录都具有相同的模式。最简单的解决方案是在每个编码记录的开头包含一个版本号，并在数据库中保留模式版本列表。读取器可以获取记录，提取版本号，然后从数据库中获取该版本号的编写者模式。使用该编写者模式，它可以解码其余的记录。（例如 Espresso 以此方式工作。）

Sending records over a network connection

When two processes are communicating over a bidirectional network connection, they can negotiate the schema version on connection setup and then use that schema for the lifetime of the connection. The Avro RPC protocol (see “Dataflow Through Services: REST and RPC” ) works like this.

当两个进程通过双向网络连接通信时，它们可以在连接设置期间协商模式版本，然后在连接的整个生命周期内使用该模式。Avro RPC协议（请参见“服务的数据流：REST和RPC”）是这样工作的。

A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility [ 24 ]. As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.

数据库模式版本的信息是任何情况下都有用的，因为它可以充当文档，同时还可以让你检查模式的兼容性[24]。你可以使用一个简单的递增整数作为版本号，或者你可以使用模式的哈希值。

Dynamically generated schemas

One advantage of Avro’s approach, compared to Protocol Buffers and Thrift, is that the schema doesn’t contain any tag numbers. But why is this important? What’s the problem with keeping a couple of numbers in the schema?

Avro的做法相对于Protocol Buffers和Thrift的优势之一就是模式中没有任何标签号。但这为什么重要呢？把几个数字写进模式有什么问题呢？ Avro的优势在于它不需要使用标签号进行编码，这样就可以避免出现不必要的翻译错误。

The difference is that Avro is friendlier to dynamically generated schemas. For example, say you have a relational database whose contents you want to dump to a file, and you want to use a binary format to avoid the aforementioned problems with textual formats (JSON, CSV, SQL). If you use Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the relational schema and encode the database contents using that schema, dumping it all to an Avro object container file [ 25 ]. You generate a record schema for each database table, and each column becomes a field in that record. The column name in the database maps to the field name in Avro.

不同之处在于Avro更友好地支持动态生成模式。例如，假设您有一个关系型数据库，希望将其内容转储到文件中，并且想要使用二进制格式避免文本格式（JSON、CSV、SQL）的问题。如果使用Avro，您可以相对容易地从关系模式生成一个Avro模式（使用我们之前看到的JSON表示），并使用该模式对数据库内容进行编码，将其全部转储到Avro对象容器文件[25]中。您为每个数据库表生成一个记录模式，每个列成为该记录中的一个字段。数据库中的列名映射到Avro中的字段名。

Now, if the database schema changes (for example, a table has one column added and one column removed), you can just generate a new Avro schema from the updated database schema and export data in the new Avro schema. The data export process does not need to pay any attention to the schema change—it can simply do the schema conversion every time it runs. Anyone who reads the new data files will see that the fields of the record have changed, but since the fields are identified by name, the updated writer’s schema can still be matched up with the old reader’s schema.

现在，如果数据库模式发生变化（例如，添加了一列和删除了一列），您只需从更新的数据库模式生成一个新的Avro模式，并按新的Avro模式导出数据。数据导出过程不需要关注模式变化 - 它可以每次运行时进行模式转换。读取新数据文件的任何人都会看到记录的字段已更改，但由于字段是通过名称标识的，因此更新的写入器模式仍然可以与旧的读者模式匹配。

By contrast, if you were using Thrift or Protocol Buffers for this purpose, the field tags would likely have to be assigned by hand: every time the database schema changes, an administrator would have to manually update the mapping from database column names to field tags. (It might be possible to automate this, but the schema generator would have to be very careful to not assign previously used field tags.) This kind of dynamically generated schema simply wasn’t a design goal of Thrift or Protocol Buffers, whereas it was for Avro.

相比之下，如果您使用Thrift或Protocol Buffers来实现此目的，字段标签很可能需要手动分配：每当数据库模式更改时，管理员都必须手动更新从数据库列名称到字段标签的映射。（这可能是可以自动化的，但模式生成器必须非常小心，以不分配先前使用过的字段标签。）这种动态生成的模式根本不是Thrift或Protocol Buffers的设计目标，但却是Avro的设计目标。

Code generation and dynamically typed languages

Thrift and Protocol Buffers rely on code generation: after a schema has been defined, you can generate code that implements this schema in a programming language of your choice. This is useful in statically typed languages such as Java, C++, or C#, because it allows efficient in-memory structures to be used for decoded data, and it allows type checking and autocompletion in IDEs when writing programs that access the data structures.

节俭和协议缓冲区依赖代码生成：在定义模式之后，您可以生成实现此模式的代码，使用您选择的编程语言。这在静态类型语言（如Java、C++或C#）中非常有用，因为它允许使用高效的内存结构以解码数据，并在编写访问数据结构的程序时允许在IDE中进行类型检查和自动完成。

In dynamically typed programming languages such as JavaScript, Ruby, or Python, there is not much point in generating code, since there is no compile-time type checker to satisfy. Code generation is often frowned upon in these languages, since they otherwise avoid an explicit compilation step. Moreover, in the case of a dynamically generated schema (such as an Avro schema generated from a database table), code generation is an unnecessarily obstacle to getting to the data.

在动态类型的编程语言中，例如JavaScript，Ruby或Python，生成代码几乎没有意义，因为没有编译时类型检查器需要满足。在这些语言中，代码生成通常是不被赞成的，因为它们可以避免显式编译步骤。此外，在动态生成模式的情况下（例如从数据库表生成的Avro模式），代码生成是一个不必要的障碍，阻碍获取数据。

Avro provides optional code generation for statically typed programming languages, but it can be used just as well without any code generation. If you have an object container file (which embeds the writer’s schema), you can simply open it using the Avro library and look at the data in the same way as you could look at a JSON file. The file is self-describing since it includes all the necessary metadata.

Avro提供可选的代码生成，用于静态类型的编程语言，但是即使没有任何代码生成，它也可以很好地使用。如果您有一个对象容器文件（其中嵌入了作者模式），则可以使用Avro库打开它，并查看数据，就像您可以查看JSON文件一样。该文件是自描述的，因为它包含了所有必要的元数据。

This property is especially useful in conjunction with dynamically typed data processing languages like Apache Pig [ 26 ]. In Pig, you can just open some Avro files, start analyzing them, and write derived datasets to output files in Avro format without even thinking about schemas.

这个属性在动态类型的数据处理语言，比如Apache Pig[26]方面非常有用。在Pig中，你可以直接打开一些Avro文件，开始分析它们，并把得出的数据集以Avro格式写入输出文件，而无需考虑模式。

The Merits of Schemas

As we saw, Protocol Buffers, Thrift, and Avro all use a schema to describe a binary encoding format. Their schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed validation rules (e.g., “the string value of this field must match this regular expression” or “the integer value of this field must be between 0 and 100”). As Protocol Buffers, Thrift, and Avro are simpler to implement and simpler to use, they have grown to support a fairly wide range of programming languages.

正如我们所看到的，Protocol Buffers，Thrift和Avro都使用模式来描述二进制编码格式。他们的模式语言比XML Schema或JSON Schema简单得多，但支持更详细的验证规则（例如，“此字段的字符串值必须与此正则表达式匹配”或“此字段的整数值必须在0和100之间”）。由于Protocol Buffers，Thrift和Avro更易于实现和使用，它们已经发展成为支持相当广泛的编程语言。

The ideas on which these encodings are based are by no means new. For example, they have a lot in common with ASN.1, a schema definition language that was first standardized in 1984 [ 27 ]. It was used to define various network protocols, and its binary encoding (DER) is still used to encode SSL certificates (X.509), for example [ 28 ]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers and Thrift [ 29 ]. However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.

这些编码所基于的想法并不是新概念。例如，它们与 ASN.1 有很多共同点，这是一种模式定义语言，于 1984 年首次标准化[27]。它被用于定义各种网络协议，其二进制编码（DER）仍然用于编码 SSL 证书（X.509）等[28]。ASN.1 支持使用标签号进行模式演化，类似于 Protocol Buffers 和 Thrift[29]。但是，ASN.1 也非常复杂，文档不完整，因此 ASN.1 对于新应用程序可能不是一个好的选择。

Many data systems also implement some kind of proprietary binary encoding for their data. For example, most relational databases have a network protocol over which you can send queries to the database and get back responses. Those protocols are generally specific to a particular database, and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses from the database’s network protocol into in-memory data structures.

许多数据系统还实现了某种专有的二进制编码来处理其数据。例如，大多数关系型数据库都有一种网络协议，您可以通过该协议发送查询到数据库并获取响应。这些协议通常是特定于特定数据库的，并且数据库供应商提供了一个驱动程序（例如使用ODBC或JDBC API），该驱动程序将数据库的网络协议解码为内存中的数据结构。

So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary encodings based on schemas are also a viable option. They have a number of nice properties:

因此，我们可以看到，尽管JSON、XML和CSV等文本数据格式很常见，但基于模式的二进制编码也是一个可行的选择。它们具有许多优点：

They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.

它们可以比各种“二进制JSON”变体更紧凑，因为它们可以在编码数据中省略字段名称。
The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).

图表是一种有价值的文档形式，由于解码需要使用图表，所以您可以确信它是最新的（而手动维护的文档可能很容易与现实偏差）。
Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.

保留模式数据库可以让您在部署任何更改之前对模式进行前向和后向兼容性检查。
For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

对于静态类型编程语言的用户而言，从模式中生成代码的能力是有用的，因为它可以使编译时进行类型检查。

In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON databases provide (see “Schema flexibility in the document model” ), while also providing better guarantees about your data and better tooling.

简而言之，模式演化允许与无模式/基于读取的JSON数据库提供的灵活性相同（参见“文档模型中的架构灵活性”），同时提供更好的数据保证和更好的工具。

Modes of Dataflow

At the beginning of this chapter we said that whenever you want to send some data to another process with which you don’t share memory—for example, whenever you want to send data over the network or write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of different encodings for doing this.

在本章的开头，我们说过，当你想要将数据发送给另一个进程，而该进程与你不共享内存（例如，当你想要通过网络发送数据或将其写入文件时），你需要将其编码为一系列字节。我们随后讨论了许多不同的编码方式来实现这一点。

We talked about forward and backward compatibility, which are important for evolvability (making change easy by allowing you to upgrade different parts of your system independently, and not having to change everything at once). Compatibility is a relationship between one process that encodes the data, and another process that decodes it.

我们谈到了向前和向后兼容性，这对于可发展性非常重要（通过允许您独立升级系统的不同部分，而不必同时更改所有内容，从而使更改变得容易）。兼容性是编码数据的一个进程与解码其的另一个进程之间的关系。

That’s a fairly abstract idea—there are many ways data can flow from one process to another. Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the most common ways how data flows between processes:

这是一个相当抽象的概念——数据可以以许多方式从一个进程流向另一个进程。是谁编码数据，谁解码数据？在本章的其余部分，我们将探索数据在进程之间流动的一些最常见的方式：

Via databases (see “Dataflow Through Databases” )

通过数据库（参见“数据库中的数据流”）
Via service calls (see “Dataflow Through Services: REST and RPC” )

通过服务调用（见“通过服务进行数据流：REST和RPC”）
Via asynchronous message passing (see “Message-Passing Dataflow” )

通过异步消息传递（参见“消息传递数据流”）

Dataflow Through Databases

In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it. There may just be a single process accessing the database, in which case the reader is simply a later version of the same process—in that case you can think of storing something in the database as sending a message to your future self .

在数据库中，写入数据库的进程会对数据进行编码，而读取数据库的进程则会进行解码。有可能只有一个进程访问数据库，在这种情况下，读取进程只是同一进程的一个较新版本。因此，可以将将数据存储在数据库中看作是给自己未来发送一条信息。

Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode what you previously wrote.

向后兼容性在这里是明显必要的;否则，您未来的自己将无法解码您之前所写的内容。

In general, it’s common for several different processes to be accessing a database at the same time. Those processes might be several different applications or services, or they may simply be several instances of the same service (running in parallel for scalability or fault tolerance). Either way, in an environment where the application is changing, it is likely that some processes accessing the database will be running newer code and some will be running older code—for example because a new version is currently being deployed in a rolling upgrade, so some instances have been updated while others haven’t yet.

一般而言，多个不同的进程同时访问数据库是很常见的。这些进程可能是多个不同的应用程序或服务，或者只是同一个服务的几个实例（同时运行以实现可扩展性或容错性）。无论哪种方式，在应用程序变化的环境中，访问数据库的一些进程可能正在运行较新的代码，而其他进程可能运行原始的较旧的代码，例如因为正在进行滚动升级的新版本，因此某些实例已更新，而其他实例尚未更新。

This means that a value in the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for databases.

这意味着数据库中的值可能被新版本的代码写入，而后被仍在运行的旧版本代码读取。因此，对于数据库也经常需要前向兼容性。

However, there is an additional snag. Say you add a field to a record schema, and the newer code writes a value for that new field to the database. Subsequently, an older version of the code (which doesn’t yet know about the new field) reads the record, updates it, and writes it back. In this situation, the desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t be interpreted.

然而，还有一个额外的问题。假设您向记录模式添加一个字段，而新代码向数据库写入该新字段的值。随后，一个不知道新字段的旧版本代码读取记录，更新它，并将其写回。在这种情况下，通常希望旧代码保留新字段的完整性，尽管旧代码无法解释新字段。

The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level, as illustrated in Figure 4-7 . For example, if you decode a database value into model objects in the application, and later reencode those model objects, the unknown field might be lost in that translation process. Solving this is not a hard problem; you just need to be aware of it.

之前讨论的编码格式支持未知字段的保留，但有时需要在应用程序级别上进行注意，如图4-7所示。例如，如果您在应用程序中将数据库值解码为模型对象，然后稍后重新对这些模型对象进行编码，则该未知字段可能会在翻译过程中丢失。解决这个问题并不困难，你只需要意识到这个问题即可。

Different values written at different times

A database generally allows any value to be updated at any time. This means that within a single database you may have some values that were written five milliseconds ago, and some values that were written five years ago.

一个数据库通常允许任何值在任何时间进行更新。这意味着在单个数据库中，你可能会有一些值是五毫秒前写入的，也可能会有一些值是五年前写入的。

When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code .

当您部署新版本的应用程序（至少是服务器端应用程序）时，您可以在几分钟内完全使用新版本替换旧版本。但是，数据库内容则不同：5年前的数据仍然以原始编码存在，除非自那时以来您已明确重写它。这种观察有时被总结为数据超越代码。

Rewriting ( migrating ) data into a new schema is certainly possible, but it’s an expensive thing to do on a large dataset, so most databases avoid it if possible. Most relational databases allow simple schema changes, such as adding a new column with a null default value, without rewriting existing data. ^v When an old row is read, the database fills in null s for any columns that are missing from the encoded data on disk. LinkedIn’s document database Espresso uses Avro for storage, allowing it to use Avro’s schema evolution rules [ 23 ].

将数据重写（迁移）到新模式中肯定是可能的，但对于大型数据集而言是一项昂贵的工作，因此大多数数据库在可能的情况下避免这种情况。大多数关系型数据库允许简单的模式更改，例如添加一个默认值为 null 的新列，而无需重写现有数据。当读取旧行时，数据库会自动为任何缺失的列填充 null。 LinkedIn 的文档数据库 Espresso 使用 Avro 进行存储，从而可以使用 Avro 的模式演化规则[23]。

Schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

模式演化因此允许整个数据库看起来像是使用单一模式编码，尽管底层存储可能包含使用各种历史版本模式编码的记录。

Archival storage

Perhaps you take a snapshot of your database from time to time, say for backup purposes or for loading into a data warehouse (see “Data Warehousing” ). In this case, the data dump will typically be encoded using the latest schema, even if the original encoding in the source database contained a mixture of schema versions from different eras. Since you’re copying the data anyway, you might as well encode the copy of the data consistently.

也许您会定期快照您的数据库，比如用于备份或加载到数据仓库（请参见“数据仓库”）。在这种情况下，数据转储通常将使用最新的模式进行编码，即使源数据库中的原始编码包含不同年代的模式版本的混合物。既然您已经在复制数据，那么也可以对数据副本进行一致的编码。

As the data dump is written in one go and is thereafter immutable, formats like Avro object container files are a good fit. This is also a good opportunity to encode the data in an analytics-friendly column-oriented format such as Parquet (see “Column Compression” ).

由于数据转储是一次性写入的，并且之后是不可变的，因此诸如Avro对象容器文件之类的格式非常适合。这也是将数据编码为适合分析的列定向格式（例如Parquet）的绝佳机会（请参见“列压缩”）。

In Chapter 10 we will talk more about using data in archival storage.

在第10章中，我们将更多地谈论如何在档案存储中使用数据。

Dataflow Through Services: REST and RPC

When you have processes that need to communicate over a network, there are a few different ways of arranging that communication. The most common arrangement is to have two roles: clients and servers . The servers expose an API over the network, and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service .

当您需要通过网络进行通信时，有几种不同的方式可以安排通信。最常见的方式是有两个角色：客户端和服务器。服务器在网络上公开API，客户端可以连接到服务器以向该API发出请求。服务器公开的API称为服务。

The web works this way: clients (web browsers) make requests to web servers, making GET requests to download HTML, CSS, JavaScript, images, etc., and making POST requests to submit data to the server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS, HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards, you can use any web browser to access any website (at least in theory!).

网络的工作方式：客户端（Web浏览器）向Web服务器发出请求，进行GET请求下载HTML、CSS、JavaScript、图像等，并进行POST请求提交数据到服务器。API由一组标准协议和数据格式（HTTP、URL、SSL/TLS、HTML等）组成。因为Web浏览器、Web服务器和网站作者大多数都同意这些标准，所以理论上你可以使用任何Web浏览器访问任何网站！

Web browsers are not the only type of client. For example, a native app running on a mobile device or a desktop computer can also make network requests to a server, and a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax [ 30 ]). In this case, the server’s response is typically not HTML for displaying to a human, but rather data in an encoding that is convenient for further processing by the client-side application code (such as JSON). Although HTTP may be used as the transport protocol, the API implemented on top is application-specific, and the client and server need to agree on the details of that API.

网络浏览器并非唯一的客户端类型。例如，运行在移动设备或台式计算机上的本地应用程序也可以向服务器发送网络请求，而在 Web 浏览器内运行的客户端 JavaScript 应用程序可以使用 XMLHttpRequest 变成 HTTP 客户端（这种技术称为 Ajax [30]）。在这种情况下，服务器的响应通常不是用于向人显示的 HTML，而是以一种对客户端端应用程序代码进一步处理方便的编码形式呈现的数据（例如 JSON）。虽然 HTTP 可以用作传输协议，但在其上实现的 API 是应用程序特定的，客户端和服务器需要就该 API 的详细信息达成一致。

Moreover, a server can itself be a client to another service (for example, a typical web app server acts as client to a database). This approach is often used to decompose a large application into smaller services by area of functionality, such that one service makes a request to another when it requires some functionality or data from that other service. This way of building applications has traditionally been called a service-oriented architecture (SOA), more recently refined and rebranded as microservices architecture [ 31 , 32 ].

此外，一个服务器本身也可以作为另一个服务的客户端（例如，一个典型的Web应用服务器作为数据库的客户端）。这种方法通常用于按功能区域将大型应用程序分解为较小的服务，其中一个服务在需要另一个服务的功能或数据时向另一个服务发出请求。构建应用程序的这种方式传统上被称为面向服务的架构（SOA），最近被改进和重新品牌为微服务架构[31,32]。

In some ways, services are similar to databases: they typically allow clients to submit and query data. However, while databases allow arbitrary queries using the query languages we discussed in Chapter 2 , services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic (application code) of the service [ 33 ]. This restriction provides a degree of encapsulation: services can impose fine-grained restrictions on what clients can and cannot do.

有些方面，服务类似于数据库：它们通常允许客户端提交和查询数据。然而，尽管数据库允许使用我们在第2章中讨论的查询语言进行任意查询，但服务公开了一个特定于应用程序的API，只允许输入和输出的数据由业务逻辑（应用程序代码）预先确定[33]。这种限制提供了一定的封装：服务可以对客户端可以和不可以做的事情施加细粒度的限制。

A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable. For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams. In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API—precisely what we’ve been talking about in this chapter.

服务导向/微服务架构的一个主要设计目标是通过使服务独立部署和演化来使应用程序更易于修改和维护。例如，每个服务应由一个团队拥有，并且该团队应能够频繁发布该服务的新版本，而无需与其他团队协调。换句话说，我们应该期望旧版和新版服务器和客户端同时运行，因此服务器和客户端使用的数据编码必须在服务API的各个版本中兼容 - 这正是我们在本章中谈论的内容。

Web services

When HTTP is used as the underlying protocol for talking to the service, it is called a web service . This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:

当HTTP被用作与服务通信的基础协议时，就被称为web服务。这可能略微有误，因为web服务不仅在网络上使用，而且在几种不同的上下文中使用。例如：

A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.

用户设备上运行的客户端应用程序（例如移动设备上的本机应用程序或使用Ajax的JavaScript Web应用程序），通过HTTP向服务发出请求。这些请求通常通过公共互联网进行。
One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware .)

一项服务向同一组织拥有的另一个服务发出请求，通常位于相同的数据中心，作为面向服务/微服务体系结构的一部分。（支持此类用例的软件有时称为中间件。）
One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category includes public APIs provided by online services, such as credit card processing systems, or OAuth for shared access to user data.

一种服务向不同组织拥有的服务发出请求，通常通过互联网进行。这用于不同组织后端系统之间的数据交换。这个类别包括在线服务提供的公共API，例如信用卡处理系统，或用于共享访问用户数据的OAuth。

There are two popular approaches to web services: REST and SOAP . They are almost diametrically opposed in terms of philosophy, and often the subject of heated debate among their respective proponents. ^vi

有两种流行的 Web 服务方法：REST 和 SOAP。它们在哲学上几乎是完全相反的，常常是各自支持者之间的激烈辩论的主题。

REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP [ 34 , 35 ]. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration [ 36 ], and is often associated with microservices [ 31 ]. An API designed according to the principles of REST is called RESTful .

REST不是一种协议，而是一种基于HTTP原则的设计哲学。它强调使用简单的数据格式，使用URL来识别资源，并利用HTTP的缓存控制、认证和内容类型协商等特性。与SOAP相比，REST在跨组织服务集成的上下文中越来越受欢迎，并经常与微服务相关联。根据REST原则设计的API称为RESTful。

By contrast, SOAP is an XML-based protocol for making network API requests. ^vii Although it is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features. Instead, it comes with a sprawling and complex multitude of related standards (the web service framework , known as WS-* ) that add various features [ 37 ].

相比之下，SOAP是一种基于XML的协议，用于进行网络API请求。尽管它最常用于HTTP上，但它旨在独立于HTTP，并避免使用大多数HTTP功能。相反，它带有庞大而复杂的相关标准（称为WS-*的Web服务框架），添加了各种功能。

The API of a SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL. WSDL enables code generation so that a client can access a remote service using local classes and method calls (which are encoded to XML messages and decoded again by the framework). This is useful in statically typed programming languages, but less so in dynamically typed ones (see “Code generation and dynamically typed languages” ).

SOAP网络服务的API是使用名为Web Services Description Language或WSDL的基于XML的语言来描述的。 WSDL使代码生成成为可能，以便客户端可以使用本地类和方法调用访问远程服务（这些调用被编码为XML消息，并由框架进行解码）。这对于静态类型的编程语言很有用，但在动态类型的编程语言中不太有用（请参见“代码生成和动态类型的语言”）。

As WSDL is not designed to be human-readable, and as SOAP messages are often too complex to construct manually, users of SOAP rely heavily on tool support, code generation, and IDEs [ 38 ]. For users of programming languages that are not supported by SOAP vendors, integration with SOAP services is difficult.

由于 WSDL 不是以人类可读的方式设计的，而且 SOAP 消息经常太复杂而无法手工构造，因此使用 SOAP 的用户严重依赖工具支持、代码生成和 IDE [38]。对于不受 SOAP 供应商支持的编程语言用户来说，与 SOAP 服务集成是困难的。

Even though SOAP and its various extensions are ostensibly standardized, interoperability between different vendors’ implementations often causes problems [ 39 ]. For all of these reasons, although SOAP is still used in many large enterprises, it has fallen out of favor in most smaller companies.

尽管SOAP及其各种扩展显然是标准化的，但不同厂商实现间的互操作性经常会引起问题[39]。出于所有这些原因，虽然SOAP在许多大型企业中仍在使用，但在大多数较小的公司中已不再受欢迎。

RESTful APIs tend to favor simpler approaches, typically involving less code generation and automated tooling. A definition format such as OpenAPI, also known as Swagger [ 40 ], can be used to describe RESTful APIs and produce documentation.

RESTful APIs倾向于采用简单的方法，通常涉及较少的代码生成和自动化工具。定义格式，例如OpenAPI，也称为Swagger [40]，可用于描述RESTful APIs并生成文档。

The problems with remote procedure calls (RPCs)

Web services are merely the latest incarnation of a long line of technologies for making API requests over a network, many of which received a lot of hype but have serious problems. Enterprise JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker Architecture (CORBA) is excessively complex, and does not provide backward or forward compatibility [ 41 ].

网络服务仅是一系列通过网络进行API请求的技术中最新的一种，其中许多技术受到过大量宣传，但存在严重问题。企业JavaBeans(EJB) 和Java的远程方法调用(RMI)仅限于Java。分布式组件对象模型(DCOM)仅限于微软平台。通用对象请求代理架构(CORBA)过于复杂，并且不提供向前或向后兼容性[41]。

All of these are based on the idea of a remote procedure call (RPC), which has been around since the 1970s [ 42 ]. The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency ). Although RPC seems convenient at first, the approach is fundamentally flawed [ 43 , 44 ]. A network request is very different from a local function call:

所有这些都是基于远程过程调用（RPC）的概念，自1970年代以来一直存在[42]。 RPC模型试图使对远程网络服务的请求看起来与在同一进程中调用函数或方法相同（这个抽象称为位置透明度）。尽管RPC乍一看似乎很方便，但这种方法在根本上存在缺陷[43,44]。网络请求与本地函数调用非常不同：

A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control. Network problems are common, so you have to anticipate them, for example by retrying a failed request.

本地函数调用是可预测的，仅取决于您控制的参数的成功或失败。网络请求不可预测：请求或响应可能由于网络问题丢失，远程计算机可能缓慢或不可用，并且此类问题完全超出您的控制范围。网络问题很常见，因此您必须预先考虑它们，例如通过重试失败的请求。
A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes). A network request has another possible outcome: it may return without a result, due to a timeout . In that case, you simply don’t know what happened: if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. (We discuss this issue in more detail in Chapter 8 .)

本地函数调用可能会返回结果，也可能会抛出异常，或者永远不会返回（因为进入无限循环或进程崩溃）。网络请求还有另一种可能的结果：由于超时而返回而没有结果。在这种情况下，你只是不知道发生了什么：如果你没有从远程服务收到响应，你就没有办法知道请求是否已经通过了。（我们将在第8章中更详细地讨论这个问题。）
If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost. In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication ( idempotence ) into the protocol. Local function calls don’t have this problem. (We discuss idempotence in more detail in Chapter 11 .)

如果重新尝试一个失败的网络请求，可能会发生请求实际上能够通过，只是响应丢失的情况。在这种情况下，重新尝试会导致动作被执行多次，除非您在协议中构建重复消除（幂等性）机制。本地函数调用不会有这个问题。（我们将在第11章中更详细地讨论幂等性。）
Every time you call a local function, it normally takes about the same time to execute. A network request is much slower than a function call, and its latency is also wildly variable: at good times it may complete in less than a millisecond, but when the network is congested or the remote service is overloaded it may take many seconds to do exactly the same thing.

每次调用本地函数，它通常需要大致相同的时间来执行。网络请求比函数调用慢得多，而且延迟也极其不稳定：在良好的情况下，它可能在不到一毫秒的时间内完成，但当网络拥塞或远程服务过载时，完成相同的操作可能需要数秒钟的时间。
When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network. That’s okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.

当您调用本地函数时，您可以有效地将本地内存中的对象的引用（指针）传递给它。当您发出网络请求时，所有这些参数都需要被编码为可以发送到网络的字节序列。如果参数是原始类型如数字或字符串，那是没问题的，但是如果是大型对象，这会很快变得棘手。
The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another. This can end up ugly, since not all languages have the same types—recall JavaScript’s problems with numbers greater than 2 ⁵³ , for example (see “JSON, XML, and Binary Variants” ). This problem doesn’t exist in a single process written in a single language.

客户端和服务可能是用不同的编程语言实现的，因此RPC框架必须将数据类型从一种语言转换为另一种语言。这可能会非常麻烦，因为不是所有语言都拥有相同的类型--例如JavaScript对大于253的数字的问题（请参见“JSON，XML和二进制变体”）。在单个进程中编写单个语言时不存在这个问题。

All of these factors mean that there’s no point trying to make a remote service look too much like a local object in your programming language, because it’s a fundamentally different thing. Part of the appeal of REST is that it doesn’t try to hide the fact that it’s a network protocol (although this doesn’t seem to stop people from building RPC libraries on top of REST).

所有这些因素意味着，在编程语言中没有必要试图让远程服务看起来过于像本地对象，因为它是完全不同的东西。 REST 的吸引力之一在于它不试图隐藏它是一个网络协议的事实（尽管这似乎并不能阻止人们在 REST 之上构建 RPC 库）。

Current directions for RPC

Despite all these problems, RPC isn’t going away. Various RPC frameworks have been built on top of all the encodings mentioned in this chapter: for example, Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.

尽管存在这些问题，RPC并不会消失。各种RPC框架已经基于本章提到的所有编码方式构建：例如，Thrift和Avro包含了RPC支持，gRPC是使用Protocol Buffers实现的RPC实施，Finagle还使用Thrift，而Rest.li使用JSON over HTTP。

This new generation of RPC frameworks is more explicit about the fact that a remote request is different from a local function call. For example, Finagle and Rest.li use futures ( promises ) to encapsulate asynchronous actions that may fail. Futures also simplify situations where you need to make requests to multiple services in parallel, and combine their results [ 45 ]. gRPC supports streams , where a call consists of not just one request and one response, but a series of requests and responses over time [ 46 ].

这一新一代的RPC框架更明确地表明远程请求与本地函数调用是不同的。例如，Finagle和Rest.li使用futures（承诺）来封装可能失败的异步操作。Futures还简化了需要并行向多个服务发出请求并合并其结果的情况[45]。 gRPC支持流式传输，在这种情况下，一个调用不仅仅包括一个请求和一个响应，还包括随时间的一系列请求和响应[46]。

Some of these frameworks also provide service discovery —that is, allowing a client to find out at which IP address and port number it can find a particular service. We will return to this topic in “Request Routing” .

其中一些框架还提供服务发现功能，即允许客户端查找特定服务的 IP 地址和端口号。我们在“请求路由”中会再次探讨这个话题。

Custom RPC protocols with a binary encoding format can achieve better performance than something generic like JSON over REST. However, a RESTful API has other significant advantages: it is good for experimentation and debugging (you can simply make requests to it using a web browser or the command-line tool curl , without any code generation or software installation), it is supported by all mainstream programming languages and platforms, and there is a vast ecosystem of tools available (servers, caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.).

采用二进制编码格式的自定义RPC协议可以比通用的JSON over REST实现更好的性能。然而，RESTful API有其他重要的优势：它适用于实验和调试（您可以使用Web浏览器或命令行工具curl轻松地对其进行请求，无需任何代码生成或软件安装），它被所有主流编程语言和平台支持，并且有大量的工具生态系统可用（服务器、缓存、负载平衡器、代理、防火墙、监控、调试工具、测试工具等等）。

For these reasons, REST seems to be the predominant style for public APIs. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.

因此，REST似乎是公共API的主要风格。 RPC框架的主要重点是同一组织拥有的服务之间的请求，通常位于相同的数据中心。

Data encoding and evolution for RPC

For evolvability, it is important that RPC clients and servers can be changed and deployed independently. Compared to data flowing through databases (as described in the last section), we can make a simplifying assumption in the case of dataflow through services: it is reasonable to assume that all the servers will be updated first, and all the clients second. Thus, you only need backward compatibility on requests, and forward compatibility on responses.

对于可进化性，RPC客户端和服务器可以独立更改和部署非常重要。与通过数据库流动的数据相比（如上一节所述），对于通过服务进行的数据流动，我们可以进行简化的假设：合理地假设所有服务器将首先更新，而所有客户端将其次更新。因此，您只需要在请求方面进行向后兼容性，在响应方面进行向前兼容性。

The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:

RPC方案的向后和向前兼容性属性是从其使用的编码继承来的。

Thrift, gRPC (Protocol Buffers), and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.

Thrift、gRPC（Protocol Buffers）和Avro RPC都可以根据各自编码格式的兼容性规则进行升级。
In SOAP, requests and responses are specified with XML schemas. These can be evolved, but there are some subtle pitfalls [ 47 ].

在SOAP中，请求和响应是使用XML模式指定的。它们可以演变，但有一些微妙的陷阱[47]。
RESTful APIs most commonly use JSON (without a formally specified schema) for responses, and JSON or URI-encoded/form-encoded request parameters for requests. Adding optional request parameters and adding new fields to response objects are usually considered changes that maintain compatibility.

RESTful API 最常用 JSON（没有正式指定的模式）作为响应，JSON 或 URI 编码/表单编码请求参数作为请求。添加可选请求参数和向响应对象添加新字段通常被认为是保持兼容性的变化。

Service compatibility is made harder by the fact that RPC is often used for communication across organizational boundaries, so the provider of a service often has no control over its clients and cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps indefinitely. If a compatibility-breaking change is required, the service provider often ends up maintaining multiple versions of the service API side by side.

由于RPC通常用于跨组织边界的通信，因此服务兼容性变得更加困难，因此服务提供者通常无法控制其客户端并强制其升级。因此，兼容性需要长期维护，可能无限期地维护。如果需要破坏兼容性的更改，则服务提供者通常会同时维护多个版本的服务API。

There is no agreement on how API versioning should work (i.e., how a client can indicate which version of the API it wants to use [ 48 ]). For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP Accept header. For services that use API keys to identify a particular client, another option is to store a client’s requested API version on the server and to allow this version selection to be updated through a separate administrative interface [ 49 ].

API版本问题没有明确的协议，也就是说客户端如何指定要使用API的哪个版本[48]. 对于RESTful API，常见的方法是在URL或HTTP Accept头中使用版本号。对于使用API密钥标识特定客户端的服务，另一个选择是在服务器上存储客户端请求的API版本，并允许通过单独的管理接口更新此版本选择[49].

Message-Passing Dataflow

We have been looking at the different ways encoded data flows from one process to another. So far, we’ve discussed REST and RPC (where one process sends a request over the network to another process and expects a response as quickly as possible), and databases (where one process writes encoded data, and another process reads it again sometime in the future).

我们一直在研究编码数据从一个进程流向另一个进程的不同方式。到目前为止，我们已经讨论了REST和RPC（其中一个进程通过网络发送请求到另一个进程，并期望尽快得到响应），以及数据库（其中一个进程编写编码数据，而另一个进程在将来的某个时候再次读取它）。

In this final section, we will briefly look at asynchronous message-passing systems, which are somewhere between RPC and databases. They are similar to RPC in that a client’s request (usually called a message ) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called a message queue or message-oriented middleware ), which stores the message temporarily.

在这个最后的部分，我们将简要介绍异步消息传递系统，它们介于远程过程调用和数据库之间。它们与远程过程调用相似，因为客户端的请求（通常称为消息）以低延迟交付给另一个进程。它们与数据库相似，因为消息不是通过直接的网络连接发送，而是通过一个中介称为消息代理（也称为消息队列或面向消息的中间件）暂时存储。

Using a message broker has several advantages compared to direct RPC:

使用消息代理与直接RPC相比具有以下几个优点：

It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.

如果接收者不可用或超负荷，它可以起到缓冲作用，从而提高系统的可靠性。
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.

它可以自动将消息重新发送给已崩溃的进程，从而防止消息丢失。
It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).

这避免了发送方需要知道接收方的IP地址和端口号（这在云部署中尤其有用，因为虚拟机经常会出现变动）。
It allows one message to be sent to several recipients.

它可以将一条消息发送给多个收件人。
It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

它逻辑上将发送者与接收者分离开来（发送者只发布消息，不关心谁消费它们）。

However, a difference compared to RPC is that message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages. It is possible for a process to send a response, but this would usually be done on a separate channel. This communication pattern is asynchronous : the sender doesn’t wait for the message to be delivered, but simply sends it and then forgets about it.

然而，与RPC相比的一个区别是，消息传递通信通常是单向的：发送方通常不希望接收其消息的回复。进程可能会发送响应，但通常会在单独的通道上进行。这种通信模式是异步的：发送方不等待消息被传送，而只是发送它并忘记它。

Message brokers

In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular. We will compare them in more detail in Chapter 11 .

过去，消息代理的市场被TIBCO、IBM WebSphere和webMethods等公司的商业企业软件所主导。近年来，像RabbitMQ、ActiveMQ、HornetQ、NATS和Apache Kafka这样的开源实现变得流行起来。我们将在第11章中对它们进行更详细的比较。

The detailed delivery semantics vary by implementation and configuration, but in general, message brokers are used as follows: one process sends a message to a named queue or topic , and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic.

具体的传递语义因实现和配置而异，但通常情况下，消息代理如下使用：一个进程向命名队列或主题发送一条消息，代理确保该消息传递给一个或多个订阅该队列或主题的消费者。同一个主题上可以有多个生产者和多个消费者。

A topic provides only one-way dataflow. However, a consumer may itself publish messages to another topic (so you can chain them together, as we shall see in Chapter 11 ), or to a reply queue that is consumed by the sender of the original message (allowing a request/response dataflow, similar to RPC).

一个主题只提供单向数据流。然而，消费者本身可以发布消息到另一个主题（因此可以将它们链接在一起，如第11章所示），或者发布到由原始消息发送者消费的回复队列（允许请求/响应数据流，类似于RPC）。

Message brokers typically don’t enforce any particular data model—a message is just a sequence of bytes with some metadata, so you can use any encoding format. If the encoding is backward and forward compatible, you have the greatest flexibility to change publishers and consumers independently and deploy them in any order.

通常，消息代理不强制任何特定的数据模型-消息只是一系列字节和一些元数据，因此您可以使用任何编码格式。如果编码是向后和向前兼容的，则您可以最大限度地灵活地更改发布者和消费者的顺序，并将它们独立部署。

If a consumer republishes messages to another topic, you may need to be careful to preserve unknown fields, to prevent the issue described previously in the context of databases ( Figure 4-7 ).

如果消费者将消息重新发布到另一个主题，则需要小心保留未知字段，以防止在数据库上下文中描述的问题（图4-7）。

Distributed actor frameworks

The actor model is a programming model for concurrency in a single process. Rather than dealing directly with threads (and the associated problems of race conditions, locking, and deadlock), logic is encapsulated in actors . Each actor typically represents one client or entity, it may have some local state (which is not shared with any other actor), and it communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t need to worry about threads, and each actor can be scheduled independently by the framework.

演员模型是单个进程并发的编程模型。与直接处理线程（和相关的竞态条件、锁定和死锁问题）不同，逻辑被封装在演员中。每个演员通常代表一个客户或实体，它可能有一些本地状态（不与任何其他演员共享），并通过发送和接收异步消息与其他演员通信。消息传递不是保证的：在某些错误情况下，消息将被丢失。由于每个演员只处理一个消息，因此它不需要担心线程，并且每个演员可以由框架独立调度。

In distributed actor frameworks , this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.

在分布式 actor 框架中，该编程模型被用于跨多个节点扩展应用程序。无论发送方和接收方是在同一节点还是不同节点，都使用相同的消息传递机制。如果它们在不同的节点上，消息将被透明地编码为字节序列，通过网络发送，并在另一侧进行解码。

Location transparency works better in the actor model than in RPC, because the actor model already assumes that messages may be lost, even within a single process. Although latency over the network is likely higher than within the same process, there is less of a fundamental mismatch between local and remote communication when using the actor model.

在Actor模型中，位置透明度比RPC更好，因为Actor模型已经假定消息可能会丢失，即使在单个进程中也是如此。虽然网络延迟可能比同一进程内的延迟要高，但使用Actor模型时，本地和远程通信之间的基本不匹配要少得多。

A distributed actor framework essentially integrates a message broker and the actor programming model into a single framework. However, if you want to perform rolling upgrades of your actor-based application, you still have to worry about forward and backward compatibility, as messages may be sent from a node running the new version to a node running the old version, and vice versa.

分布式演员框架基本上将消息代理和演员编程模型集成到单个框架中。然而，如果您想要执行基于演员的应用程序的滚动升级，则仍然需要考虑前向和后向兼容性，因为消息可能会从运行新版本的节点发送到运行旧版本的节点，反之亦然。

Three popular distributed actor frameworks handle message encoding as follows:

三个流行的分布式 actor 框架的信息编码方式如下：

Akka uses Java’s built-in serialization by default, which does not provide forward or backward compatibility. However, you can replace it with something like Protocol Buffers, and thus gain the ability to do rolling upgrades [ 50 ].

Akka默认使用Java内置的序列化，但是它没有提供向前或向后兼容性。不过，您可以使用类似Protocol Buffers的东西替换它，从而获得滚动升级的能力[50].
Orleans by default uses a custom data encoding format that does not support rolling upgrade deployments; to deploy a new version of your application, you need to set up a new cluster, move traffic from the old cluster to the new one, and shut down the old one [ 51 , 52 ]. Like with Akka, custom serialization plug-ins can be used.

Orleans 默认使用自定义数据编码格式，不支持滚动升级部署；要部署应用程序的新版本，需要设置一个新的集群，将流量从旧集群移动到新集群，并关闭旧集群。与 Akka 一样，可以使用自定义序列化插件。
In Erlang OTP it is surprisingly hard to make changes to record schemas (despite the system having many features designed for high availability); rolling upgrades are possible but need to be planned carefully [ 53 ]. An experimental new maps datatype (a JSON-like structure, introduced in Erlang R17 in 2014) may make this easier in the future [ 54 ].

在Erlang OTP中，尽管该系统有许多为高可用性而设计的功能，但更改记录模式仍然令人惊讶地困难; 滚动升级可能是可行的，但需要仔细计划。一个实验性的新映射数据类型（类似于JSON的结构，在2014年Erlang R17中引入）可能会在未来使这一过程更加容易。

Summary

In this chapter we looked at several ways of turning data structures into bytes on the network or bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more importantly also the architecture of applications and your options for deploying them.

在本章中，我们探讨了将数据结构转换为网络上的字节或磁盘上的字节的几种方法。我们看到了这些编码的细节不仅影响它们的效率，而且更重要的是对应用程序的架构和部署选项产生影响。

In particular, many services need to support rolling upgrades, where a new version of a service is gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously. Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging frequent small releases over rare big releases) and make deployments less risky (allowing faulty releases to be detected and rolled back before they affect a large number of users). These properties are hugely beneficial for evolvability , the ease of making changes to an application.

许多服务需要支持滚动升级，在此过程中，新版本的服务逐渐部署到一些节点，而不是同时部署到所有节点。滚动升级允许在没有停机的情况下发布服务的新版本（因此鼓励频繁发布小版本而非罕见的大版本），并降低了部署风险（允许在影响大量用户之前检测到并回滚错误发布）。这些特性对于应用程序的可发展性和易于进行更改非常有益。

During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code. Thus, it is important that all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).

在滚动升级期间或出于其他原因，我们必须假设不同的节点运行着我们应用程序代码的不同版本。因此，重要的是所有在系统中流动的数据都以一种提供向后兼容性（新代码可以读取旧数据）和向前兼容性（旧代码可以读取新数据）的方式进行编码。

We discussed several data encoding formats and their compatibility properties:

我们讨论了几种数据编码格式及其兼容性属性：

Programming language–specific encodings are restricted to a single programming language and often fail to provide forward and backward compatibility.

编程语言特定编码只限于一个编程语言，经常不能提供向前和向后的兼容性。
Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you use them. They have optional schema languages, which are sometimes helpful and sometimes a hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things like numbers and binary strings.

文本格式，如JSON、 XML和CSV，是广泛使用的，它们的兼容性取决于你如何使用它们。它们具有可选的模式语言，有时有帮助，有时是一种阻力。这些格式对于数据类型有些模糊，所以对于像数字和二进制字符串这样的东西，你必须小心。
Binary schema–driven formats like Thrift, Protocol Buffers, and Avro allow compact, efficient encoding with clearly defined forward and backward compatibility semantics. The schemas can be useful for documentation and code generation in statically typed languages. However, they have the downside that data needs to be decoded before it is human-readable.

二进制schema驱动格式（例如Thrift，Protocol Buffers和Avro）允许紧凑、高效的编码，并具有明确定义的向前和向后兼容性语义。在静态类型语言中，这些模式可以用于文档和代码生成。但是，它们的缺点是需要在数据变得可读之前进行解码。

We also discussed several modes of dataflow, illustrating different scenarios in which data encodings are important:

我们还讨论了几种数据流模式，并说明了数据编码在不同场景下的重要性。

Databases, where the process writing to the database encodes the data and the process reading from the database decodes it

数据库，写入数据库的过程对数据进行编码，读取数据库的过程对其进行解码。
RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response

RPC和REST API，客户端编码一个请求，服务器解码请求并编码响应，客户端最后解码响应。
Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient

异步消息传递（使用消息代理或演员），其中节点通过发送由发送者编码的消息，由接收者解码的方式进行通信。

We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable. May your application’s evolution be rapid and your deployments be frequent.

我们可以得出结论，借助一些小心谨慎，实现向前/向后兼容和滚动升级是完全可行的。愿你的应用能够快速迭代，部署频繁。

Footnotes

ⁱ With the exception of some special cases, such as certain memory-mapped files or when operating directly on compressed data (as described in “Column Compression” ).

除了某些特殊情况，例如某些内存映射文件或在直接操作压缩数据时（如“列压缩”中所述）。

ⁱⁱ Note that encoding has nothing to do with encryption . We don’t discuss encryption in this book.

请注意编码与加密无关。本书不讨论加密。

ⁱⁱⁱ Actually, it has three—BinaryProtocol, CompactProtocol, and DenseProtocol—although DenseProtocol is only supported by the C++ implementation, so it doesn’t count as cross-language [ 18 ]. Besides those, it also has two different JSON-based encoding formats [ 19 ]. What fun!

其实，Thrift协议有三种--BinaryProtocol、CompactProtocol和DenseProtocol，尽管DenseProtocol只被C++实现所支持，因此不能算作跨语言。此外，它还有两种不同的基于JSON的编码格式。多么有趣！

^iv To be precise, the default value must be of the type of the first branch of the union, although this is a specific limitation of Avro, not a general feature of union types.

"具体来说，默认值必须是联合类型中第一个分支的类型，虽然这是Avro的一个具体限制，而不是联合类型的一般特征。"

^v Except for MySQL, which often rewrites an entire table even though it is not strictly necessary, as mentioned in “Schema flexibility in the document model” .

除了MySQL，如“文档模型中的架构灵活性”中提到的那样，即使没有严格必要，它经常重写整个表。

^vi Even within each camp there are plenty of arguments. For example, HATEOAS ( hypermedia as the engine of application state ), often provokes discussions [ 35 ].

甚至在每个阵营内部也有很多争论。例如，HATEOAS（将超媒体作为应用程序状态的引擎）经常引发讨论。

^vii Despite the similarity of acronyms, SOAP is not a requirement for SOA. SOAP is a particular technology, whereas SOA is a general approach to building systems.

虽然这些首字母缩写很相似，但SOAP不是SOA的要求。SOAP只是一种特定技术，而SOA是一种通用的系统构建方法。

References

[ 1 ] “ Java Object Serialization Specification ,” docs.oracle.com , 2010.

[1] “Java对象序列化规范”，docs.oracle.com，2010。

[ 2 ] “ Ruby 2.2.0 API Documentation ,” ruby-doc.org , Dec 2014.

“Ruby 2.2.0 API 文档”，ruby-doc.org，2014年12月。

[ 3 ] “ The Python 3.4.3 Standard Library Reference Manual ,” docs.python.org , February 2015.

[3] “Python 3.4.3标准库参考手册”，docs.python.org，2015年2月。

[ 4 ] “ EsotericSoftware/kryo ,” github.com , October 2014.

「[4] “EsotericSoftware/kryo,” github.com，2014 年10月。」的簡化中文翻譯如下：「[4] “EsotericSoftware/kryo”，github.com，2014年10月。」

[ 5 ] “ CWE-502: Deserialization of Untrusted Data ,” Common Weakness Enumeration, cwe.mitre.org , July 30, 2014.

CWE-502：“不受信任的数据反序列化”，通用弱点枚举，cwe.mitre.org，2014年7月30日。

[ 6 ] Steve Breen: “ What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability ,” foxglovesecurity.com , November 6, 2015.

史蒂夫·布林： “WebLogic、WebSphere、JBoss、Jenkins、OpenNMS和你的应用程序有什么共同点？这种漏洞，“foxglovesecurity.com，2015年11月6日。

[ 7 ] Patrick McKenzie: “ What the Rails Security Issue Means for Your Startup ,” kalzumeus.com , January 31, 2013.

[7] Patrick McKenzie：“Rails 安全问题对你的创业公司意味着什么”，kalzumeus.com，2013 年 1 月 31 日。

[ 8 ] Eishay Smith: “ jvm-serializers wiki ,” github.com , November 2014.

“jvm-serializers 维基”，github.com，2014年11月。

[ 9 ] “ XML Is a Poor Copy of S-Expressions ,” c2.com wiki.

"[9] "XML是S-表达式的劣质复制品"，c2.com维基。" (Note: This is the translated content in simplified Chinese)

[ 10 ] Matt Harris: “ Snowflake: An Update and Some Very Important Information ,” email to Twitter Development Talk mailing list, October 19, 2010.

[10] 马特·哈里斯： “雪花：更新和一些非常重要的信息”，电子邮件发送给Twitter开发者讨论组，2010年10月19日。

[ 11 ] Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson: “ XML Schema 1.1 ,” W3C Recommendation, May 2001.

[11] 高舒迪（桑迪）Gao、C.M.斯珀伯格-麦奎因和Henry S.汤普森：“XML Schema 1.1”，W3C建议，2001年5月。

[ 12 ] Francis Galiegue, Kris Zyp, and Gary Court: “ JSON Schema ,” IETF Internet-Draft, February 2013.

[12] Francis Galiegue，Kris Zyp和Gary Court：“JSON模式”，IETF网络草案，2013年2月。

[ 13 ] Yakov Shafranovich: “ RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files ,” October 2005.

[13] Yakov Shafranovich：“RFC 4180：逗号分隔值（CSV）文件的通用格式和MIME类型”，2005年10月。

[ 14 ] “ MessagePack Specification ,” msgpack.org .

"[14] "MessagePack规范", msgpack.org." translated to simplified Chinese is: "[14] "MessagePack规范", msgpack.org."

[ 15 ] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski: “ Thrift: Scalable Cross-Language Services Implementation ,” Facebook technical report, April 2007.

[15] Mark Slee, Aditya Agarwal和Marc Kwiatkowski:“Thrift: 可扩展的跨语言服务实现”，Facebook技术报告，2007年4月。

[ 16 ] “ Protocol Buffers Developer Guide ,” Google, Inc., developers.google.com .

“Protocol Buffers 开发者指南”，谷歌公司，developers.google.cn。

[ 17 ] Igor Anishchenko: “ Thrift vs Protocol Buffers vs Avro - Biased Comparison ,” slideshare.net , September 17, 2012.

[17] Igor Anishchenko：“Thrift vs Protocol Buffers vs Avro - 有偏见的比较”，slideshare.net，2012年9月17日。

[ 18 ] “ A Matrix of the Features Each Individual Language Library Supports ,” wiki.apache.org .

一个矩阵，显示每个个体语言库支持的特性，来自wiki.apache.org。

[ 19 ] Martin Kleppmann: “ Schema Evolution in Avro, Protocol Buffers and Thrift ,” martin.kleppmann.com , December 5, 2012.

"[19] Martin Kleppmann: “Avro、Protocol Buffers 和 Thrift中的模式演化”，martin.kleppmann.com，2012年12月5日。"

[ 20 ] “ Apache Avro 1.7.7 Documentation ,” avro.apache.org , July 2014.

“Apache Avro 1.7.7 文档”，avro.apache.org，2014年7月。

[ 21 ] Doug Cutting, Chad Walters, Jim Kellerman, et al.: “ [PROPOSAL] New Subproject: Avro ,” email thread on hadoop-general mailing list, mail-archives.apache.org , April 2009.

【提案】新子项目：Avro，Doug Cutting、Chad Walters、Jim Kellerman等人在hadoop-general邮件列表上的电子邮件线程，mail-archives.apache.org，2009年4月。

[ 22 ] Tony Hoare: “ Null References: The Billion Dollar Mistake ,” at QCon London , March 2009.

[22] 托尼·霍尔：“空引用：十亿美元的错误”，于2009年3月在伦敦QCon上发表。

[ 23 ] Aditya Auradkar and Tom Quiggle: “ Introducing Espresso—LinkedIn’s Hot New Distributed Document Store ,” engineering.linkedin.com , January 21, 2015.

[23] Aditya Auradkar 和 Tom Quiggle： “介绍 Espresso - LinkedIn 最新的分布式文档数据库”，engineering.linkedin.com，2015年1月21日。

[ 24 ] Jay Kreps: “ Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2) ,” blog.confluent.io , February 25, 2015.

[24] Jay Kreps：“使用Apache Kafka：构建流数据平台实践指南（第2部分）”，blog.confluent.io，2015年2月25日。

[ 25 ] Gwen Shapira: “ The Problem of Managing Schemas ,” radar.oreilly.com , November 4, 2014.

[25] Gwen Shapira: “管理模式的问题”，radar.oreilly.com，2014年11月4日。

[ 26 ] “ Apache Pig 0.14.0 Documentation ,” pig.apache.org , November 2014.

[26] “Apache Pig 0.14.0 文档”，pig.apache.org，2014年11月。

[ 27 ] John Larmouth: ASN.1 Complete . Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1

[27] 约翰·拉默斯: ASN.1 完整版。摩根·考夫曼，1999年。ISBN:978-0-122-33435-1。

[ 28 ] Russell Housley, Warwick Ford, Tim Polk, and David Solo: “ RFC 2459: Internet X.509 Public Key Infrastructure: Certificate and CRL Profile ,” IETF Network Working Group, Standards Track, January 1999.

[28] Russell Housley，Warwick Ford，Tim Polk和David Solo： “RFC 2459：Internet X.509公钥基础设施：证书和CRL配置文件”，IETF网络工作组，标准跟踪，1999年1月。

[ 29 ] Lev Walkin: “ Question: Extensibility and Dropping Fields ,” lionet.info , September 21, 2010.

[29] 列夫·沃尔金： “问题：可扩展性和删除字段，” lionet.info，2010年9月21日。

[ 30 ] Jesse James Garrett: “ Ajax: A New Approach to Web Applications ,” adaptivepath.com , February 18, 2005.

"[30] Jesse James Garrett: “Ajax: A New Approach to Web Applications,” adaptivepath.com, February 18, 2005." -> "[30] 杰西·詹姆斯·加勒特：'Ajax: 一种新的Web应用程序方法'，adaptivepath.com，2005年2月18日。"

[ 31 ] Sam Newman: Building Microservices . O’Reilly Media, 2015. ISBN: 978-1-491-95035-7

[31] Sam Newman: 构建微服务。O’Reilly Media，2015年。ISBN：978-1-491-95035-7。

[ 32 ] Chris Richardson: “ Microservices: Decomposing Applications for Deployability and Scalability ,” infoq.com , May 25, 2014.

[32] 克里斯·理查德森：「微服务：为了可部署性和可扩展性分解应用程序」，infoq.com，2014年5月25日。

[ 33 ] Pat Helland: “ Data on the Outside Versus Data on the Inside ,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

[33] Pat Helland：2005年1月在创新数据系统研究（CIDR）第二届双年会上发表的“内部数据和外部数据”，。

[ 34 ] Roy Thomas Fielding: “ Architectural Styles and the Design of Network-Based Software Architectures ,” PhD Thesis, University of California, Irvine, 2000.

[34] 罗伊·托马斯·菲尔丁： “体系结构风格与基于网络的软件架构的设计”，加州大学欧文分校博士论文，2000年。

[ 35 ] Roy Thomas Fielding: “ REST APIs Must Be Hypertext-Driven ,” roy.gbiv.com , October 20 2008.

"REST API必须由超文本驱动"，Roy Thomas Fielding，2008年10月20日，roy.gbiv.com。"

[ 36 ] “ REST in Peace, SOAP ,” royal.pingdom.com , October 15, 2010.

“安息吧，SOAP”，royal.pingdom.com，2010年10月15日。

[ 37 ] “ Web Services Standards as of Q1 2007 ,” innoq.com , February 2007.

“2007年第一季度的Web服务标准”，innoq.com，2007年2月。

[ 38 ] Pete Lacey: “ The S Stands for Simple ,” harmful.cat-v.org , November 15, 2006.

"Pete Lacey：`S`意味着简单", harmful.cat-v.org，2006年11月15日。

[ 39 ] Stefan Tilkov: “ Interview: Pete Lacey Criticizes Web Services ,” infoq.com , December 12, 2006.

[39] Stefan Tilkov：“访谈：Pete Lacey 批评 Web Services”，infoq.com，2006 年 12 月 12 日。

[ 40 ] “ OpenAPI Specification (fka Swagger RESTful API Documentation Specification) Version 2.0 ,” swagger.io , September 8, 2014.

"OpenAPI 规范（曾称 Swagger RESTful API 文档规范）2.0 版本，swagger.io，2014 年 9 月 8 日。"

[ 41 ] Michi Henning: “ The Rise and Fall of CORBA ,” ACM Queue , volume 4, number 5, pages 28–34, June 2006. doi:10.1145/1142031.1142044

[41] 米奇·亨宁：“CORBA 的兴起与衰落”，ACM Queue，第4卷，第5期，28-34页，2006年6月。doi：10.1145/1142031.1142044。

[ 42 ] Andrew D. Birrell and Bruce Jay Nelson: “ Implementing Remote Procedure Calls ,” ACM Transactions on Computer Systems (TOCS), volume 2, number 1, pages 39–59, February 1984. doi:10.1145/2080.357392

[42] 安德鲁·D·比雷尔和布鲁斯·杰伊·尼尔森：“实现远程过程调用”，ACM计算机系统交易（TOCS），第2卷，第1号，页39-59，1984年2月。doi：10.1145/2080.357392。

[ 43 ] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall: “ A Note on Distributed Computing ,” Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994.

请帮我翻译：“[43]Jim Waldo，Geoff Wyant，Ann Wollrath和Sam Kendall：‘关于分布式计算的注释，’Sun Microsystems Laboratories，Inc.，技术报告TR-94-29，1994年11月。”，翻译为简体中文，只返回翻译内容，不包括原始文本。 “[43] Jim Waldo，Geoff Wyant，Ann Wollrath和Sam Kendall：‘关于分布式计算的注释，’Sun Microsystems Laboratories，Inc.，技术报告TR-94-29，1994年11月。” 翻译为： “Jim Waldo、Geoff Wyant、Ann Wollrath和Sam Kendall：‘分布式计算注释’，Sun Microsystems实验室，技术报告TR-94-29，1994年11月。”

[ 44 ] Steve Vinoski: “ Convenience over Correctness ,” IEEE Internet Computing , volume 12, number 4, pages 89–92, July 2008. doi:10.1109/MIC.2008.75

"方便性胜于正确性"，《IEEE互联网计算》杂志，2008年7月，第12卷第4期，89-92页。DOI：10.1109/MIC.2008.75。

[ 45 ] Marius Eriksen: “ Your Server as a Function ,” at 7th Workshop on Programming Languages and Operating Systems (PLOS), November 2013. doi:10.1145/2525528.2525538

[45] Marius Eriksen：“作为函数的服务器”，于2013年11月第7届编程语言和操作系统研讨会(PLOS)上发表，doi:10.1145/2525528.2525538。

[ 46 ] “ grpc-common Documentation ,” Google, Inc., github.com , February 2015.

“grpc-common文档”，Google，Inc.，github.com，2015年2月。

[ 47 ] Aditya Narayan and Irina Singh: “ Designing and Versioning Compatible Web Services ,” ibm.com , March 28, 2007.

"Aditya Narayan和Irina Singh：“设计和版本兼容的Web服务”，ibm.com，2007年3月28日。" "Aditya Narayan和Irina Singh：“设计和版本兼容的Web服务”，ibm.com，2007年3月28日。"

[ 48 ] Troy Hunt: “ Your API Versioning Is Wrong, Which Is Why I Decided to Do It 3 Different Wrong Ways ,” troyhunt.com , February 10, 2014.

"[48] Troy Hunt: "你的API版本控制错误，这就是为什么我决定用三种不同的错误方法来解决它"，troyhunt.com，2014年2月10日。"

[ 49 ] “ API Upgrades ,” Stripe, Inc., April 2015.

“API升级”，Stripe公司，2015年4月。

[ 50 ] Jonas Bonér: “ Upgrade in an Akka Cluster ,” email to akka-user mailing list, grokbase.com , August 28, 2013.

[50] Jonas Bonér：“在 Akka 集群中升级”，电子邮件发送至 akka-user 邮件列表，grokbase.com，2013 年 8 月 28 日。

[ 51 ] Philip A. Bernstein, Sergey Bykov, Alan Geller, et al.: “ Orleans: Distributed Virtual Actors for Programmability and Scalability ,” Microsoft Research Technical Report MSR-TR-2014-41, March 2014.

菲利普·A·伯恩斯坦、谢尔盖·拜科夫、艾伦·盖勒等人写道：“奥尔良：分布式虚拟执行器用于可编程性和可扩展性”，微软研究技术报告MSR-TR-2014-41，2014年3月。

[ 52 ] “ Microsoft Project Orleans Documentation ,” Microsoft Research, dotnet.github.io , 2015.

“Microsoft Project Orleans文档”，微软研究，dotnet.github.io，2015年。

[ 53 ] David Mercer, Sean Hinde, Yinso Chen, and Richard A O’Keefe: “ beginner: Updating Data Structures ,” email thread on erlang-questions mailing list, erlang.com , October 29, 2007.

[53] 大卫·默瑟、肖恩·欣德、陈银锁和理查德·A·奥基夫：“初学者：更新数据结构”，电子邮件线程在erlang-questions邮件列表上，erlang.com，2007年10月29日。

[ 54 ] Fred Hebert: “ Postscript: Maps ,” learnyousomeerlang.com , April 9, 2014.

“附录：地图”，来自learnysomeerlang.com的Fred Hebert， 2014年4月9日。

Part II. Distributed Data

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

技术的成功在现实胜于公关，因为无法欺骗自然。

Richard Feynman, Rogers Commission Report (1986)

理查德·费曼，罗杰斯委员会报告（1986年）

In Part I of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in Part II , we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?

在本书的第一部分，我们讨论了数据系统的方面，适用于在单个计算机上存储数据的情况。现在，在第二部分中，我们提高了一个层次，问：如果多台计算机参与数据的存储和检索会发生什么？

There are various reasons why you might want to distribute a database across multiple machines:

将数据库分布在多台计算机上有多种原因：

Scalability

If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.

如果您的数据量、读取负载或写入负载增长到超过单台机器可以处理的范围，您可以在多台机器之间分担负载。

Fault tolerance/high availability

If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.

如果您的应用程序需要即使一台机器（或几台机器、网络、整个数据中心）宕机，仍需继续工作，您可以使用多台机器来提供冗余。当其中一台失败时，另一台可以接替它的工作。

Latency

If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.

如果你有全球的用户，你可能想要在世界各地设立服务器，这样每个用户就可以从地理位置接近他们的数据中心获得服务。这可以避免用户等待网络数据包传送到全球的中央。

Scaling to Higher Load

If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called vertical scaling or scaling up ). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of shared-memory architecture , all the components can be treated as a single machine [ 1 ]. ⁱ

如果您只需要扩展到更高的负载，最简单的方法是购买一台更强大的机器（有时称为垂直扩展或向上缩放）。许多 CPU、许多内存芯片和许多磁盘可以在一个操作系统下连接在一起，并且快速互联允许任何 CPU 访问内存或磁盘的任何部分。在这种共享内存架构中，所有组件都可以被视为单个机器。[1]。

The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load.

共享内存方法的问题在于成本增长得比线性快：一台具有两倍 CPU、两倍内存和两倍磁盘容量的机器通常比另一台显著地更贵。并且由于瓶颈，两倍大小的机器不能一定处理两倍的负载。

A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines)—but it is definitely limited to a single geographic location.

共享内存架构可能具有有限的容错能力-高端机器具有热插拔组件(可以更换磁盘、内存模块，甚至CPU而不需要关闭机器)-但它肯定限于一个地理位置。

Another approach is the shared-disk architecture , which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network. ⁱⁱ This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [ 2 ].

另一种方法是共享磁盘架构，它使用多台具有独立CPU和RAM的机器，但将数据存储在由多台机器共享的磁盘阵列上，这些机器通过快速网络连接。ii这种架构用于一些数据仓库工作负载，但竞争和锁定的开销限制了共享磁盘方法的可扩展性[2]。

Shared-Nothing Architectures

By contrast, shared-nothing architectures [ 3 ] (sometimes called horizontal scaling or scaling out ) have gained a lot of popularity. In this approach, each machine or virtual machine running the database software is called a node . Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.

相比之下，无共享的架构（有时称为水平扩展或扩展）变得越来越流行。在这种方法中，运行数据库软件的每台机器或虚拟机被称为节点。每个节点独立使用其CPU、RAM和磁盘。节点之间的任何协调都是在软件层面上完成，使用传统的网络。

No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio. You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter. With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible.

共享无系统不需要特殊硬件，因此您可以使用具有最佳性价比的任何机器。您可以将数据潜在地分布到多个地理区域，从而减少用户的延迟，并潜在地能够承受整个数据中心的损失。通过虚拟机的云部署，您不需要以谷歌的规模运营：即使对于小公司，多区域分布式架构现在也是可行的。

In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer. If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you.

在本书的这一部分中，我们将重点关注共享无何架构，这并不是因为它们在每种情况下都是最佳选择，而是因为它们需要你，应用程序开发人员，最多的谨慎。如果你的数据分布在多个节点上，你需要了解在这样的分布式系统中会发生什么限制和权衡—数据库无法神奇地向你隐藏这些。

While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use. In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [ 4 ]. On the other hand, shared-nothing systems can be very powerful. The next few chapters go into details on the issues that arise when data is distributed.

分布式共享无关架构具有许多优点，但通常会使应用程序增加复杂性，并有时限制您可以使用的数据模型的表现力。在某些情况下，一个简单的单线程程序可以比一个拥有超过100个CPU核心的集群表现得更好[4]。另一方面，共享无关系统可以非常强大。接下来的几章将详细探讨数据分布时出现的问题。

Replication Versus Partitioning

There are two common ways data is distributed across multiple nodes:

数据分布在多个节点中的常见方式有两种：

Replication

Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance. We discuss replication in Chapter 5 .

将相同的数据副本存储在多个不同的节点上，可能位于不同的位置。复制提供了冗余：如果某些节点不可用，则仍然可以从剩余节点提供数据。复制还可以帮助提高性能。我们在第5章中讨论复制。

Partitioning

Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding ). We discuss partitioning in Chapter 6 .

将一个大型数据库分割成较小的子集，称为分区，以便将不同的分区分配给不同的节点（也称为分片）。我们在第6章讨论分区。

These are separate mechanisms, but they often go hand in hand, as illustrated in Figure II-1 .

这些是独立的机制，但它们经常共同发生，如图II-1所示。

With an understanding of those concepts, we can discuss the difficult trade-offs that you need to make in a distributed system. We’ll discuss transactions in Chapter 7 , as that will help you understand all the many things that can go wrong in a data system, and what you can do about them. We’ll conclude this part of the book by discussing the fundamental limitations of distributed systems in Chapters 8 and 9 .

有了这些概念的理解，我们才能讨论分布式系统中需要做出的各种艰难抉择。我们将在第七章讨论事务，因为这将帮助您了解数据系统中可能出现的所有问题以及您可以采取的措施。我们将在第八章和第九章中讨论分布式系统的基本限制，以此结束本书的这一部分。

Later, in Part III of this book, we will discuss how you can take several (potentially distributed) datastores and integrate them into a larger system, satisfying the needs of a complex application. But first, let’s talk about distributed data.

接下来，在本书的第三部分，我们将讨论如何将多个（可能分布式的）数据存储集成到一个更大的系统中，以满足复杂应用的需要。但首先，让我们先谈谈分布式数据。

Footnotes

ⁱ In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access , or NUMA [ 1 ]). To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine.

在一台大型机器中，尽管任何 CPU 都可以访问内存的任何部分，但某些内存库接近于某些 CPU，而不是其他 CPU（这称为非均匀内存访问，或 NUMA）。为了有效利用这种架构，处理需要分解，使每个 CPU 大多访问附近的内存-这意味着即使表面上在一台机器上运行，仍需要分区。

ⁱⁱ Network Attached Storage (NAS) or Storage Area Network (SAN).

网络附加存储（NAS）或存储区域网络（SAN）。

References

[ 1 ] Ulrich Drepper: “ What Every Programmer Should Know About Memory ,” akkadia.org , November 21, 2007.

[1] Ulrich Drepper: “程序员应该了解的有关内存的知识”，akkadia.org，2007年11月21日。

[ 2 ] Ben Stopford: “ Shared Nothing vs. Shared Disk Architectures: An Independent View ,” benstopford.com , November 24, 2009.

[2] Ben Stopford: “Shared Nothing vs. Shared Disk Architectures: An Independent View,” benstopford.com, November 24, 2009. “共享无限架构 vs. 共享磁盘架构：一个独立的观点”，Ben Stopford, 2009年11月24日。

[ 3 ] Michael Stonebraker: “ The Case for Shared Nothing ,” IEEE Database Engineering Bulletin , volume 9, number 1, pages 4–9, March 1986.

[3] 迈克尔·斯通布雷克：“共享无需空间”，IEEE 数据库工程通报，第9卷，第1期，第4-9页，1986年3月。

[ 4 ] Frank McSherry, Michael Isard, and Derek G. Murray: “ Scalability! But at What COST? ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[4] Frank McSherry、 Michael Isard、 Derek G. Murray： “可伸缩性！但是代价是什么？” 于2015年5月在第15届USENIX操作系统热门话题研讨会（HotOS）上发表。

Chapter 5. Replication

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

可能出错的事情和不可能出错的事情的主要区别在于，当不可能出错的事情出错时，通常会发现难以修复。

Douglas Adams, Mostly Harmless (1992)

道格拉斯·亚当斯，《几乎无害》（1992年）

Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in the introduction to Part II , there are several reasons why you might want to replicate data:

复制意味着将连接通过网络的多台计算机上的相同数据保留副本。正如第二部分的引言所讨论的那样，有几个原因你可能想要复制数据：

To keep data geographically close to your users (and thus reduce latency)

将数据保持地理位置与用户接近（从而减少延迟）。
To allow the system to continue working even if some of its parts have failed (and thus increase availability)

允许系统即使一些部件出现故障也能正常运行（从而提高可用性）。
To scale out the number of machines that can serve read queries (and thus increase read throughput)

扩展可处理读查询的机器数量（从而增加读取吞吐量）。

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset. In Chapter 6 we will relax that assumption and discuss partitioning ( sharding ) of datasets that are too big for a single machine. In later chapters we will discuss various kinds of faults that can occur in a replicated data system, and how to deal with them.

在本章中，我们假设您的数据集很小，以至于每台机器都可以容纳整个数据集的副本。在第6章中，我们将放宽这个假设，并讨论数据集的分区（分片）——这些数据集太大而无法放入一台机器中。在后面的章节中，我们将讨论在复制数据系统中可能发生的各种故障，以及如何处理它们。

If the data that you’re replicating does not change over time, then replication is easy: you just need to copy the data to every node once, and you’re done. All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: single-leader , multi-leader , and leaderless replication. Almost all distributed databases use one of these three approaches. They all have various pros and cons, which we will examine in detail.

如果你要复制的数据不会随时间变化，那么复制就很容易：你只需要将数据复制到每个节点一次，就完成了。所有复制中的困难都在于处理复制数据的变化，这就是本章讨论的内容。我们将讨论三种流行的节点间复制变化的算法：单主节点、多主节点和无主节点复制。几乎所有分布式数据库都使用这三种方法中的一种。它们都有不同的优缺点，我们将详细研究。

There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas. Those are often configuration options in databases, and although the details vary by database, the general principles are similar across many different implementations. We will discuss the consequences of such choices in this chapter.

使用数据复制需要考虑很多折衷方案，例如同步或异步复制和如何处理失败的副本。这些通常是数据库中的配置选项，尽管实现细节因数据库而异，但一般原则在许多不同实现中都是类似的。本章将讨论这些选择的后果。

Replication of databases is an old topic—the principles haven’t changed much since they were studied in the 1970s [ 1 ], because the fundamental constraints of networks have remained the same. However, outside of research, many developers continued to assume for a long time that a database consisted of just one node. Mainstream use of distributed databases is more recent. Since many application developers are new to this area, there has been a lot of misunderstanding around issues such as eventual consistency . In “Problems with Replication Lag” we will get more precise about eventual consistency and discuss things like the read-your-writes and monotonic reads guarantees.

数据库复制是一个古老的话题——自从1970年代研究以来，其原理基本没有改变，因为网络的基本限制仍然相同。然而，在研究以外，许多开发人员长时间以来仍然假定数据库只包含一个节点。分布式数据库的主流使用是较晚期的。由于许多应用程序开发人员对这个领域还不熟悉，因此关于最终一致性等问题存在许多误解。在“复制滞后的问题”中，我们将更精确地讨论最终一致性等问题，并讨论类似于读取自己写入和单调读取保证的问题。

Leaders and Followers

Each node that stores a copy of the database is called a replica . With multiple replicas, a question inevitably arises: how do we ensure that all the data ends up on all the replicas?

每个存储数据库副本的节点被称为副本。有了多个副本，不可避免地会出现一个问题：我们如何确保所有的数据都最终存储在所有的副本上？

Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution for this is called leader-based replication (also known as active/passive or master–slave replication ) and is illustrated in Figure 5-1 . It works as follows:

每次写入数据库都需要由每个副本进行处理，否则副本将不再包含相同的数据。最常见的解决方案称为基于主节点的复制（也称为主动/被动或主从复制），如图5-1所示。其工作如下：

One of the replicas is designated the leader (also known as master or primary ). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.

其中一个复制品被指定为领导者（也称为主节点）。当客户端想要写入数据库时，它们必须将请求发送给领导者，领导者首先将新数据写入其本地存储。
The other replicas are known as followers ( read replicas , slaves , secondaries , or hot standbys ). ⁱ Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream . Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader.

其他复制品称为跟随者（读取副本、从属、次要或热备份）。每当领导者将新数据写入其本地存储时，它还将数据更改作为复制日志或更改流发送到其所有跟随者。每个跟随者从领导者获取日志，并按照在领导者上处理它们的相同顺序应用其本地数据库的副本中的所有写入。
When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).

当客户端想要从数据库中读取数据时，它可以查询领导者或任何一个追随者。但是，只有领导者接受写入（从客户端的角度来看，追随者只读）。

This mode of replication is a built-in feature of many relational databases, such as PostgreSQL (since version 9.0), MySQL, Oracle Data Guard [ 2 ], and SQL Server’s AlwaysOn Availability Groups [ 3 ]. It is also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso [ 4 ]. Finally, leader-based replication is not restricted to only databases: distributed message brokers such as Kafka [ 5 ] and RabbitMQ highly available queues [ 6 ] also use it. Some network filesystems and replicated block devices such as DRBD are similar.

这种复制模式是许多关系型数据库内置的功能，如PostgreSQL（自版本9.0起），MySQL，Oracle Data Guard和SQL Server的AlwaysOn可用性组。它还用于一些非关系型数据库，包括MongoDB、RethinkDB和Espresso。最后，基于leader的复制不仅仅局限于数据库：分布式消息代理，如Kafka和RabbitMQ高可用队列，也使用它。一些网络文件系统和复制块设备，如DRBD，也类似。

Synchronous Versus Asynchronous Replication

An important detail of a replicated system is whether the replication happens synchronously or asynchronously . (In relational databases, this is often a configurable option; other systems are often hardcoded to be either one or the other.)

复制系统的一个重要细节是复制是同步还是异步发生的。（在关系型数据库中，这常常是一个可配置的选项;其他系统常常是硬编码为其中的一个。）

Think about what happens in Figure 5-1 , where the user of a website updates their profile image. At some point in time, the client sends the update request to the leader; shortly afterward, it is received by the leader. At some point, the leader forwards the data change to the followers. Eventually, the leader notifies the client that the update was successful.

想想在图5-1中发生了什么，其中一个网站用户更新了其个人资料图片。在某个时间点，客户端将更新请求发送给领导者；不久之后，领导者接收到该请求。然后，领导者将数据更改转发给跟随者。最终，领导者会通知客户端更新成功。

Figure 5-2 shows the communication between various components of the system: the user’s client, the leader, and two followers. Time flows from left to right. A request or response message is shown as a thick arrow.

图5-2显示了系统各个组件之间的通信：用户客户端、领导者和两个跟随者。时间从左到右流动。请求或响应消息显示为粗箭头。

In the example of Figure 5-2 , the replication to follower 1 is synchronous : the leader waits until follower 1 has confirmed that it received the write before reporting success to the user, and before making the write visible to other clients. The replication to follower 2 is asynchronous : the leader sends the message, but doesn’t wait for a response from the follower.

在图5-2的例子中，向追随者1的复制是同步的：领导者等待追随者1确认接收到写操作，然后才向用户报告成功并使写操作对其他客户端可见。向追随者2的复制是异步的：领导者发送信息，但不等待追随者的响应。

The diagram shows that there is a substantial delay before follower 2 processes the message. Normally, replication is quite fast: most database systems apply changes to followers in less than a second. However, there is no guarantee of how long it might take. There are circumstances when followers might fall behind the leader by several minutes or more; for example, if a follower is recovering from a failure, if the system is operating near maximum capacity, or if there are network problems between the nodes.

这张图表明在跟随者2处理信息之前会有相当大的延迟。通常情况下，复制是相当快的：大多数数据库系统在不到一秒的时间内将更改应用到跟随者上。然而，无法保证需要多长时间。有些情况下，跟随者可能会落后于领导者几分钟或更长时间；例如，如果跟随者正在从故障中恢复、系统接近最大容量运行或节点之间存在网络问题。

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower. The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

同步复制的好处是，从程序可以保证跟随者拥有与主程序一致的最新数据副本。如果主程序突然故障，我们可以确保数据仍然可用于从程序。缺点是，如果同步从程序没有响应（因为它已崩溃、网络故障或其他原因），则写操作将无法执行。主程序必须阻塞所有写入操作，并等待同步复制再次可用。

For that reason, it is impractical for all followers to be synchronous: any one node outage would cause the whole system to grind to a halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower. This configuration is sometimes also called semi-synchronous [ 7 ].

因此，所有的跟隨者同步是不切實際的：任何一個節點的停機都會導致整個系統停擺。實際上，如果你在一個數據庫上啟用同步複製，通常意味著其中一個跟隨者是同步的，其他的是非同步的。如果同步跟隨者變得不可用或變慢，一個非同步的跟隨者就會被設置為同步。這可以保證你在至少兩個節點上擁有最新的數據副本：領導者和一個同步跟隨者。這種配置有時候也被稱為半同步（Semi-Synchronous）。

Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the client. However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind.

通常情况下，基于领导者的复制被配置为完全异步。在这种情况下，如果领导者失败并且无法恢复，则尚未复制到追随者的任何写入都将丢失。这意味着，即使已经确认给客户端，写入也不能保证是持久的。但是，完全异步配置的优点是，即使其所有追随者已经落后，领导者也可以继续处理写入。

Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed. We will return to this issue in “Problems with Replication Lag” .

耐久性的减弱可能听起来像是一个不好的权衡，但是异步复制仍然被广泛使用，特别是如果有许多追随者或者它们在地理上分布。我们将在“复制滞后问题”中回到这个问题。

Research on Replication

It can be a serious problem for asynchronously replicated systems to lose data if the leader fails, so researchers have continued investigating replication methods that do not lose data but still provide good performance and availability. For example, chain replication [ 8 , 9 ] is a variant of synchronous replication that has been successfully implemented in a few systems such as Microsoft Azure Storage [ 10 , 11 ].

异步复制系统在领导者失效时丢失数据可能会成为严重的问题，因此研究人员继续探索不会丢失数据但仍能提供良好性能和可用性的复制方法。例如，链式复制是同步复制的一种变体，已成功实现在一些系统中，如Microsoft Azure Storage。

There is a strong connection between consistency of replication and consensus (getting several nodes to agree on a value), and we will explore this area of theory in more detail in Chapter 9 . In this chapter we will concentrate on the simpler forms of replication that are most commonly used in databases in practice.

一致性的复制和共识（使多个节点达成一致）之间存在强烈的联系，我们将在第9章更详细地探讨这个理论领域。在本章中，我们将集中讨论实践中最常用的简单复制形式。

Setting Up New Followers

From time to time, you need to set up new followers—perhaps to increase the number of replicas, or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the leader’s data?

你需要定期设置新的跟随者，可能是为了增加副本数量，或替换故障节点。你如何确保新的跟随者有准确的领袖数据副本？

Simply copying data files from one node to another is typically not sufficient: clients are constantly writing to the database, and the data is always in flux, so a standard file copy would see different parts of the database at different points in time. The result might not make any sense.

仅仅从一个节点复制数据文件到另一个节点通常是不足够的：客户端不断地向数据库写入数据，数据始终在变化，因此标准的文件复制在不同时间可能看到不同部分的数据库。结果可能毫无意义。

You could make the files on disk consistent by locking the database (making it unavailable for writes), but that would go against our goal of high availability. Fortunately, setting up a follower can usually be done without downtime. Conceptually, the process looks like this:

通过锁定数据库（使其无法写入），您可以使磁盘上的文件一致，但这将违反我们高可用性的目标。幸运的是，通常可以在无需停机的情况下设置跟随者。从概念上讲，该过程看起来像这样：

Take a consistent snapshot of the leader’s database at some point in time—if possible, without taking a lock on the entire database. Most databases have this feature, as it is also required for backups. In some cases, third-party tools are needed, such as innobackupex for MySQL [ 12 ].

在某个时间点拍摄领导数据库的一致性快照——如果可能的话，不要锁定整个数据库。大多数数据库都有这个功能，因为备份也需要这个功能。在某些情况下，需要使用第三方工具，例如 MySQL 的 innobackupex 。[12]。
Copy the snapshot to the new follower node.

将快照复制到新的关注者节点。
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log. That position has various names: for example, PostgreSQL calls it the log sequence number , and MySQL calls it the binlog coordinates .

跟随者连接到领导者，并请求自快照拍摄以来发生的所有数据更改。这需要将快照与领导者的复制日志中的确切位置相关联。该位置有各种名称：例如，PostgreSQL称其为日志序列号，而MySQL称其为binlog坐标。
When the follower has processed the backlog of data changes since the snapshot, we say it has caught up . It can now continue to process data changes from the leader as they happen.

当追随者处理完自快照之后的数据更改积压时，我们称它已经追上了。现在它可以继续处理来自领导者的实时数据更改。

The practical steps of setting up a follower vary significantly by database. In some systems the process is fully automated, whereas in others it can be a somewhat arcane multi-step workflow that needs to be manually performed by an administrator.

建立跟随者的实际步骤因数据库而异。在某些系统中，该过程完全自动化，而在其他系统中，可能需要管理员手动执行多步神秘的工作流程。

Handling Node Outages

Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to planned maintenance (for example, rebooting a machine to install a kernel security patch). Being able to reboot individual nodes without downtime is a big advantage for operations and maintenance. Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep the impact of a node outage as small as possible.

系统中的任何节点都可能由于故障或计划维护（例如重新启动机器以安装内核安全补丁）而意外关闭。能够在不影响 downtime 的情况下重启个别节点对于操作和维护来说是一个巨大的优势。因此，我们的目标是在个别节点故障时保持整个系统运行，并尽可能减少节点故障的影响。

How do you achieve high availability with leader-based replication?

如何通过基于领导者的复制实现高可用性？

Follower failure: Catch-up recovery

On its local disk, each follower keeps a log of the data changes it has received from the leader. If a follower crashes and is restarted, or if the network between the leader and the follower is temporarily interrupted, the follower can recover quite easily: from its log, it knows the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected. When it has applied these changes, it has caught up to the leader and can continue receiving a stream of data changes as before.

每个追随者在本地磁盘上保存了一个从领导者接收到的数据更改日志。如果一个跟随者崩溃并重新启动，或者领导者和跟随者之间的网络暂时中断，跟随者可以很容易地恢复：从它的日志中，它知道在故障发生之前处理的最后一项交易。因此，跟随者可以连接到领导者并请求在跟随者断开连接期间发生的所有数据更改。当它应用了这些更改后，它已经赶上领导者，可以继续像以前一样接收数据更改流。

Leader failure: Failover

Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover .

处理领导人失败更加棘手：需要提升一个跟随者成为新的领导者，客户端需要重新配置以将其写入发送到新的领导者，其他跟随者需要开始从新的领导者消耗数据更改。此过程称为故障切换。

Failover can happen manually (an administrator is notified that the leader has failed and takes the necessary steps to make a new leader) or automatically. An automatic failover process usually consists of the following steps:

故障转移可以手动进行（管理员会被通知领导者发生失败并采取必要措施使新的领导者），也可以自动进行。自动故障转移过程通常包括以下步骤：

Determining that the leader has failed. There are many things that could potentially go wrong: crashes, power outages, network issues, and more. There is no foolproof way of detecting what has gone wrong, so most systems simply use a timeout: nodes frequently bounce messages back and forth between each other, and if a node doesn’t respond for some period of time—say, 30 seconds—it is assumed to be dead. (If the leader is deliberately taken down for planned maintenance, this doesn’t apply.)

确定领导者已失败。有很多事情可能会出错：崩溃，停电，网络问题等等。没有绝对可靠的方法来检测出问题所在，因此大多数系统只是使用时间限制：节点之间经常反弹消息，如果某个节点在一段时间内没有响应，比如30秒，就被认为已经死亡。（如果领导者是因计划维护而被故意关闭，则不适用。）
Choosing a new leader. This could be done through an election process (where the leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously elected controller node . The best candidate for leadership is usually the replica with the most up-to-date data changes from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader is a consensus problem, discussed in detail in Chapter 9 .

选择新领导。这可以通过选举过程（领导者由剩余副本的大多数选择）或由先前选举的控制器节点任命新领导者来完成。领导的最佳人选通常是来自旧领导者最新的数据更改的副本（以最小化任何数据丢失）。让所有节点对新领导达成共识是一个共识问题，详细讨论在第9章中进行。
Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader (we discuss this in “Request Routing” ). If the old leader comes back, it might still believe that it is the leader, not realizing that the other replicas have forced it to step down. The system needs to ensure that the old leader becomes a follower and recognizes the new leader.

重新配置系统以使用新的领导者。客户端现在需要将其写入请求发送给新的领导者（我们在“请求路由”中讨论此问题）。如果旧的领导者回来，它可能仍然认为自己是领导者，没有意识到其他副本已经迫使它下台。系统需要确保旧的领导者成为追随者，并认识到新的领导者。

Failover is fraught with things that can go wrong:

故障切换充满了可能会出错的事情:

If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be discarded, which may violate clients’ durability expectations.

如果使用异步复制，新领导者在旧领导者失败之前可能没有收到所有写入。如果前领导者在新领导者被选出后重新加入集群，那么这些写入应该怎么处理呢？在此期间，新的领导者可能已经接收到了冲突的写入。最常见的解决方案是旧领导者的未复制写入被简单地丢弃，这可能会违反客户端的耐久性期望。
Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents. For example, in one incident at GitHub [ 13 ], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users.

如果数据库外部的其他存储系统需要与数据库内容协调，那么丢弃写入操作尤其危险。例如，在GitHub的一次事件中，一个过时的MySQL从属数据库被提升为主库。数据库使用一个自增计数器来分配新行的主键，但由于新主库的计数器落后于旧主库，它重用了一些旧主库已经分配的主键。这些主键也用于Redis存储，因此主键的重用导致MySQL和Redis之间的不一致，从而导致某些私人数据被泄露给错误的用户。
In certain fault scenarios (see Chapter 8 ), it could happen that two nodes both believe that they are the leader. This situation is called split brain , and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts (see “Multi-Leader Replication” ), data is likely to be lost or corrupted. As a safety catch, some systems have a mechanism to shut down one node if two leaders are detected. ⁱⁱ However, if this mechanism is not carefully designed, you can end up with both nodes being shut down [ 14 ].

在某些故障情况（见第8章）下，可能会出现两个节点都认为自己是领导者的情况。此情况称为“脑裂”，它是危险的：如果两个领导者都接受写入，并且没有解决冲突的过程（见“多领导者复制”），数据很可能会丢失或损坏。为了安全起见，有些系统有机制在检测到两个领导者时关闭其中一个节点。然而，如果这个机制设计不当，你可能会导致两个节点都关闭[14]。
What is the right timeout before the leader is declared dead? A longer timeout means a longer time to recovery in the case where the leader fails. However, if the timeout is too short, there could be unnecessary failovers. For example, a temporary load spike could cause a node’s response time to increase above the timeout, or a network glitch could cause delayed packets. If the system is already struggling with high load or network problems, an unnecessary failover is likely to make the situation worse, not better.

正确的超时时间是多少才能宣布领袖死亡？更长的超时时间意味着在领袖失效的情况下需要更长的恢复时间。然而，如果超时时间太短，可能会出现不必要的故障转移。例如，暂时的负载峰值可能会导致节点的响应时间超过超时时间，或者网络故障可能会导致数据包延迟。如果系统已经面临高负载或网络问题，不必要的故障转移很可能会使情况更糟，而不是更好。

There are no easy solutions to these problems. For this reason, some operations teams prefer to perform failovers manually, even if the software supports automatic failover.

这些问题没有简单的解决方案。因此，一些运维团队更喜欢手动执行故障转移，即使软件支持自动故障转移。

These issues—node failures; unreliable networks; and trade-offs around replica consistency, durability, availability, and latency—are in fact fundamental problems in distributed systems. In Chapter 8 and Chapter 9 we will discuss them in greater depth.

这些问题——节点故障；不可靠的网络；以及副本一致性、耐久性、可用性和延迟之间的权衡——实际上是分布式系统中的基本问题。在第8章和第9章中，我们将更深入地讨论它们。

Implementation of Replication Logs

How does leader-based replication work under the hood? Several different replication methods are used in practice, so let’s look at each one briefly.

领导者基础复制如何在幕后运行？在实践中使用了几种不同的复制方法，因此让我们简要地看看每种方法。

Statement-based replication

In the simplest case, the leader logs every write request ( statement ) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT , UPDATE , or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as if it had been received from a client.

在最简单的情况下，领导者记录执行的每个写入请求（语句），并将该语句日志发送给其追随者。对于关系型数据库，这意味着每个INSERT，UPDATE或DELETE语句都会转发到追随者，并且每个追随者将解析和执行该SQL语句，就像它是从客户端接收到的一样。

Although this may sound reasonable, there are various ways in which this approach to replication can break down:

虽然这听起来很合理，但这种复制方法可能会出现各种问题：

Any statement that calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.

任何调用非确定性函数的语句，例如NOW()获取当前日期和时间或RAND()获取随机数的语句，在每个副本上生成的值很可能不同。
If statements use an autoincrementing column, or if they depend on the existing data in the database (e.g., UPDATE … WHERE <some condition> ), they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when there are multiple concurrently executing transactions.

如果语句使用自增列，或者依赖于数据库中现有数据（例如，UPDATE...WHERE <某些条件>），则它们必须在每个副本上以完全相同的顺序执行，否则它们可能会产生不同的效果。当有多个正在执行的事务时，这可能会受限。
Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.

具有副作用的语句（例如触发器、存储过程、用户定义函数）可能导致不同的副作用在每个副本上发生，除非这些副作用是绝对确定的。

It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. However, because there are so many edge cases, other replication methods are now generally preferred.

可以采用解决这些问题的方法——例如，领导可以在记录语句时将任何非确定性函数调用替换为固定的返回值，以便跟随者都获得相同的值。但由于存在很多特殊情况，现在通常更喜欢使用其他复制方法。

Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it safe by requiring transactions to be deterministic [ 15 ].

在MySQL 5.1版本之前，基于语句的复制曾被使用。尽管这种方式非常紧凑，但现在默认情况下MySQL会切换到基于行的复制（稍后讨论），如果一个语句存在任何不确定性。VoltDB使用基于语句的复制，并通过要求事务具有确定性来使其安全。[15]。

Write-ahead log (WAL) shipping

In Chapter 3 we discussed how storage engines represent data on disk, and we found that usually every write is appended to a log:

在第三章中，我们讨论了存储引擎如何在磁盘上表示数据，我们发现通常每次写入都会附加到日志中。

In the case of a log-structured storage engine (see “SSTables and LSM-Trees” ), this log is the main place for storage. Log segments are compacted and garbage-collected in the background.

在采用日志结构化存储引擎的情况下（见“SSTables和LSM树”），这个日志是主要的存储地点。日志段在后台进行压缩和垃圾回收。
In the case of a B-tree (see “B-Trees” ), which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.

在B树（参见“B树”）的情况下，会覆盖单个磁盘块，因此每次修改都会首先写入写前日志，以便在崩溃后恢复索引到一致状态。

In either case, the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers. When the follower processes this log, it builds a copy of the exact same data structures as found on the leader.

在任何情况下，日志都是一系列字节的附加序列，包含对数据库的所有写入。我们可以使用完全相同的日志在另一个节点上构建副本：除将日志写入磁盘外，领导者还将其发送到其追随者之间的网络。当追随者处理此日志时，它将构建与领导者上发现的完全相同的数据结构的副本。

This method of replication is used in PostgreSQL and Oracle, among others [ 16 ]. The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database software on the leader and the followers.

此复制方法适用于PostgreSQL和Oracle等数据库。其主要缺点是日志记录数据级别非常低：WAL包含哪些磁盘块中的哪些字节被更改的详细信息。这使得复制与存储引擎密切相关。如果数据库从一个版本更改其存储格式到另一个版本，通常不可能在Leader和Follower上运行不同版本的数据库软件。

That may seem like a minor implementation detail, but it can have a big operational impact. If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero-downtime upgrade of the database software by first upgrading the followers and then performing a failover to make one of the upgraded nodes the new leader. If the replication protocol does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require downtime.

这可能看起来像一个微小的实现细节，但它可能会对操作产生重大影响。如果复制协议允许从者使用比领导者更新的软件版本，那么您可以通过首先升级从者，然后执行故障转移使升级后的节点成为新的领导者，从而执行零停机升级数据库软件。如果复制协议不允许这种版本不匹配，通常在WAL传送中出现这种情况，则此类升级需要停机时间。

Logical (row-based) log replication

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called a logical log , to distinguish it from the storage engine’s ( physical ) data representation.

一种替代方案是使用不同的日志格式进行复制和存储引擎，这允许复制日志与存储引擎内部解耦。这种类型的复制日志被称为逻辑日志，以区别于存储引擎（物理）数据表示。

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

关系数据库的逻辑日志通常是描述对数据库表进行行级写操作的记录序列：

For an inserted row, the log contains the new values of all columns.

对于插入的行，日志包含所有列的新值。
For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.

对于已删除的行，日志包含足够的信息来唯一标识被删除的行。通常这将是主键，但如果表没有主键，则需要记录所有列的旧值。
For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

更新的行，在日志中包含足够的信息来唯一识别更新的行，并且包含所有列的新值（或者至少包含所有修改的列的新值）。

A transaction that modifies several rows generates several such log records, followed by a record indicating that the transaction was committed. MySQL’s binlog (when configured to use row-based replication) uses this approach [ 17 ].

一项修改多行的交易会产生多个此类日志记录，最后会有记录表明该交易已提交。当配置为使用基于行的复制时，MySQL的binlog使用此方法。

Since a logical log is decoupled from the storage engine internals, it can more easily be kept backward compatible, allowing the leader and the follower to run different versions of the database software, or even different storage engines.

由于逻辑日志与存储引擎内部解耦，因此可以更轻松地保持向后兼容性，允许领导者和追随者运行不同版本的数据库软件，甚至是不同的存储引擎。

A logical log format is also easier for external applications to parse. This aspect is useful if you want to send the contents of a database to an external system, such as a data warehouse for offline analysis, or for building custom indexes and caches [ 18 ]. This technique is called change data capture , and we will return to it in Chapter 11 .

逻辑日志格式也更容易被外部应用程序分析。如果您想将数据库内容发送到外部系统（例如用于离线分析的数据仓库），或用于构建自定义索引和缓存，这个方面会很有用 [18]。这个技术称为变更数据捕获，在第11章中我们将回到它。

Trigger-based replication

The replication approaches described so far are implemented by the database system, without involving any application code. In many cases, that’s what you want—but there are some circumstances where more flexibility is needed. For example, if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic (see “Handling Write Conflicts” ), then you may need to move replication up to the application layer.

迄今为止所描述的复制方法是由数据库系统实现的，不涉及任何应用程序代码。在许多情况下，这是您想要的，但还有一些情况需要更多的灵活性。例如，如果您只想复制数据子集，或者想要从一种数据库复制到另一种数据库，或者如果您需要冲突解决逻辑（请参见“处理写入冲突”），那么您可能需要将复制向应用程序层移动。

Some tools, such as Oracle GoldenGate [ 19 ], can make data changes available to an application by reading the database log. An alternative is to use features that are available in many relational databases: triggers and stored procedures .

一些工具，例如Oracle GoldenGate，可以通过读取数据库日志将数据更改提供给应用程序。另一种选择是使用许多关系数据库可用的功能：触发器和存储过程。

A trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process. That external process can then apply any necessary application logic and replicate the data change to another system. Databus for Oracle [ 20 ] and Bucardo for Postgres [ 21 ] work like this, for example.

触发器允许您注册自定义应用程序代码，当数据库系统中发生数据更改（写入事务）时自动执行。触发器可以将此更改记录到单独的表中，外部进程可以读取该表。然后，外部进程可以应用任何必要的应用程序逻辑并将数据更改复制到另一个系统。例如，Databus for Oracle和Postgres的Bucardo工作方式如此。

Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it can nevertheless be useful due to its flexibility.

基于触发器的复制通常比其他复制方法有更高的开销，且比数据库内置的复制更容易出现错误和限制。然而，由于其灵活性，它仍然可以是有用的。

Problems with Replication Lag

Being able to tolerate node failures is just one reason for wanting replication. As mentioned in the introduction to Part II , other reasons are scalability (processing more requests than a single machine can handle) and latency (placing replicas geographically closer to users).

实现容忍节点故障只是需要复制的原因之一。如Part II 的介绍中提到的，其它原因包括可扩展性（能够处理超出单一机器所能负担的更多请求）和延迟（把副本放置在更接近用户的地理位置）。

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica. For workloads that consist of mostly reads and only a small percentage of writes (a common pattern on the web), there is an attractive option: create many followers, and distribute the read requests across those followers. This removes load from the leader and allows read requests to be served by nearby replicas.

领导者基础复制要求所有写入必须通过单个节点完成，但只读查询可以去任何副本。对于主要由读取组成且仅有少量写入的工作负载（在网络上是一个常见模式），这是一种有吸引力的选择：创建许多追随者并将读取请求分布到这些追随者之间。这将负荷从领导者处移除，并允许附近的副本为读取请求提供服务。

In this read-scaling architecture, you can increase the capacity for serving read-only requests simply by adding more followers. However, this approach only realistically works with asynchronous replication—if you tried to synchronously replicate to all followers, a single node failure or network outage would make the entire system unavailable for writing. And the more nodes you have, the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.

在这种读取扩展架构中，您只需添加更多的关注者即可增加处理只读请求的能力。但是，这种方法只在异步复制中实际有效——如果您尝试通过同步复制到所有关注者，则单个节点故障或网络中断将导致整个系统无法进行写入。而且，您拥有的节点越多，出现故障的可能性就越大，因此完全同步的配置会非常不可靠。

Unfortunately, if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind. This leads to apparent inconsistencies in the database: if you run the same query on the leader and a follower at the same time, you may get different results, because not all writes have been reflected in the follower. This inconsistency is just a temporary state—if you stop writing to the database and wait a while, the followers will eventually catch up and become consistent with the leader. For that reason, this effect is known as eventual consistency [ 22 , 23 ]. ⁱⁱⁱ

不幸的是，如果应用程序从异步追随者读取，如果追随者落后，可能会看到过时的信息。这导致数据库中的明显不一致性：如果您同时在领导者和追随者上运行相同的查询，您可能会得到不同的结果，因为并非所有写入都已在追随者中反映。这种不一致性只是暂时的状态 - 如果您停止向数据库写入并等待一段时间，追随者最终将追上并与领导者保持一致。因此，这种效果被称为最终一致性[22，23]。

The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can fall behind. In normal operation, the delay between a write happening on the leader and being reflected on a follower—the replication lag —may be only a fraction of a second, and not noticeable in practice. However, if the system is operating near capacity or if there is a problem in the network, the lag can easily increase to several seconds or even minutes.

“Eventually”这个术语是有意模糊的：通常情况下，副本落后的限制是没有的。在正常运行中，领导者执行写操作到某个跟随者反映这个写操作的延迟，也就是复制延迟，可能仅为几分之一秒，在实践中并不会被感知。然而，如果系统接近容量极限或网络存在问题，延迟很容易增加到几秒甚至几分钟。

When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a real problem for applications. In this section we will highlight three examples of problems that are likely to occur when there is replication lag and outline some approaches to solving them.

当延迟非常大时，它引入的不一致性不仅是一个理论问题，而且对应用程序来说是一个真正的问题。在本节中，我们将强调三个问题的例子，这些问题在存在复制延迟时很有可能发生，并概述解决它们的方法。

Reading Your Own Writes

Many applications let the user submit some data and then view what they have submitted. This might be a record in a customer database, or a comment on a discussion thread, or something else of that sort. When new data is submitted, it must be sent to the leader, but when the user views the data, it can be read from a follower. This is especially appropriate if data is frequently viewed but only occasionally written.

许多应用程序允许用户提交数据，然后查看他们提交的内容。这可能是客户数据库中的记录，或讨论主题上的评论，或其他类似的东西。当新数据被提交时，它必须发送到领导者，但当用户查看数据时，可以从追随者中读取。如果数据经常被查看但只偶尔被写入，这是特别适当的。

With asynchronous replication, there is a problem, illustrated in Figure 5-3 : if the user views the data shortly after making a write, the new data may not yet have reached the replica. To the user, it looks as though the data they submitted was lost, so they will be understandably unhappy.

使用异步复制，存在一个问题，如图5-3所示：如果用户在写入数据后不久即查看数据，则新数据可能尚未到达副本。对于用户来说，看起来好像他们提交的数据丢失了，所以他们会感到不满意。

In this situation, we need read-after-write consistency , also known as read-your-writes consistency [ 24 ]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users: other users’ updates may not be visible until some later time. However, it reassures the user that their own input has been saved correctly.

在这种情况下，我们需要读写一致性，也称为读取自己写入的一致性[24]。这保证了用户重新加载页面时，他们始终会看到他们自己提交的更新。它不保证其他用户：其他用户的更新可能直到某个时间才能看到。然而，它向用户保证他们自己的输入已经正确保存。

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques. To mention a few:

我们如何在基于领导者复制的系统中实现读写一致性？有多种可能的技术。举几个例子：

When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. This requires that you have some way of knowing whether something might have been modified, without actually querying it. For example, user profile information on a social network is normally only editable by the owner of the profile, not by anybody else. Thus, a simple rule is: always read the user’s own profile from the leader, and any other users’ profiles from a follower.

如果用户可能修改了某些内容，请从领导者处读取；否则，请从追随者处读取。这要求您以某种方式知道某些内容可能已经被修改，而不必实际查询它。例如，在社交网络上，用户资料信息通常只能由个人资料的所有者进行编辑，而不能由其他任何人编辑。因此，一个简单的规则是：总是从领导者读取用户自己的资料，并从追随者读取其他用户的资料。
If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.

如果应用程序中大多数内容都可以由用户编辑，这种方法可能不会有效，因为大多数内容都需要从领袖读取（抵消了读取扩展的好处）。在这种情况下，可以使用其他标准来决定是否从领袖中读取。例如，您可以跟踪上次更新的时间，并在上次更新后的一分钟内使所有读取都来自领袖。您还可以监视追随者的复制延迟，并防止在落后于领袖超过一分钟的任何追随者上查询。
The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up. The timestamp could be a logical timestamp (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (in which case clock synchronization becomes critical; see “Unreliable Clocks” ).

客户端可以记住它最近写入的时间戳 - 然后系统可以确保为该用户提供的任何读取副本都至少反映该时间戳之前的更新。如果一个副本不够更新，那么读取可以由另一个副本处理，或者查询可以等待直到该副本赶上。时间戳可以是逻辑时间戳（指示写入排序的某些内容，例如日志序列号）或实际的系统时钟（在这种情况下，时钟同步变得至关重要；请参见“不可靠的时钟”）。
If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.

如果您的副本分布在多个数据中心（以方便用户的地理位置或可用性），那么就会有额外的复杂性。任何需要由领导者提供服务的请求都必须路由到包含领导者的数据中心。

Another complication arises when the same user is accessing your service from multiple devices, for example a desktop web browser and a mobile app. In this case you may want to provide cross-device read-after-write consistency: if the user enters some information on one device and then views it on another device, they should see the information they just entered.

当同一用户从多个设备访问您的服务时，另一个复杂性问题出现，例如桌面 web 浏览器和移动应用程序。在这种情况下，您可能需要提供跨设备的“先写后读”一致性：如果用户在一个设备上输入一些信息，然后在另一个设备上查看它，他们应该看到刚刚输入的信息。

In this case, there are some additional issues to consider:

在这种情况下，还有一些额外的问题需要考虑：

Approaches that require remembering the timestamp of the user’s last update become more difficult, because the code running on one device doesn’t know what updates have happened on the other device. This metadata will need to be centralized.

需要记住用户上次更新的时间戳的方法变得更加困难，因为在一个设备上运行的代码不知道另一个设备上发生了哪些更新。这些元数据需要集中处理。
If your replicas are distributed across different datacenters, there is no guarantee that connections from different devices will be routed to the same datacenter. (For example, if the user’s desktop computer uses the home broadband connection and their mobile device uses the cellular data network, the devices’ network routes may be completely different.) If your approach requires reading from the leader, you may first need to route requests from all of a user’s devices to the same datacenter.

如果您的副本分布在不同的数据中心中，则不能保证来自不同设备的连接将路由到同一数据中心。（例如，如果用户的桌面计算机使用家庭宽带连接，而其移动设备使用蜂窝数据网络，则设备的网络路由可能完全不同。）如果您的方法需要从领导者读取，您可能需要首先将来自用户所有设备的请求路由到同一个数据中心。

Monotonic Reads

Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s possible for a user to see things moving backward in time .

异步跟随者读取时可能出现的异常情况的第二个例子是用户可能会看到时间倒流。

This can happen if a user makes several reads from different replicas. For example, Figure 5-4 shows user 2345 making the same query twice, first to a follower with little lag, then to a follower with greater lag. (This scenario is quite likely if the user refreshes a web page, and each request is routed to a random server.) The first query returns a comment that was recently added by user 1234, but the second query doesn’t return anything because the lagging follower has not yet picked up that write. In effect, the second query is observing the system at an earlier point in time than the first query. This wouldn’t be so bad if the first query hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear, and then see it disappear again.

这种情况可能发生在用户从不同副本中进行多次读取时。例如，图5-4显示用户2345两次查询相同的问题，首先是到滞后很小的追随者，然后是到滞后更大的追随者。（如果用户刷新网页，并且每个请求都路由到随机服务器，则很可能出现此场景。）第一个查询返回最近由用户1234添加的评论，但是第二个查询不返回任何内容，因为滞后的追随者尚未拾取该写入。实际上，第二个查询观察系统比第一个查询更早的时间点。如果第一个查询没有返回任何内容，那么这并不太糟糕，因为用户2345可能不知道用户1234最近添加了评论。但是，如果他们首先看到用户1234的评论出现，然后再次看到它消失，这对用户2345非常令人困惑。

Monotonic reads [ 23 ] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data.

单调性读取是一种保证，可以避免出现此类异常。这是比强一致性更小但比最终一致性更强的保证。当您读取数据时，可能会看到旧值；单调性读取只意味着如果一个用户连续进行多次读取，他们不会看到时间倒流，也就是说，他们不会在之前读取了更新的数据后再读取旧数据。

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (different users can read from different replicas). For example, the replica can be chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the user’s queries will need to be rerouted to another replica.

实现单调读的一种方法是确保每个用户始终从相同的副本中读取（不同的用户可以从不同的副本中读取）。例如，可以根据用户ID的哈希值选择副本，而不是随机选择。但是，如果该副本失败，用户的查询将需要重新路由到另一个副本。

Consistent Prefix Reads

Our third example of replication lag anomalies concerns violation of causality. Imagine the following short dialog between Mr. Poons and Mrs. Cake:

我们第三个复制滞后异常的例子涉及违反因果关系。想象一下普恩斯先生和蛋糕夫人之间的简短对话。

Mr. Poons

How far into the future can you see, Mrs. Cake?

你能看到未来多远，蛋糕夫人？

Mrs. Cake

About ten seconds usually, Mr. Poons.

通常大约需要十秒钟，Poons先生。

There is a causal dependency between those two sentences: Mrs. Cake heard Mr. Poons’s question and answered it.

这两个句子之间存在因果依赖关系：Cake 夫人听到了 Poons 先生的问题并回答了它。

Now, imagine a third person is listening to this conversation through followers. The things said by Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer replication lag (see Figure 5-5 ). This observer would hear the following:

现在，想象一位第三方通过追随者听到这次谈话。蛋糕夫人说的话通过追随者传递的时间较短，而普恩斯先生说的话则有较长的复制延迟（见图5-5）。这位观察者会听到以下内容：

Mrs. Cake

About ten seconds usually, Mr. Poons.

通常约十秒钟，Poons先生。

Mr. Poons

How far into the future can you see, Mrs. Cake?

"Mrs. Cake，你能看多远的将来？"

To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked it. Such psychic powers are impressive, but very confusing [ 25 ].

观察者会觉得蛋糕夫人在普恩斯先生还没问出问题之前就已经回答了问题。这种超自然力量令人印象深刻，但也让人困惑。

Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads [ 23 ]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

预防这种异常需要另一种类型的保证：一致性前缀读取[23]。该保证表示，如果一系列写入按照某个顺序发生，则任何读取这些写入的人都会以相同的顺序看到它们出现。

This is a particular problem in partitioned (sharded) databases, which we will discuss in Chapter 6 . If the database always applies writes in the same order, reads always see a consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different partitions operate independently, so there is no global ordering of writes: when a user reads from the database, they may see some parts of the database in an older state and some in a newer state.

这是分区数据库中的特殊问题，我们将在第6章中讨论。如果数据库总是按相同顺序应用写操作，读操作就能始终看到一致的前缀，因此这种异常不会发生。然而，在许多分布式数据库中，不同的分区操作是独立的，因此没有全局的写入顺序：当用户从数据库读取时，他们可能会看到一些旧状态和一些新状态的数据库部分。

One solution is to make sure that any writes that are causally related to each other are written to the same partition—but in some applications that cannot be done efficiently. There are also algorithms that explicitly keep track of causal dependencies, a topic that we will return to in “The “happens-before” relationship and concurrency” .

一种解决方案是确保任何因果相关的写操作都写入同一分区 - 但在某些应用中，这可能无法高效完成。还有一些算法明确跟踪因果依赖关系，这是我们将在““发生在之前”关系和并发性”中重新回到的主题。

Solutions for Replication Lag

When working with an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If the answer is “no problem,” that’s great. However, if the result is a bad experience for users, it’s important to design the system to provide a stronger guarantee, such as read-after-write. Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.

当使用一致性系统时，值得考虑一下如果复制延迟增加到几分钟甚至几小时时应用程序的行为。如果答案是“没有问题”，那太好了。然而，如果结果是给用户造成糟糕的体验，那么设计系统以提供更强的保证，如写入后读取，就非常重要。假装复制是同步的实际上是异步的，这会导致问题逐渐加剧。

As discussed earlier, there are ways in which an application can provide a stronger guarantee than the underlying database—for example, by performing certain kinds of reads on the leader. However, dealing with these issues in application code is complex and easy to get wrong.

正如之前所讨论的那样，应用程序可以提供比底层数据库更强的保证方式——例如，在领导者上进行某些类型的读取。然而，处理这些问题在应用程序代码中是复杂的，容易出错。

It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.

应用程序开发人员不必担心微妙的复制问题，只需信任它们的数据库“做正确的事情”，这样会更好。这就是事务存在的原因：它们是数据库提供更强保证的一种方式，因此应用程序可以更简单。

Single-node transactions have existed for a long time. However, in the move to distributed (replicated and partitioned) databases, many systems have abandoned them, claiming that transactions are too expensive in terms of performance and availability, and asserting that eventual consistency is inevitable in a scalable system. There is some truth in that statement, but it is overly simplistic, and we will develop a more nuanced view over the course of the rest of this book. We will return to the topic of transactions in Chapters 7 and 9 , and we will discuss some alternative mechanisms in Part III .

单节点事务存在已久。然而，在分布式（复制和分区）数据库的移动中，许多系统已经放弃了它们，并声称事务在性能和可用性方面过于昂贵，并断言最终一致性在可扩展系统中是不可避免的。这种说法有一定的道理，但过于简单化了，我们将在本书的其余部分中逐渐形成更为微妙的观点。我们将在第7章和第9章回到事务的话题上，并在第III部分中讨论一些替代机制。

Multi-Leader Replication

So far in this chapter we have only considered replication architectures using a single leader. Although that is a common approach, there are interesting alternatives.

到目前为止，在这一章中，我们只考虑使用单个领导者的复制架构。虽然这是常见的方法，但也有一些有趣的替代方案。

Leader-based replication has one major downside: there is only one leader, and all writes must go through it. ^iv If you can’t connect to the leader for any reason, for example due to a network interruption between you and the leader, you can’t write to the database.

领导者复制有一个主要的缺点：只有一个领导者，所有的写操作都必须通过它进行。如果由于任何原因无法连接到领导者，例如由于您和领导者之间的网络中断，您将无法对数据库进行写操作。

A natural extension of the leader-based replication model is to allow more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all the other nodes. We call this a multi-leader configuration (also known as master–master or active/active replication ). In this setup, each leader simultaneously acts as a follower to the other leaders.

一种领导者基础复制模式的自然扩展是允许多个节点接受写入。复制仍然采用相同的方式：处理写入的每个节点都必须将该数据更改转发给所有其他节点。我们将其称为多个领导者配置（也称为主-主或活动/活动复制）。在此设置中，每个领导者同时充当其他领导者的追随者。

Use Cases for Multi-Leader Replication

It rarely makes sense to use a multi-leader setup within a single datacenter, because the benefits rarely outweigh the added complexity. However, there are some situations in which this configuration is reasonable.

在单个数据中心中很少有必要使用多领导者架构, 因为收益很少能够超过额外的复杂度。然而, 在某些情况下这种配置是合理的。

Multi-datacenter operation

Imagine you have a database with replicas in several different datacenters (perhaps so that you can tolerate failure of an entire datacenter, or perhaps in order to be closer to your users). With a normal leader-based replication setup, the leader has to be in one of the datacenters, and all writes must go through that datacenter.

想象一下，你有一个数据库，它在几个不同的数据中心中有副本（可能是为了能够容忍整个数据中心的故障，或者可能是为了更加接近你的用户）。在普通的基于领导者的复制设置中，领导者必须在其中一个数据中心中，而所有写操作必须经过该数据中心。

In a multi-leader configuration, you can have a leader in each datacenter. Figure 5-6 shows what this architecture might look like. Within each datacenter, regular leader–follower replication is used; between datacenters, each datacenter’s leader replicates its changes to the leaders in other datacenters.

在多领导者配置中，您可以在每个数据中心都拥有一位领导者。图5-6显示了这种架构可能的外观。在每个数据中心内，使用常规的领导者-跟随者复制；在数据中心之间，每个数据中心的领导者将其更改复制到其他数据中心的领导者。

Let’s compare how the single-leader and multi-leader configurations fare in a multi-datacenter deployment:

让我们来比较单领导者和多领导者配置在多数据中心部署中的表现：

Performance

In a single-leader configuration, every write must go over the internet to the datacenter with the leader. This can add significant latency to writes and might contravene the purpose of having multiple datacenters in the first place. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters. Thus, the inter-datacenter network delay is hidden from users, which means the perceived performance may be better.

在单个领导者配置中，每个写操作都必须通过互联网传输到具有领导者的数据中心。这可能会增加写入的延迟，并可能违反首先拥有多个数据中心的目的。在多领导者配置中，每个写操作可以在本地数据中心中处理，并异步复制到其他数据中心。因此，数据中心之间的网络延迟对用户来说是隐藏的，这意味着感知性能可能更好。

Tolerance of datacenter outages

In a single-leader configuration, if the datacenter with the leader fails, failover can promote a follower in another datacenter to be leader. In a multi-leader configuration, each datacenter can continue operating independently of the others, and replication catches up when the failed datacenter comes back online.

在单个领导者配置中，如果具有领导者的数据中心失败，故障转移可以将另一个数据中心中的追随者晋升为领导者。在多个领导者配置中，每个数据中心都可以独立运行，而且当失败的数据中心恢复线上时，复制会追赶上来。

Tolerance of network problems

Traffic between datacenters usually goes over the public internet, which may be less reliable than the local network within a datacenter. A single-leader configuration is very sensitive to problems in this inter-datacenter link, because writes are made synchronously over this link. A multi-leader configuration with asynchronous replication can usually tolerate network problems better: a temporary network interruption does not prevent writes being processed.

数据中心之间的流量通常通过公共互联网进行传输，这可能比数据中心内部的本地网络不够可靠。单主配置非常敏感于此数据中心间的链接问题，因为写入是通过此链接同步进行的。使用异步复制的多主配置通常可以更好地容纳网络问题：暂时的网络中断不会阻止写入被处理。

Some databases support multi-leader configurations by default, but it is also often implemented with external tools, such as Tungsten Replicator for MySQL [ 26 ], BDR for PostgreSQL [ 27 ], and GoldenGate for Oracle [ 19 ].

一些数据库默认支持多主配置，但也经常通过外部工具实现，例如 MySQL 的 Tungsten Replicator [26]，PostgreSQL 的 BDR [27]，以及 Oracle 的 GoldenGate [19]。

Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved (indicated as “conflict resolution” in Figure 5-6 ). We will discuss this issue in “Handling Write Conflicts” .

虽然多领导者复制具有优点，但也有一个很大的缺点：同样的数据可能会在两个不同的数据中心中被同时修改，这些写冲突必须被解决（在图5-6中被表示为“冲突解决”）。我们将在“处理写冲突”中讨论这个问题。

As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible [ 28 ].

由于多主复制在许多数据库中是一项后期配置功能，因此通常存在微妙的配置陷阱和与其他数据库功能的惊人相互作用。例如，自增键、触发器和完整性约束可能存在问题。因此，多主复制通常被认为是危险领域，如果可能应该避免 [28]。

Clients with offline operation

Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the internet.

另一个适用于多主复制的情况是，如果您有一个应用程序需要在断开与互联网的连接时仍然继续工作。

For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You need to be able to see your meetings (make read requests) and enter new meetings (make write requests) at any time, regardless of whether your device currently has an internet connection. If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.

例如，考虑一下您手机、笔记本电脑和其他设备上的日历应用程序。无论您的设备当前是否有互联网连接，您都需要能够随时查看您的会议（进行读取请求）并输入新的会议（进行写入请求）。如果您在离线状态下进行任何更改，它们需要在设备下次联机时与服务器和您的其他设备同步。

In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.

在这种情况下，每个设备都有一个本地数据库，充当领导者（它接受写入请求），并且在您所有设备上的日历副本之间存在异步的多领导者复制过程（同步）。复制滞后可能是几个小时甚至几天，这取决于您何时可以访问互联网。

From an architectural point of view, this setup is essentially the same as multi-leader replication between datacenters, taken to the extreme: each device is a “datacenter,” and the network connection between them is extremely unreliable. As the rich history of broken calendar sync implementations demonstrates, multi-leader replication is a tricky thing to get right.

从建筑的角度来看，这个设置实际上与数据中心之间的多领导者复制基本相同，达到了极限：每个设备都是一个“数据中心”，它们之间的网络连接非常不可靠。正如破碎的日历同步实现的丰富历史所证明的那样，多领导者复制是一个很棘手的事情。

There are tools that aim to make this kind of multi-leader configuration easier. For example, CouchDB is designed for this mode of operation [ 29 ].

有些工具旨在使这种多领导者配置更加容易。例如，CouchDB是为此模式设计的[29]。

Collaborative editing

Real-time collaborative editing applications allow several people to edit a document simultaneously. For example, Etherpad [ 30 ] and Google Docs [ 31 ] allow multiple people to concurrently edit a text document or spreadsheet (the algorithm is briefly discussed in “Automatic Conflict Resolution” ).

实时协作编辑应用程序可以让多人同时编辑一个文档。例如，Etherpad [30] 和 Google Docs [31] 允许多人同时编辑文本文档或电子表格（算法已在“自动冲突解决”中简要讨论）。

We don’t usually think of collaborative editing as a database replication problem, but it has a lot in common with the previously mentioned offline editing use case. When one user edits a document, the changes are instantly applied to their local replica (the state of the document in their web browser or client application) and asynchronously replicated to the server and any other users who are editing the same document.

我们通常不将协作编辑视为数据库复制问题，但它与先前提到的离线编辑用例有很多共同点。当一个用户编辑文档时，更改会立即应用于他们的本地副本（即在其Web浏览器或客户端应用程序中的文档状态），并异步地复制到服务器和任何其他正在编辑同一文档的用户。

If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. If another user wants to edit the same document, they first have to wait until the first user has committed their changes and released the lock. This collaboration model is equivalent to single-leader replication with transactions on the leader.

如果您想保证没有编辑冲突，应用程序必须在用户进行编辑之前获取文档的锁定。如果另一个用户想要编辑同一份文档，他们必须等待第一个用户提交更改并释放锁定。这种协作模型相当于在领导者上具有事务的单一领导者复制。

However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking. This approach allows multiple users to edit simultaneously, but it also brings all the challenges of multi-leader replication, including requiring conflict resolution [ 32 ].

然而，为了更快的协作，您可能想要将更改的单元变得非常小（例如，一个单一的按键），并避免锁定。这种方法允许多个用户同时编辑，但也带来了多个领导者复制的所有挑战，包括需要冲突解决。

Handling Write Conflicts

The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.

多领导者复制的最大问题是可能会发生写冲突，这意味着需要进行冲突解决。

For example, consider a wiki page that is simultaneously being edited by two users, as shown in Figure 5-7 . User 1 changes the title of the page from A to B, and user 2 changes the title from A to C at the same time. Each user’s change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected [ 33 ]. This problem does not occur in a single-leader database.

比如，考虑一个维基页面，同时由两个用户编辑，如图5-7所示。用户1将页面标题从A更改为B，用户2将标题从A更改为C。每个用户的更改都成功应用于其本地领导者。然而，当更改被异步复制时，会检测到冲突[33]。这个问题在单领导者数据库中不会发生。

Synchronous versus asynchronous conflict detection

In a single-leader database, the second writer will either block and wait for the first write to complete, or abort the second write transaction, forcing the user to retry the write. On the other hand, in a multi-leader setup, both writes are successful, and the conflict is only detected asynchronously at some later point in time. At that time, it may be too late to ask the user to resolve the conflict.

在单个领导者数据库中，第二个写入者将要么阻塞并等待第一个写入完成，要么中止第二个写入事务，强制用户重试写入。另一方面，在多个领导者设置中，两个写入都成功，冲突仅在稍后异步检测到。这时，可能已经太晚要求用户解决冲突。

In principle, you could make the conflict detection synchronous—i.e., wait for the write to be replicated to all replicas before telling the user that the write was successful. However, by doing so, you would lose the main advantage of multi-leader replication: allowing each replica to accept writes independently. If you want synchronous conflict detection, you might as well just use single-leader replication.

原则上，您可以将冲突检测设置为同步 - 即在向用户报告写入成功之前，等待复制到所有副本。然而，这样做会丧失多主复制的主要优势：允许每个副本独立接受写入。如果您想要同步的冲突检测，那么您也可以使用单主复制。

Conflict avoidance

The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur. Since many implementations of multi-leader replication handle conflicts quite poorly, avoiding conflicts is a frequently recommended approach [ 34 ].

避免冲突的最简单策略是避免它们：如果应用程序能够确保特定记录的所有写入都通过同一领导者进行，则冲突就不会发生。由于许多多领导者复制的实现处理冲突非常糟糕，避免冲突是一种经常推荐的方法[34]。

For example, in an application where a user can edit their own data, you can ensure that requests from a particular user are always routed to the same datacenter and use the leader in that datacenter for reading and writing. Different users may have different “home” datacenters (perhaps picked based on geographic proximity to the user), but from any one user’s point of view the configuration is essentially single-leader.

例如，在一个用户可以编辑自己数据的应用程序中，您可以确保来自特定用户的请求始终路由到同一数据中心，并使用该数据中心的领导者进行读写操作。不同的用户可能有不同的“家庭”数据中心（可能是基于地理距离选择的），但从任何一个用户的角度来看，配置基本上是单主导的。

However, sometimes you might want to change the designated leader for a record—perhaps because one datacenter has failed and you need to reroute traffic to another datacenter, or perhaps because a user has moved to a different location and is now closer to a different datacenter. In this situation, conflict avoidance breaks down, and you have to deal with the possibility of concurrent writes on different leaders.

然而，有时你可能想要更改被指定的记录领导者——可能是因为一个数据中心已经失效，你需要将流量重新路由到另一个数据中心，或者因为用户已经移动到不同的位置，更靠近其他数据中心。在这种情况下，冲突避免就会失败，你必须面对不同领导者上相互写入的可能性。

Converging toward a consistent state

A single-leader database applies writes in a sequential order: if there are several updates to the same field, the last write determines the final value of the field.

单主数据库按顺序应用写操作：如果对同一字段进行了多个更新，则最后一次写入确定字段的最终值。

In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be. In Figure 5-7 , at leader 1 the title is first updated to B and then to C; at leader 2 it is first updated to C and then to B. Neither order is “more correct” than the other.

在多领导者配置中，写入的顺序没有定义，因此不清楚最终的值应该是什么。在图5-7中，领导者1首先将标题更新为B，然后更新为C；在领导者2中，它首先更新为C，然后更新为B。没有一种顺序比另一种“更正确”。

If each replica simply applied writes in the order that it saw the writes, the database would end up in an inconsistent state: the final value would be C at leader 1 and B at leader 2. That is not acceptable—every replication scheme must ensure that the data is eventually the same in all replicas. Thus, the database must resolve the conflict in a convergent way, which means that all replicas must arrive at the same final value when all changes have been replicated.

如果每个副本只是按照看到的写入顺序应用写入操作，数据库将最终处于不一致的状态：Leader1的最终值将是C，而Leader2的最终值将是B。这是不可接受的-每个复制方案必须确保最终所有副本中的数据相同。因此，数据库必须以收敛方式解决冲突，这意味着所有副本在复制所有更改后必须到达相同的最终值。

There are various ways of achieving convergent conflict resolution:

有许多实现协调解决冲突的方法：

Give each write a unique ID (e.g., a timestamp, a long random number, a UUID, or a hash of the key and value), pick the write with the highest ID as the winner , and throw away the other writes. If a timestamp is used, this technique is known as last write wins (LWW). Although this approach is popular, it is dangerously prone to data loss [ 35 ]. We will discuss LWW in more detail at the end of this chapter ( “Detecting Concurrent Writes” ).

给每个写入一个唯一的ID（例如，时间戳，长随机数，UUID或键值的哈希），选择具有最高ID的写入为赢家，并且丢弃其他写入。如果使用时间戳，这种技术被称为最后写入赢（LWW）。尽管此方法很流行，但极易丢失数据[35]。我们将在本章末尾“检测并发写入”中更详细地讨论LWW。
Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss.

给每个副本分配唯一的ID，并允许来自较高编号的副本发起的写操作始终优先于来自较低编号的副本的写操作。这种方法也意味着数据丢失。
Somehow merge the values together—e.g., order them alphabetically and then concatenate them (in Figure 5-7 , the merged title might be something like “B/C”).

以某种方式将值合并在一起 - 例如，按字母顺序排列它们，然后将它们连接起来（在5-7图中，合并后的标题可能是“B / C”之类的东西）。
Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).

将冲突记录在显式数据结构中，以保留所有信息，并编写应用程序代码以在以后的某个时间解决冲突（可能通过提示用户）。

Custom conflict resolution logic

As the most appropriate way of resolving a conflict may depend on the application, most multi-leader replication tools let you write conflict resolution logic using application code. That code may be executed on write or on read:

由于解决冲突的最适当方式可能取决于应用程序，因此大多数多领导者复制工具可以使用应用程序代码编写冲突解决逻辑。该代码可以在写入或读取时执行。

On write

As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler. For example, Bucardo allows you to write a snippet of Perl for this purpose. This handler typically cannot prompt a user—it runs in a background process and it must execute quickly.

当数据库系统检测到复制更改日志中的冲突时，它会调用冲突处理程序。例如，Bucardo允许您编写一个Perl片段来处理此类冲突。此处理程序通常无法提示用户-它在后台进程中运行，并且必须快速执行。

On read

When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict, and write the result back to the database. CouchDB works this way, for example.

当检测到冲突时，所有冲突的写操作都被存储。下一次数据被读取时，这些数据的多个版本将返回到应用程序中。应用程序可以提示用户或自动解决冲突，并将结果写回到数据库。例如，CouchDB就是这样工作的。

Note that conflict resolution usually applies at the level of an individual row or document, not for an entire transaction [ 36 ]. Thus, if you have a transaction that atomically makes several different writes (see Chapter 7 ), each write is still considered separately for the purposes of conflict resolution.

需要注意的是，冲突解决通常适用于单个行或文档级别，而不是整个事务[36]。因此，如果您有一个原子地进行多个不同写入的事务（见第7章），则每个写入仍然被单独考虑用于冲突解决的目的。

Automatic Conflict Resolution

Conflict resolution rules can quickly become complicated, and custom code can be error-prone. Amazon is a frequently cited example of surprising effects due to a conflict resolution handler: for some time, the conflict resolution logic on the shopping cart would preserve items added to the cart, but not items removed from the cart. Thus, customers would sometimes see items reappearing in their carts even though they had previously been removed [ 37 ].

冲突解决规则很快就变得复杂，并且自定义代码很容易出错。亚马逊是一个经常被引用的例子，由于冲突解决处理程序的影响出人意料：一段时间内，购物车上的冲突解决逻辑会保留添加到购物车中的物品，但不包括从购物车中删除的物品。因此，顾客有时会看到已经被删除的物品重新出现在他们的购物车中。[37]。

There has been some interesting research into automatically resolving conflicts caused by concurrent data modifications. A few lines of research are worth mentioning:

有关自动解决由并发数据修改引起的冲突的一些有趣研究已经进行。值得一提的是几个研究方向：

Conflict-free replicated datatypes (CRDTs) [ 32 , 38 ] are a family of data structures for sets, maps, ordered lists, counters, etc. that can be concurrently edited by multiple users, and which automatically resolve conflicts in sensible ways. Some CRDTs have been implemented in Riak 2.0 [ 39 , 40 ].

无冲突复制数据类型（CRDT）是一族数据结构，包括集合、映射、有序列表、计数器等，可同时被多个用户编辑，并自动以合理方式解决冲突。一些CRDT已在Riak 2.0中得到实现。
Mergeable persistent data structures [ 41 ] track history explicitly, similarly to the Git version control system, and use a three-way merge function (whereas CRDTs use two-way merges).

可合并的持久数据结构[41] 显式地跟踪历史，类似于 Git 版本控制系统，并使用三方合并函数（而 CRDTs 使用两方合并）。
Operational transformation [ 42 ] is the conflict resolution algorithm behind collaborative editing applications such as Etherpad [ 30 ] and Google Docs [ 31 ]. It was designed particularly for concurrent editing of an ordered list of items, such as the list of characters that constitute a text document.

操作转换是协同编辑应用程序，如Etherpad和Google Docs背后的冲突解决算法。它专门设计用于同时编辑有序项目列表，例如构成文本文档的字符列表。

Implementations of these algorithms in databases are still young, but it’s likely that they will be integrated into more replicated data systems in the future. Automatic conflict resolution could make multi-leader data synchronization much simpler for applications to deal with.

这些算法在数据库中的应用还很年轻，但很可能会被整合到更多的复制数据系统中。自动冲突解决可以使多主数据同步更加简单，方便应用程序处理。

What is a conflict?

Some kinds of conflict are obvious. In the example in Figure 5-7 , two writes concurrently modified the same field in the same record, setting it to two different values. There is little doubt that this is a conflict.

一些冲突是显而易见的。在图5-7的例子中，两个写操作同时修改了同一条记录中的同一字段，将其设置为两个不同的值。毫无疑问，这是一种冲突。

Other kinds of conflict can be more subtle to detect. For example, consider a meeting room booking system: it tracks which room is booked by which group of people at which time. This application needs to ensure that each room is only booked by one group of people at any one time (i.e., there must not be any overlapping bookings for the same room). In this case, a conflict may arise if two different bookings are created for the same room at the same time. Even if the application checks availability before allowing a user to make a booking, there can be a conflict if the two bookings are made on two different leaders.

其他类型的冲突可能更加难以检测。例如，考虑一个会议室预订系统：它跟踪哪个房间在什么时间被哪个团队预订。这个应用程序需要确保每个房间在任何时候只被一个团队预订（即，不能有同一房间的重叠预订）。在这种情况下，如果同一时间为同一房间创建了两个不同的预订，则可能会发生冲突。即使应用程序在允许用户预订之前检查可用性，如果两个预订是在两个不同的领导者上进行的，仍可能存在冲突。

There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a good understanding of this problem. We will see some more examples of conflicts in Chapter 7 , and in Chapter 12 we will discuss scalable approaches for detecting and resolving conflicts in a replicated system.

这个问题没有快速现成的答案，但是在接下来的章节中，我们会追溯一条路径，以便更好地理解这个问题。在第七章中，我们将看到更多的冲突示例，在第十二章中，我们将讨论检测和解决复制系统中冲突的可扩展方法。

Multi-Leader Replication Topologies

A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, like in Figure 5-7 , there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible. Some examples are illustrated in Figure 5-8 .

复制拓扑描述了写入从一个节点到另一个节点传播的通信路径。如果您有两个领导者，如图5-7所示，则只有一种合理的拓扑：领导者1必须将其所有写入发送到领导者2，反之亦然。具有两个以上领导者时，可能存在各种不同的拓扑。图5-8中给出了一些示例。

The most general topology is all-to-all ( Figure 5-8 [c]), in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example, MySQL by default supports only a circular topology [ 34 ], in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology has the shape of a star : ^v one designated root node forwards writes to all of the other nodes. The star topology can be generalized to a tree.

最常见的拓扑结构是全对全（图5-8 [c]），其中每个领导者都向其他领导者发送其写入。但是，也可以使用更受限制的拓扑结构：例如，默认情况下MySQL仅支持循环拓扑结构[34]，其中每个节点接收来自一个节点的写入并将这些写入（以及其自己的任何写入）转发给另一个节点。另一个流行的拓扑结构是星形拓扑结构：一个指定的根节点将写入转发到所有其他节点。星形拓扑结构可以推广为树形结构。

In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through [ 43 ]. When a node receives a data change that is tagged with its own identifier, that data change is ignored, because the node knows that it has already been processed.

在环形和星型拓扑结构中，一个写操作可能需要经过多个节点才能到达所有副本。因此，节点需要转发从其他节点接收到的数据变更。为了防止无限复制循环，每个节点都被赋予了一个唯一的标识符，并且在复制日志中，每个写操作都带有它经过的所有节点的标识符。当一个节点接收到带有自己标识符的数据变更时，它将忽略此变更，因为该节点知道它已经被处理过了。

A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed. The topology could be reconfigured to work around the failed node, but in most deployments such reconfiguration would have to be done manually. The fault tolerance of a more densely connected topology (such as all-to-all) is better because it allows messages to travel along different paths, avoiding a single point of failure.

循环和星型拓扑的问题在于，如果仅有一个节点失败，它就可能会中断其他节点之间的复制消息流，导致它们无法通信，直到该节点得到修复。虽然可以重新配置拓扑来解决故障节点问题，但在大多部署中，这种重新配置通常必须手动完成。更密集连接的拓扑（例如全互联）的容错性更好，因为它允许消息沿不同的路径传播，避免单点故障。

On the other hand, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication messages may “overtake” others, as illustrated in Figure 5-9 .

但是，全互连拓扑结构也可能存在问题。特别是，一些网络连接可能比其他连接更快（例如，由于网络拥塞），导致一些复制消息可能会“超越”其他消息，如图5-9所示。

In Figure 5-9 , client A inserts a row into a table on leader 1, and client B updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may first receive the update (which, from its point of view, is an update to a row that does not exist in the database) and only later receive the corresponding insert (which should have preceded the update).

在图5-9中，客户端A在领袖1上向表中插入一行，而客户端B在领袖3上更新该行。但领袖2可能以不同的顺序接收写入操作：它可能首先接收更新操作（从它的角度来看，这是对不存在于数据库中的行进行的更新），然后才接收相应的插入操作（本应该先于更新操作）。

This is a problem of causality, similar to the one we saw in “Consistent Prefix Reads” : the update depends on the prior insert, so we need to make sure that all nodes process the insert first, and then the update. Simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see Chapter 8 ).

这是一个因果关系的问题，类似于我们在“一致的前缀读取”中看到的问题：更新取决于之前的插入，因此我们需要确保所有节点先处理插入然后处理更新。仅仅为每个写操作附加时间戳是不够的，因为时钟不能被信任地同步，以正确地对事件进行排序在领导者2（见第8章）。

To order these events correctly, a technique called version vectors can be used, which we will discuss later in this chapter (see “Detecting Concurrent Writes” ). However, conflict detection techniques are poorly implemented in many multi-leader replication systems. For example, at the time of writing, PostgreSQL BDR does not provide causal ordering of writes [ 27 ], and Tungsten Replicator for MySQL doesn’t even try to detect conflicts [ 34 ].

为了正确地排序这些事件，可以使用一种称为版本向量的技术，我们将在本章后面讨论（请参见“检测并发写入”）。然而，许多多主复制系统的冲突检测技术实现得很差。例如，写作时，PostgreSQL BDR不提供写入的因果排序，而MySQL的Tungsten Replicator甚至不尝试检测冲突。

If you are using a system with multi-leader replication, it is worth being aware of these issues, carefully reading the documentation, and thoroughly testing your database to ensure that it really does provide the guarantees you believe it to have.

如果您正在使用具有多主复制系统的系统，则值得了解这些问题，仔细阅读文档，并彻底测试您的数据库以确保它确实提供您所相信的保证。

Leaderless Replication

The replication approaches we have discussed so far in this chapter—single-leader and multi-leader replication—are based on the idea that a client sends a write request to one node (the leader), and the database system takes care of copying that write to the other replicas. A leader determines the order in which writes should be processed, and followers apply the leader’s writes in the same order.

目前在本章中讨论的复制方法——单领导者和多领导者复制，基于一个客户端向一个节点（领导者）发送写请求的想法，数据库系统会处理将该写入复制到其他副本的工作。领导者确定写入的处理顺序，跟随者按照相同的顺序应用领导者的写入。

Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Some of the earliest replicated data systems were leaderless [ 1 , 44 ], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house Dynamo system [ 37 ]. ^vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style .

一些数据存储系统采取不同的方法，放弃领导者的概念，允许任何副本直接从客户端接受写入。一些最早的复制数据系统是无领导的，但是在关系数据库占主导地位的时代，这个想法大多被遗忘了。亚马逊将其用于其内部动力系统之后，它再次成为数据库的流行架构。Riak、Cassandra和Voldemort是受动力启发的开源数据存储器，因此这种数据库也被称为动力样式。

In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has profound consequences for the way the database is used.

在一些无领导实现中，客户端直接向多个副本发送其写入内容，而在另一些实现中，一个协调节点代表客户端执行此操作。然而，与领导数据库不同，该协调器不会强制执行特定的写入顺序。正如我们将看到的那样，这种设计上的差异对数据库的使用方式有深远的影响。

Writing to the Database When a Node Is Down

Imagine you have a database with three replicas, and one of the replicas is currently unavailable—perhaps it is being rebooted to install a system update. In a leader-based configuration, if you want to continue processing writes, you may need to perform a failover (see “Handling Node Outages” ).

假设您拥有一个包含三个副本的数据库，并且其中一个副本当前不可用 —— 可能正在重新启动以安装系统更新。在基于领导者的配置中，如果您想要继续处理写入操作，您可能需要执行故障转移（请参阅“处理节点故障”）。

On the other hand, in a leaderless configuration, failover does not exist. Figure 5-10 shows what happens: the client (user 1234) sends the write to all three replicas in parallel, and the two available replicas accept the write but the unavailable replica misses it. Let’s say that it’s sufficient for two out of three replicas to acknowledge the write: after user 1234 has received two ok responses, we consider the write to be successful. The client simply ignores the fact that one of the replicas missed the write.

另一方面，在没有领导的配置中，不存在故障转移。如图5-10所示：客户端（用户1234）并行发送写请求到所有三个副本，两个可用的副本接受写请求，但不可用的副本未接受此请求。假设只需要三个副本中的两个副本确认写操作即可，当用户1234接收到两个确认响应之后，我们认为写操作已成功。客户端简单地忽略了一个副本未接受写请求的情况。

Now imagine that the unavailable node comes back online, and clients start reading from it. Any writes that happened while the node was down are missing from that node. Thus, if you read from that node, you may get stale (outdated) values as responses.

现在想象一下，不可用的节点重新联机，客户端开始从它那里读取。在该节点离线期间发生的任何写操作都将缺失于该节点。因此，如果您从该节点读取，您可能会收到过时的响应值。

To solve that problem, when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel . The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer (see “Detecting Concurrent Writes” ).

为了解决这个问题，当客户端从数据库中读取数据时，它不仅会将请求发送到一个复制节点，读取请求还会同时发送到多个节点。客户端可能会从不同的节点获取不同的响应；即从一个节点获取最新值，而从另一个节点获取旧值。版本号用于确定哪个值是更新的（请参阅“检测并发写入”）。

Read repair and anti-entropy

The replication scheme should ensure that eventually all the data is copied to every replica. After an unavailable node comes back online, how does it catch up on the writes that it missed?

复制方案应确保最终所有数据都被复制到每个副本。当一个不可用的节点重新联机后，它如何追赶它错过的写入呢？

Two mechanisms are often used in Dynamo-style datastores:

通常在 Dynamo 风格的数据存储中使用了两种机制：

Read repair

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10 , user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read.

当客户端并行从多个节点读取时，它可以检测到任何过时的响应。例如，在图5-10中，用户 2345 从复制品 3 获取版本 6 的值，并从副本 1 和 2 中获取版本 7 的值。客户端发现复制品 3 有一个过时的值，然后将更新的值写回该复制品。这种方法适用于频繁读取的值。

Anti-entropy process

In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.

此外，一些数据存储具有后台进程，不断查找副本之间数据的差异，并将任何缺失的数据从一个副本复制到另一个副本。与基于领导者的复制中的复制日志不同，此反熵过程不按任何特定顺序复制写入，并且数据复制可能会有显着延迟。

Not all systems implement both of these; for example, Voldemort currently does not have an anti-entropy process. Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.

并非所有的系统都实现了这两种方法，例如目前 Voldemort 就没有一个反熵过程。请注意，如果没有反熵过程，很少被读取的值可能会在某些副本中丢失，从而降低了其耐用性，因为只有在应用程序读取一个值时才会进行读取修复。

Quorums for reading and writing

In the example of Figure 5-10 , we considered the write to be successful even though it was only processed on two out of three replicas. What if only one out of three replicas accepted the write? How far can we push this?

在图5-10的示例中，即使仅在三个副本中处理了两个，我们仍考虑写操作是成功的。如果只有三个副本中的一个接受了写入，我们可以推到多远？

If we know that every successful write is guaranteed to be present on at least two out of three replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas, we can be sure that at least one of the two is up to date. If the third replica is down or slow to respond, reads can nevertheless continue returning an up-to-date value.

如果我们知道每个成功的写入都保证在三个副本中至少有两个出现，这意味着最多只能有一个副本过期。因此，如果我们从至少两个副本中读取，我们可以确信其中至少有一个是最新的。如果第三个副本停机或响应缓慢，读取仍然可以返回最新的值。

More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n , we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [ 44 ]. ^vii You can think of r and w as the minimum number of votes required for the read or write to be valid.

一般来说，如果有n个副本，每次写入必须得到w个节点的确认才能被视为成功，每次读取时我们必须查询至少r个节点。(在我们的例子中，n=3，w=2，r=2）。只要w+r>n，我们预计在读取时能得到最新的值，因为我们至少会从r个节点中得到一个最新值。遵守这些r和w值的读写称为"quorum reads and writes"[44]。你可以把r和w看作读写操作所需的最小投票数，以确定其是否有效。

In Dynamo-style databases, the parameters n , w , and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = ( n + 1) / 2 (rounded up). However, you can vary the numbers as you see fit. For example, a workload with few writes and many reads may benefit from setting w = n and r = 1. This makes reads faster, but has the disadvantage that just one failed node causes all database writes to fail.

在 Dynamo 风格的数据库中，参数 n、w 和 r 通常是可配置的。一个普遍的选择是将 n 设为奇数（通常是 3 或 5），并将 w = r = (n + 1) / 2（向上取整）。然而，你可以根据需要变化这些数字。例如，一个写入较少但读取较多的工作负载可能会从设置 w = n 和 r = 1 中获益。这会加快读取速度，但缺点是只要一个节点失败，所有数据库的写入就会失败。

Note

There may be more than n nodes in the cluster, but any given value is stored only on n nodes. This allows the dataset to be partitioned, supporting datasets that are larger than you can fit on one node. We will return to partitioning in Chapter 6 .

集群中可能有多个节点，但任何给定值仅存储在n个节点上。这使得数据集能够被分割，支持大于一个节点可容纳的数据集。在第6章中，我们将返回到分区。

The quorum condition, w + r > n , allows the system to tolerate unavailable nodes as follows:

法定人数条件w+r>n允许系统在以下情况下容忍不可用节点：

If w < n , we can still process writes if a node is unavailable.

如果w < n，即使节点不可用，我们仍然可以处理写操作。
If r < n , we can still process reads if a node is unavailable.

如果 r < n，即使节点不可用，我们仍然可以处理读取。
With n = 3, w = 2, r = 2 we can tolerate one unavailable node.

当n = 3，w = 2，r = 2时，我们可以容忍一个不可用节点。
With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes. This case is illustrated in Figure 5-11 .

当n = 5，w = 3，r = 3时，我们可以容忍两个不可用的节点。此情况在图5-11中说明。
Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.

通常情况下，读写操作总是并行发送到所有的n个副本。参数w和r决定我们等待多少个节点——即多少个n个节点需要报告成功，我们才认为读写操作成功。

If fewer than the required w or r nodes are available, writes or reads return an error. A node could be unavailable for many reasons: because the node is down (crashed, powered down), due to an error executing the operation (can’t write because the disk is full), due to a network interruption between the client and the node, or for any number of other reasons. We only care whether the node returned a successful response and don’t need to distinguish between different kinds of fault.

如果可用的w或r节点少于所需数量，写入或读取将返回错误。节点可能无法使用的原因很多：因为节点宕机（崩溃，断电），由于执行操作时出现错误（因为磁盘已满而无法写入），由于客户端和节点之间的网络中断，或由于任何其他原因。我们只关心节点是否返回成功响应，无需区分不同类型的故障。

Limitations of Quorum Consistency

If you have n replicas, and you choose w and r such that w + r > n , you can generally expect every read to return the most recent value written for a key. This is the case because the set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That is, among the nodes you read there must be at least one node with the latest value (illustrated in Figure 5-11 ).

如果您有n个副本，并且选择w和r，使得w + r> n，则通常可以期望每次读取都返回针对键写入的最新值。这是因为您写入的节点集和您读取的节点集必须重叠。也就是说，您读取的节点集中必须至少有一个节点具有最新值（如图5-11所示）。

Often, r and w are chosen to be a majority (more than n /2) of nodes, because that ensures w + r > n while still tolerating up to n /2 node failures. But quorums are not necessarily majorities—it only matters that the sets of nodes used by the read and write operations overlap in at least one node. Other quorum assignments are possible, which allows some flexibility in the design of distributed algorithms [ 45 ].

通常，r 和 w 被选择为大多数节点（超过 n/2），因为这确保了 w + r > n，同时仍然容忍 n/2 节点故障。但是，仲裁机并不一定是多数机制——重要的是读操作和写操作使用的节点集合在至少一个节点上重叠。其他仲裁分配也是可能的，这允许分布式算法设计的一定灵活性 [45]。

You may also set w and r to smaller numbers, so that w + r ≤ n (i.e., the quorum condition is not satisfied). In this case, reads and writes will still be sent to n nodes, but a smaller number of successful responses is required for the operation to succeed.

你还可以将w和r设置为较小的数字，使得w + r ≤ n（即未满足法定人数条件）。在这种情况下，读取和写入仍将发送到n个节点，但操作成功所需的成功响应数量将减少。

With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability: if there is a network interruption and many replicas become unreachable, there’s a higher chance that you can continue processing reads and writes. Only after the number of reachable replicas falls below w or r does the database become unavailable for writing or reading, respectively.

使用较小的w和r值，更有可能读取陈旧的值，因为您可能没有包含最新值的节点。好处是，该配置可以实现较低的延迟和更高的可用性：如果存在网络中断并且许多副本变得无法访问，则可以继续处理读写操作的可能性更高。仅在可达副本数低于w或r时，数据库才变得无法写入或读取。

However, even with w + r > n , there are likely to be edge cases where stale values are returned. These depend on the implementation, but possible scenarios include:

然而，即使w + r > n，仍然有可能出现返回过期值的边际情况。这些情况取决于实现，但可能的场景包括：

If a sloppy quorum is used (see “Sloppy Quorums and Hinted Handoff” ), the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes [ 46 ].

如果使用了松散的法定人数（参见“松散的法定人数和暗示的转交”），那么写入可能会落在与读取不同的节点上，因此读操作节点和写操作节点之间就不再有保证的重叠了。
If two writes occur concurrently, it is not clear which one happened first. In this case, the only safe solution is to merge the concurrent writes (see “Handling Write Conflicts” ). If a winner is picked based on a timestamp (last write wins), writes can be lost due to clock skew [ 35 ]. We will return to this topic in “Detecting Concurrent Writes” .

如果出现两个并发写入，不清楚哪一个先发生。在这种情况下，唯一安全的解决方案是合并并发的写入（参见“处理写入冲突”）。如果根据时间戳选择获胜者（最后一次写入获胜），由于时钟偏差可能会丢失写入[35]。我们将在“检测并发写入”中回到这个话题。
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.

如果在读取发生并发写入时，写入可能只会反映在一些副本上。在这种情况下，无法确定读取是否会返回旧值还是新值。
If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write [ 47 ].

如果在某些副本上写入成功，但在其他副本上失败（例如因为某些节点上的磁盘已满），并且总体上成功的副本少于w个，则不会在成功的副本上回滚。这意味着如果写入报告为失败，则随后的读取可能会返回该写入的值或不返回。[47]。
If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w , breaking the quorum condition.

如果一个承载新价值的节点失败了，并且它的数据从承载旧价值的副本中恢复，那么存储新价值的副本数量可能会低于w，从而破坏了法定人数的条件。
Even if everything is working correctly, there are edge cases in which you can get unlucky with the timing, as we shall see in “Linearizability and quorums” .

即使一切都运行得正确，也有极端情况，在“线性化和仲裁”的内容中我们会看到可能会出现时机不利的情况。

Thus, although quorums appear to guarantee that a read returns the latest written value, in practice it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency. The parameters w and r allow you to adjust the probability of stale values being read, but it’s wise to not take them as absolute guarantees.

因此，尽管法定人数似乎保证读取最新写入的值，但实际上并不那么简单。 Dynamo样式的数据库通常针对可以容忍最终一致性的用例进行优化。参数w和r允许您调整读取过期值的概率，但明智的做法是不将它们视为绝对保证。

In particular, you usually do not get the guarantees discussed in “Problems with Replication Lag” (reading your writes, monotonic reads, or consistent prefix reads), so the previously mentioned anomalies can occur in applications. Stronger guarantees generally require transactions or consensus. We will return to these topics in Chapter 7 and Chapter 9 .

特别是，在“复制延迟问题”的讨论中通常不会得到“读取您的写入操作”、“单调读取”或“一致前缀读取”等保证，因此在应用程序中可能会出现前面提到的异常情况。更强的保证通常需要事务或共识。我们将在第7章和第9章返回这些主题。

Monitoring staleness

From an operational perspective, it’s important to monitor whether your databases are returning up-to-date results. Even if your application can tolerate stale reads, you need to be aware of the health of your replication. If it falls behind significantly, it should alert you so that you can investigate the cause (for example, a problem in the network or an overloaded node).

从操作角度看，监视数据库是否返回最新结果非常重要。即使您的应用程序可以容忍陈旧的读取，您也需要了解复制的健康状况。如果它落后很多，它应该向您发出警报，以便您可以调查原因（例如，在网络中出现问题或节点超载）。

For leader-based replication, the database typically exposes metrics for the replication lag, which you can feed into a monitoring system. This is possible because writes are applied to the leader and to followers in the same order, and each node has a position in the replication log (the number of writes it has applied locally). By subtracting a follower’s current position from the leader’s current position, you can measure the amount of replication lag.

针对基于领导者的复制，数据库通常公开用于复制滞后度的度量标准，您可以将其输入监控系统。这是可能的，因为写入按照相同顺序应用于领导者和关注者，并且每个节点在复制日志中都有一个位置（它已经本地应用的写入数量）。通过从关注者的当前位置减去领导者的当前位置，您可以测量复制滞后的量。

However, in systems with leaderless replication, there is no fixed order in which writes are applied, which makes monitoring more difficult. Moreover, if the database only uses read repair (no anti-entropy), there is no limit to how old a value might be—if a value is only infrequently read, the value returned by a stale replica may be ancient.

然而，在没有领导的复制系统中，写入应用的顺序是没有固定的，这使得监视更加困难。此外，如果数据库仅使用读修复而没有反熵机制，则值可能非常陈旧 - 如果一个值很少被读取，那么由过时副本返回的值可能已经过时。

There has been some research on measuring replica staleness in databases with leaderless replication and predicting the expected percentage of stale reads depending on the parameters n , w , and r [ 48 ]. This is unfortunately not yet common practice, but it would be good to include staleness measurements in the standard set of metrics for databases. Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be able to quantify “eventual.”

已经进行了一些研究，用于在无领导副本复制的数据库中测量副本陈旧度，并根据参数n、w和r [48] 预测期望的陈旧读取百分比。不幸的是，这还不是普遍的做法，但将陈旧度测量包括在数据库的标准指标集中将是很好的。最终一致性是一种故意模糊的保证，但为了操作能力，量化“最终”非常重要。

Sloppy Quorums and Hinted Handoff

Databases with appropriately configured quorums can tolerate the failure of individual nodes without the need for failover. They can also tolerate individual nodes going slow, because requests don’t have to wait for all n nodes to respond—they can return when w or r nodes have responded. These characteristics make databases with leaderless replication appealing for use cases that require high availability and low latency, and that can tolerate occasional stale reads.

配置适当仲裁的数据库可以容忍个别节点的失败，而无需进行故障转移。它们也可以容忍节点的缓慢，因为请求不必等待所有的n个节点响应，它们可以在w或r个节点响应时返回。这些特性使得具有无领导复制的数据库在需要高可用性和低延迟、可以容忍偶尔的陈旧读取的用例中非常有吸引力。

However, quorums (as described so far) are not as fault-tolerant as they could be. A network interruption can easily cut off a client from a large number of database nodes. Although those nodes are alive, and other clients may be able to connect to them, to a client that is cut off from the database nodes, they might as well be dead. In this situation, it’s likely that fewer than w or r reachable nodes remain, so the client can no longer reach a quorum.

然而，目前描述的仲裁（quorum）并没有尽可能实现容错性。网络中断很容易就会使客户端与大量数据库节点失去联系。虽然这些节点仍然存活，其他客户端可能仍能够连接它们，但对于与数据库节点失去联系的客户端来说，这些节点实际上就已经死亡。在这种情况下，可到达的节点数量很可能少于w或r，因此客户端无法再达到一个仲裁。

In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value. In that case, database designers face a trade-off:

在一个非常大的群集（节点数量显著超过n）中，客户端可能可以在网络中断期间连接到一些数据库节点，只是无法连接到需要为特定值组装法定人数的节点。在这种情况下，数据库设计师面临一个权衡：

Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?

我们是否应该对于那些无法达到w或r节点共识的请求返回错误信息更好呢？
Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?

或者我们应该接受写入请求，然后将其写入一些可达但不属于通常存储值的n个节点中的某些节点吗？

The latter is known as a sloppy quorum [ 37 ]: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value. By analogy, if you lock yourself out of your house, you may knock on the neighbor’s door and ask whether you may stay on their couch temporarily.

后者被称为松散的法定人数。写入和读取仍需要成功的w和r响应，但可能包括未经指定的n“家庭”节点的值。类比地，如果你锁在外面找不到钥匙进不了自己家，你可能会敲邻居的门并询问是否可以暂时留在他们的沙发上。

Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff . (Once you find the keys to your house again, your neighbor politely asks you to get off their couch and go home.)

一旦网络中断问题得到解决，一个节点暂时接受另一个节点的写入，这些写入会被发送到相应的“归属”节点。这被称为提示性转移。（一旦你找到家的钥匙，你的邻居会礼貌地请你离开他们的沙发回家。）

Sloppy quorums are particularly useful for increasing write availability: as long as any w nodes are available, the database can accept writes. However, this means that even when w + r > n , you cannot be sure to read the latest value for a key, because the latest value may have been temporarily written to some nodes outside of n [ 47 ].

松散的议定书对于增加写入可用性特别有用：只要有任何w个节点可用，数据库就可以接受写入。然而，这意味着即使w + r > n，您也不能确保读取关键字的最新值，因为最新值可能暂时写入n外的某些节点[47]。

Thus, a sloppy quorum actually isn’t a quorum at all in the traditional sense. It’s only an assurance of durability, namely that the data is stored on w nodes somewhere. There is no guarantee that a read of r nodes will see it until the hinted handoff has completed.

因此，一个松散的法定人数实际上在传统意义上并不是一个法定人数。它只是一种持久性的保证，即数据存储在某处的w个节点上。没有保证r个节点的读取将看到它，直到暗示手动完成。

Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default [ 46 , 49 , 50 ].

在所有通用的 Dynamo 实现中，松散的仲裁机制是可选的。在 Riak 中，它们默认启用，在Cassandra 和 Voldemort 中默认禁用。[46, 49, 50]。

Multi-datacenter operation

We previously discussed cross-datacenter replication as a use case for multi-leader replication (see “Multi-Leader Replication” ). Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.

我们之前讨论了跨数据中心复制作为多领袖复制的用例（参见“多首领复制”）。无领导复制也适用于多数据中心操作，因为它被设计为容忍冲突的并发写入，网络中断和延迟波动。

Cassandra and Voldemort implement their multi-datacenter support within the normal leaderless model: the number of replicas n includes nodes in all datacenters, and in the configuration you can specify how many of the n replicas you want to have in each datacenter. Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on the cross-datacenter link. The higher-latency writes to other datacenters are often configured to happen asynchronously, although there is some flexibility in the configuration [ 50 , 51 ].

Cassandra和Voldemort在常规无领导模式下实现了它们的多数据中心支持：副本数n包括所有数据中心的节点，并且在配置中可以指定要在每个数据中心中拥有多少个n副本。每个客户端的写操作都发送到所有副本，而不管数据中心，但客户端通常只等待其本地数据中心内的节点数量达到法定人数以获得确认，因此才不会受到跨数据中心链接的延迟和中断的影响。对于其他数据中心的高延迟写操作通常配置为异步执行，尽管在配置中还有一定的灵活性[50，51]。

Riak keeps all communication between clients and database nodes local to one datacenter, so n describes the number of replicas within one datacenter. Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication [ 52 ].

Riak将客户端与数据库节点之间的所有通信都保持在一个数据中心内，因此n描述了一个数据中心内的副本数量。不同数据中心之间的数据库集群之间的交叉数据中心复制会以类似于多个领导者复制的方式在后台异步进行。

Detecting Concurrent Writes

Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used. The situation is similar to multi-leader replication (see “Handling Write Conflicts” ), although in Dynamo-style databases conflicts can also arise during read repair or hinted handoff.

Dynamo式数据库允许多个客户端同时写入同一关键字，这意味着即使使用严格的仲裁，冲突仍将发生。这种情况类似于多领导者复制（请参阅“处理写冲突”），尽管在Dynamo式数据库中，冲突也可能在读取修复或提示式移交中出现。

The problem is that events may arrive in a different order at different nodes, due to variable network delays and partial failures. For example, Figure 5-12 shows two clients, A and B, simultaneously writing to a key X in a three-node datastore:

问题在于由于不同节点之间的网络延迟和部分故障，事件可能以不同的顺序到达不同的节点。例如，图5-12显示了两个客户端A和B同时向三节点数据存储中的键X写入的情况。

Node 1 receives the write from A, but never receives the write from B due to a transient outage.

节点1由于短暂的故障未能收到B的写入。
Node 2 first receives the write from A, then the write from B.

节点2首先收到来自A的写入，然后收到来自B的写入。
Node 3 first receives the write from B, then the write from A.

节点3先接收到来自B的写入，然后再接收来自A的写入。

If each node simply overwrote the value for a key whenever it received a write request from a client, the nodes would become permanently inconsistent, as shown by the final get request in Figure 5-12 : node 2 thinks that the final value of X is B, whereas the other nodes think that the value is A.

如果每个节点仅仅在收到来自客户端的写请求时覆盖键的值，节点将变得永久不一致，如图5-12中的最后一个get请求所示：节点2认为X的最终值是B，而其他节点则认为值为A。

In order to become eventually consistent, the replicas should converge toward the same value. How do they do that? One might hope that replicated databases would handle this automatically, but unfortunately most implementations are quite poor: if you want to avoid losing data, you—the application developer—need to know a lot about the internals of your database’s conflict handling.

为了实现最终一致性，副本应该收敛到同一值。如何实现这一点？人们可能希望复制的数据库能够自动处理这一点，但不幸的是，大多数实现都相当糟糕：如果您想避免丢失数据，您——应用程序开发者——需要了解很多有关数据库冲突处理的内部知识。

We briefly touched on some techniques for conflict resolution in “Handling Write Conflicts” . Before we wrap up this chapter, let’s explore the issue in a bit more detail.

在“处理写冲突”一章中，我们简要地介绍了一些冲突解决的技巧。在我们结束本章之前，让我们更详细地探讨这个问题。

Last write wins (discarding concurrent writes)

One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. Then, as long as we have some way of unambiguously determining which write is more “recent,” and every write is eventually copied to every replica, the replicas will eventually converge to the same value.

达到最终收敛的方法之一是声明每个副本只需存储最“新”的值，允许“旧”的值被覆盖和丢弃。然后，只要我们有一种明确确定哪个写入更“新”的方法，并且每个写入最终都复制到每个副本，副本最终将收敛到相同的值。

As indicated by the quotes around “recent,” this idea is actually quite misleading. In the example of Figure 5-12 , neither client knew about the other one when it sent its write requests to the database nodes, so it’s not clear which one happened first. In fact, it doesn’t really make sense to say that either happened “first”: we say the writes are concurrent , so their order is undefined.

由于“recent”周围的引号，这个想法实际上是非常具有误导性的。在图5-12的例子中，当它发送写请求到数据库节点时，两个客户端都不知道对方的存在，因此不清楚哪个首先发生。事实上，说其中任何一个先发生都没有意义：我们说写操作是并发的，因此它们的顺序是未定义的。

Even though the writes don’t have a natural ordering, we can force an arbitrary order on them. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW), is the only supported conflict resolution method in Cassandra [ 53 ], and an optional feature in Riak [ 35 ].

尽管写入没有自然的顺序，我们可以强制在它们上面施加任意的顺序。例如，我们可以为每个写入连接一个时间戳，选择最大的时间戳作为最“近期”的，丢弃任何具有较早时间戳的写入。这个冲突解决算法叫做最后写入获胜（LWW），是Cassandra中唯一支持的冲突解决方法[53]，也是Riak中的可选功能[35]。

LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded. Moreover, LWW may even drop writes that are not concurrent, as we shall discuss in “Timestamps for ordering events” .

LWW可实现最终收敛目标，但以耐用性为代价：如果有多个并发写入相同的键，即使它们都被报告给客户端（因为它们被写入w个副本），只有一个写入将生存，而其他写入将被静默丢弃。此外，LWW甚至可能会删除不并发的写入，如我们将在“用于排序事件的时间戳”中讨论的那样。

There are some situations, such as caching, in which lost writes are perhaps acceptable. If losing data is not acceptable, LWW is a poor choice for conflict resolution.

在某些情况下，如缓存中，一些丢失的写入也许是可以接受的。如果数据丢失是不可接受的，LWW 是一个冲突解决的劣选择。

The only safe way of using a database with LWW is to ensure that a key is only written once and thereafter treated as immutable, thus avoiding any concurrent updates to the same key. For example, a recommended way of using Cassandra is to use a UUID as the key, thus giving each write operation a unique key [ 53 ].

使用LWW和数据库的唯一安全方式是确保一个键只被写入一次，然后被视为不可变，以避免任何并发更新相同的键。例如，建议使用Cassandra的一种方法是使用UUID作为键，从而为每个写操作提供一个唯一的键[53]。

The “happens-before” relationship and concurrency

How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look at some examples:

我们如何确定两个操作是否并发？为了发展直觉，让我们来看几个例子：

In Figure 5-9 , the two writes are not concurrent: A’s insert happens before B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s operation builds upon A’s operation, so B’s operation must have happened later. We also say that B is causally dependent on A.

在图5-9中，两个写操作不是并发的：A的插入操作发生在B的递增操作之前，因为B递增的值是A插入的值。换句话说，B的操作建立在A的操作上，因此B的操作必须是后发生的。我们还可以说B在因果上依赖于A。
On the other hand, the two writes in Figure 5-12 are concurrent: when each client starts the operation, it does not know that another client is also performing an operation on the same key. Thus, there is no causal dependency between the operations.

另一方面，图5-12中的两个写操作是并发的：当每个客户端开始操作时，它并不知道另一个客户端也正在对同一键执行操作。因此，这些操作之间没有因果依赖关系。

An operation A happens before another operation B if B knows about A, or depends on A, or builds upon A in some way. Whether one operation happens before another operation is the key to defining what concurrency means. In fact, we can simply say that two operations are concurrent if neither happens before the other (i.e., neither knows about the other) [ 54 ].

如果B知道A，或者依赖于A，或者以某种方式建立在A之上，则操作A发生在另一个操作B之前。一个操作是否在另一个操作之前发生是定义并发的关键。实际上，我们可以简单地说，如果两个操作都不知道对方，则它们是并发的（也就是说，都不知道对方） [54]。

Thus, whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us whether two operations are concurrent or not. If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.

因此，每当您有两个操作A和B时，有三种可能性：要么A发生在B之前，要么B发生在A之前，要么A和B是并发的。我们需要的是一种算法来告诉我们两个操作是否是并发的。如果一个操作在另一个操作之前发生，后面的操作应覆盖先前的操作，但如果操作是并发的，我们就有一个需要解决的冲突。

Concurrency, Time, and Relativity

It may seem that two operations should be called concurrent if they occur “at the same time”—but in fact, it is not important whether they literally overlap in time. Because of problems with clocks in distributed systems, it is actually quite difficult to tell whether two things happened at exactly the same time—an issue we will discuss in more detail in Chapter 8 .

在分布式系统中，可能会出现两个操作“同时”发生的情况，但实际上，它们是否真正重叠在时间上并不重要。由于时钟问题，很难确定两个事情是否确实发生在完全相同的时间点上。我们将在第8章中详细讨论这个问题。

For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People sometimes make a connection between this principle and the special theory of relativity in physics [ 54 ], which introduced the idea that information cannot travel faster than the speed of light. Consequently, two events that occur some distance apart cannot possibly affect each other if the time between the events is shorter than the time it takes light to travel the distance between them.

定义并发，确切的时间并不重要：如果两个操作互相不知道彼此的存在，那么我们就称它们同时发生，而不考虑它们发生的具体时间。人们有时将此原则与物理学中的特殊相对论联系起来，[54] 这引入了这样一个观念：信息不能以超过光速的速度传递。因此，如果两个事件在一定距离内发生，并且它们之间的时间比光从它们之间的距离传播所需的时间更短，那么这两个事件绝对不会互相影响。

In computer systems, two operations might be concurrent even though the speed of light would in principle have allowed one operation to affect the other. For example, if the network was slow or interrupted at the time, two operations can occur some time apart and still be concurrent, because the network problems prevented one operation from being able to know about the other.

在计算机系统中，即使光速原则上允许一项操作影响另一项操作，但两项操作可能同时进行。例如，如果网络在某个时间变慢或中断，两项操作可在一段时间之后发生并仍然是并发的，因为网络问题阻止了一项操作无法了解另一项操作的情况。

Capturing the happens-before relationship

Let’s look at an algorithm that determines whether two operations are concurrent, or whether one happened before another. To keep things simple, let’s start with a database that has only one replica. Once we have worked out how to do this on a single replica, we can generalize the approach to a leaderless database with multiple replicas.

让我们来看一个算法，确定两个操作是并发的还是一个发生在另一个之前。为了保持简单，让我们从只有一个副本的数据库开始。一旦我们弄清楚了如何在单个副本上执行此操作，我们就可以将其推广到具有多个副本的无领导数据库。

Figure 5-13 shows two clients concurrently adding items to the same shopping cart. (If that example strikes you as too inane, imagine instead two air traffic controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is empty. Between them, the clients make five writes to the database:

图5-13展示了两个客户端同时向同一个购物车中添加物品的情境。（如果这个例子听起来太无聊了，那么你可以想象一下两个空管员同时添加他们正在追踪的飞机到区域中。）初始时，购物车是空的。两个客户端总共在数据库上进行了五次写操作。

Client 1 adds milk to the cart. This is the first write to that key, so the server successfully stores it and assigns it version 1. The server also echoes the value back to the client, along with the version number.

客户端1将牛奶添加到购物车中。这是对该键的第一次写入，因此服务器成功存储并分配版本1。服务器还将值回显到客户端，同时附带版本号。
Client 2 adds eggs to the cart, not knowing that client 1 concurrently added milk (client 2 thought that its eggs were the only item in the cart). The server assigns version 2 to this write, and stores eggs and milk as two separate values. It then returns both values to the client, along with the version number of 2.

客户端2将鸡蛋添加到购物车中，不知道客户端1同时添加了牛奶（客户端2认为它的鸡蛋是购物车中唯一的物品）。服务器将版本2分配给此写入，并将鸡蛋和牛奶存储为两个单独的值。然后将这两个值与版本号2一起返回给客户端。
Client 1, oblivious to client 2’s write, wants to add flour to the cart, so it thinks the current cart contents should be [milk, flour] . It sends this value to the server, along with the version number 1 that the server gave client 1 previously. The server can tell from the version number that the write of [milk, flour] supersedes the prior value of [milk] but that it is concurrent with [eggs] . Thus, the server assigns version 3 to [milk, flour] , overwrites the version 1 value [milk] , but keeps the version 2 value [eggs] and returns both remaining values to the client.

客户端1不知道客户端2的写入操作，想向购物车添加面粉，因此它认为当前购物车内容应该是[牛奶，面粉]。它将这个值与服务器之前给客户端1的版本号1一起发送到服务器。从版本号可以看出，[牛奶，面粉]的写入操作取代了以前的值[牛奶]但与[鸡蛋]并发。因此，服务器将版本3分配给[牛奶，面粉]，覆盖版本1的值[牛奶]，但保留版本2的值[鸡蛋]并将两个剩余的值返回给客户端。
Meanwhile, client 2 wants to add ham to the cart, unaware that client 1 just added flour . Client 2 received the two values [milk] and [eggs] from the server in the last response, so the client now merges those values and adds ham to form a new value, [eggs, milk, ham] . It sends that value to the server, along with the previous version number 2. The server detects that version 2 overwrites [eggs] but is concurrent with [milk, flour] , so the two remaining values are [milk, flour] with version 3, and [eggs, milk, ham] with version 4.

与此同时，客户2想要将火腿添加到购物车中，不知道客户1只是添加了面粉。客户2在上次响应中从服务器接收到了两个值[milk]和[eggs]，所以客户现在合并这些值并添加火腿来形成一个新值[eggs，milk，ham]。它将该值发送到服务器，以及先前的版本号2。服务器检测到版本2覆盖了[eggs]，但与[milk，flour]并发，因此剩下的两个值是[milk，flour]版本3和[eggs，milk，ham]版本4。
Finally, client 1 wants to add bacon . It previously received [milk, flour] and [eggs] from the server at version 3, so it merges those, adds bacon , and sends the final value [milk, flour, eggs, bacon] to the server, along with the version number 3. This overwrites [milk, flour] (note that [eggs] was already overwritten in the last step) but is concurrent with [eggs, milk, ham] , so the server keeps those two concurrent values.

最终，客户端1想要加入培根。它之前从版本3的服务器上收到了[牛奶，面粉]和[鸡蛋]，于是它合并了这些值并添加了培根，最终发送值[milk，flour，eggs，bacon]以及版本号3到服务器。这将覆盖[milk，flour]（请注意，[eggs]已经在上一步被覆盖了），但与[eggs，milk，ham]并发，因此服务器保留这两个并发的值。

The dataflow between the operations in Figure 5-13 is illustrated graphically in Figure 5-14 . The arrows indicate which operation happened before which other operation, in the sense that the later operation knew about or depended on the earlier one. In this example, the clients are never fully up to date with the data on the server, since there is always another operation going on concurrently. But old versions of the value do get overwritten eventually, and no writes are lost.

图5-13中操作之间的数据流在图5-14中以图形化的方式呈现。箭头表示哪一个操作发生在另一个操作之前，即后面的操作知道或依赖于前面的操作。在这个例子中，由于总是有另一个操作在同时进行，所以客户端从未完全更新服务器上的数据。但是旧版的值最终会被覆盖，且没有写操作会丢失。

Note that the server can determine whether two operations are concurrent by looking at the version numbers—it does not need to interpret the value itself (so the value could be any data structure). The algorithm works as follows:

请注意，服务器可以通过查看版本号来确定两个操作是否并发，它不需要解释值本身（因此值可以是任何数据结构）。该算法的工作原理如下：

The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.

服务器为每个键维护一个版本号，在每次写入该键时递增版本号，并将新版本号与写入的值一起存储。
When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.

当客户端读取一个键时，服务器返回所有未被覆盖的值，以及最新的版本号。客户端必须在写入之前先读取一个键。
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)

当客户端写入一个键时，它必须包含先前读取的版本号，并将先前读取的所有值合并在一起。(写请求的响应可以像读取一样返回所有当前值，这使我们可以像购物车示例中那样链接几个写操作。)
When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).

当服务器接收到特定版本号的写操作时，它可以覆盖所有该版本号及以下的值（因为它知道它们已被合并到新值中），但必须保留所有高版本号的值（因为这些值与传入的写操作并发）。

When a write includes the version number from a prior read, that tells us which previous state the write is based on. If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything—it will just be returned as one of the values on subsequent reads.

当写操作包括先前读取的版本号时，这告诉我们写操作是基于哪个先前的状态。如果您进行写操作而不包括版本号，则它与所有其他写操作是并发的，因此它不会覆盖任何内容，它只会在后续读取中作为一个值返回。

Merging concurrently written values

This algorithm ensures that no data is silently dropped, but it unfortunately requires that the clients do some extra work: if several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values. Riak calls these concurrent values siblings .

这个算法确保不会静默丢失任何数据，但不幸的是，它要求客户做一些额外的工作：如果有几个操作同时发生，客户必须通过合并并发写入的值来清理。Riak称这些并发值为“同胞”。

Merging sibling values is essentially the same problem as conflict resolution in multi-leader replication, which we discussed previously (see “Handling Write Conflicts” ). A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. So, you may need to do something more intelligent in application code.

合并兄弟节点的值本质上与多个领导者复制中的冲突解决问题相同，我们之前讨论过（请参见“处理写入冲突”）。一个简单的方法是基于版本号或时间戳选择一个值（最后一个写入者获胜），但这意味着会丢失数据。因此，在应用程序代码中，您可能需要做一些更智能的处理。

With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union. In Figure 5-14 , the two final siblings are [milk, flour, eggs, bacon] and [eggs, milk, ham] ; note that milk and eggs appear in both, even though they were each only written once. The merged value might be something like [milk, flour, eggs, bacon, ham] , without duplicates.

在购物车的示例中，合并兄弟节点的合理方法是取并集。在图5-14中，最终的两个兄弟节点是[milk，flour，eggs，bacon]和[eggs，milk，ham]；请注意，牛奶和鸡蛋出现在两者中，即使它们仅被写入一次。合并后的值可能类似于[milk，flour，eggs，bacon，ham]，不包括重复项。

However, if you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result: if you merge two sibling carts and an item has been removed in only one of them, then the removed item will reappear in the union of the siblings [ 37 ]. To prevent this problem, an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone . (We previously saw tombstones in the context of log compaction in “Hash Indexes” .)

然而，如果您希望允许用户从购物车中删除物品，而不仅仅是添加物品，则合并兄弟节点的并集可能不会产生正确的结果：如果合并了两个兄弟购物车，并且有一项物品仅在其中一个购物车中被删除，则已删除的物品将重新出现在兄弟节点的并集中[37]。为了避免这个问题，当一个物品被移除时，不能只是从数据库中简单地删除它；相反，系统必须留下一个带有适当版本号的标记，以指示该物品已被移除当合并兄弟节点时。这样的删除标记被称为tombstone。（我们在“哈希索引”中曾经看到过tombstone在日志压缩的上下文中的应用。）

As merging siblings in application code is complex and error-prone, there are some efforts to design data structures that can perform this merging automatically, as discussed in “Automatic Conflict Resolution” . For example, Riak’s datatype support uses a family of data structures called CRDTs [ 38 , 39 , 55 ] that can automatically merge siblings in sensible ways, including preserving deletions.

由于在应用程序代码中合并兄弟节点很复杂且容易出错，因此有一些努力设计数据结构，可以自动执行此合并，正如“自动冲突解决”所讨论的那样。例如，Riak 的数据类型支持使用称为 CRDTs 的一系列数据结构 [38，39，55]，可以以明智的方式自动合并兄弟节点，包括保留删除操作。

Version vectors

The example in Figure 5-13 used only a single replica. How does the algorithm change when there are multiple replicas, but no leader?

当有多个副本但没有领导者时，图5-13中的示例仅使用单个副本。算法如何发生改变？

Figure 5-13 uses a single version number to capture dependencies between operations, but that is not sufficient when there are multiple replicas accepting writes concurrently. Instead, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings.

图5-13使用一个版本号来捕捉操作之间的依赖关系，但当有多个副本同时接受写入时，这是不够的。相反，我们需要为每个副本和每个键使用一个版本号。每个副本在处理写入时都会增加自己的版本号，并且还会跟踪它从其他副本中看到的版本号。这些信息指示了要覆盖哪些值以及哪些值作为兄弟保留。

The collection of version numbers from all the replicas is called a version vector [ 56 ]. A few variants of this idea are in use, but the most interesting is probably the dotted version vector [ 57 ], which is used in Riak 2.0 [ 58 , 59 ]. We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.

所有副本版本号的收集称为版本向量。这个想法有几个变体，但最有趣的可能是点分版本向量，它在Riak 2.0中使用。我们不会深入探讨细节，但它的工作方式与我们在购物车示例中看到的相似。

Like the version numbers in Figure 5-13 , version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context .) The version vector allows the database to distinguish between overwrites and concurrent writes.

像图5-13中的版本号一样，当值被读取时，版本向量从数据库副本发送到客户端，并在随后写入值时需要发送回数据库。(Riak将版本向量编码为称为因果上下文的字符串。)版本向量允许数据库区分覆盖写和并发写。

Also, like in the single-replica example, the application may need to merge siblings. The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica. Doing so may result in siblings being created, but no data is lost as long as siblings are merged correctly.

同样地，就像在单副本示例中一样，应用程序可能需要合并兄弟节点。版本向量结构确保从一个副本中读取并随后写回另一个副本是安全的。这样做可能会导致创建兄弟节点，但只要正确合并兄弟节点，就不会丢失任何数据。

Version vectors and vector clocks

A version vector is sometimes also called a vector clock , even though they are not quite the same. The difference is subtle—please see the references for details [ 57 , 60 , 61 ]. In brief, when comparing the state of replicas, version vectors are the right data structure to use.

版本向量有时也被称为向量时钟，尽管它们并不完全相同。不同之处微妙，请参阅详细信息的参考资料。概括地说，当比较副本的状态时，版本向量是正确的数据结构。

Summary

In this chapter we looked at the issue of replication. Replication can serve several purposes:

在本章中，我们讨论了复制的问题。复制可以有几个目的：

High availability

Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down

保持系统运行，即使一个机器（或多个机器，或整个数据中心）挂掉。

Disconnected operation

Allowing an application to continue working when there is a network interruption

当网络中断时，允许应用程序继续工作。

Latency

Placing data geographically close to users, so that users can interact with it faster

将数据放置在用户附近，以便用户能够更快地与之交互。

Scalability

Being able to handle a higher volume of reads than a single machine could handle, by performing reads on replicas

通过在副本上执行读取操作，能够处理比单台机器更高量的读取。

Despite being a simple goal—keeping a copy of the same data on several machines—replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we need to deal with unavailable nodes and network interruptions (and that’s not even considering the more insidious kinds of fault, such as silent data corruption due to software bugs).

尽管保持在几台计算机上的相同数据似乎是一个简单的目标，但复制实际上是一个异常棘手的问题。它需要仔细考虑并发性和所有可能出错的事情，并处理这些故障的后果。最少，我们需要处理不可用的节点和网络中断（甚至没有考虑由于软件错误而导致的隐蔽数据损坏等更隐蔽的故障）。

We discussed three main approaches to replication:

我们讨论了三种主要的复制方法：

Single-leader replication

Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale.

客户端将所有的写操作发送到单个节点（领导者），该节点发送一系列的数据变更事件到其它副本（跟随者）。读取操作可以在任何副本中进行，但从跟随者读取可能过期。

Multi-leader replication

Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes.

客户端将每个写操作发送到多个领导节点中的一个，任何一个领导节点都可以接受。领导者将数据变更事件的数据流发送给彼此和任何跟随者节点。

Leaderless replication

Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.

客户端将每个写操作发送到多个节点，并从多个节点并行读取，以便检测和纠正具有过时数据的节点。

Each approach has advantages and disadvantages. Single-leader replication is popular because it is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees.

每种方法都有优缺点。单个领导者复制很受欢迎，因为它相对容易理解，而且没有冲突解决的问题。多个领导者和无领导者复制可以更鲁棒，即使在存在故障节点、网络中断和延迟峰值的情况下，也可以在提供非常弱的一致性保证的代价下更加令人信服。

Replication can be synchronous or asynchronous, which has a profound effect on the system behavior when there is a fault. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases and servers fail. If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost.

复制可以是同步或异步的，这对系统的行为有深远影响，特别是在出现故障时。尽管当系统运行顺畅时，异步复制可以很快，但重要的是要弄清楚当复制延迟增加和服务器故障时会发生什么。如果领导者失败并且您提升一个异步更新的追随者成为新领导者，则可能会丢失最近提交的数据。

We looked at some strange effects that can be caused by replication lag, and we discussed a few consistency models which are helpful for deciding how an application should behave under replication lag:

我们研究了一些由复制延迟引起的奇怪影响，并讨论了几个一致性模型，这些模型有助于决定应用程序在复制延迟下应该如何行为。

Read-after-write consistency

Users should always see data that they submitted themselves.

用户应始终看到他们自己提交的数据。

Monotonic reads

After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

用户在某个时间点看到了数据后，不应该再看到一些早期时间点的数据。

Consistent prefix reads

Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order.

用户应该以类因果关系的形式看到数据：例如，在正确的顺序中看到一个问题及其回答。

Finally, we discussed the concurrency issues that are inherent in multi-leader and leaderless replication approaches: because they allow multiple writes to happen concurrently, conflicts may occur. We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they happened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates.

最后，我们讨论了多领导者和无领导者复制方法固有的并发问题：因为它们允许多个写入同时发生，可能会发生冲突。我们研究了一个数据库可能使用的算法，以确定一项操作是在另一项操作之前发生，还是它们同时发生。我们还提及了通过合并同时更新来解决冲突的方法。

In the next chapter we will continue looking at data that is distributed across multiple machines, through the counterpart of replication: splitting a large dataset into partitions .

在下一章中，我们将继续探讨在多台机器上分布的数据，通过复制的对应方式：将大型数据集分成多个分区。

Footnotes

ⁱ Different people have different definitions for hot , warm , and cold standby servers. In PostgreSQL, for example, hot standby is used to refer to a replica that accepts reads from clients, whereas a warm standby processes changes from the leader but doesn’t process any queries from clients. For purposes of this book, the difference isn’t important.

不同人对热备、温备和冷备服务器有不同的定义。例如，在PostgreSQL中，热备是指可以接受客户端读取的副本，而温备只处理来自主服务器的更改，不会处理客户端的任何查询。本书的目的并不在于讨论它们之间的区别。

ⁱⁱ This approach is known as fencing or, more emphatically, Shoot The Other Node In The Head (STONITH). We will discuss fencing in more detail in “The leader and the lock” .

这种方法被称为围栏或更强调地说，Shoot The Other Node In The Head (STONITH)。我们将在“领导者和锁”的章节中更详细地讨论围栏。

ⁱⁱⁱ The term eventual consistency was coined by Douglas Terry et al. [ 24 ], popularized by Werner Vogels [ 22 ], and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually consistent: followers in an asynchronously replicated relational database have the same characteristics.

“最终一致性”这个术语是由Douglas Terry等人首创[24]，由Werner Vogels[22]推广，并成为许多NoSQL项目的战斗口号。但是，不仅NoSQL数据库是最终一致性的：在异步复制的关系型数据库中，追随者具有相同的特点。”

^iv If the database is partitioned (see Chapter 6 ), each partition has one leader. Different partitions may have their leaders on different nodes, but each partition must nevertheless have one leader node.

如果数据库被分区（参见第6章），每个分区都有一个领导者。不同的分区可能在不同的节点上有它们的领导者，但每个分区必须至少有一个领导节点。

^v Not to be confused with a star schema (see “Stars and Snowflakes: Schemas for Analytics” ), which describes the structure of a data model, not the communication topology between nodes.

不要与星型模式混淆（参见“星型和雪花型：分析模式”），星型模式描述的是数据模型的结构，而不是节点之间的通信拓扑结构。

^vi Dynamo is not available to users outside of Amazon. Confusingly, AWS offers a hosted database product called DynamoDB , which uses a completely different architecture: it is based on single-leader replication.

Vi Dynamo对于亚马逊以外的用户不可用。AWS提供了一个名为DynamoDB的托管数据库产品，但它采用了完全不同的架构：它基于单主复制。

^vii Sometimes this kind of quorum is called a strict quorum , to contrast with sloppy quorums (discussed in “Sloppy Quorums and Hinted Handoff” ).

有时，这种类型的法定人数被称为严格的法定人数，与松散的法定人数相对应（在“松散的法定人数和提示手动抛掷”中讨论）。

References

[ 1 ] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “ Notes on Distributed Databases ,” IBM Research, Research Report RJ2571(33471), July 1979.

[1] 布鲁斯·林赛，帕特里夏·格里菲斯·塞林格，C. 卡尔蒂埃里等人： “分布式数据库注记”，IBM 研究，研究报告 RJ2571（33471），1979年7月。

[ 2 ] “ Oracle Active Data Guard Real-Time Data Protection and Availability ,” Oracle White Paper, June 2013.

[2] “Oracle Active Data Guard实时数据保护和可用性”，Oracle白皮书，2013年6月。

[ 3 ] “ AlwaysOn Availability Groups ,” in SQL Server Books Online , Microsoft, 2012.

[3] "SQL Server Books Online中的AlwaysOn可用性组，Microsoft，2012年"。

[ 4 ] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “ On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[4] Lin Qiao、Kapil Surlaker、Shirshanka Das等人: “关于制作新鲜的浓缩咖啡：LinkedIn的分布式数据服务平台”，于2013年6月ACM国际数据管理会议(SIGMOD)发表。

[ 5 ] Jun Rao: “ Intra-Cluster Replication for Apache Kafka ,” at ApacheCon North America , February 2013.

[5] 饶俊: “Apache Kafka的集群内复制”，在2013年2月的ApacheCon北美大会上。

[ 6 ] “ Highly Available Queues ,” in RabbitMQ Server Documentation , Pivotal Software, Inc., 2014.

[6] “高度可用的队列”，RabbitMQ服务器文档，Pivotal Software, Inc., 2014。

[ 7 ] Yoshinori Matsunobu: “ Semi-Synchronous Replication at Facebook ,” yoshinorimatsunobu.blogspot.co.uk , April 1, 2014.

[7] Yoshinori Matsunobu： “Facebook的半同步复制”，yoshinorimatsunobu.blogspot.co.uk，2014年4月1日。

[ 8 ] Robbert van Renesse and Fred B. Schneider: “ Chain Replication for Supporting High Throughput and Availability ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[8] Robbert van Renesse 和 Fred B. Schneider： “链式复制以支持高吞吐量和可用性”，于第六届USENIX操作系统设计与实现研讨会（OSDI）上，于2004年12月发布。

[ 9 ] Jeff Terrace and Michael J. Freedman: “ Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads ,” at USENIX Annual Technical Conference (ATC), June 2009.

[9] Jeff Terrace和Michael J. Freedman: “基于CRAQ的对象存储：适用于只读工作负载的高吞吐量链复制”，于2009年6月的USENIX 年度技术会议(ATC)上发表。

[ 10 ] Brad Calder, Ju Wang, Aaron Ogus, et al.: “ Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency ,” at 23rd ACM Symposium on Operating Systems Principles (SOSP), October 2011.

【10】Brad Calder, Ju Wang, Aaron Ogus等人：“Windows Azure储存：高可用云存储服务与强一致性”，发表在第23届ACM操作系统原理研讨会（SOSP）上，2011年10月。

[ 11 ] Andrew Wang: “ Windows Azure Storage ,” umbrant.com , February 4, 2016.

[11] Andrew Wang: “Windows Azure Storage,” umbrant.com，2016年2月4日。安德鲁·王： “Windows Azure 存储”， umbrant.com，2016年2月4日。

[ 12 ] “ Percona Xtrabackup - Documentation ,” Percona LLC, 2014.

[12] "Percona Xtrabackup - 文档，" Percona LLC，2014年。

[ 13 ] Jesse Newland: “ GitHub Availability This Week ,” github.com , September 14, 2012.

Jesse Newland：“GitHub本周可用性”，github.com，2012年9月14日。

[ 14 ] Mark Imbriaco: “ Downtime Last Saturday ,” github.com , December 26, 2012.

[14] Mark Imbriaco：“上周六的停机时间”，github.com，2012年12月26日。

[ 15 ] John Hugg: “ ‘All in’ with Determinism for Performance and Testing in Distributed Systems ,” at Strange Loop , September 2015.

[15] 约翰·哈格（John Hugg）：“在分布式系统的性能与测试中实现确定性”，于 2015 年 9 月在 Strange Loop 上演讲。

[ 16 ] Amit Kapila: “ WAL Internals of PostgreSQL ,” at PostgreSQL Conference (PGCon), May 2012.

[16] Amit Kapila：2012年5月在PostgreSQL Conference（PGCon）上发表的“PostgreSQL WAL的内部机制”。

[ 17 ] MySQL Internals Manual . Oracle, 2014.

[17] MySQL 内部手册。Oracle，2014年。

[ 18 ] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “ Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services ,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

【18】Yogeshwer Sharma，Philippe Ajoux，Petchean Ang等：“Wormhole：“可靠的发布-订阅以支持地理复制的互联网服务”，在第12届USENIX网络系统设计和实现研讨会（NSDI），2015年5月。

[ 19 ] “ Oracle GoldenGate 12c: Real-Time Access to Real-Time Information ,” Oracle White Paper, October 2013.

[19] “Oracle GoldenGate 12c：实时获取实时信息，”Oracle白皮书，2013年10月。

[ 20 ] Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “ All Aboard the Databus! ,” at ACM Symposium on Cloud Computing (SoCC), October 2012.

"[20] Shirshanka Das, Chavdar Botev, Kapil Surlaker等人： “全员上车数据总线！”，于2012年10月在ACM云计算研讨会（SoCC）上发表。"

[ 21 ] Greg Sabino Mullane: “ Version 5 of Bucardo Database Replication System ,” blog.endpoint.com , June 23, 2014.

[21] Greg Sabino Mullane: “Bucardo数据库复制系统的版本5”，blog.endpoint.com，2014年6月23日。

[ 22 ] Werner Vogels: “ Eventually Consistent ,” ACM Queue , volume 6, number 6, pages 14–19, October 2008. doi:10.1145/1466443.1466448

“最终一致性”，Werner Vogels，ACM队列，第6卷，第6号，第14-19页，2008年10月。DOI：10.1145 / 1466443.1466448。

[ 23 ] Douglas B. Terry: “ Replicated Data Consistency Explained Through Baseball ,” Microsoft Research, Technical Report MSR-TR-2011-137, October 2011.

[23] Douglas B. Terry：“通过棒球解释复制数据一致性”，微软研究，技术报告MSR-TR-2011-137，2011年10月。

[ 24 ] Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: “ Session Guarantees for Weakly Consistent Replicated Data ,” at 3rd International Conference on Parallel and Distributed Information Systems (PDIS), September 1994. doi:10.1109/PDIS.1994.331722

[24] Douglas B. Terry, Alan J. Demers, Karin Petersen等人： “弱一致性复制数据的会话保证”，发表于1994年9月第3届国际并行与分布式信息系统会议（PDIS）。 doi：10.1109/PDIS.1994.331722。

[ 25 ] Terry Pratchett: Reaper Man: A Discworld Novel . Victor Gollancz, 1991. ISBN: 978-0-575-04979-6

[25] 特里·普拉切特：死神来了：磁盘世界小说。维克多·戈兰茨，1991年。ISBN：978-0-575-04979-6。

[ 26 ] “ Tungsten Replicator ,” Continuent, Inc., 2014.

[26] “钨复制器”，Continuent公司，2014年。

[ 27 ] “ BDR 0.10.0 Documentation ,” The PostgreSQL Global Development Group, bdr-project.org , 2015.

“BDR 0.10.0文档”，PostgreSQL全球开发组， bdr-project.org，2015年。

[ 28 ] Robert Hodges: “ If You *Must* Deploy Multi-Master Replication, Read This First ,” scale-out-blog.blogspot.co.uk , March 30, 2012.

[28] 罗伯特·霍奇斯： “如果您必须部署多主复制，请先阅读本文”，scale-out-blog.blogspot.co.uk，2012年3月30日。

[ 29 ] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide . O’Reilly Media, 2010. ISBN: 978-0-596-15589-6

[29] J. Chris Anderson，Jan Lehnardt和Noah Slater：CouchDB：权威指南。O'Reilly Media，2010年。 ISBN：978-0-596-15589-6。

[ 30 ] AppJet, Inc.: “ Etherpad and EasySync Technical Manual ,” github.com , March 26, 2011.

[30] AppJet, Inc.：“Etherpad和EasySync技术手册”，github.com，2011年3月26日。

[ 31 ] John Day-Richter: “ What’s Different About the New Google Docs: Making Collaboration Fast ,” googledrive.blogspot.com , 23 September 2010.

[31] 约翰·戴-里希特： “新版Google文档有何不同：加快协作速度”，googledrive.blogspot.com，2010年9月23日。

[ 32 ] Martin Kleppmann and Alastair R. Beresford: “ A Conflict-Free Replicated JSON Datatype ,” arXiv:1608.03960, August 13, 2016.

[32] Martin Kleppmann和Alastair R. Beresford： “无冲突复制JSON数据类型，” arXiv:1608.03960，2016年8月13日。

[ 33 ] Frazer Clement: “ Eventual Consistency – Detecting Conflicts ,” messagepassing.blogspot.co.uk , October 20, 2011.

"最终一致性–检测冲突"，Frazer Clement，messagepassing.blogspot.co.uk，2011年10月20日。"

[ 34 ] Robert Hodges: “ State of the Art for MySQL Multi-Master Replication ,” at Percona Live: MySQL Conference & Expo , April 2013.

[34] Robert Hodges： “MySQL 多主复制的现状”，于 Percona Live：MySQL 会议和展览会，2013 年 4 月。

[ 35 ] John Daily: “ Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems ,” basho.com , November 12, 2013.

[35] 约翰·戴利: “时钟是不好的，或者说，欢迎来到分布式系统的美好世界，” basho.com，2013年11月12日。

[ 36 ] Riley Berton: “ Is Bi-Directional Replication (BDR) in Postgres Transactional? ,” sdf.org , January 4, 2016.

“Postgres的Bi-Directional Replication (BDR)是否支持事务处理？”- Riley Berton，sdf.org，2016年1月4日。

[ 37 ] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “ Dynamo: Amazon’s Highly Available Key-Value Store ,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

[37] 朱塞佩·德坎迪亚、德尼兹·哈斯图伦、马丹·贾姆帕尼等人: “Dynamo: 亚马逊高度可用性的键值存储”，发表于2007年10月的第21届ACM操作系统原理研讨会（SOSP）。

[ 38 ] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: “ A Comprehensive Study of Convergent and Commutative Replicated Data Types ,” INRIA Research Report no. 7506, January 2011.

[38] Marc Shapiro、Nuno Preguiça、Carlos Baquero和Marek Zawirski：“《收敛和交换复制数据类型的全面研究》”，INRIA研究报告7506，2011年1月。

[ 39 ] Sam Elliott: “ CRDTs: An UPDATE (or Maybe Just a PUT) ,” at RICON West , October 2013.

[39] 萨姆·艾略特: “CRDTs: 更新(或或许只是PUT)”，于2013年10月RICON West演讲。

[ 40 ] Russell Brown: “ A Bluffers Guide to CRDTs in Riak ,” gist.github.com , October 28, 2013.

“Riak中CRDT的骗子指南”，Russell Brown，2013年10月28日，gist.github.com。

[ 41 ] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: “ Mergeable Persistent Data Structures ,” at 26es Journées Francophones des Langages Applicatifs (JFLA), January 2015.

[41] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy：“可合并的持久化数据结构”，发表于2015年1月的第26届法语应用语言研讨会（JFLA）。

[ 42 ] Chengzheng Sun and Clarence Ellis: “ Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements ,” at ACM Conference on Computer Supported Cooperative Work (CSCW), November 1998.

[42] 孙成政和克拉伦斯·艾利斯：“实时群组编辑器中的操作转换：问题、算法和成就”，1998年11月，ACM计算机支持的合作工作会议（CSCW）。

[ 43 ] Lars Hofhansl: “ HBASE-7709: Infinite Loop Possible in Master/Master Replication ,” issues.apache.org , January 29, 2013.

[43] Lars Hofhansl：“HBASE-7709：主/主复制中可能存在无限循环”，issues.apache.org，2013年1月29日。

[ 44 ] David K. Gifford: “ Weighted Voting for Replicated Data ,” at 7th ACM Symposium on Operating Systems Principles (SOSP), December 1979. doi:10.1145/800215.806583

[44] David K. Gifford： “复制数据的加权投票”，于1979年12月第7届ACM操作系统原理研讨会（SOSP）上发表。 doi:10.1145/800215.806583。

[ 45 ] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “ Flexible Paxos: Quorum Intersection Revisited ,” arXiv:1608.06696 , August 24, 2016.

[45] Heidi Howard，Dahlia Malkhi和Alexander Spiegelman：“灵活的Paxos: 重访仲裁交集”，arXiv:1608.06696，2016年8月24日。

[ 46 ] Joseph Blomstedt: “ Re: Absolute Consistency ,” email to riak-users mailing list, lists.basho.com , January 11, 2012.

[46] Joseph Blomstedt： “关于：绝对一致性”，发件人为riak-users邮件列表的电子邮件，lists.basho.com，2012年1月11日。

[ 47 ] Joseph Blomstedt: “ Bringing Consistency to Riak ,” at RICON West , October 2012.

[47] Joseph Blomstedt：2012年10月，在RICON West上，“将一致性带给Riak”。

[ 48 ] Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: “ Quantifying Eventual Consistency with PBS ,” Communications of the ACM , volume 57, number 8, pages 93–102, August 2014. doi:10.1145/2632792

Peter Bailis、Shivaram Venkataraman、Michael J. Franklin等人在2014年8月的ACM通讯中发表了一篇名为“Quantifying Eventual Consistency with PBS”的论文，页码为93-102，DOI为10.1145/2632792，介绍了使用PBS量化最终一致性的方法。

[ 49 ] Jonathan Ellis: “ Modern Hinted Handoff ,” datastax.com , December 11, 2012.

[49] Jonathan Ellis：“现代提示式传递”，datastax.com，2012年12月11日。 [49] 乔纳森·埃利斯：“现代提示式传递”，datastax.com，2012年12月11日。

[ 50 ] “ Project Voldemort Wiki ,” github.com , 2013.

“伏地魔项目维基”，github.com，2013年。

[ 51 ] “ Apache Cassandra 2.0 Documentation ,” DataStax, Inc., 2014.

“Apache Cassandra 2.0 文档”，DataStax，Inc.，2014年。

[ 52 ] “ Riak Enterprise: Multi-Datacenter Replication .” Technical whitepaper, Basho Technologies, Inc., September 2014.

[52] “Riak企业版：多数据中心复制。” 技术白皮书，Basho Technologies, Inc.，2014年9月。

[ 53 ] Jonathan Ellis: “ Why Cassandra Doesn’t Need Vector Clocks ,” datastax.com , September 2, 2013.

[53] Jonathan Ellis: "为什么Cassandra不需要向量时钟"，datastax.com，2013年9月2日。

[ 54 ] Leslie Lamport: “ Time, Clocks, and the Ordering of Events in a Distributed System ,” Communications of the ACM , volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

“时间、时钟和分布式系统中事件的排序”，作者 Leslie Lamport，发表于 1978 年 7 月的 ACM 通讯期刊，第 21 卷第 7 期，页码为 558-565。doi:10.1145/359545.359563。

[ 55 ] Joel Jacobson: “ Riak 2.0: Data Types ,” blog.joeljacobson.com , March 23, 2014.

[55] Joel Jacobson：《Riak 2.0: 数据类型》，blog.joeljacobson.com，2014年3月23日。

[ 56 ] D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, et al.: “ Detection of Mutual Inconsistency in Distributed Systems ,” IEEE Transactions on Software Engineering , volume 9, number 3, pages 240–247, May 1983. doi:10.1109/TSE.1983.236733

[56] D. Stott Parker Jr.，Gerald J. Popek，Gerard Rudisin等人： “分布式系统中的相互不一致性检测”，IEEE软件工程交易，第9卷，第3号，240-247页， 1983年5月。 doi:10.1109/TSE.1983.236733

[ 57 ] Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: “ Dotted Version Vectors: Logical Clocks for Optimistic Replication ,” arXiv:1011.5808, November 26, 2010.

【57】Nuno Preguiça、Carlos Baquero、Paulo Sérgio Almeida 等人：「点状版本向量：乐观复制的逻辑时钟」，arXiv:1011.5808，2010 年 11 月 26 日。

[ 58 ] Sean Cribbs: “ A Brief History of Time in Riak ,” at RICON , October 2014.

[58] Sean Cribbs： “Riak的简史”，于2014年10月在RICON上。

[ 59 ] Russell Brown: “ Vector Clocks Revisited Part 2: Dotted Version Vectors ,” basho.com , November 10, 2015.

[59] 罗素·布朗：“向量时钟再探（第二部分）：“带点版本向量”，basho.com，2015年11月10日。

[ 60 ] Carlos Baquero: “ Version Vectors Are Not Vector Clocks ,” haslab.wordpress.com , July 8, 2011.

[60] Carlos Baquero：“版本向量不是向量时钟”，haslab.wordpress.com，2011年7月8日。

[ 61 ] Reinhard Schwarz and Friedemann Mattern: “ Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail ,” Distributed Computing , volume 7, number 3, pages 149–174, March 1994. doi:10.1007/BF02277859

[61] Reinhard Schwarz 和 Friedemann Mattern： “在分布式计算中检测因果关系：寻找圣杯”，《分布式计算》杂志，1994 年 3 月，卷 7，号 3，第 149-174 页。doi:10.1007/BF02277859

Chapter 6. Partitioning

Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures.

显然，我们必须摆脱线性思维并不限制计算机。我们必须明确定义和优先级，并描述数据。我们必须声明关系，而不是程序。

Grace Murray Hopper, Management and the Computer of the Future (1962)

格雷斯·默里·霍珀，《管理和未来的计算机》（1962年）。

In Chapter 5 we discussed replication—that is, having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions , also known as sharding . ⁱ

第5章讨论了复制——也就是在不同节点上有相同数据的多个副本。对于非常大的数据集，或非常高的查询吞吐量来说，这还不足以满足需求：我们需要将数据分成分区，也被称为分片。

Terminological confusion

What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; it’s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase. However, partitioning is the most established term, so we’ll stick with that.

在这里我们所称的“分区”，在MongoDB、Elasticsearch和SolrCloud中被称为“分片”；在HBase中被称为“区域”，在Bigtable中被称为“表格”，在Cassandra和Riak中被称为“虚拟节点”，在Couchbase中被称为“vBucket”。然而，“分区”是最常用的术语，所以我们将坚持使用它。

Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition. There are various ways of achieving this, which we discuss in depth in this chapter. In effect, each partition is a small database of its own, although the database may support operations that touch multiple partitions at the same time.

通常情况下，分区是按照每个数据片段（每个记录，行或文档）恰好属于一个分区的方式来定义。有多种方法可以实现这一点，本章节我们将深入探讨。实际上，每个分区本身都是一个小型数据库，尽管数据库可能支持同时涉及多个分区的操作。

The main reason for wanting to partition data is scalability . Different partitions can be placed on different nodes in a shared-nothing cluster (see the introduction to Part II for a definition of shared nothing ). Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

分区数据的主要原因是为了可扩展性。不同的分区可以放置在共享无处不在的集群中的不同节点上（有关共享无处不在的定义，请参见第二部分的介绍）。因此，大型数据集可以分布在许多磁盘上，并且查询负载可以分布在许多处理器上。

For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder.

针对仅操作单个分区的查询，每个节点可以独立执行其自己分区的查询，因此可以通过添加更多节点来扩展查询吞吐量。尽管这变得更加困难，但大型复杂查询可能会在许多节点上并行化。

Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL [ 1 ], and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses. Some systems are designed for transactional workloads, and others for analytics (see “Transaction Processing or Analytics?” ): this difference affects how the system is tuned, but the fundamentals of partitioning apply to both kinds of workloads.

分区数据库最早在20世纪80年代由Teradata和Tandem NonStop SQL等产品开创[1]，最近又被NoSQL数据库和基于Hadoop的数据仓库重新发现。一些系统适用于事务工作负载，而另一些则适用于分析（参见“事务处理还是分析？”）：这种差异会影响系统的调整，但分区的基础知识适用于两种类型的工作负载。

In this chapter we will first look at different approaches for partitioning large datasets and observe how the indexing of data interacts with partitioning. We’ll then talk about rebalancing, which is necessary if you want to add or remove nodes in your cluster. Finally, we’ll get an overview of how databases route requests to the right partitions and execute queries.

在本章节中，我们首先将探讨对于大型数据集的不同分区方法，并观察数据索引与分区之间的交互作用。然后我们将谈论重新平衡，如果您想在群集中添加或删除节点，则必须进行重新平衡。最后，我们将概述数据库如何将请求路由到正确的分区并执行查询。

Partitioning and Replication

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes. This means that, even though each record belongs to exactly one partition, it may still be stored on several different nodes for fault tolerance.

通常，分区会与复制相结合，以便将每个分区的副本存储在多个节点上。这意味着，即使每个记录仅属于一个分区，它仍可能存储在多个不同的节点上以实现容错。

A node may store more than one partition. If a leader–follower replication model is used, the combination of partitioning and replication can look like Figure 6-1 . Each partition’s leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the leader for some partitions and a follower for other partitions.

一个节点可以存储多个分区。如果使用领导者-追随者复制模型，分区和复制的组合可能会看起来像图6-1所示的那样。每个分区的领导者分配给一个节点，其追随者分配给其他节点。每个节点可以是某些分区的领导者，也可以是其他分区的追随者。

Everything we discussed in Chapter 5 about replication of databases applies equally to replication of partitions. The choice of partitioning scheme is mostly independent of the choice of replication scheme, so we will keep things simple and ignore replication in this chapter.

我们在第5章中讨论的有关数据库复制的所有内容同样适用于分区复制。分区方案的选择大多与复制方案无关，因此我们将保持简单，忽略本章中的复制。

Partitioning of Key-Value Data

Say you have a large amount of data, and you want to partition it. How do you decide which records to store on which nodes?

如果您有大量的数据需要进行分区，您将如何决定将哪些记录存储在哪些节点上？

Our goal with partitioning is to spread the data and the query load evenly across nodes. If every node takes a fair share, then—in theory—10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node (ignoring replication for now).

我们分区的目标是将数据和查询负载均匀地分散到各个节点上。如果每个节点都承担公平的份额，那么理论上，10个节点应该能够处理单个节点的10倍数据量和10倍读写吞吐量（暂不考虑复制）。

If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed . The presence of skew makes partitioning much less effective. In an extreme case, all the load could end up on one partition, so 9 out of 10 nodes are idle and your bottleneck is the single busy node. A partition with disproportionately high load is called a hot spot .

如果分区不公平，某些分区的数据或查询比其他分区多，我们称之为倾斜。存在倾斜会使分区变得不太有效。在极端情况下，所有负载可能都集中在一个分区上，因此有9个节点空闲，而您的瓶颈是单个繁忙节点。负载不成比例的分区称为热点。

The simplest approach for avoiding hot spots would be to assign records to nodes randomly. That would distribute the data quite evenly across the nodes, but it has a big disadvantage: when you’re trying to read a particular item, you have no way of knowing which node it is on, so you have to query all nodes in parallel.

避免热点的最简单方法是将记录随机分配到节点上。这将在节点之间相当均匀地分布数据，但它有一个很大的缺点：当您尝试读取特定项目时，您无法知道它在哪个节点上，因此必须并行查询所有节点。

We can do better. Let’s assume for now that you have a simple key-value data model, in which you always access a record by its primary key. For example, in an old-fashioned paper encyclopedia, you look up an entry by its title; since all the entries are alphabetically sorted by title, you can quickly find the one you’re looking for.

我们可以做得更好。现在假设你有一个简单的键值数据模型，其中你总是根据主键访问记录。例如，在旧式纸质百科全书中，你通过标题查找条目；由于所有条目都按标题字母顺序排序，因此你可以迅速找到需要的条目。

Partitioning by Key Range

One way of partitioning is to assign a continuous range of keys (from some minimum to some maximum) to each partition, like the volumes of a paper encyclopedia ( Figure 6-2 ). If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is assigned to which node, then you can make your request directly to the appropriate node (or, in the case of the encyclopedia, pick the correct book off the shelf).

一种分区的方式是将一段连续的键（从某个最小到某个最大）分配给每个分区，就像一本书的卷数一样（图6-2）。如果你知道范围之间的界限，就可以很容易地确定哪个分区包含一个给定的键。如果你还知道哪个分区分配给哪个节点，那么你就可以直接向适当的节点发出请求（或者，在百科全书的情况下，从架子上取出正确的书）。

The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed. For example, in Figure 6-2 , volume 1 contains words starting with A and B, but volume 12 contains words starting with T, U, V, X, Y, and Z. Simply having one volume per two letters of the alphabet would lead to some volumes being much bigger than others. In order to distribute the data evenly, the partition boundaries need to adapt to the data.

密钥范围不一定均匀分布，因为您的数据可能不是均匀分布的。例如，在图6-2中，卷1包含以A和B开头的单词，但是卷12包含以T、U、V、X、Y和Z开头的单词。仅将字母表每两个字母一个卷的方式将导致一些卷比其他卷大得多。为了均匀分布数据，分区边界需要根据数据进行调整。

The partition boundaries might be chosen manually by an administrator, or the database can choose them automatically (we will discuss choices of partition boundaries in more detail in “Rebalancing Partitions” ). This partitioning strategy is used by Bigtable, its open source equivalent HBase [ 2 , 3 ], RethinkDB, and MongoDB before version 2.4 [ 4 ].

分区边界可以由管理员手动选择，也可以被数据库自动选择（我们将在“重新平衡分区”的更多细节中讨论分区边界的选择）。这种分区策略被Bigtable、它的开源版本HBase [2, 3]、RethinkDB和MongoDB 2.4版本之前使用[4]。

Within each partition, we can keep keys in sorted order (see “SSTables and LSM-Trees” ). This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one query (see “Multi-column indexes” ). For example, consider an application that stores data from a network of sensors, where the key is the timestamp of the measurement ( year-month-day-hour-minute-second ). Range scans are very useful in this case, because they let you easily fetch, say, all the readings from a particular month.

在每个分区内，我们可以按排序顺序保留键（参见“SSTables 和 LSM-Trees”）。这样可以轻松进行范围扫描，并且您可以将键视为连接的索引，以便在一个查询中获取几个相关记录（参见“多列索引”）。例如，考虑一个存储来自传感器网络数据的应用程序，其中键是测量的时间戳（年-月-日-小时-分钟-秒）。在这种情况下，范围扫描非常有用，因为它们可以让您轻松获取一个特定月份的所有读数。

However, the downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time—e.g., one partition per day. Unfortunately, because we write data from the sensors to the database as the measurements happen, all the writes end up going to the same partition (the one for today), so that partition can be overloaded with writes while others sit idle [ 5 ].

然而，键值范围分区的缺点在于某些访问模式可能会导致热点问题。如果键是时间戳，则分区对应于时间范围-例如，每天一个分区。不幸的是，由于我们将传感器数据写入数据库时，所有写操作最终都会进入同一个分区（即今天的分区），因此该分区可能会被写入压力过大而其他分区处于空闲状态[5]。

To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor name so that the partitioning is first by sensor name and then by time. Assuming you have many sensors active at the same time, the write load will end up more evenly spread across the partitions. Now, when you want to fetch the values of multiple sensors within a time range, you need to perform a separate range query for each sensor name.

为了避免传感器数据库中出现这个问题，你需要使用不同于时间戳作为键的第一个元素的东西。例如，你可以给每个时间戳添加传感器名称前缀，以便按传感器名称和时间进行分区。假设你有许多传感器同时活动，写入负载将更均匀地分布在分区中。现在，当你想要获取某个时间范围内多个传感器的值时，你需要为每个传感器名称执行一个单独的范围查询。

Partitioning by Hash of Key

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key.

由于存在偏斜和热点的风险，许多分布式数据存储系统使用散列函数来确定给定键的分区。

A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit hash function that takes a string. Whenever you give it a new string, it returns a seemingly random number between 0 and 2 ³² − 1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers.

一个好的哈希函数将偏斜的数据转化为均匀分布的。假设你有一个32位哈希函数，它接收一个字符串。无论何时你给它一个新的字符串，它返回的数字看起来都是0到232 − 1之间的随机数。即使输入字符串非常相似，它们的哈希在那个数字范围内也是均匀分布的。

For partitioning purposes, the hash function need not be cryptographically strong: for example, Cassandra and MongoDB use MD5, and Voldemort uses the Fowler–Noll–Vo function. Many programming languages have simple hash functions built in (as they are used for hash tables), but they may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash , the same key may have a different hash value in different processes [ 6 ].

为了分区目的，哈希函数不需要具有密码学强度:例如，Cassandra和MongoDB使用MD5，Voldemort使用Fowler-Noll-Vo函数。许多编程语言都内置了简单的哈希函数(因为它们用于哈希表)，但它们可能不适合分区:例如，在Java的Object.hashCode()和Ruby的Object#hash中，同一个键在不同的进程中可能具有不同的哈希值 [6]。

Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition. This is illustrated in Figure 6-3 .

一旦你为键准备了一个合适的哈希函数，就可以将每个分区分配一个哈希范围（而不是一个键的范围），每个哈希落在分区范围内的键都将存储在该分区中。如图6-3所示。

This technique is good at distributing keys fairly among the partitions. The partition boundaries can be evenly spaced, or they can be chosen pseudorandomly (in which case the technique is sometimes known as consistent hashing ).

该技术能够在分区间公平分配密钥。分区边界可以均匀间隔，也可以伪随机选择（此时该技术有时被称为一致性哈希）。

Consistent Hashing

Consistent hashing, as defined by Karger et al. [ 7 ], is a way of evenly distributing load across an internet-wide system of caches such as a content delivery network (CDN). It uses randomly chosen partition boundaries to avoid the need for central control or distributed consensus. Note that consistent here has nothing to do with replica consistency (see Chapter 5 ) or ACID consistency (see Chapter 7 ), but rather describes a particular approach to rebalancing.

一致性哈希是由Karger等人定义的一种方式，可以在像内容分发网络（CDN）这样的全球缓存系统中平均分布负载。它使用随机选择的分区边界来避免对中心控制或分布式共识的需求。请注意，这里的一致性与副本一致性（请参见第5章）或ACID一致性（请参见第7章）无关，而是描述了一种特定的平衡方法。

As we shall see in “Rebalancing Partitions” , this particular approach actually doesn’t work very well for databases [ 8 ], so it is rarely used in practice (the documentation of some databases still refers to consistent hashing, but it is often inaccurate). Because this is so confusing, it’s best to avoid the term consistent hashing and just call it hash partitioning instead.

正如我们在“重平衡分区”一节中所看到的那样，用于数据库的这种特定方法实际上并不太有效[8]，因此在实践中很少使用（某些数据库的文档仍然引用了一致性哈希，但经常是不准确的）。由于这很令人困惑，最好避免使用一致性哈希这个术语，而是称之为哈希分区。

Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries. Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost. In MongoDB, if you have enabled hash-based sharding mode, any range query has to be sent to all partitions [ 4 ]. Range queries on the primary key are not supported by Riak [ 9 ], Couchbase [ 10 ], or Voldemort.

但是，通过使用密钥的哈希进行分区，我们失去了一种良好的键范围分区特性：进行高效的范围查询。曾经相邻的键现在分散在所有分区中，因此它们的排序顺序被打乱了。在MongoDB中，如果启用了基于哈希的分片模式，则任何范围查询都必须发送到所有分区。Riak [9]，Couchbase [10]或Voldemort不支持基于主键的范围查询。

Cassandra achieves a compromise between the two partitioning strategies [ 11 , 12 , 13 ]. A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables. A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.

Cassandra在两个分区策略之间取得了折衷。Cassandra中的一个表可以使用由多个列组成的复合主键声明。只有该键的第一部分被哈希以确定分区，其他列用作Cassandra SSTables中数据排序的连接索引。因此，查询无法在复合键的第一列中搜索值的范围，但如果它为第一列指定一个固定值，则可以在键的其他列上执行有效的范围扫描。

The concatenated index approach enables an elegant data model for one-to-many relationships. For example, on a social media site, one user may post many updates. If the primary key for updates is chosen to be (user_id, update_timestamp) , then you can efficiently retrieve all updates made by a particular user within some time interval, sorted by timestamp. Different users may be stored on different partitions, but within each user, the updates are stored ordered by timestamp on a single partition.

连接索引方法可以为一对多的关系提供优雅的数据模型。例如，在社交媒体站点上，一个用户可以发布许多更新。如果更新的主键选择为（用户ID、更新时间戳），则可以有效地检索某个时间间隔内某个特定用户所作的所有更新，按时间戳排序。不同的用户可以存储在不同的分区中，但在每个用户内，更新以时间戳在单个分区中排序存储。

Skewed Workloads and Relieving Hot Spots

As discussed, hashing a key to determine its partition can help reduce hot spots. However, it can’t avoid them entirely: in the extreme case where all reads and writes are for the same key, you still end up with all requests being routed to the same partition.

如上所述，通过对键进行哈希以确定其分区可以帮助减少热点。但是，它无法完全避免热点问题：在所有读写都针对同一个密钥的极端情况下，仍然需要将所有请求路由到同一个分区。

This kind of workload is perhaps unusual, but not unheard of: for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something [ 14 ]. This event can result in a large volume of writes to the same key (where the key is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). Hashing the key doesn’t help, as the hash of two identical IDs is still the same.

这种工作负载可能不太常见，但并非闻所未闻：例如，在社交媒体网站上，拥有数百万粉丝的名人用户可能会在他们做某事时引起一场风暴活动[14]。这种事件可能会导致对相同密钥进行大量写入（其中密钥可能是名人的用户ID或人们正在评论的行动的ID）。对密钥进行散列处理并没有帮助，因为相同ID的散列仍然相同。

Today, most data systems are not able to automatically compensate for such a highly skewed workload, so it’s the responsibility of the application to reduce the skew. For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different partitions.

今天，大多数数据系统不能自动补偿如此高度偏斜的工作量，因此应用程序需要减少偏斜的责任。例如，如果一个键被认为是非常“热门”的，则一个简单的技巧是在键的开头或结尾添加一个随机数。只需一个两位数的十进制随机数，就可以将写入键的操作均匀分布到100个不同的键中，使这些键分布到不同的分区。

However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it. This technique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track of which keys are being split.

然而，将写操作分散到不同的键中后，任何读取操作现在都需要进行额外的工作，因为它们必须从所有100个键中读取数据并合并它们。这种技术还需要额外的记账：仅对小量的热键添加随机数才有意义；而对于大多数写吞吐量较低的键来说，这将是不必要的开销。因此，您还需要一些方式来跟踪哪些键正在被拆分。

Perhaps in the future, data systems will be able to automatically detect and compensate for skewed workloads; but for now, you need to think through the trade-offs for your own application.

也许在未来，数据系统能够自动检测并补偿不平衡的工作负载；但现在，你需要为你自己的应用程序权衡利弊。

Partitioning and Secondary Indexes

The partitioning schemes we have discussed so far rely on a key-value data model. If records are only ever accessed via their primary key, we can determine the partition from that key and use it to route read and write requests to the partition responsible for that key.

到目前为止，我们讨论的分区方案都依赖于键值数据模型。如果记录只通过其主键访问，我们可以从该键确定分区，并使用它将读写请求路由到负责该键的分区。

The situation becomes more complicated if secondary indexes are involved (see also “Other Indexing Structures” ). A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value: find all actions by user 123 , find all articles containing the word hogwash , find all cars whose color is red , and so on.

如果涉及到次要索引（另请参见“其他索引结构”），情况会变得更加复杂。次要索引通常不能唯一地标识记录，而是一种搜索特定值出现的方式：查找用户123执行的所有操作、查找包含“胡言乱语”一词的所有文章、查找颜色为红色的所有汽车等等。

Secondary indexes are the bread and butter of relational databases, and they are common in document databases too. Many key-value stores (such as HBase and Voldemort) have avoided secondary indexes because of their added implementation complexity, but some (such as Riak) have started adding them because they are so useful for data modeling. And finally, secondary indexes are the raison d’être of search servers such as Solr and Elasticsearch.

次要索引是关系型数据库的基础，也在文档数据库中很常见。许多键值存储（例如HBase和Voldemort）由于增加了实现的复杂性而避免了二级索引，但一些（例如Riak）已经开始添加它们，因为它们对于数据建模非常有用。最后，次要索引是搜索服务器（例如Solr和Elasticsearch）存在的原因。

The problem with secondary indexes is that they don’t map neatly to partitions. There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning.

次要索引的问题在于它们无法很好地映射到分区。为对具备次要索引的数据库进行分区，有两种主要方法：基于文档的分区和基于词项的分区。

Partitioning Secondary Indexes by Document

For example, imagine you are operating a website for selling used cars (illustrated in Figure 6-4 ). Each listing has a unique ID—call it the document ID —and you partition the database by the document ID (for example, IDs 0 to 499 in partition 0, IDs 500 to 999 in partition 1, etc.).

例如，想象你经营一个出售二手车的网站（如图6-4所示）。每个车辆列表都有一个唯一的ID，称为文档ID，你按照文档ID将数据库进行分区（例如，0到499的ID在分区0中，500到999的ID在分区1中等等）。

You want to let users search for cars, allowing them to filter by color and by make, so you need a secondary index on color and make (in a document database these would be fields; in a relational database they would be columns). If you have declared the index, the database can perform the indexing automatically. ⁱⁱ For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red .

你想讓使用者搜尋車輛，並允許他們按顏色和製造商進行篩選，因此你需要在顏色和製造商方面建立二級索引（在文檔數據庫中這些將是字段；在關系數據庫中這些將是列）。如果你已經聲明了索引，數據庫可以自動執行索引。例如，每次將紅色車輛添加到數據庫時，數據庫分區自動將其添加到索引項目“color:red”的文檔ID列表中。

In this indexing approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. Whenever you need to write to the database—to add, remove, or update a document—you only need to deal with the partition that contains the document ID that you are writing. For that reason, a document-partitioned index is also known as a local index (as opposed to a global index , described in the next section).

在这种索引方法中，每个分区都是完全独立的：每个分区都维护着自己的二级索引，仅覆盖该分区中的文档。它不关心其他分区中存储的数据。每当您需要写入数据库 - 添加、删除或更新文档 - 您只需要处理包含要写入的文档ID的分区。因此，基于文档的索引也被称为本地索引（与下一节中描述的全局索引相对）。

However, reading from a document-partitioned index requires care: unless you have done something special with the document IDs, there is no reason why all the cars with a particular color or a particular make would be in the same partition. In Figure 6-4 , red cars appear in both partition 0 and partition 1. Thus, if you want to search for red cars, you need to send the query to all partitions, and combine all the results you get back.

然而，从文档分区索引中读取需要谨慎：除非您对文档ID进行了特殊处理，否则没有理由让所有特定颜色或制造商的汽车都在同一个分区中。在图6-4中，红色汽车出现在分区0和分区1中。因此，如果您想搜索红色汽车，您需要将查询发送到所有分区，并组合所有收到的结果。

This approach to querying a partitioned database is sometimes known as scatter/gather , and it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (see “Percentiles in Practice” ). Nevertheless, it is widely used: MongoDB, Riak [ 15 ], Cassandra [ 16 ], Elasticsearch [ 17 ], SolrCloud [ 18 ], and VoltDB [ 19 ] all use document-partitioned secondary indexes. Most database vendors recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition, but that is not always possible, especially when you’re using multiple secondary indexes in a single query (such as filtering cars by color and by make at the same time).

这种查询分区数据库的方法有时被称为scatter/gather，可以使二级索引的读取查询变得非常昂贵。即使你并行查询分区，scatter/gather也容易造成尾部延迟的加剧（见“实践中的百分位数”）。尽管如此，它被广泛使用：MongoDB、Riak [15]、Cassandra [16]、Elasticsearch [17]、SolrCloud[18]和VoltDB [19]都使用了基于文档分区的二级索引。大多数数据库供应商建议你构建分区方案，以便辅助索引查询可以从单个分区中提供服务，但这并不总是可能的，特别是当你在单个查询中使用多个辅助索引（例如同时按颜色和制造商筛选汽车）时。

Partitioning Secondary Indexes by Term

Rather than each partition having its own secondary index (a local index ), we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index must also be partitioned, but it can be partitioned differently from the primary key index.

我们不应该让每个分区都拥有自己的二级索引，而是可以构建一个覆盖所有分区数据的全局索引。然而，我们不能把该索引仅存储在一个节点上，因为这样很可能会成为瓶颈，从而使分区失去意义。全局索引必须被分区，但是它的分区方式可以与主键索引不同。

Figure 6-5 illustrates what this could look like: red cars from all partitions appear under color:red in the index, but the index is partitioned so that colors starting with the letters a to r appear in partition 0 and colors starting with s to z appear in partition 1. The index on the make of car is partitioned similarly (with the partition boundary being between f and h ).

图6-5说明了这个可能的样子：来自所有分区的红色汽车在索引中出现在颜色：红色下，但索引被分区，使得以字母a到r开头的颜色出现在分区0中，以s到z开头的颜色出现在分区1中。汽车制造商的索引也以类似的方式分区（分区边界在f和h之间）。

We call this kind of index term-partitioned , because the term we’re looking for determines the partition of the index. Here, a term would be color:red , for example. The name term comes from full-text indexes (a particular kind of secondary index), where the terms are all the words that occur in a document.

我们称这种指标术语分区，因为我们正在寻找的术语确定了索引的分区。在这里，例如，一个术语将是颜色：红色。术语名称来自全文索引（一种特定的辅助索引），其中术语是文档中出现的所有单词。

As before, we can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load.

我们可以像以前一样通过词项本身或词项的哈希来分割索引。通过词项本身分割对于范围扫描（例如在数字属性上，比如汽车的询问价格）很有用，而通过词项哈希分割则可以得到更加均匀的负载分布。

The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

全局（分类术语分区）索引的优点在于可以更有效地进行读取：客户端只需向包含所需术语的分区发出请求，而不需要在所有分区上进行分散/聚合。然而，全局索引的缺点是写入速度更慢，更复杂，因为对单个文档的写入可能会影响索引的多个分区（文档中的每个术语可能位于不同的分区上，在不同的节点上）。

In an ideal world, the index would always be up to date, and every document written to the database would immediately be reflected in the index. However, in a term-partitioned index, that would require a distributed transaction across all partitions affected by a write, which is not supported in all databases (see Chapter 7 and Chapter 9 ).

在理想的世界中，索引应始终保持最新状态，每个写入数据库的文档都应立即反映在索引中。然而，在术语分区索引中，这将需要跨所有受写入影响的分区执行分布式事务，而这在所有数据库中都不受支持（请参阅第7章和第9章）。

In practice, updates to global secondary indexes are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). For example, Amazon DynamoDB states that its global secondary indexes are updated within a fraction of a second in normal circumstances, but may experience longer propagation delays in cases of faults in the infrastructure [ 20 ].

实际上，全局二级索引的更新通常是异步的（也就是说，如果你在写操作之后立刻读取索引，你刚刚做出的更改可能还没有在索引中反映出来）。例如，亚马逊DynamoDB指出，其全局二级索引在正常情况下在几分之一秒内进行更新，但在基础设施故障的情况下可能会出现较长的传播延迟[20]。

Other uses of global term-partitioned indexes include Riak’s search feature [ 21 ] and the Oracle data warehouse, which lets you choose between local and global indexing [ 22 ]. We will return to the topic of implementing term-partitioned secondary indexes in Chapter 12 .

全局术语分区索引的其他用途包括Riak的搜索功能[21]以及Oracle数据仓库，它允许您在本地和全局索引之间进行选择[22]。我们将在第12章中回到实现术语分区二级索引的主题上。

Rebalancing Partitions

Over time, things change in a database:

随着时间的推移，数据库中的事物会发生变化。

The query throughput increases, so you want to add more CPUs to handle the load.

查询吞吐量增加，因此您想要增加更多的CPU来处理负载。
The dataset size increases, so you want to add more disks and RAM to store it.

数据集大小增加，所以你想要添加更多的硬盘和内存来存储它。
A machine fails, and other machines need to take over the failed machine’s responsibilities.

一台机器出现故障，其他机器需要接手故障机器的职责。

All of these changes call for data and requests to be moved from one node to another. The process of moving load from one node in the cluster to another is called rebalancing .

所有这些变化都需要将数据和请求从一个节点转移到另一个节点。将集群中的负载从一个节点移动到另一个节点的过程称为重新平衡。

No matter which partitioning scheme is used, rebalancing is usually expected to meet some minimum requirements:

无论使用哪种分区方案，重新平衡通常应满足一些最低要求：

After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.

重新平衡后，集群中的节点应该公平共享负载（数据存储、读写请求）。
While rebalancing is happening, the database should continue accepting reads and writes.

在重新平衡过程中，数据库应该继续接受读和写操作。
No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.

应该只传输必要的数据来完成重新平衡，以使重平衡快速，并尽量减少网络和磁盘I/O负载。

Strategies for Rebalancing

There are a few different ways of assigning partitions to nodes [ 23 ]. Let’s briefly discuss each in turn.

分配分区给节点有几种不同的方法[23]。让我们按顺序简要讨论每种方法。

How not to do it: hash mod N

When partitioning by the hash of a key, we said earlier ( Figure 6-3 ) that it’s best to divide the possible hashes into ranges and assign each range to a partition (e.g., assign key to partition 0 if 0 ≤ hash ( key ) < b ₀ , to partition 1 if b ₀ ≤ hash ( key ) < b ₁ , etc.).

当按键的哈希值进行分区时，我们先前提到（如 Figure 6-3）最好将可行的哈希值分为范围，并将每个范围分配给一个分区（例如，如果 0 ≤ hash(key) < b0，则将键分配给分区 0，如果 b0≤hash(key)<b1，则将键分配给分区 1，以此类推）。

Perhaps you wondered why we don’t just use mod (the % operator in many programming languages). For example, hash ( key ) mod 10 would return a number between 0 and 9 (if we write the hash as a decimal number, the hash mod 10 would be the last digit). If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.

也许你会想知道为什么我们不直接使用 mod（在许多编程语言中的 % 操作符）。例如，hash（key）mod 10 将返回 0 到 9 之间的数字（如果我们将哈希写为十进制数字，则哈希 mod 10 将是最后一位数字）。如果我们有 10 个节点，编号从 0 到 9，那么这似乎是一种将每个键分配到节点的简单方法。

The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another. For example, say hash ( key ) = 123456. If you initially have 10 nodes, that key starts out on node 6 (because 123456 mod 10 = 6). When you grow to 11 nodes, the key needs to move to node 3 (123456 mod 11 = 3), and when you grow to 12 nodes, it needs to move to node 0 (123456 mod 12 = 0). Such frequent moves make rebalancing excessively expensive.

mod N的问题在于，如果节点数N发生变化，大多数键将需要从一个节点移动到另一个节点。例如，假设哈希(key) = 123456。如果最初有10个节点，该键从节点6开始（因为123456 mod 10 = 6）。当您增长到11个节点时，该键需要移动到节点3（123456 mod 11 = 3），当您增长到12个节点时，它需要移动到节点0（123456 mod 12 = 0）。这样频繁的移动使得重新平衡极其昂贵。

We need an approach that doesn’t move data around more than necessary.

我们需要一种方法，不会将数据移动得过多。

Fixed number of partitions

Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node. For example, a database running on a cluster of 10 nodes may be split into 1,000 partitions from the outset so that approximately 100 partitions are assigned to each node.

幸运的是，有一个相当简单的解决方案：创建比节点数量更多的分区，并将多个分区分配给每个节点。例如，运行在10个节点集群上的数据库可以从一开始就划分为1,000个分区，这样每个节点大约分配100个分区。

Now, if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. This process is illustrated in Figure 6-6 . If a node is removed from the cluster, the same happens in reverse.

现在，如果向群集添加节点，则新节点可以从每个现有节点中窃取一些分区，直到再次公平分配分区。这个过程如图6-6所示。如果从群集中删除节点，则反向发生相同的情况。

Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. This change of assignment is not immediate—it takes some time to transfer a large amount of data over the network—so the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress.

仅整个分区在节点之间移动。分区数量不会改变，键到分区的分配也不会改变。唯一变化的是分配给节点的分区。该分配的更改不会立即发生，需要一些时间将大量数据通过网络传输，因此在传输过程中发生的任何读写操作都将使用旧的分区分配。

In principle, you can even account for mismatched hardware in your cluster: by assigning more partitions to nodes that are more powerful, you can force those nodes to take a greater share of the load.

原则上，您甚至可以考虑集群中不匹配的硬件：通过将更多的分区分配给更强大的节点，您可以让这些节点承担更大的工作负荷份额。

This approach to rebalancing is used in Riak [ 15 ], Elasticsearch [ 24 ], Couchbase [ 10 ], and Voldemort [ 25 ].

这种重新平衡的方法被用于Riak [15]、Elasticsearch [24]、Couchbase [10]和Voldemort [25]中。

In this configuration, the number of partitions is usually fixed when the database is first set up and not changed afterward. Although in principle it’s possible to split and merge partitions (see the next section), a fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting. Thus, the number of partitions configured at the outset is the maximum number of nodes you can have, so you need to choose it high enough to accommodate future growth. However, each partition also has management overhead, so it’s counterproductive to choose too high a number.

在这种配置中，分区数量通常在数据库首次设置时固定不变。虽然原则上可以分裂和合并分区（见下一节），但固定分区数量在操作上更简单，因此许多固定分区数据库选择不实现分区分割。因此，最初配置的分区数是您可以拥有的最大节点数，因此您需要选择足够高的数字来容纳未来增长。然而，每个分区也有管理开销，因此选择过高的数字是不利的。

Choosing the right number of partitions is difficult if the total size of the dataset is highly variable (for example, if it starts small but may grow much larger over time). Since each partition contains a fixed fraction of the total data, the size of each partition grows proportionally to the total amount of data in the cluster. If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead. The best performance is achieved when the size of partitions is “just right,” neither too big nor too small, which can be hard to achieve if the number of partitions is fixed but the dataset size varies.

如果数据集的总大小非常不确定（例如，它可能从小变得非常大），那么选择正确的分区数量就很困难。由于每个分区包含总数据的固定部分，因此每个分区的大小与群集中的数据总量成比例增长。如果分区非常大，则重新平衡和从节点故障中恢复会变得昂贵。但是，如果分区太小，则会产生过多的开销。当分区大小“刚刚好”而不是太大或太小时，可以实现最佳性能，但如果分区数量固定但数据集大小变化，则可能难以实现。

Dynamic partitioning

For databases that use key range partitioning (see “Partitioning by Key Range” ), a fixed number of partitions with fixed boundaries would be very inconvenient: if you got the boundaries wrong, you could end up with all of the data in one partition and all of the other partitions empty. Reconfiguring the partition boundaries manually would be very tedious.

对于使用关键字范围分区的数据库（参见“按关键字范围分区”），具有固定边界的固定数量的分区将非常不方便：如果边界设置错误，您可能会在一个分区中得到所有数据以及所有其他分区为空。手动重新配置分区边界将非常繁琐。对于使用关键字范围分区的数据库（参见“按关键字范围分区”），固定数量的分区以及固定边界将非常不方便：如果边界设置错误，您可能会在一个分区中得到所有数据以及所有其他分区为空。手动重新配置分区边界将非常繁琐。

For that reason, key range–partitioned databases such as HBase and RethinkDB create partitions dynamically. When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split [ 26 ]. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This process is similar to what happens at the top level of a B-tree (see “B-Trees” ).

因此，像HBase和RethinkDB这样的键范围分区数据库会动态地创建分区。当分区增长超过配置的大小时（在HBase上，默认为10 GB），它会被分成两个分区，以便将数据的约一半分布在拆分的每一侧[26]。反之，如果删除了大量数据并且分区缩小到低于某个阈值，它可以与相邻的分区合并。这个过程类似于B树的顶层发生的情况（参见“B树”）。

Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load. In the case of HBase, the transfer of partition files happens through HDFS, the underlying distributed filesystem [ 3 ].

每个分区都被分配给一个节点，每个节点可以处理多个分区，就像一个固定数量的分区一样。在大分区被拆分后，其中的一半可以转移到另一个节点以平衡负载。在HBase的情况下，分区文件的转移通过HDFS进行，这是底层分布式文件系统。

An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. If there is only a small amount of data, a small number of partitions is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual partition is limited to a configurable maximum [ 23 ].

动态分区的优点在于分区数量能够自适应数据量大小。如果数据量很小，只需要少量分区，从而减小开销；如果数据量很大，每个分区的大小会被限制在可配置的最大值[23]。

However, a caveat is that an empty database starts off with a single partition, since there is no a priori information about where to draw the partition boundaries. While the dataset is small—until it hits the point at which the first partition is split—all writes have to be processed by a single node while the other nodes sit idle. To mitigate this issue, HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting ). In the case of key-range partitioning, pre-splitting requires that you already know what the key distribution is going to look like [ 4 , 26 ].

然而，有一个警告是空数据库开始时仅有一个分区，因为没有先验信息关于在哪里绘制分区边界。当数据集很小-直到第一个分区被拆分时，所有写入都必须由单个节点处理，而其他节点则闲置。为了缓解这个问题，HBase和MongoDB允许在空数据库上配置一组初始分区（这称为预分割）。在键范围分区的情况下，预分割要求你已经知道键分布会是什么样子 [4，26]。

Dynamic partitioning is not only suitable for key range–partitioned data, but can equally well be used with hash-partitioned data. MongoDB since version 2.4 supports both key-range and hash partitioning, and it splits partitions dynamically in either case.

动态分区不仅适用于键范围分区数据，同样也可用于哈希分区数据。自2.4版本以来，MongoDB支持键范围和哈希分区，并在任一情况下动态分割分区。

Partitioning proportionally to nodes

With dynamic partitioning, the number of partitions is proportional to the size of the dataset, since the splitting and merging processes keep the size of each partition between some fixed minimum and maximum. On the other hand, with a fixed number of partitions, the size of each partition is proportional to the size of the dataset. In both of these cases, the number of partitions is independent of the number of nodes.

使用动态分区，分区数量与数据集大小成正比，因为拆分和合并过程保持每个分区的大小介于某个固定的最小值和最大值之间。另一方面，如果分区数量固定，则每个分区的大小与数据集的大小成正比。在这两种情况下，分区数与节点数无关。

A third option, used by Cassandra and Ketama, is to make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node [ 23 , 27 , 28 ]. In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

第三种选项是由Cassandra和Ketama使用的，就是按节点数比例设置分区数量-换句话说，每个节点有一个固定数量的分区[23,27,28]。在这种情况下，每个分区的大小与数据集大小成比例增长，而节点数保持不变，但是当您增加节点数时，分区大小又会变小。由于更大的数据量一般需要更多的节点来存储，因此这种方法也可以保持每个分区的大小相对稳定。

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. The randomization can produce unfair splits, but when averaged over a larger number of partitions (in Cassandra, 256 partitions per node by default), the new node ends up taking a fair share of the load from the existing nodes. Cassandra 3.0 introduced an alternative rebalancing algorithm that avoids unfair splits [ 29 ].

当一个新节点加入集群时，它会随机选择一定数量的现有分区进行拆分，然后接管这些拆分分区的一半，同时留下每个分区的另一半。随机化可能会产生不公平的拆分，但当它在大量分区上进行平均（在Cassandra中，默认每个节点有256个分区），新节点最终会从现有节点中获得公平的负载份额。Cassandra 3.0引入了一种避免不公平拆分的替代平衡算法[29]。

Picking partition boundaries randomly requires that hash-based partitioning is used (so the boundaries can be picked from the range of numbers produced by the hash function). Indeed, this approach corresponds most closely to the original definition of consistent hashing [ 7 ] (see “Consistent Hashing” ). Newer hash functions can achieve a similar effect with lower metadata overhead [ 8 ].

随机选择分区边界需要使用基于哈希的分区（因此可以从哈希函数产生的数字范围中选择边界）。实际上，这种方法最接近一致性哈希的原始定义[7]（参见“一致性哈希”）。新的哈希函数可以以较低的元数据开销实现类似的效果[8]。

Operations: Automatic or Manual Rebalancing

There is one important question with regard to rebalancing that we have glossed over: does the rebalancing happen automatically or manually?

关于再平衡存在一个重要问题：是自动还是手动进行？

There is a gradient between fully automatic rebalancing (the system decides automatically when to move partitions from one node to another, without any administrator interaction) and fully manual (the assignment of partitions to nodes is explicitly configured by an administrator, and only changes when the administrator explicitly reconfigures it). For example, Couchbase, Riak, and Voldemort generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.

完全自动重新平衡（系统自动决定何时将分区从一个节点移动到另一个节点，而无需任何管理员干预）和完全手动之间存在梯度（将分区分配给节点是由管理员明确配置的，并且仅在管理员明确重新配置时发生更改）。例如，Couchbase，Riak和Voldemort自动生成建议的分区分配，但需要管理员在生效之前批准它。

Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable. Rebalancing is an expensive operation, because it requires rerouting requests and moving a large amount of data from one node to another. If it is not done carefully, this process can overload the network or the nodes and harm the performance of other requests while the rebalancing is in progress.

全自动重新平衡可以很方便，因为正常维护时需要做较少的操作工作。但它也可能是不可预测的。重新平衡是一项昂贵的操作，因为它需要重新路由请求并移动大量数据从一个节点到另一个节点。如果不小心进行，则此过程可能会过载网络或节点，并在重新平衡期间影响其他请求的性能。

Such automation can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on the overloaded node, other nodes, and the network—making the situation worse and potentially causing a cascading failure.

这种自动化在与自动故障检测相结合时可能会很危险。例如，假设一个节点过载并暂时无法快速响应请求。其他节点会得出结论这个超载节点已经死亡，并自动重新平衡集群以移除负载。这会增加超载节点、其他节点和网络的负载，使情况恶化并潜在地导致级联故障。

For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.

因此，让人类介入重新平衡可能是一件好事。虽然速度比完全自动化流程慢，但它有助于防止操作上的意外。

Request Routing

We have now partitioned our dataset across multiple nodes running on multiple machines. But there remains an open question: when a client wants to make a request, how does it know which node to connect to? As partitions are rebalanced, the assignment of partitions to nodes changes. Somebody needs to stay on top of those changes in order to answer the question: if I want to read or write the key “foo”, which IP address and port number do I need to connect to?

我们现在已经将数据集分区到多个运行在多台机器上的节点上了。但是还有一个未解决的问题：当客户端想要发起请求时，它如何知道要连接哪个节点？当分区重新平衡时，分配给节点的分区会发生变化。有人需要跟上这些变化，以回答这个问题：如果我想读取或写入键“foo”，我需要连接哪个IP地址和端口号？

This is an instance of a more general problem called service discovery , which isn’t limited to just databases. Any piece of software that is accessible over a network has this problem, especially if it is aiming for high availability (running in a redundant configuration on multiple machines). Many companies have written their own in-house service discovery tools, and many of these have been released as open source [ 30 ].

这是一个更一般的问题，被称为服务发现，它不仅仅局限于数据库。任何在网络上可访问的软件都会面临这个问题，特别是如果它想要实现高可用性（在多台机器上的冗余配置运行）。许多公司编写了自己的内部服务发现工具，其中许多已被发布为开源软件[30]。

On a high level, there are a few different approaches to this problem (illustrated in Figure 6-7 ):

从高层次来讲，对于这个问题有几种不同的方法（如图6-7所示）：

Allow clients to contact any node (e.g., via a round-robin load balancer). If that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along to the client.

允许客户端通过轮询负载均衡器与任何节点联系。如果该节点恰好拥有请求所涉及的分区，则可以直接处理请求；否则，它将请求转发到适当的节点，接收回复，并将回复传递给客户端。
Send all requests from clients to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier does not itself handle any requests; it only acts as a partition-aware load balancer.

将所有客户端请求先发送到路由层，该层确定应该处理每个请求的节点，并相应地转发它们。这个路由层本身不处理任何请求，它只作为一个分区感知的负载均衡器。
Require that clients be aware of the partitioning and the assignment of partitions to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.

要求客户端了解分区和分区分配给节点的情况。在这种情况下，客户端可以直接连接到适当的节点，无需任何中间人。

In all cases, the key problem is: how does the component making the routing decision (which may be one of the nodes, or the routing tier, or the client) learn about changes in the assignment of partitions to nodes?

在所有情况下，关键问题是：决定路由的组件（可能是节点之一、路由层或客户端）如何了解将分区分配给节点的更改？

This is a challenging problem, because it is important that all participants agree—otherwise requests would be sent to the wrong nodes and not handled correctly. There are protocols for achieving consensus in a distributed system, but they are hard to implement correctly (see Chapter 9 ).

这是一个具有挑战性的问题，因为重要的是所有参与者都同意，否则请求将被发送到错误的节点，并且不能正确处理。有协议可用于在分布式系统中实现共识，但是正确地实现它们很难（请参阅第9章）。

Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata, as illustrated in Figure 6-8 . Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in ZooKeeper. Whenever a partition changes ownership, or a node is added or removed, ZooKeeper notifies the routing tier so that it can keep its routing information up to date.

许多分布式数据系统都依赖于一个单独的协调服务（例如ZooKeeper）来跟踪集群元数据，如图6-8所示。每个节点在ZooKeeper中注册自己，并且ZooKeeper维护分区到节点的授权映射。其他参与者（如路由层或分区感知客户端）可以在ZooKeeper中订阅这些信息。每当分区所有权发生变化或节点添加或删除时，ZooKeeper会通知路由层以便它可以保持其路由信息最新。

For example, LinkedIn’s Espresso uses Helix [ 31 ] for cluster management (which in turn relies on ZooKeeper), implementing a routing tier as shown in Figure 6-8 . HBase, SolrCloud, and Kafka also use ZooKeeper to track partition assignment. MongoDB has a similar architecture, but it relies on its own config server implementation and mongos daemons as the routing tier.

例如，LinkedIn的Espresso使用Helix进行集群管理（其又依赖于ZooKeeper），并实现了如图6-8所示的路由层。HBase、SolrCloud和Kafka也使用ZooKeeper来跟踪分区分配。MongoDB具有类似的架构，但它依赖于自己的配置服务器实现和mongos守护程序作为路由层。

Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition (approach 1 in Figure 6-7 ). This model puts more complexity in the database nodes but avoids the dependency on an external coordination service such as ZooKeeper.

Cassandra和Riak采用不同的方法：它们使用节点之间的八卦协议来传播集群状态的任何更改。请求可以发送到任何节点，该节点将其转发到所请求分区的适当节点（图6-7中的第一种方法）。该模型将更多的复杂性放在数据库节点上，但避免了对外部协调服务（如ZooKeeper）的依赖。

Couchbase does not rebalance automatically, which simplifies the design. Normally it is configured with a routing tier called moxi , which learns about routing changes from the cluster nodes [ 32 ].

Couchbase不会自动重新平衡，这简化了设计。通常情况下，它会配置一个叫做moxi的路由层，从集群节点了解路由变化。

When using a routing tier or when sending requests to a random node, clients still need to find the IP addresses to connect to. These are not as fast-changing as the assignment of partitions to nodes, so it is often sufficient to use DNS for this purpose.

使用路由层或向随机节点发送请求时，客户端仍然需要找到要连接的 IP 地址。这些地址不像将分区分配给节点一样经常发生变化，因此通常可以使用 DNS 来实现这一目的。

Parallel Query Execution

So far we have focused on very simple queries that read or write a single key (plus scatter/gather queries in the case of document-partitioned secondary indexes). This is about the level of access supported by most NoSQL distributed datastores.

到目前为止，我们专注于非常简单的查询，仅读取或写入单个键（在文档分区二级索引的情况下还包括scatter/gather查询）。这是大多数NoSQL分布式数据存储支持的访问级别。

However, massively parallel processing (MPP) relational database products, often used for analytics, are much more sophisticated in the types of queries they support. A typical data warehouse query contains several join, filtering, grouping, and aggregation operations. The MPP query optimizer breaks this complex query into a number of execution stages and partitions, many of which can be executed in parallel on different nodes of the database cluster. Queries that involve scanning over large parts of the dataset particularly benefit from such parallel execution.

然而，通常用于分析的大规模并行处理（MPP）关系型数据库产品在其支持的查询类型方面要复杂得多。典型的数据仓库查询包含多个连接、筛选、分组和聚合操作。MPP查询优化器将这个复杂的查询分成多个执行阶段和分区，其中许多可以在数据库群集的不同节点上并行执行。涉及扫描大量数据集的查询尤其受益于这种并行执行。

Fast parallel execution of data warehouse queries is a specialized topic, and given the business importance of analytics, it receives a lot of commercial interest. We will discuss some techniques for parallel query execution in Chapter 10 . For a more detailed overview of techniques used in parallel databases, please see the references [ 1 , 33 ].

数据仓库查询的快速并行执行是一项专业话题，由于分析的商业重要性，它受到了很多商业上的关注。我们将在第10章中讨论一些并行查询执行技术。有关并行数据库中使用的技术的更详细概述，请参见参考文献[1，33]。

Summary

In this chapter we explored different ways of partitioning a large dataset into smaller subsets. Partitioning is necessary when you have so much data that storing and processing it on a single machine is no longer feasible.

在本章中，我们探讨了将大型数据集分割为较小子集的不同方法。当您的数据量如此大，以至于在单台计算机上存储和处理不再可行时，分区是必要的。

The goal of partitioning is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). This requires choosing a partitioning scheme that is appropriate to your data, and rebalancing the partitions when nodes are added to or removed from the cluster.

分区的目标是在多台机器间均匀分配数据和查询负载，避免热点节点（负载过高的节点）。这需要选择适合数据的分区方案，并在集群中添加或删除节点时重新平衡分区。

We discussed two main approaches to partitioning:

我们讨论了两种主要的分区方法：

Key range partitioning , where keys are sorted, and a partition owns all the keys from some minimum up to some maximum. Sorting has the advantage that efficient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order.

键范围分区，其中键已排序，并且一个分区拥有从某个最小值到某个最大值的所有键。排序具有高效的范围查询优势，但如果应用程序经常访问在排序顺序中彼此靠近的键，则存在热点的风险。

In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big.

在这种方法中，当一个分区变得太大时，通常会通过将范围分成两个子范围来动态地重新平衡分区。
Hash partitioning , where a hash function is applied to each key, and a partition owns a range of hashes. This method destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly.

哈希分区，是对每个关键字应用哈希函数，并对一个分区拥有一定范围的哈希值。这种方法破坏了关键字的排序，导致区间查询效率低下，但可以更均匀地分配负载。

When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed. Dynamic partitioning can also be used.

当使用哈希分区时，通常会提前创建固定数量的分区，分配多个分区到每个节点，当节点添加或删除时，会将整个分区从一个节点移动到另一个节点。也可以使用动态分区。

Hybrid approaches are also possible, for example with a compound key: using one part of the key to identify the partition and another part for the sort order.

混合方法也是可能的，例如使用复合键：使用键的一部分来标识分区，另一部分用于排序顺序。

We also discussed the interaction between partitioning and secondary indexes. A secondary index also needs to be partitioned, and there are two methods:

我们还讨论了分区和二级索引之间的交互。二级索引也需要进行分区，有两种方法：

Document-partitioned indexes (local indexes), where the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/gather across all partitions.

文档分区索引（本地索引），其中二级索引存储在与主键和值相同的分区中。这意味着在写入时只需要更新单个分区，但读取辅助索引需要在所有分区间散布/收集。
Term-partitioned indexes (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.

使用按术语分区的索引（全局索引），其中二级索引单独分区，使用索引值。二级索引中的条目可能包括主键所有分区的记录。当编写文档时，需要更新二级索引的多个分区；但是，读取可以从单个分区服务。

Finally, we discussed techniques for routing queries to the appropriate partition, which range from simple partition-aware load balancing to sophisticated parallel query execution engines.

最后，我们讨论了将查询路由到适当分区的技术，这些技术包括简单的分区意识负载均衡和复杂的并行查询执行引擎。

By design, every partition operates mostly independently—that’s what allows a partitioned database to scale to multiple machines. However, operations that need to write to several partitions can be difficult to reason about: for example, what happens if the write to one partition succeeds, but another fails? We will address that question in the following chapters.

按照设计，每个分区基本上是独立运作的，这使得分区数据库可以扩展到多台机器。然而，需要写入多个分区的操作可能很难理解：例如，如果对一个分区的写入成功了，但另一个分区失败了怎么办？我们将在接下来的章节中解决这个问题。

Footnotes

ⁱ Partitioning, as discussed in this chapter, is a way of intentionally breaking a large database down into smaller ones. It has nothing to do with network partitions (netsplits), a type of fault in the network between nodes. We will discuss such faults in Chapter 8 .

本章所讨论的分区是一种有意将大型数据库分解成更小模块的方法，与网络分区（net split）这种在节点之间网络连接上发生的故障无关。我们将在第八章讨论这类故障。

ⁱⁱ If your database only supports a key-value model, you might be tempted to implement a secondary index yourself by creating a mapping from values to document IDs in application code. If you go down this route, you need to take great care to ensure your indexes remain consistent with the underlying data. Race conditions and intermittent write failures (where some changes were saved but others weren’t) can very easily cause the data to go out of sync—see “The need for multi-object transactions” .

如果你的数据库仅支持键值模型，你可能会想自己实现二级索引，通过在应用程序代码中创建从数值到文档ID的映射。如果你选择这种方法，你需要非常小心地确保你的索引与底层数据保持一致。竞争条件和断断续续的写入故障(其中一些更改被保存，但其他更改则没有)很容易导致数据失去同步。请参阅“多对象事务”的需要。

References

[ 1 ] David J. DeWitt and Jim N. Gray: “ Parallel Database Systems: The Future of High Performance Database Systems ,” Communications of the ACM , volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894

[1] David J. DeWitt和Jim N. Gray："并行数据库系统：高性能数据库系统的未来"，ACM通讯，第35卷，第6期，1992年6月，85-98页。doi：10.1145/129888.129894。

[ 2 ] Lars George: “ HBase vs. BigTable Comparison ,” larsgeorge.com , November 2009.

[2] Lars George： “HBase vs. BigTable 对比，” larsgeorge.com，2009年11月。

[ 3 ] “ The Apache HBase Reference Guide ,” Apache Software Foundation, hbase.apache.org , 2014.

[3] “Apache HBase参考指南”，Apache软件基金会，hbase.apache.org，2014年。

[ 4 ] MongoDB, Inc.: “ New Hash-Based Sharding Feature in MongoDB 2.4 ,” blog.mongodb.org , April 10, 2013.

[4] MongoDB公司：“MongoDB 2.4中新增基于哈希的分片特性”，blog.mongodb.org，2013年4月10日。

[ 5 ] Ikai Lan: “ App Engine Datastore Tip: Monotonically Increasing Values Are Bad ,” ikaisays.com , January 25, 2011.

"[5] Ikai Lan: “App Engine 数据存储技巧：单调递增的值是不好的”，ikaisays.com, 2011年1月25日."

[ 6 ] Martin Kleppmann: “ Java’s hashCode Is Not Safe for Distributed Systems ,” martin.kleppmann.com , June 18, 2012.

"Java的hashCode对分布式系统不安全"，Martin Kleppmann，martin.kleppmann.com，2012年6月18日。"

[ 7 ] David Karger, Eric Lehman, Tom Leighton, et al.: “ Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web ,” at 29th Annual ACM Symposium on Theory of Computing (STOC), pages 654–663, 1997. doi:10.1145/258533.258660

[7] David Karger, Eric Lehman, Tom Leighton等：“一致性哈希和随机树：分布式缓存协议以缓解全球网络上的热点”，第29届ACM计算理论研讨会（STOC），1997年，第654-663页。doi：10.1145/258533.258660。

[ 8 ] John Lamping and Eric Veach: “ A Fast, Minimal Memory, Consistent Hash Algorithm ,” arxiv.org , June 2014.

约翰·兰平和埃里克·维奇：“一种快速、最小内存、一致散列算法”，arxiv.org，2014年6月。

[ 9 ] Eric Redmond: “ A Little Riak Book ,” Version 1.4.0, Basho Technologies, September 2013.

[9] Eric Redmond: 《Riak简介》第1.4.0版， Basho Technologies，2013年9月。

[ 10 ] “ Couchbase 2.5 Administrator Guide ,” Couchbase, Inc., 2014.

[10] “Couchbase 2.5管理员指南”，Couchbase公司，2014年。

[ 11 ] Avinash Lakshman and Prashant Malik: “ Cassandra – A Decentralized Structured Storage System ,” at 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), October 2009.

[11] Avinash Lakshman 和 Prashant Malik：“Cassandra-一个分散式构造存储系统”，于 2009 年 10 月第三届 ACM SIGOPS 国际大规模分散式系统和中间件研讨会（LADIS）上发表。

[ 12 ] Jonathan Ellis: “ Facebook’s Cassandra Paper, Annotated and Compared to Apache Cassandra 2.0 ,” datastax.com , September 12, 2013.

[12] Jonathan Ellis：“Facebook的Cassandra 论文：注释和与Apache Cassandra 2.0的比较”，datastax.com，2013年9月12日。

[ 13 ] “ Introduction to Cassandra Query Language ,” DataStax, Inc., 2014.

《Cassandra查询语言简介》，DataStax，Inc.，2014年。

[ 14 ] Samuel Axon: “ 3% of Twitter’s Servers Dedicated to Justin Bieber ,” mashable.com , September 7, 2010.

“Twitter的服务器中有3%专门用于贾斯汀·比伯。”，来源于2010年9月7日的mashable.com。

[ 15 ] “ Riak 1.4.8 Docs ,” Basho Technologies, Inc., 2014.

《Riak 1.4.8文档》，Basho Technologies，Inc.，2014年。

[ 16 ] Richard Low: “ The Sweet Spot for Cassandra Secondary Indexing ,” wentnet.com , October 21, 2013.

[16] Richard Low：“Cassandra二级索引的最佳情况”，wentnet.com，2013年10月21日。

[ 17 ] Zachary Tong: “ Customizing Your Document Routing ,” elasticsearch.org , June 3, 2013.

[17] Zachary Tong：“自定义文档路由”，elasticsearch.org，2013年6月3日。

[ 18 ] “ Apache Solr Reference Guide ,” Apache Software Foundation, 2014.

"[18] Apache Solr参考指南，Apache Software Foundation，2014年。"

[ 19 ] Andrew Pavlo: “ H-Store Frequently Asked Questions ,” hstore.cs.brown.edu , October 2013.

[19] Andrew Pavlo: “H-Store常见问题,” hstore.cs.brown.edu, 2013年10月。

[ 20 ] “ Amazon DynamoDB Developer Guide ,” Amazon Web Services, Inc., 2014.

「亚马逊DynamoDB开发者指南」，由Amazon Web Services，Inc.于2014年出版。

[ 21 ] Rusty Klophaus: “ Difference Between 2I and Search ,” email to riak-users mailing list, lists.basho.com , October 25, 2011.

【21】 Rusty Klophaus： “2I和Search的差异”，邮件发送至riak-users邮件列表，lists.basho.com，2011年10月25日。

[ 22 ] Donald K. Burleson: “ Object Partitioning in Oracle ,” dba-oracle.com , November 8, 2000.

[22] Donald K. Burleson：“Oracle中的对象分区”，dba-oracle.com，2000年11月8日。 [22]：唐纳德·K·伯勒森：“Oracle中的对象分区”，dba-oracle.com，2000年11月8日。

[ 23 ] Eric Evans: “ Rethinking Topology in Cassandra ,” at ApacheCon Europe , November 2012.

[23] Eric Evans：在2012年11月的ApacheCon Europe上发表了题为“重新审视Cassandra中的拓扑结构”的演讲。

[ 24 ] Rafał Kuć: “ Reroute API Explained ,” elasticsearchserverbook.com , September 30, 2013.

【24】Rafał Kuć：“重定向API解释”，elasticsearchserverbook.com，2013年9月30日。

[ 25 ] “ Project Voldemort Documentation ,” project-voldemort.com .

“Voldemort项目文档”，project-voldemort.com。

[ 26 ] Enis Soztutar: “ Apache HBase Region Splitting and Merging ,” hortonworks.com , February 1, 2013.

"[26] Enis Soztutar：'Apache HBase区域的分割和合并'，hortonworks.com，2013年2月1日。"

[ 27 ] Brandon Williams: “ Virtual Nodes in Cassandra 1.2 ,” datastax.com , December 4, 2012.

[27]Brandon Williams：“Cassandra1.2中的虚拟节点”，datastax.com，2012年12月4日。

[ 28 ] Richard Jones: “ libketama: Consistent Hashing Library for Memcached Clients ,” metabrew.com , April 10, 2007.

[28] Richard Jones： “libketama：用于Memcached客户端的一致性哈希库”，metabrew.com，2007年4月10日。

[ 29 ] Branimir Lambov: “ New Token Allocation Algorithm in Cassandra 3.0 ,” datastax.com , January 28, 2016.

[29] 布拉尼米尔·兰博夫: “Cassandra 3.0中的新令牌分配算法”，datastax.com，2016年1月28日。

[ 30 ] Jason Wilder: “ Open-Source Service Discovery ,” jasonwilder.com , February 2014.

"Jason Wilder: “开源服务发现”，jasonwilder.com，2014年2月。"

[ 31 ] Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: “ Untangling Cluster Management with Helix ,” at ACM Symposium on Cloud Computing (SoCC), October 2012. doi:10.1145/2391229.2391248

[31] Kishore Gopalakrishna, Shi Lu, Zhen Zhang等人：“通过Helix解开集群管理”，发表于2012年ACM云计算研讨会（SoCC），2012年10月。 doi:10.1145/2391229.2391248

[ 32 ] “ Moxi 1.8 Manual ,” Couchbase, Inc., 2014.

[32] "Moxi 1.8 手册," Couchbase，Inc.，2014年。

[ 33 ] Shivnath Babu and Herodotos Herodotou: “ Massively Parallel Databases and MapReduce Systems ,” Foundations and Trends in Databases , volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036

“大规模并行数据库与MapReduce系统”，Shivnath Babu和Herodotos Herodotou，数据库基础与趋势，第5卷，第1号，第1-104页，2013年11月。doi：10.1561/1900000036。

Chapter 7. Transactions

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

一些作者声称，一般的两阶段提交由于带来的性能或可用性问题而过于昂贵。我们认为，最好让应用程序员在出现瓶颈时处理由于过度使用事务而导致的性能问题，而不是总是编码以避免缺少事务。

James Corbett et al., Spanner: Google’s Globally-Distributed Database (2012)

詹姆斯·科贝特等，扳手：谷歌的全球分布式数据库（2012年）

In the harsh reality of data systems, many things can go wrong:

在数据系统的残酷现实中，很多事情可能会出错。

The database software or hardware may fail at any time (including in the middle of a write operation).

数据库软件或硬件随时可能出现故障（包括在写入操作中间）。
The application may crash at any time (including halfway through a series of operations).

应用程序在任何时候都可能崩溃（包括在一系列操作中途崩溃）。
Interruptions in the network can unexpectedly cut off the application from the database, or one database node from another.

网络中的中断可能会意外地从数据库中切断应用程序，或从一个数据库节点到另一个节点。
Several clients may write to the database at the same time, overwriting each other’s changes.

若干个客户端可能同时对数据库进行写操作，从而覆盖彼此所做的更改。
A client may read data that doesn’t make sense because it has only partially been updated.

客户端可能读取数据，因为数据只进行了部分更新，导致数据看起来毫无意义。
Race conditions between clients can cause surprising bugs.

客户端之间的竞争条件可能会导致令人惊讶的错误。

In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works.

为了保证可靠性，系统必须处理这些故障并确保它们不会导致整个系统的灾难性故障。然而，实现容错机制是一项艰巨的工作，需要对可能出现的所有问题进行仔细思考，并进行大量的测试以确保解决方案实际有效。

For decades, transactions have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds ( commit ) or it fails ( abort , rollback ). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason).

几十年来，事务一直是简化这些问题的首选机制。事务是应用程序将多个读取和写入组合成一个逻辑单元的方式。从概念上讲，事务中的所有读取和写入都作为一个操作执行：整个事务成功（提交）或失败（中止，回滚）。如果失败，应用程序可以安全地重试。使用事务，应用程序的错误处理变得更加简单，因为它不需要担心部分失败，即某些操作成功而某些操作失败（无论出于什么原因）。

If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them for granted. Transactions are not a law of nature; they were created with a purpose, namely to simplify the programming model for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these safety guarantees ).

如果您花了多年时间处理事务，它们可能看起来很明显，但我们不应该把它们视为理所当然。事务并不是自然法则；它们是为了简化访问数据库的应用程序的编程模型而创建的。通过使用事务，应用程序可以忽略某些潜在的错误场景和并发问题，因为数据库会代替它们处理（我们称这些为安全保障）。

Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability). Some safety properties can be achieved without transactions.

并非每个应用程序都需要事务，有时放弃或减弱事务保证可能会带来优势（例如，为了实现更高的性能或可用性）。有些安全属性可以在不使用事务的情况下实现。

How do you figure out whether you need transactions? In order to answer that question, we first need to understand exactly what safety guarantees transactions can provide, and what costs are associated with them. Although transactions seem straightforward at first glance, there are actually many subtle but important details that come into play.

你如何确定是否需要事务处理？为了回答这个问题，我们首先需要了解事务处理能提供什么安全保障以及与之相关的成本。虽然事务处理乍一看似乎很简单，但实际上有许多微妙但重要的细节需要考虑。

In this chapter, we will examine many examples of things that can go wrong, and explore the algorithms that databases use to guard against those issues. We will go especially deep in the area of concurrency control, discussing various kinds of race conditions that can occur and how databases implement isolation levels such as read committed , snapshot isolation , and serializability .

在本章，我们将研究许多可能出错的示例，并探讨数据库用来防范这些问题的算法。我们将特别深入探讨并发控制领域，讨论可能发生的各种竞争条件，以及数据库如何实现读取提交、快照隔离和串行化等隔离级别。

This chapter applies to both single-node and distributed databases; in Chapter 8 we will focus the discussion on the particular challenges that arise only in distributed systems.

本章适用于单节点和分布式数据库；第8章我们将重点讨论仅在分布式系统中出现的特定挑战。

The Slippery Concept of a Transaction

Almost all relational databases today, and some nonrelational databases, support transactions. Most of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [ 1 , 2 , 3 ]. Although some implementation details have changed, the general idea has remained virtually the same for 40 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily similar to that of System R.

几乎所有的关系型数据库、以及一些非关系型数据库，都支持事务。它们中的大部分都遵循1975年IBM System R所引入的样式，这是第一个SQL数据库[1，2，3]。虽然一些实现细节发生了改变，但是基本思想已经几乎保持了40年不变：MySQL、PostgreSQL、Oracle、SQL Server等的事务支持都和System R非常相似。

In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to improve upon the relational status quo by offering a choice of new data models (see Chapter 2 ), and by including replication ( Chapter 5 ) and partitioning ( Chapter 6 ) by default. Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood [ 4 ].

在2000年代后期，非关系型（NoSQL）数据库开始变得流行起来。他们旨在通过提供新的数据模型选择（见第2章），以及包括默认的复制（第5章）和分区（第6章），来改善关系型数据库现状。这一运动的主要牺牲品是交易：许多这一新一代数据库完全放弃了交易，或重新定义这个词来描述比以前理解的保证要弱得多的一组保证[4]。

With the hype around this new crop of distributed databases, there emerged a popular belief that transactions were the antithesis of scalability, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability [ 5 , 6 ]. On the other hand, transactional guarantees are sometimes presented by database vendors as an essential requirement for “serious applications” with “valuable data.” Both viewpoints are pure hyperbole.

随着新一代分布式数据库的热度上升，出现了一种流行观念，即事务是可扩展性的对立面，任何大型系统都必须放弃事务以保持良好的性能和高可用性。另一方面，数据库供应商有时将事务保证作为“重要应用程序”和“有价值数据”的必要要求。两种观点都是纯粹的夸张言辞。

The truth is not that simple: like every other technical design choice, transactions have advantages and limitations. In order to understand those trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal operation and in various extreme (but realistic) circumstances.

事实并不简单：就像其他技术设计选择一样，交易具有优点和限制。为了理解这些权衡，让我们深入了解交易可以提供的保证细节 - 在正常操作和各种极端（但现实）情况下。

The Meaning of ACID

The safety guarantees provided by transactions are often described by the well-known acronym ACID , which stands for Atomicity , Consistency , Isolation , and Durability . It was coined in 1983 by Theo Härder and Andreas Reuter [ 7 ] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.

交易所提供的安全保证通常被描述为著名缩写ACID，即原子性、一致性、隔离性和持久性。该术语由Theo Härder和Andreas Reuter [7]于1983年创造，旨在为数据库中的容错机制建立精确术语。

However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of isolation [ 8 ]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.

然而，在实践中，一个数据库的ACID实现并不等同于另一个数据库的实现。例如，正如我们将看到的那样，关于隔离级别的含义存在很大的歧义。这个高层次的想法是好的，但是细节决定成败。今天，当一个系统声称是“ACID兼容”的时候，你可以期望得到什么样的保证是不清楚的。不幸的是，ACID已经成为主要的营销术语。

(Systems that do not meet the ACID criteria are sometimes called BASE , which stands for Basically Available , Soft state , and Eventual consistency [ 9 ]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.)

没有符合 ACID 标准的系统有时被称为 BASE，即 Basically Available、Soft state 和 Eventual consistency [9]。这比 ACID 的定义更加模糊。似乎唯一合理的 BASE 定义是“非 ACID”；也就是说，它可以意味着几乎任何你想要的东西。

Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let us refine our idea of transactions.

让我们深入探讨原子性、一致性、隔离性和持久性的定义，这将让我们完善对事务的概念。

Atomicity

In general, atomic refers to something that cannot be broken down into smaller parts. The word means similar but subtly different things in different branches of computing. For example, in multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.

通常情况下，原子指的是无法被分解成更小部分的物质。在计算机不同领域中，这个词有着类似但微妙不同的意思。例如，在多线程编程中，如果一个线程执行了一个原子操作，意味着另一个线程无法看到操作的半成品结果。系统只能处于操作之前或之后的状态，而不可能处于中间状态。

By contrast, in the context of ACID, atomicity is not about concurrency. It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I , for isolation (see “Isolation” ).

相比之下，在ACID的背景下，原子性不涉及并发。如果多个进程尝试同时访问相同的数据，它并不描述会发生什么，因为这在I，即隔离(请参见“隔离”)下涵盖。

Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed ( committed ) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.

ACID的原子性描述的是当客户端要进行多写操作，但是在部分写操作被处理后发生故障的情况，例如进程崩溃、网络连接中断、磁盘空间满或某些完整性约束被违反。如果将写操作组合成原子事务，并且由于故障无法完成（提交）事务，则事务将被中止，数据库必须放弃或撤销在该事务中迄今为止所做的任何写操作。

Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to know which changes have taken effect and which haven’t. The application could try again, but that risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies this problem: if a transaction was aborted, the application can be sure that it didn’t change anything, so it can safely be retried.

如果没有原子性，则在进行多个更改的过程中发生错误时，很难知道哪些更改已生效，哪些没有生效。应用程序可以再次尝试，但这会冒着使相同的更改两次的风险，导致重复或不正确的数据。原子性简化了这个问题：如果事务被中止，应用程序可以确定它没有更改任何内容，因此可以安全地重试。

The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity. Perhaps abortability would have been a better term than atomicity , but we will stick with atomicity since that’s the usual word.

事务发生错误时可以终止事务并且回滚所有写入操作，这是ACID原子性的定义特征。或许"可中止性"是个更好的术语，但由于"原子性"是常用词汇，我们将继续使用它。

Consistency

The word consistency is terribly overloaded:

一致性这个词非常混乱多义。

In Chapter 5 we discussed replica consistency and the issue of eventual consistency that arises in asynchronously replicated systems (see “Problems with Replication Lag” ).

在第五章中，我们讨论了复制一致性以及异步复制系统中可能出现的最终一致性问题（请参阅“复制滞后问题”）。
Consistent hashing is an approach to partitioning that some systems use for rebalancing (see “Consistent Hashing” ).

"一致性哈希是一些系统用于重新平衡分区的一种方法（请参见“一致性哈希”）。"
In the CAP theorem (see Chapter 9 ), the word consistency is used to mean linearizability (see “Linearizability” ).

在CAP定理（参见第9章）中，“一致性”一词是用来指代可线性化性（参见“线性化”）。
In the context of ACID, consistency refers to an application-specific notion of the database being in a “good state.”

在ACID的背景下，一致性指的是数据库处于“良好状态”的应用程序特定概念。

It’s unfortunate that the same word is used with at least four different meanings.

使用至少四个不同的含义来表达同一个词是不幸的。

The idea of ACID consistency is that you have certain statements about your data ( invariants ) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied.

ACID一致性的理念在于，您对数据有某些语句（不变量）必须始终为真——例如，在一个会计系统中，所有账户的借贷必须始终平衡。如果事务始于一个根据这些不变量有效的数据库，并且事务期间的任何写操作都保存了有效性，那么您可以确保不变量始终满足。

However, this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. This is not something that the database can guarantee: if you write bad data that violates your invariants, the database can’t stop you. (Some specific kinds of invariants can be checked by the database, for example using foreign key constraints or uniqueness constraints. However, in general, the application defines what data is valid or invalid—the database only stores it.)

然而，一致性的概念取决于应用程序的不变量概念，应用程序有责任正确定义其交易，以便保持一致性。这不是数据库能够保证的：如果您写入违反不变量的不良数据，数据库无法阻止您。（某些特定类型的不变量可以由数据库检查，例如使用外键约束或唯一性约束。但是，一般来说，应用程序定义哪些数据是有效或无效的-数据库仅存储它们。）

Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID. ⁱ

原子性、隔离性和持久性是数据库的属性，而一致性（在ACID的意义上）是应用程序的属性。应用程序可以依赖于数据库的原子性和隔离性属性以实现一致性，但它不只是数据库的责任。因此，字母C实际上不属于ACID。

Isolation

Most databases are accessed by several clients at the same time. That is no problem if they are reading and writing different parts of the database, but if they are accessing the same database records, you can run into concurrency problems (race conditions).

大多数数据库同时被多个客户端访问。如果它们读取和写入数据库的不同部分，那么就没有问题。但是，如果它们同时访问同一数据库记录，就可能出现并发问题（竞态条件）。

Figure 7-1 is a simple example of this kind of problem. Say you have two clients simultaneously incrementing a counter that is stored in a database. Each client needs to read the current value, add 1, and write the new value back (assuming there is no increment operation built into the database). In Figure 7-1 the counter should have increased from 42 to 44, because two increments happened, but it actually only went to 43 because of the race condition.

图7-1是这种问题的一个简单示例。假设您有两个客户端同时对存储在数据库中的计数器进行增量。每个客户端都需要读取当前值，加1，并将新值写回（假设数据库中没有增量操作）。在图7-1中，计数器应该从42增加到44，因为发生了两次增量，但由于竞争条件，它实际上只增加到了43。

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability , which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently [ 10 ].

ACID中的隔离是指并发执行的事务相互隔离，它们不能相互影响。经典的数据库教材将隔离形式化为串行化，这意味着每个事务都可以假装它是整个数据库上运行的唯一事务。数据库确保在事务提交时，结果与它们串行运行（一个接一个地）时的结果相同，即使事实上它们可能已经并发运行[10]。

However, in practice, serializable isolation is rarely used, because it carries a performance penalty. Some popular databases, such as Oracle 11g, don’t even implement it. In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation , which is a weaker guarantee than serializability [ 8 , 11 ]. We will explore snapshot isolation and other forms of isolation in “Weak Isolation Levels” .

然而，在实践中，可序列化隔离很少被使用，因为它会带来性能损失。一些流行的数据库，如 Oracle 11g，甚至没有实现它。在 Oracle 中有一种叫做“可序列化”的隔离级别，但实际上它实现的是一种叫做快照隔离的东西，这种保证比可序列化要弱[8， 11]。我们将在“弱隔离级别”中探讨快照隔离和其他形式的隔离。

Durability

The purpose of a database system is to provide a safe place where data can be stored without fear of losing it. Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

数据库系统的目的是提供一个安全的存储数据的地方，使用户不必担心数据丢失。持久性是保证一旦事务成功提交，它写入的任何数据都不会被遗忘的承诺，即使存在硬件故障或数据库崩溃。

In a single-node database, durability typically means that the data has been written to nonvolatile storage such as a hard drive or SSD. It usually also involves a write-ahead log or similar (see “Making B-trees reliable” ), which allows recovery in the event that the data structures on disk are corrupted. In a replicated database, durability may mean that the data has been successfully copied to some number of nodes. In order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed.

在单节点数据库中，持久性通常意味着数据已被写入非易失性存储，例如硬盘或固态硬盘。通常它也包括一个预写日志或类似物（请参见"使B树可靠"），它允许在磁盘上的数据结构损坏时进行恢复。在复制的数据库中，持久性可能意味着数据已成功复制到某些节点。为了提供持久性保证，数据库必须等待这些写入或复制完成才能报告一个事务成功提交。

As discussed in “Reliability” , perfect durability does not exist: if all your hard disks and all your backups are destroyed at the same time, there’s obviously nothing your database can do to save you.

正如在“可靠性”中所讨论的那样，没有完美的耐久性：如果您的所有硬盘和所有备份同时被销毁，那么您的数据库显然无法挽救您。

Replication and Durability

Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk or SSD. More recently, it has been adapted to mean replication. Which implementation is better?

历史上，持久性意味着写入档案带。然后它被理解为写入磁盘或SSD。最近，它被改进为复制的意思。哪种实施更好？

The truth is, nothing is perfect:

事实是，没有什么是完美的。

If you write to disk and the machine dies, even though your data isn’t lost, it is inaccessible until you either fix the machine or transfer the disk to another machine. Replicated systems can remain available.

如果您将数据写入磁盘，但计算机崩溃了，即使数据没有丢失，也无法访问，除非您修复计算机或将磁盘转移到另一台计算机。而复制系统可以保持可用性。
A correlated fault—a power outage or a bug that crashes every node on a particular input—can knock out all replicas at once (see “Reliability” ), losing any data that is only in memory. Writing to disk is therefore still relevant for in-memory databases.

相关故障——如断电或错误导致所有节点同时崩溃——可以一次性使所有副本失效（请参见“可靠性”），从而丢失仅保存在内存中的任何数据。因此，对于内存数据库而言，将数据写入磁盘仍然是必要的。
In an asynchronously replicated system, recent writes may be lost when the leader becomes unavailable (see “Handling Node Outages” ).

在一个异步复制系统中，当领导者不可用时，最近的写入可能会丢失（参见“处理节点故障”）。
When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the guarantees they are supposed to provide: even fsync isn’t guaranteed to work correctly [ 12 ]. Disk firmware can have bugs, just like any other kind of software [ 13 , 14 ].

当电源突然中断时，特别是固态硬盘，有时会违反它们应提供的保证：即使是fsync也不能保证正常工作[12]。硬盘固件也可能存在漏洞，就像任何其他类型的软件一样[13，14]。
Subtle interactions between the storage engine and the filesystem implementation can lead to bugs that are hard to track down, and may cause files on disk to be corrupted after a crash [ 15 , 16 ].

存储引擎和文件系统实现之间的微妙交互可能导致难以跟踪的错误，并可能在崩溃后导致磁盘上的文件损坏。[15,16]。
Data on disk can gradually become corrupted without this being detected [ 17 ]. If data has been corrupted for some time, replicas and recent backups may also be corrupted. In this case, you will need to try to restore the data from a historical backup.

磁盘上的数据可能会逐渐损坏，而无需检测到此问题[17]。如果数据已经损坏了一段时间，则副本和最近备份也可能已经被损坏。在这种情况下，您需要尝试从历史备份中恢复数据。
One study of SSDs found that between 30% and 80% of drives develop at least one bad block during the first four years of operation [ 18 ]. Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than SSDs.

一项SSD的研究发现，在开始运营的前四年中，30%到80%的硬盘会产生至少一个坏块[18]。磁性硬盘的坏扇区率较低，但完全故障率较高，而SSDs则相反。
If an SSD is disconnected from power, it can start losing data within a few weeks, depending on the temperature [ 19 ].

如果SSD从电源中断开，根据温度[19]，它可能在几周内开始丢失数据。

In practice, there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups—and they can and should be used together. As always, it’s wise to take any theoretical “guarantees” with a healthy grain of salt.

实际上，并没有一种技术能够提供绝对的保证。只有各种减少风险的技巧，包括写入磁盘、复制到远程机器和备份，它们可以并且应该一起使用。总之，要对任何理论上的“保证”持谨慎态度。

Single-Object and Multi-Object Operations

To recap, in ACID, atomicity and isolation describe what the database should do if a client makes several writes within the same transaction:

回顾一下，在ACID中，原子性和隔离性描述了如果客户端在同一事务中进行了多次写操作，数据库应该采取的措施：

Atomicity

If an error occurs halfway through a sequence of writes, the transaction should be aborted, and the writes made up to that point should be discarded. In other words, the database saves you from having to worry about partial failure, by giving an all-or-nothing guarantee.

如果在一系列写操作进行的中途出现错误，交易应该被中止且到那时为止的写操作应该被丢弃。换言之，数据库通过提供全部或无操作的保证来避免您担心部分故障。

Isolation

Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.

并发运行的事务不应互相干扰。例如，如果一个事务进行了多次写操作，那么另一个事务应该看到全部或部分，但不是一部分。

These definitions assume that you want to modify several objects (rows, documents, records) at once. Such multi-object transactions are often needed if several pieces of data need to be kept in sync. Figure 7-2 shows an example from an email application. To display the number of unread messages for a user, you could query something like:

这些定义假定您想要同时修改多个对象（行、文档、记录）。如果需要保持几个数据的同步，这种多对象交易通常是必需的。图7-2显示了电子邮件应用程序的示例。要显示用户未读消息的数量，您可以查询类似于：`的内容。

SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true

However, you might find this query to be too slow if there are many emails, and decide to store the number of unread messages in a separate field (a kind of denormalization). Now, whenever a new message comes in, you have to increment the unread counter as well, and whenever a message is marked as read, you also have to decrement the unread counter.

然而，如果有许多电子邮件，您可能会发现此查询速度太慢，决定在单独的字段中存储未读邮件的数量（一种非规范化）。现在，每当有新消息进来时，您还必须增加未读计数器，每当将消息标记为已读时，您也必须减少未读计数器。

In Figure 7-2 , user 2 experiences an anomaly: the mailbox listing shows an unread message, but the counter shows zero unread messages because the counter increment has not yet happened. ⁱⁱ Isolation would have prevented this issue by ensuring that user 2 sees either both the inserted email and the updated counter, or neither, but not an inconsistent halfway point.

在图7-2中，用户2遇到了异常情况：邮箱列表显示未读邮件，但计数器显示为零未读邮件，因为计数器增量尚未发生。隔离将通过确保用户2同时看到插入的电子邮件和更新的计数器或两者都不看到而避免此问题，而不是出现不一致的中间状态。

Figure 7-3 illustrates the need for atomicity: if an error occurs somewhere over the course of the transaction, the contents of the mailbox and the unread counter might become out of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted and the inserted email is rolled back.

图7-3展示了原子性的必要性：如果事务的过程中发生了错误，邮箱的内容和未读邮件计数器可能不同步。在一个原子事务中，如果计数器更新失败，事务将被中止并且插入的邮件将被回滚。

Multi-object transactions require some way of determining which read and write operations belong to the same transaction. In relational databases, that is typically done based on the client’s TCP connection to the database server: on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction. ⁱⁱⁱ

多对象事务需要确定哪些读写操作属于同一个事务。在关系型数据库中，通常基于客户端与数据库服务器之间的TCP连接进行选择：在任何一个特定的连接中，BEGIN TRANSACTION和COMMIT语句之间的所有内容都被认为是同一个事务的一部分。

On the other hand, many nonrelational databases don’t have such a way of grouping operations together. Even if there is a multi-object API (for example, a key-value store may have a multi-put operation that updates several keys in one operation), that doesn’t necessarily mean it has transaction semantics: the command may succeed for some keys and fail for others, leaving the database in a partially updated state.

然而，许多非关系型数据库没有将操作分组在一起的方式。即使有多对象API（例如，键-值存储可以有一个多放操作，在一个操作中更新多个键），这并不一定意味着它具有事务语义：该命令可能对某些键成功，对其他键失败，从而使数据库处于部分更新状态。

Single-object writes

Atomicity and isolation also apply when a single object is being changed. For example, imagine you are writing a 20 KB JSON document to a database:

原子性和隔离性也适用于单个对象的更改。例如，想象一下，您要将一个20 KB的JSON文档写入数据库：

If the network connection is interrupted after the first 10 KB have been sent, does the database store that unparseable 10 KB fragment of JSON?

如果在第一个10KB被发送后，网络连接中断，数据库是否会存储无法解析的10KB JSON片段？
If the power fails while the database is in the middle of overwriting the previous value on disk, do you end up with the old and new values spliced together?

如果在数据库正在覆盖磁盘上的先前值时出现电力故障，您最终会得到旧值和新值拼接在一起吗？
If another client reads that document while the write is in progress, will it see a partially updated value?

如果在写入过程中另一个客户端读取该文档，它会看到部分更新的值吗？

Those issues would be incredibly confusing, so storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node. Atomicity can be implemented using a log for crash recovery (see “Making B-trees reliable” ), and isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).

这些问题将极其混乱，因此存储引擎几乎普遍旨在为单个对象（例如键值对）的某个节点提供原子性和隔离性。可以使用日志实现原子性以进行崩溃恢复（请参阅“使B树可靠”），并且可以使用每个对象上的锁实现隔离（仅允许一个线程在任何时间访问对象）。

Some databases also provide more complex atomic operations, ^iv such as an increment operation, which removes the need for a read-modify-write cycle like that in Figure 7-1 . Similarly popular is a compare-and-set operation, which allows a write to happen only if the value has not been concurrently changed by someone else (see “Compare-and-set” ).

一些数据库也提供更复杂的原子操作，例如增量操作，它消除了像图7-1中的读取-修改-写入循环的需要。同样受欢迎的是比较和设置操作，它仅允许在值未被他人同时更改时进行写入（请参阅“比较和设置”）。

These single-object operations are useful, as they can prevent lost updates when several clients try to write to the same object concurrently (see “Preventing Lost Updates” ). However, they are not transactions in the usual sense of the word. Compare-and-set and other single-object operations have been dubbed “lightweight transactions” or even “ACID” for marketing purposes [ 20 , 21 , 22 ], but that terminology is misleading. A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.

这些单个操作非常有用，因为它们可以防止多个客户端同时写入同一个对象时丢失更新（参见“防止丢失更新”）。然而，它们在通常意义上不是事务。比较和设置以及其他单个对象操作已被称为“轻量级事务”甚至是“ACID”以进行营销 [20, 21, 22]，但这种术语是误导性的。事务通常被理解为将多个对象上的多个操作分组为一个执行单元的机制。

The need for multi-object transactions

Many distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions, and they can get in the way in some scenarios where very high availability or performance is required. However, there is nothing that fundamentally prevents transactions in a distributed database, and we will discuss implementations of distributed transactions in Chapter 9 .

许多分布式数据存储已经放弃了多对象事务，因为跨分区实现起来很困难，并且在需要非常高的可用性或性能的某些情况下可能会妨碍。然而，在分布式数据库中实现事务没有什么根本上的阻碍，我们将在第9章中讨论分布式事务的实现。

But do we need multi-object transactions at all? Would it be possible to implement any application with only a key-value data model and single-object operations?

我们真的需要多对象事务吗？是否可能只通过键值数据模型和单对象操作来实现任何应用程序？

There are some use cases in which single-object inserts, updates, and deletes are sufficient. However, in many other cases writes to several different objects need to be coordinated:

有些使用情况下，单个对象的插入、更新和删除就足够了。但是，在许多其他情况下，需要协调对多个不同对象的写入。

In a relational data model, a row in one table often has a foreign key reference to a row in another table. (Similarly, in a graph-like data model, a vertex has edges to other vertices.) Multi-object transactions allow you to ensure that these references remain valid: when inserting several records that refer to one another, the foreign keys have to be correct and up to date, or the data becomes nonsensical.

在关系数据模型中，一个表中的一行通常具有到另一个表中一行的外键引用。（同样，在类似图形的数据模型中，一个顶点有到其他顶点的边缘。）多个对象的事务允许您确保这些引用保持有效：在插入互相引用的多条记录时，外键必须正确和最新，否则数据就变得没有意义。
In a document data model, the fields that need to be updated together are often within the same document, which is treated as a single object—no multi-object transactions are needed when updating a single document. However, document databases lacking join functionality also encourage denormalization (see “Relational Versus Document Databases Today” ). When denormalized information needs to be updated, like in the example of Figure 7-2 , you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.

在文档数据模型中，需要同时更新的字段通常在同一文档中，该文档被视为单个对象 - 更新单个文档时不需要多个对象事务。然而，缺乏连接功能的文档数据库也鼓励去规范化（见“关系和文档数据库的比较”）。当需要更新非规范化信息时，例如图7-2中的示例，您需要一次更新多个文档。在这种情况下，事务非常有用，可以防止去规范化数据失去同步。
In databases with secondary indexes (almost everything except pure key-value stores), the indexes also need to be updated every time you change a value. These indexes are different database objects from a transaction point of view: for example, without transaction isolation, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet.

在具有辅助索引的数据库中（几乎所有非纯键值存储都需要），每次更改值时也需要更新索引。从事务角度讲，这些索引是不同的数据库对象：例如，如果没有事务隔离，可能会在一个索引中出现一条记录，但在另一个索引中还没有出现，因为第二个索引的更新尚未发生。

Such applications can still be implemented without transactions. However, error handling becomes much more complicated without atomicity, and the lack of isolation can cause concurrency problems. We will discuss those in “Weak Isolation Levels” , and explore alternative approaches in Chapter 12 .

这些应用程序仍然可以在没有事务的情况下实现。然而，没有原子性会使错误处理变得更加复杂，缺乏隔离可能会引起并发问题。我们将在“弱隔离级别”中讨论这些问题，并在第12章探讨替代方法。

Handling errors and aborts

A key feature of a transaction is that it can be aborted and safely retried if an error occurred. ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.

交易的一个关键特性是，如果出现错误，它可以被中止并安全地重试。 ACID数据库基于这个哲学：如果数据库有可能违反其原子性、隔离性或持久性保证，它宁可完全放弃该事务，也不愿意让它半途而废。

Not all systems follow that philosophy, though. In particular, datastores with leaderless replication (see “Leaderless Replication” ) work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done”—so it’s the application’s responsibility to recover from errors.

不过，并非所有系统都遵循这种哲学。特别是领导者缺失的数据存储（参见“领导者缺失的复制”）更多地运用“最佳努力”方法，可以概括为“数据库会尽可能地做好它的工作，如果出现错误，它不会撤销已经完成的工作”——所以应用程序需要负责从错误中恢复。

Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling. For example, popular object-relational mapping (ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message. This is a shame, because the whole point of aborts is to enable safe retries.

错误不可避免，但许多软件开发人员更喜欢考虑快乐路径，而不是错误处理的复杂性。例如，流行的对象关系映射（ORM）框架，如Rails的ActiveRecord和Django，不会重试中止事务，错误通常会导致异常冒泡到堆栈，因此任何用户输入都被丢弃，并且用户收到错误消息。这很遗憾，因为中止的整个目的是为了实现安全重试。

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect:

尽管重试已终止的交易是一种简单且有效的错误处理机制，它并不是完美的：

If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place.

如果交易实际上成功了，但在服务器尝试向客户端确认成功提交时网络失败（因此客户端认为它失败），那么重试交易会导致交易被执行两次——除非您有另外一个应用程序级别的去重机制。
If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible).

如果错误是由于过载引起的，重试交易将使问题恶化而不是改善。为避免这种反馈循环，可以限制重试次数，使用指数退避，并将过载相关错误与其他错误区别对待（如果可能）。
It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.

只有在短暂错误（例如死锁、隔离违规、暂时网络中断和故障转移）后重新尝试才有意义；在永久错误（例如约束违规）后重试是没有意义的。
If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to send the email again every time you retry the transaction. If you want to make sure that several different systems either commit or abort together, two-phase commit can help (we will discuss this in “Atomic Commit and Two-Phase Commit (2PC)” ).

如果交易还具有数据库之外的副作用，即使事务被中止，这些副作用也可能发生。例如，如果您正在发送电子邮件，则不希望每次重试交易时都重新发送电子邮件。如果您想确保多个不同的系统同时提交或中止，两阶段提交可以帮助（我们将在“原子提交和两阶段提交（2PC）”中讨论）。
If the client process fails while retrying, any data it was trying to write to the database is lost.

如果客户端在重试期间失败，它试图写入数据库的任何数据都将丢失。

Weak Isolation Levels

If two transactions don’t touch the same data, they can safely be run in parallel, because neither depends on the other. Concurrency issues (race conditions) only come into play when one transaction reads data that is concurrently modified by another transaction, or when two transactions try to simultaneously modify the same data.

如果两个事务不涉及相同的数据，它们可以安全地并行运行，因为它们之间没有相互依存的关系。并发问题（竞态条件）只会在一个事务读取由另一个事务并发修改的数据时出现，或者在两个事务尝试同时修改相同的数据时出现。

Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to reproduce. Concurrency is also very difficult to reason about, especially in a large application where you don’t necessarily know which other pieces of code are accessing the database. Application development is difficult enough if you just have one user at a time; having many concurrent users makes it much harder still, because any piece of data could unexpectedly change at any time.

并发bug很难通过测试找到，因为只有在时间不幸的情况下才会触发这样的bug。这种时间问题可能非常罕见，而且通常很难重现。并发也非常难以理解，特别是在一个大型应用程序中，你可能不知道哪些其他代码正在访问数据库。如果你只有一个用户，应用程序开发已经很困难了；如果有很多并发用户，那么它会变得更加困难，因为任何数据都可能在任何时候意外地改变。

For that reason, databases have long tried to hide concurrency issues from application developers by providing transaction isolation . In theory, isolation should make your life easier by letting you pretend that no concurrency is happening: serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially (i.e., one at a time, without any concurrency).

因此，数据库长期以来一直试图通过提供事务隔离来隐藏应用程序开发人员的并发问题。理论上，隔离应该使您的生活更轻松，因为它让您假装没有发生并发：可串行化隔离意味着数据库保证事务具有与串行运行（即一个接一个地进行，没有任何并发）相同的效果。

In practice, isolation is unfortunately not that simple. Serializable isolation has a performance cost, and many databases don’t want to pay that price [ 8 ]. It’s therefore common for systems to use weaker levels of isolation, which protect against some concurrency issues, but not all. Those levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are nevertheless used in practice [ 23 ].

在实践中，隔离不是那么简单的。可串行化隔离会带来性能上的成本，因此许多数据库不想承担这样的代价。通常情况下，系统会采用更弱的隔离级别，可以防止某些并发问题，但并非所有问题都可以得到解决。这种隔离级别更难以理解，可能会导致一些微妙的错误，但它们仍然在实践中被广泛使用。

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have caused substantial loss of money [ 24 , 25 ], led to investigation by financial auditors [ 26 ], and caused customer data to be corrupted [ 27 ]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!”—but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring.

弱事务隔离引起的并发错误不仅仅是一个理论问题。它们已经导致了大量的金钱损失[24, 25]，引起了金融审计机构的调查[26]，并导致客户数据的损坏[27]。对于这样的问题，一个普遍的评论是“如果你处理财务数据，使用ACID数据库！”-但这失去了重点。即使是许多流行的关系型数据库系统（通常被认为是“ACID”）也使用弱隔离，因此它们不一定可以防止这些错误的出现。

Rather than blindly relying on tools, we need to develop a good understanding of the kinds of concurrency problems that exist, and how to prevent them. Then we can build applications that are reliable and correct, using the tools at our disposal.

我们需要对存在的并发问题有一个很好的理解，并学会如何防止它们，而不是盲目地依靠工具。然后我们可以利用手头的工具构建可靠和正确的应用程序。

In this section we will look at several weak (nonserializable) isolation levels that are used in practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can decide what level is appropriate to your application. Once we’ve done that, we will discuss serializability in detail (see “Serializability” ). Our discussion of isolation levels will be informal, using examples. If you want rigorous definitions and analyses of their properties, you can find them in the academic literature [ 28 , 29 , 30 ].

在这一部分中，我们将探讨几个实际使用的弱（非可串行化）隔离级别，并详细讨论可能发生和不可能发生的竞争条件，以便您可以决定适合您应用程序的级别。一旦我们完成了这个任务，我们将详细讨论可串行化性（见“可序列化性”）。我们对隔离级别的讨论将是非正式的，并使用示例。如果您想了解其属性的严格定义和分析，则可以在学术文献[28、29、30]中找到。

Read Committed

The most basic level of transaction isolation is read committed . ^v It makes two guarantees:

最基本的事务隔离级别是读取提交。它提供两个保证：

When reading from the database, you will only see data that has been committed (no dirty reads ).

读取数据库时，只能看到已提交的数据（没有脏读取）。
When writing to the database, you will only overwrite data that has been committed (no dirty writes ).

在写入数据库时，只会覆盖已提交的数据（不会出现脏写）。

Let’s discuss these two guarantees in more detail.

让我们更详细地讨论这两个保证。

No dirty reads

Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. Can another transaction see that uncommitted data? If yes, that is called a dirty read [ 2 ].

想象一下，一个事务已经向数据库写入了一些数据，但是这个事务还没有提交或中止。另一个事务是否能看到这些未提交的数据？如果是的话，那就叫做脏读取。

Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then all of its writes become visible at once). This is illustrated in Figure 7-4 , where user 1 has set x = 3, but user 2’s get x still returns the old value, 2, while user 1 has not yet committed.

运行在读提交隔离级别下的事务必须防止脏读。这意味着，只有当一个事务提交后，它的任何写入才会对其他事务可见（然后所有写入一次性变为可见）。这在图7-4中有所体现，用户1设置x = 3，但用户2的get x仍返回旧值2，因为用户1尚未提交。

There are a few reasons why it’s useful to prevent dirty reads:

防止脏读有几个好处：

If a transaction needs to update several objects, a dirty read means that another transaction may see some of the updates but not others. For example, in Figure 7-2 , the user sees the new unread email but not the updated counter. This is a dirty read of the email. Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.

如果一次事务需要更新多个对象，脏读表示另一个事务可能会看到某些更新，但不是全部更新。例如，在图7-2中，用户看到了新的未读邮件，但没有看到更新的计数器。这是对邮件的脏读取。查看数据库的部分更新状态会令用户感到困惑，可能会导致其他事务做出错误的决策。
If a transaction aborts, any writes it has made need to be rolled back (like in Figure 7-3 ). If the database allows dirty reads, that means a transaction may see data that is later rolled back—i.e., which is never actually committed to the database. Reasoning about the consequences quickly becomes mind-bending.

如果事务中止了，它所做的任何写操作都需要回滚（就像图7-3中的一样）。如果数据库允许脏读取，那么意味着一个事务可能看到后来被回滚的数据，即实际上从未提交到数据库中的数据。对后果的推理很快变得令人费解。

No dirty writes

What happens if two transactions concurrently try to update the same object in a database? We don’t know in which order the writes will happen, but we normally assume that the later write overwrites the earlier write.

如果两个事务同时尝试更新数据库中的同一对象会发生什么？我们不知道写入的顺序，但通常假设后一次的写入会覆盖早先的写入。

However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a dirty write [ 28 ]. Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted.

如果早期的写入是尚未提交的事务的一部分，那么后续的写入会覆盖未提交的值，这就是脏写(又译为“临时写入”，Dirty Write）发生的情况。在读取已提交的隔离级别下运行的事务必须防止脏写，通常会延迟第二个写入，直到第一个写入的事务提交或中止。

By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:

通过防止脏写，此隔离级别避免了一些并发问题。

If transactions update multiple objects, dirty writes can lead to a bad outcome. For example, consider Figure 7-5 , which illustrates a used car sales website on which two people, Alice and Bob, are simultaneously trying to buy the same car. Buying a car requires two database writes: the listing on the website needs to be updated to reflect the buyer, and the sales invoice needs to be sent to the buyer. In the case of Figure 7-5 , the sale is awarded to Bob (because he performs the winning update to the listings table), but the invoice is sent to Alice (because she performs the winning update to the invoices table). Read committed prevents such mishaps.

如果交易更新多个对象，则脏写入可能导致不良结果。例如，考虑图7-5所示的二手车销售网站，Alice和Bob正在同时尝试购买同一辆车。购买车辆需要两个数据库写入操作：网站上的列表需要更新以反映买家，销售发票需要发送给买家。在图7-5的情况下，Bob获得了销售权（因为他执行了对列表表的获胜更新），但发票却被发送给了Alice（因为她执行了对发票表的获胜更新）。读取已提交的事务可以防止发生此类意外情况。
However, read committed does not prevent the race condition between two counter increments in Figure 7-1 . In this case, the second write happens after the first transaction has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in “Preventing Lost Updates” we will discuss how to make such counter increments safe.

然而，“读已提交”无法防止图7-1中两个计数器自增之间的竞争条件。在这种情况下，第二次写操作在第一个事务提交后发生，因此它不是脏写。它仍然不正确，但是原因不同——在“防止丢失更新”中，我们将讨论如何使这样的计数器自增安全。

Implementing read committed

Read committed is a very popular isolation level. It is the default setting in Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many other databases [ 8 ].

提交读是一个非常流行的隔离级别。它是Oracle 11g、PostgreSQL、SQL Server 2012、MemSQL等许多数据库的默认设置。

Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given object; if another transaction wants to write to the same object, it must wait until the first transaction is committed or aborted before it can acquire the lock and continue. This locking is done automatically by databases in read committed mode (or stronger isolation levels).

通常情况下，数据库通过使用行级锁来防止脏写：当事务想要修改特定对象（行或文档）时，它必须首先获取该对象的锁。然后它必须保持该锁直到事务提交或中止。每个给定对象只能有一个事务持有锁；如果另一个事务想要写入相同的对象，它必须等待第一个事务提交或中止，然后才能获取锁并继续进行。在读提交模式（或更强的隔离级别）下，数据库会自动进行此锁定。

How do we prevent dirty reads? One option would be to use the same lock, and to require any transaction that wants to read an object to briefly acquire the lock and then release it again immediately after reading. This would ensure that a read couldn’t happen while an object has a dirty, uncommitted value (because during that time the lock would be held by the transaction that has made the write).

怎样防止脏读取？一个选项是使用相同的锁，并要求任何想要读取对象的事务短暂地获取锁，然后在读取后立即释放它。这将确保在对象具有脏的、未提交值的情况下不能进行读取（因为在此期间锁将由进行写操作的事务持有）。

However, the approach of requiring read locks does not work well in practice, because one long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed. This harms the response time of read-only transactions and is bad for operability: a slowdown in one part of an application can have a knock-on effect in a completely different part of the application, due to waiting for locks.

然而，要求读锁的方法在实际应用中不起作用，因为一个长时间运行的写事务可以强制许多只读事务等待，直到长时间运行的事务完成。这会损害只读事务的响应时间，对于可操作性不利：应用程序的某一部分减速可能会对完全不同的应用程序的其他部分产生连锁效应，因为等待锁。

For that reason, most databases ^vi prevent dirty reads using the approach illustrated in Figure 7-4 : for every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. While the transaction is ongoing, any other transactions that read the object are simply given the old value. Only when the new value is committed do transactions switch over to reading the new value.

因此，大多数数据库使用图7-4所示的方法来防止脏读：对于每个被写入的对象，数据库都会记住旧的提交值和当前持有写锁的事务设置的新值。在事务进行时，任何其他读取对象的事务只会得到旧值。只有在新值提交后，事务才会开始读取新值。

Snapshot Isolation and Repeatable Read

If you look superficially at read committed isolation, you could be forgiven for thinking that it does everything that a transaction needs to do: it allows aborts (required for atomicity), it prevents reading the incomplete results of transactions, and it prevents concurrent writes from getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can get from a system that has no transactions.

如果你只是肤浅地看待读已提交隔离级别，你可能会认为它能够完成事务所需的一切：它允许中断（原子性所需），它避免读取不完整的事务结果，它避免并发写入相互纠缠。的确，这些是有用的特性，比起一个没有事务的系统，提供了更强的保证。

However, there are still plenty of ways in which you can have concurrency bugs when using this isolation level. For example, Figure 7-6 illustrates a problem that can occur with read committed.

然而，在使用这种隔离级别时仍然存在许多可能导致并发错误的方式。例如，图 7-6展示了在读提交时可能出现的问题。

Say Alice has $1,000 of savings at a bank, split across two accounts with $500 each. Now a transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her list of account balances in the same moment as that transaction is being processed, she may see one account balance at a time before the incoming payment has arrived (with a balance of $500), and the other account after the outgoing transfer has been made (the new balance being $400). To Alice it now appears as though she only has a total of $900 in her accounts—it seems that $100 has vanished into thin air.

假设爱丽丝在银行有1000美元的存款，在两个账户中平均分配，每个账户500美元。现在，一笔交易将其中一个账户中的100美元转移到另一个账户。如果她在交易处理过程中不幸地查看她的账户余额列表，她可能会在收到入账之前看到一个账户余额（余额为500美元），并在出账后看到另一个账户余额（新余额为400美元）。对爱丽丝来说，现在她的账户总额只有900美元-似乎100美元已经消失了。

This anomaly is called a nonrepeatable read or read skew : if Alice were to read the balance of account 1 again at the end of the transaction, she would see a different value ($600) than she saw in her previous query. Read skew is considered acceptable under read committed isolation: the account balances that Alice saw were indeed committed at the time when she read them.

这种异常称为不可重复读或读取偏斜：如果艾丽斯在事务结束时再次读取账户1的余额，她会看到一个不同的值（600美元），而不是她之前查询到的值。在读已提交隔离级别下，读取偏斜被认为是可接受的：艾丽斯看到的帐户余额确实在她读取它们时已经提交。

Note

The term skew is unfortunately overloaded: we previously used it in the sense of an unbalanced workload with hot spots (see “Skewed Workloads and Relieving Hot Spots” ), whereas here it means timing anomaly .

“Skew”的术语不幸被重载：我们先前使用它来表示具有热点的不平衡工作负载（请参见“不平衡工作负载和缓解热点”），而在这里它表示时间异常。

In Alice’s case, this is not a lasting problem, because she will most likely see consistent account balances if she reloads the online banking website a few seconds later. However, some situations cannot tolerate such temporary inconsistency:

在艾丽斯的情况下，这不是一个长期的问题，因为如果她几秒钟后重新加载在线银行网站，她很可能会看到一致的账户余额。然而，有些情况无法容忍这种临时的不一致：

Backups

Taking a backup requires making a copy of the entire database, which may take hours on a large database. During the time that the backup process is running, writes will continue to be made to the database. Thus, you could end up with some parts of the backup containing an older version of the data, and other parts containing a newer version. If you need to restore from such a backup, the inconsistencies (such as disappearing money) become permanent.

备份需要复制整个数据库，对于大型数据库可能需要数小时。在备份过程中，数据库仍会继续进行写入操作。因此，备份的某些部分可能包含旧版本的数据，而其他部分包含新版本的数据。如果需要从这样的备份进行还原，不一致性问题（如消失的金额）将变为永久性的。

Analytic queries and integrity checks

Sometimes, you may want to run a query that scans over large parts of the database. Such queries are common in analytics (see “Transaction Processing or Analytics?” ), or may be part of a periodic integrity check that everything is in order (monitoring for data corruption). These queries are likely to return nonsensical results if they observe parts of the database at different points in time.

有时候，你可能想运行一个查询，它会扫描整个数据库的大部分。这些查询通常出现在分析中（参见“事务处理还是分析？”），或者作为定期完整性检查的一部分，以确保一切都井然有序（监控数据损坏）。如果这些查询不同时观察数据库的不同部分，在返回结果时很可能会出现荒谬的结果。

Snapshot isolation [ 28 ] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.

快照隔离[28]是最常见的解决这个问题的方法。其思想是每个事务从数据库的一致快照中读取——即，事务在开始时看到数据库提交的所有数据。即使数据随后被另一个事务更改，每个事务仅看到那个特定时间点的旧数据。

Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It is very hard to reason about the meaning of a query if the data on which it operates is changing at the same time as the query is executing. When a transaction can see a consistent snapshot of the database, frozen at a particular point in time, it is much easier to understand.

快照隔离对于长时间运行的只读查询，如备份和分析查询，是一个福音。如果数据正在查询执行的同时发生变化，很难理解查询的含义。当一个事务能够看到数据库的一致快照，冻结在某个时间点上时，就容易理解得多。

Snapshot isolation is a popular feature: it is supported by PostgreSQL, MySQL with the InnoDB storage engine, Oracle, SQL Server, and others [ 23 , 31 , 32 ].

快照隔离是一个受欢迎的特性：它被PostgreSQL, MySQL的InnoDB存储引擎, Oracle, SQL Server等支持。

Implementing snapshot isolation

Like read committed isolation, implementations of snapshot isolation typically use write locks to prevent dirty writes (see “Implementing read committed” ), which means that a transaction that makes a write can block the progress of another transaction that writes to the same object. However, reads do not require any locks. From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers . This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.

像读提交隔离一样，快照隔离的实现通常使用写锁来防止脏写（参见“实现读提交”），这意味着进行写操作的事务可以阻止另一个对同一对象进行写操作的事务的进展。然而，读取不需要任何锁。从性能角度来看，快照隔离的一个关键原则是读者永远不会阻塞写者，写者永远不会阻塞读者。这使得数据库可以在处理常规写入的同时，在一致的快照上处理长时间运行的读查询，而两者之间没有任何锁争用。

To implement snapshot isolation, databases use a generalization of the mechanism we saw for preventing dirty reads in Figure 7-4 . The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as multi-version concurrency control (MVCC).

为了实现快照隔离，数据库使用了一种机制的扩展，该机制用于防止图7-4中看到的脏读。数据库必须潜在地保留一个对象的多个不同提交版本，因为各种正在进行的事务可能需要在不同时刻查看数据库的状态。由于它并排维护一个对象的几个版本，因此这种技术被称为多版本并发控制（MVCC）。

If a database only needed to provide read committed isolation, but not snapshot isolation, it would be sufficient to keep two versions of an object: the committed version and the overwritten-but-not-yet-committed version. However, storage engines that support snapshot isolation typically use MVCC for their read committed isolation level as well. A typical approach is that read committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction.

如果一个数据库只需要提供读取已提交隔离，而不需要快照隔离，那么维护一个对象的两个版本就足够了：已提交的版本和覆盖但未提交的版本。但是，支持快照隔离的存储引擎通常也会使用MVCC来进行读取已提交隔离级别的处理。一个典型的方法是，读取已提交隔离对于每个查询使用单独的快照，而快照隔离对于整个事务使用同一个快照。

Figure 7-7 illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL [ 31 ] (other implementations are similar). When a transaction is started, it is given a unique, always-increasing ^vii transaction ID ( txid ). Whenever a transaction writes anything to the database, the data it writes is tagged with the transaction ID of the writer.

图7-7说明了PostgreSQL中MVCC基于快照隔离的实现方式[31]（其他实现方式类似）。当一个事务开始时，它被赋予一个唯一的、始终递增的事务ID（txid）。无论何时一个事务将任何内容写入数据库，它写入的数据都会被标记上写入者的事务ID。

Each row in a table has a created_by field, containing the ID of the transaction that inserted this row into the table. Moreover, each row has a deleted_by field, which is initially empty. If a transaction deletes a row, the row isn’t actually deleted from the database, but it is marked for deletion by setting the deleted_by field to the ID of the transaction that requested the deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space.

每个表格行都有一个created_by字段，包含将此行插入表格的交易ID。此外，每一行都有一个deleted_by字段，最初为空白。如果交易删除一行，则实际上不会从数据库中删除该行，但是将删除_by字段设置为请求删除的事务的ID以标记该行将被删除。在某个以后的时间，当确保不再有任何交易可以访问被删除的数据时，数据库中的垃圾回收过程将删除任何标记为删除的行并释放它们的空间。

An update is internally translated into a delete and a create. For example, in Figure 7-7 , transaction 13 deducts $100 from account 2, changing the balance from $500 to $400. The accounts table now actually contains two rows for account 2: a row with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of $400 which was created by transaction 13.

更新操作在内部被转化成删除和创建操作。例如，在图7-7中，事务13从账户2中扣除了100美元，将余额从500美元变为400美元。现在，账户表实际上包含了账户2的两行记录：一行余额为500美元的记录被事务13标记为已删除，另一行余额为400美元的记录则是由事务13创建的。更新操作内部转化为删除和创建操作。例如，图7-7中，13号交易从第二账户中扣除100美元，将余额从500美元变为400美元。现在，账户表实际上包含第二账户的两行记录：一行余额为500美元的记录被13号交易标记为已删除，而一行余额为400美元的记录则是由13号交易创建的。

Visibility rules for observing a consistent snapshot

When a transaction reads from the database, transaction IDs are used to decide which objects it can see and which are invisible. By carefully defining visibility rules, the database can present a consistent snapshot of the database to the application. This works as follows:

当一个事务从数据库中读取时，事务ID被用来决定哪些对象是可见的，哪些是不可见的。通过精确定义可见性规则，数据库可以向应用程序呈现一致的数据库快照。这个过程如下：

At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit.

每个事务开始时，数据库都会列出正在进行中（未提交或取消）的所有其他事务。即使这些事务随后提交，它们所做的任何写操作也将被忽略。
Any writes made by aborted transactions are ignored.

已中止的交易所进行的任何写入操作将被忽略。
Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started) are ignored, regardless of whether those transactions have committed.

任何通过后续事务进行的写操作（即，这些事务在当前事务开始之后启动）都会被忽略，无论这些事务是否已提交。
All other writes are visible to the application’s queries.

所有其他写入都对应用程序的查询可见。

These rules apply to both creation and deletion of objects. In Figure 7-7 , when transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500 balance was made by transaction 13 (according to rule 3, transaction 12 cannot see a deletion made by transaction 13), and the creation of the $400 balance is not yet visible (by the same rule).

这些规则适用于对象的创建和删除。在图7-7中，当交易12从账户2读取时，它看到的余额为$500，因为$500余额的删除是由交易13执行的（根据规则3，交易12不能看到交易13执行的删除），而$400余额的创建尚不可见（同一规则）。

Put another way, an object is visible if both of the following conditions are true:

换句话说，一个物体是可见的，如果满足以下两个条件：

At the time when the reader’s transaction started, the transaction that created the object had already committed.

在读者的交易开始时，创建对象的交易已经提交。
The object is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed at the time when the reader’s transaction started.

对象未被标记为删除，或者如果被标记为删除，那么请求删除的事务在读取器事务开始时尚未提交。

A long-running transaction may continue using a snapshot for a long time, continuing to read values that (from other transactions’ point of view) have long been overwritten or deleted. By never updating values in place but instead creating a new version every time a value is changed, the database can provide a consistent snapshot while incurring only a small overhead.

长时间执行的事务可能会长时间使用快照，继续读取其他事务已经覆盖或删除的值。通过不直接更新值而是每次改变值时创建一个新版本，数据库可以提供一致的快照，同时只产生很小的开销。

Indexes and snapshot isolation

How do indexes work in a multi-version database? One option is to have the index simply point to all versions of an object and require an index query to filter out any object versions that are not visible to the current transaction. When garbage collection removes old object versions that are no longer visible to any transaction, the corresponding index entries can also be removed.

多版本数据库中索引如何工作？一种选择是使索引指向对象的所有版本，并要求索引查询过滤掉任何当前事务看不到的对象版本。当垃圾回收删除对任何事务不再可见的旧对象版本时，相应的索引条目也可以被删除。

In practice, many implementation details determine the performance of multi-version concurrency control. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the same object can fit on the same page [ 31 ].

在实践中，许多实现细节决定了多版本并发控制的性能。例如，PostgreSQL通过优化避免在同一页上更新不同版本的对象的索引。

Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see “B-Trees” ), they use an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and remain immutable [ 33 , 34 , 35 ].

另一种方法用于CouchDB、Datomic和LMDB。虽然它们也使用B树（见“B树”），但它们使用追加只读/写入复制变体，在更新时不会覆盖树的页面，而是创建每个修改页面的新副本。父页面，直到树的根，都被复制并更新以指向子页面的新版本。不受写入影响的任何页面都不需要被复制，并保持不变[33、34、35]。

With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created. There is no need to filter out objects based on transaction IDs because subsequent writes cannot modify an existing B-tree; they can only create new tree roots. However, this approach also requires a background process for compaction and garbage collection.

使用只追加的B树，每个写事务（或一批事务）都会创建一个新的B树根节点，并且特定的根节点是数据库在创建时的一致快照。不需要基于事务ID过滤对象，因为后续的写操作不能修改现有的B树；它们只能创建新的树根节点。但是，这种方法还需要进行压缩和垃圾回收的后台处理程序。

Repeatable read and naming confusion

Snapshot isolation is a useful isolation level, especially for read-only transactions. However, many databases that implement it call it by different names. In Oracle it is called serializable , and in PostgreSQL and MySQL it is called repeatable read [ 23 ].

快照隔离级别是一种有用的隔离级别，特别是对于只读事务。然而，许多实现此级别的数据库使用不同的名称。在Oracle中，它被称为可序列化，在PostgreSQL和MySQL中，则称为可重复读[23]。

The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot isolation, because the standard is based on System R’s 1975 definition of isolation levels [ 2 ] and snapshot isolation hadn’t yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL and MySQL call their snapshot isolation level repeatable read because it meets the requirements of the standard, and so they can claim standards compliance.

“这种命名混淆的原因是因为SQL标准没有快照隔离的概念。这是因为该标准基于System R在1975年定义的隔离级别，而那时尚未发明快照隔离。相反，它定义了可重复读，这看起来与快照隔离表面上相似。PostgreSQL和MySQL将它们的快照隔离级别称为可重复读，因为它满足标准的要求，所以它们可以声称符合标准。”

Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, imprecise, and not as implementation-independent as a standard should be [ 28 ]. Even though several databases implement repeatable read, there are big differences in the guarantees they actually provide, despite being ostensibly standardized [ 23 ]. There has been a formal definition of repeatable read in the research literature [ 29 , 30 ], but most implementations don’t satisfy that formal definition. And to top it off, IBM DB2 uses “repeatable read” to refer to serializability [ 8 ].

不幸的是，SQL标准中对隔离级别的定义存在缺陷 - 它是模棱两可的，不精确的，并且不像标准应该的那样独立于实现[28]。尽管有几个数据库实现了可重复读，但它们实际提供的保证存在很大差异，尽管它们表面上是规范化的[23]。研究文献中已经有对可重复读的正式定义[29、30]，但大多数实现都不能满足这个正式定义。而且，IBM DB2使用“可重复读”来指代串行化[8]。

As a result, nobody really knows what repeatable read means.

因此，没有人真正知道可重复读取的含义。

Preventing Lost Updates

The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored the issue of two transactions writing concurrently—we have only discussed dirty writes (see “No dirty writes” ), one particular type of write-write conflict that can occur.

到目前为止，我们所讨论的读取提交和快照隔离级别主要关注的是在存在并发写入的情况下，只读事务可以看到什么样的保证。我们大多数情况下忽略了两个事务同时写入的问题 - 我们只讨论了脏写（参见“无脏写入”），一种可能发生的写-写冲突的特定类型。

There are several other interesting kinds of conflicts that can occur between concurrently writing transactions. The best known of these is the lost update problem, illustrated in Figure 7-1 with the example of two concurrent counter increments.

并发写入事务之间可能发生多种有趣的冲突。其中最为人所知的是丢失更新问题，本例以两个并发计数器增量的示例解释，如图7-1所示。

The lost update problem can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle ). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification. (We sometimes say that the later write clobbers the earlier write.) This pattern occurs in various different scenarios:

如果应用程序从数据库中读取某个值，对其进行修改，然后写回修改后的值（读取-修改-写入周期），则可能会发生丢失更新问题。如果两个事务同时执行此操作，则其中一个修改可能会丢失，因为第二次写入不包括第一次修改。（我们有时会说后来的写入覆盖了先前的写入。）这种模式在各种不同的情况下发生：

Incrementing a counter or updating an account balance (requires reading the current value, calculating the new value, and writing back the updated value)

增加计数器或更新账户余额（需要读取当前值、计算新值，然后写回更新后的值）。
Making a local change to a complex value, e.g., adding an element to a list within a JSON document (requires parsing the document, making the change, and writing back the modified document)

进行本地更改复杂值，例如在JSON文档中向列表添加元素（需要解析文档、进行更改并写回修改后的文档）。
Two users editing a wiki page at the same time, where each user saves their changes by sending the entire page contents to the server, overwriting whatever is currently in the database

两个用户同时编辑一个维基页面，在保存更改时，它们将整个页面内容发送到服务器，覆盖当前数据库中的内容。

Because this is such a common problem, a variety of solutions have been developed.

因为这是一个很常见的问题，所以已经开发出了各种解决方案。

Atomic write operations

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solution if your code can be expressed in terms of those operations. For example, the following instruction is concurrency-safe in most relational databases:

许多数据库提供原子更新操作，这消除了在应用程序代码中实现读取-修改-写入循环的需要。如果你的代码可以用这些操作来表达，它们通常是最好的解决方案。例如，以下指令在大多数关系数据库中是并发安全的：

UPDATE counters SET value = value + 1 WHERE key = 'foo';

Similarly, document databases such as MongoDB provide atomic operations for making local modifications to a part of a JSON document, and Redis provides atomic operations for modifying data structures such as priority queues. Not all writes can easily be expressed in terms of atomic operations—for example, updates to a wiki page involve arbitrary text editing ^viii —but in situations where atomic operations can be used, they are usually the best choice.

类似地，文档数据库（如MongoDB）提供原子操作，用于对JSON文档的部分进行本地修改，而Redis提供原子操作以修改数据结构（如优先队列）。并非所有的写操作都可以很容易地表达为原子操作。例如，对维基页面的更新涉及任意的文本编辑。但在可以使用原子操作的情况下，它们通常是最好的选择。

Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. This technique is sometimes known as cursor stability [ 36 , 37 ]. Another option is to simply force all atomic operations to be executed on a single thread.

原子操作通常在读取对象时采取独占锁，以防止其他事务在更新应用之前读取它。这种技术有时被称为游标稳定性。另一种选择是强制所有原子操作在单个线程上执行。原子操作通常采用独占锁定对象，以防其他事务在应用更新之前读取它。这种技术有时称为游标稳定性。另一个选择是强制所有原子操作在单个线程上执行。

Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database [ 38 ]. That’s not a problem if you know what you are doing, but it is potentially a source of subtle bugs that are difficult to find by testing.

很遗憾，对象关系映射框架很容易让人无意中编写执行不安全的读取-修改-写入循环的代码，而不是使用数据库提供的原子操作。如果你知道自己在做什么，那没有问题，但这可能会成为难以通过测试发现的微妙错误的源头。

Explicit locking

Another option for preventing lost updates, if the database’s built-in atomic operations don’t provide the necessary functionality, is for the application to explicitly lock objects that are going to be updated. Then the application can perform a read-modify-write cycle, and if any other transaction tries to concurrently read the same object, it is forced to wait until the first read-modify-write cycle has completed.

当数据库内置的原子操作无法提供所需功能时，防止丢失更新的另一种选择是应用程序显式锁定要更新的对象。然后，应用程序可以执行读取-修改-写入周期，如果任何其他事务尝试同时读取相同的对象，则被迫等待第一个读取-修改-写入周期完成。

For example, consider a multiplayer game in which several players can move the same figure concurrently. In this case, an atomic operation may not be sufficient, because the application also needs to ensure that a player’s move abides by the rules of the game, which involves some logic that you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two players from concurrently moving the same piece, as illustrated in Example 7-1 .

例如，考虑一个多人游戏，多个玩家可以同时移动同一角色。在这种情况下，原子操作可能不足够，因为应用还需要确保玩家的移动符合游戏规则，这涉及到一些逻辑，无法合理地实现为数据库查询。相反，您可以使用锁来防止两个玩家同时移动同一棋子，如示例7-1所示。

Example 7-1. Explicitly locking rows to prevent lost updates

BEGIN TRANSACTION;

SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE; 

-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;

The FOR UPDATE clause indicates that the database should take a lock on all rows returned by this query.

"FOR UPDATE子句指示数据库在返回该查询的所有行上锁定。"

This works, but to get it right, you need to carefully think about your application logic. It’s easy to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.

这个方法是可行的，但要达到最佳效果，需要仔细考虑应用程序逻辑。很容易忘记在代码中加入必要的锁，从而引入竞态条件。

Automatically detecting lost updates

Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.

原子操作和锁是防止丢失更新的方法，通过强制读取-修改-写入周期按顺序执行。另一种方法是允许它们并行执行，如果事务管理器检测到丢失更新，则中止事务并强制其重试读取-修改-写入周期。

An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [ 23 ]. Some authors [ 28 , 30 ] argue that a database must prevent lost updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot isolation under this definition.

这种方法的优点是数据库能够高效地与快照隔离一起执行此检查。实际上，PostgreSQL的可重复读、Oracle的串行化和SQL Server的快照隔离级别自动检测到丢失的更新，并中止有问题的事务。然而，MySQL/InnoDB的可重复读无法检测到丢失的更新[23]。一些作者[28,30]认为，数据库必须防止丢失的更新，才能被认为是提供快照隔离的资格，因此MySQL在此定义下不提供快照隔离。

Lost update detection is a great feature, because it doesn’t require application code to use any special database features—you may forget to use a lock or an atomic operation and thus introduce a bug, but lost update detection happens automatically and is thus less error-prone.

失效更新检测是一个很好的功能，因为它不需要应用程序代码使用任何特殊的数据库功能。你可能会忘记使用锁定或原子操作并引入错误，但是失效更新检测会自动发生，因此更不容易出错。

Compare-and-set

In databases that don’t provide transactions, you sometimes find an atomic compare-and-set operation (previously mentioned in “Single-object writes” ). The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried.

在不提供事务处理的数据库中，有时可以找到原子比较和设置操作（在“单对象写入”中提到过）。此操作的目的是通过仅在上次读取后的值未更改时允许更新来避免丢失更新。如果当前值与您以前读取的值不匹配，则更新无效，并且必须重试读取-修改-写入循环。

For example, to prevent two users concurrently updating the same wiki page, you might try something like this, expecting the update to occur only if the content of the page hasn’t changed since the user started editing it:

例如，为了防止两个用户同时更新同一个Wiki页面，您可以尝试类似以下的方法，只有在页面内容自用户开始编辑时没有更改时，才能期望更新成功：

-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content'
  WHERE id = 1234 AND content = 'old content';

If the content has changed and no longer matches 'old content' , this update will have no effect, so you need to check whether the update took effect and retry if necessary. However, if the database allows the WHERE clause to read from an old snapshot, this statement may not prevent lost updates, because the condition may be true even though another concurrent write is occurring. Check whether your database’s compare-and-set operation is safe before relying on it.

如果内容已更改且不再与旧内容匹配，则此更新将不起作用，因此需要检查更新是否生效，如果必要，则重试。但是，如果数据库允许WHERE子句从旧快照中读取，则此语句可能无法防止丢失更新，因为条件可能为true，即使另一个并发写操作正在进行中。在依赖它之前，请检查您的数据库的比较和设置操作是否安全。

Conflict resolution and replication

In replicated databases (see Chapter 5 ), preventing lost updates takes on another dimension: since they have copies of the data on multiple nodes, and the data can potentially be modified concurrently on different nodes, some additional steps need to be taken to prevent lost updates.

在复制的数据库中（见第五章），防止丢失更新会有另一个层面：由于它们在多个节点上拥有数据的副本，并且数据可能会在不同节点上并发修改，因此需要采取一些额外的步骤来防止丢失更新。

Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context. (We will revisit this issue in more detail in “Linearizability” .)

锁定和比较并设置操作假定数据有一个最新的副本。然而，具有多个领导者或无领导者复制的数据库通常允许同时进行多个写操作，并异步复制它们，因此无法保证数据有一个最新的副本。因此，基于锁定或比较并设置的技术在这种情况下不适用。（我们将在“线性化”中更详细地讨论这个问题。）

Instead, as discussed in “Detecting Concurrent Writes” , a common approach in such replicated databases is to allow concurrent writes to create several conflicting versions of a value (also known as siblings ), and to use application code or special data structures to resolve and merge these versions after the fact.

相反，在“检测并发写入”中讨论的方式，这种复制数据库中常见的方法是允许并发写入创建多个冲突版本的值（也称为兄弟姐妹），并使用应用程序代码或特殊数据结构在事后解决和合并这些版本。

Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a value is concurrently updated by different clients, Riak automatically merges together the updates in such a way that no updates are lost [ 39 ].

原子操作在复制环境中表现良好，尤其是在可交换的情况下（即在不同的副本上可以以不同的顺序应用它们，仍然可以得到相同的结果）。例如，对计数器进行递增或将元素添加到集合中都是可交换的操作。这就是Riak 2.0数据类型的想法，它可以防止副本之间出现丢失更新。当不同客户端同时更新一个值时，Riak会自动合并这些更新，以便不会丢失任何更新。[39]。

On the other hand, the last write wins (LWW) conflict resolution method is prone to lost updates, as discussed in “Last write wins (discarding concurrent writes)” . Unfortunately, LWW is the default in many replicated databases.

另一方面，最后写入胜出（LWW）冲突解决方法容易丢失更新，就像在“最后写入获胜（丢弃并行写入）”中讨论的那样。不幸的是，在许多复制数据库中，LWW是默认设置。

Write Skew and Phantoms

In the previous sections we saw dirty writes and lost updates , two kinds of race conditions that can occur when different transactions concurrently try to write to the same objects. In order to avoid data corruption, those race conditions need to be prevented—either automatically by the database, or by manual safeguards such as using locks or atomic write operations.

在前面的章节中，我们看到了脏写和丢失更新这两种竞态条件，当不同的事务同时尝试写入相同的对象时，可能会发生。为了避免数据损坏，需要防止这些竞争条件--可以通过数据库自动实现，也可以通过使用锁或原子写操作等手动保障实现。

However, that is not the end of the list of potential race conditions that can occur between concurrent writes. In this section we will see some subtler examples of conflicts.

然而，这并不是并发写入之间可能发生的潜在竞态条件列表的终点。在本节中，我们将看到一些更微妙的冲突示例。

To begin, imagine this example: you are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if they are sick themselves), provided that at least one colleague remains on call in that shift [ 40 , 41 ].

首先，想象一个例子：你正在为医生编写一个应用程序，以管理他们在医院的轮班。医院通常会尽量安排几名医生轮班，但绝对必须有至少一名医生轮班。医生可以放弃他们的轮班（例如，如果他们自己生病了），前提是在该轮班中至少有一名同事仍然轮班。

Now imagine that Alice and Bob are the two on-call doctors for a particular shift. Both are feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button to go off call at approximately the same time. What happens next is illustrated in Figure 7-8 .

现在想象一下，艾丽斯和鲍勃是特定班次的两名值班医生。他们俩都不太舒服，于是都决定请假。不幸的是，他们恰巧在大约同一时间点击了下班按钮。接下来发生的事情如图7-8所示。

In each transaction, your application first checks that two or more doctors are currently on call; if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot isolation, both checks return 2 , so both transactions proceed to the next stage. Alice updates her own record to take herself off call, and Bob updates his own record likewise. Both transactions commit, and now no doctor is on call. Your requirement of having at least one doctor on call has been violated.

在每一次交易中，你的应用程序首先检查当前是否有两个或更多医生在接听电话；如果是的话，它就默认有一个医生可以不用接听电话。由于数据库使用了快照隔离，两个检查都返回2，所以两个交易都进入下一个阶段。艾丽斯更新她自己的记录以使自己不再接听电话，鲍勃也做了类似的更新。两个交易都提交，现在没有医生在接听电话了。你的要求至少有一个医生接听电话已经被违反了。

Characterizing write skew

This anomaly is called write skew [ 28 ]. It is neither a dirty write nor a lost update, because the two transactions are updating two different objects (Alice’s and Bob’s on-call records, respectively). It is less obvious that a conflict occurred here, but it’s definitely a race condition: if the two transactions had run one after another, the second doctor would have been prevented from going off call. The anomalous behavior was only possible because the transactions ran concurrently.

这种异常被称为写入偏斜[28]。它既不是脏写也不是丢失更新，因为这两个交易正在更新两个不同的对象（分别是Alice和Bob的呼叫记录）。这里发生了冲突并不是很明显，但这绝对是一种竞争条件：如果这两个事务运行在彼此之后，第二个医生将被阻止下班。这种异常行为只有在事务并发运行时才有可能出现。

You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).

无论何时，两个事务读取相同的对象，然后更新其中一些对象，就会发生写入偏斜。如果不同的事务更新相同的对象，那么你将会得到一种脏写或丢失更新异常（取决于时间）。

We saw that there are various different ways of preventing lost updates. With write skew, our options are more restricted:

我们看到了许多不同的方法来防止丢失的更新。但在写入偏差的情况下，我们的选择比较有限：

Atomic single-object operations don’t help, as multiple objects are involved.

原文：Atomic single-object operations don’t help, as multiple objects are involved. 简化中文：原子单对象操作无助于处理，因为涉及多个对象。
The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level [ 23 ]. Automatically preventing write skew requires true serializable isolation (see “Serializability” ).

某些快照隔离实现中发现的丢失更新的自动检测不幸的是没有帮助：在PostgreSQL的可重复读、MySQL/InnoDB的可重复读、Oracle的可串行化或SQL Server的快照隔离级别[ 23 ]中无法自动检测到写入偏斜。自动防止写入偏斜需要真正的串行化隔离（见“串行化”）。
Some databases allow you to configure constraints, which are then enforced by the database (e.g., uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to specify that at least one doctor must be on call, you would need a constraint that involves multiple objects. Most databases do not have built-in support for such constraints, but you may be able to implement them with triggers or materialized views, depending on the database [ 42 ].

一些数据库允许您配置限制条件，然后由数据库执行（例如：独特性、外部键限制或特定值的限制）。然而，为了指定至少有一位医生需要值班，您需要涉及多个对象的限制条件。大多数数据库没有内置的支持这样的限制条件，但您可以根据数据库的情况使用触发器或材料化视图来实现它们[42]。
If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. In the doctors example, you could write something like the following:

如果您无法使用可序列化隔离级别，在这种情况下，第二好的选择可能是显式锁定事务所依赖的行。在医生示例中，您可以编写以下内容：
```
BEGIN TRANSACTION;

SELECT * FROM doctors
  WHERE on_call = true
  AND shift_id = 1234 FOR UPDATE; 

UPDATE doctors
  SET on_call = false
  WHERE name = 'Alice'
  AND shift_id = 1234;

COMMIT;
```
As before, FOR UPDATE tells the database to lock all rows returned by this query.

与以往一样，FOR UPDATE 告诉数据库锁定此查询返回的所有行。

More examples of write skew

Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice more situations in which it can occur. Here are some more examples:

写偏斜一开始可能看起来像是一个深奥的问题，但是一旦你意识到它，你可能会注意到更多可能发生这种现象的情况。以下是一些例子：

Meeting room booking system

Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [ 43 ]. When someone wants to make a booking, you first check for any conflicting bookings (i.e., bookings for the same room with an overlapping time range), and if none are found, you create the meeting (see Example 7-2 ). ^ix

假设您想强制要求同一时间不能有两个预订同一个会议室的情况[43]。当有人想要进行预订时，您首先要检查是否存在任何冲突的预订（即按时间重叠的同一房间预订），如果不存在，您就可以创建会议（参见示例7-2）。

Example 7-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)

BEGIN TRANSACTION;

-- Check for any existing bookings that overlap with the period of noon-1pm
SELECT COUNT(*) FROM bookings
  WHERE room_id = 123 AND
    end_time > '2015-01-01 12:00' AND start_time < '2015-01-01 13:00';

-- If the previous query returned zero:
INSERT INTO bookings
  (room_id, start_time, end_time, user_id)
  VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);

COMMIT;

Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable isolation.

遗憾的是，快照隔离并不能防止其他用户同时插入冲突会议。为了确保您不会遇到调度冲突，您再次需要可序列化隔离。

Multiplayer game

In Example 7-1 , we used a lock to prevent lost updates (that is, making sure that two players can’t move the same figure at the same time). However, the lock doesn’t prevent players from moving two different figures to the same position on the board or potentially making some other move that violates the rules of the game. Depending on the kind of rule you are enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write skew.

在示例7-1中，我们使用了锁来防止丢失更新（即确保两个玩家不能同时移动同一个图案）。然而，该锁并不能防止玩家将两个不同的图案移动到棋盘上的同一位置，或者可能进行其他违反游戏规则的动作。根据您要执行的规则类型，您可能可以使用唯一约束，但否则您容易受到写入扭曲的影响。

Claiming a username

On a website where each user has a unique username, two users may try to create accounts with the same username at the same time. You may use a transaction to check whether a name is taken and, if not, create an account with that name. However, like in the previous examples, that is not safe under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second transaction that tries to register the username will be aborted due to violating the constraint).

在一个每个用户都有唯一用户名的网站上，可能会有两个用户同时尝试创建一个同名的帐户。您可以使用事务来检查用户名是否已被使用，如果没有，则使用该名称创建帐户。但是，像之前的例子一样，在快照隔离下不安全。幸运的是，唯一约束是一个简单的解决方案（尝试注册用户名的第二个事务将因违反约束而被中止）。

Preventing double-spending

A service that allows users to spend money or points needs to check that a user doesn’t spend more than they have. You might implement this by inserting a tentative spending item into a user’s account, listing all the items in the account, and checking that the sum is positive [ 44 ]. With write skew, it could happen that two spending items are inserted concurrently that together cause the balance to go negative, but that neither transaction notices the other.

一个允许用户花费货币或积分的服务需要检查用户不会超支。可以通过将暂定的消费项目插入用户的账户，列出账户中的所有项目，并检查总和是否为正来实现这一点。但是，如果存在写入偏斜，可能会同时插入两个消费项目，这两个项目联合起来会导致余额变为负数，但是两个事务都没有注意到另一个事务。

Phantoms causing write skew

All of these examples follow a similar pattern:

所有这些例子都遵循相似的模式：

A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition (there are at least two doctors on call, there are no existing bookings for that room at that time, the position on the board doesn’t already have another figure on it, the username isn’t already taken, there is still money in the account).

一个SELECT查询通过搜索符合某些条件的行来检查是否满足要求（当时有至少两位医生值班，该房间此时没有已有的预订，该棋盘位置上没有其他棋子，用户名没有被占用，该账户还有余额）。
Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort).

根据第一个查询的结果，应用程序代码决定如何继续（可能继续执行操作，或者向用户报告错误并中止操作）。
If the application decides to go ahead, it makes a write ( INSERT , UPDATE , or DELETE ) to the database and commits the transaction.

如果应用程序决定继续，它将向数据库进行写入（INSERT、UPDATE或DELETE），并提交事务。

The effect of this write changes the precondition of the decision of step 2. In other words, if you were to repeat the SELECT query from step 1 after commiting the write, you would get a different result, because the write changed the set of rows matching the search condition (there is now one fewer doctor on call, the meeting room is now booked for that time, the position on the board is now taken by the figure that was moved, the username is now taken, there is now less money in the account).

这个写操作会改变第二步骤的前提条件。换句话说，如果你在提交了这个写操作之后重复执行第一步骤的SELECT查询，你会得到不同的结果，因为这个写操作改变了符合搜索条件的行集合（现在少了一个值班医生，会议室现在被预定，面板上的位置现在被移动的物体占据，用户名已被占用，帐户中的金额现在减少了）。

The steps may occur in a different order. For example, you could first make the write, then the SELECT query, and finally decide whether to abort or commit based on the result of the query.

步骤的顺序可能会不同。例如，您可以先写入，然后进行SELECT查询，最后根据查询结果决定是否中止或提交。

In the case of the doctor on call example, the row being modified in step 3 was one of the rows returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows in step 1 ( SELECT FOR UPDATE ). However, the other four examples are different: they check for the absence of rows matching some search condition, and the write adds a row matching the same condition. If the query in step 1 doesn’t return any rows, SELECT FOR UPDATE can’t attach locks to anything.

对于值班医生的示例，第三步中要修改的行是在第一步返回的行之一，因此我们可以通过在第一步中锁定行（SELECT FOR UPDATE）使交易安全，避免写入偏差。然而，其他四个示例是不同的：它们检查匹配某些搜索条件的行是否不存在，而写入则添加了匹配相同条件的行。如果第一步中的查询没有返回任何行，则SELECT FOR UPDATE无法将锁附加到任何内容。

This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [ 3 ]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew.

这种效应被称为幻影[3]，即在一个交易中的写操作改变了另一个交易中的搜索结果。快照隔离避免了只读查询中的幻像，但在像我们讨论的读写交易中，幻像会导致特别棘手的写倾斜情况。

Materializing conflicts

If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database?

如果幽灵问题在于没有对象可以附加锁定，或许我们可以在数据库中人为地引入锁定对象？

For example, in the meeting room booking case you could imagine creating a table of time slots and rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15 minutes). You create rows for all possible combinations of rooms and time periods ahead of time, e.g. for the next six months.

例如，在预订会议室的情况下，您可以想象创建一个时间槽和房间的表格。该表中的每一行对应于特定时间段（例如，15分钟）内的特定房间。您提前为所有可能的房间和时间段组合创建行，例如未来六个月的时间。

Now a transaction that wants to create a booking can lock ( SELECT FOR UPDATE ) the rows in the table that correspond to the desired room and time period. After it has acquired the locks, it can check for overlapping bookings and insert a new booking as before. Note that the additional table isn’t used to store information about the booking—it’s purely a collection of locks which is used to prevent bookings on the same room and time range from being modified concurrently.

现在，想要创建预订的交易可以锁定（SELECT FOR UPDATE）与所需房间和时间段相对应的表中的行。在获取了锁之后，它可以检查重叠的预订并像以前一样插入新的预订。请注意，其他表不用于存储关于预订的信息，而是纯粹用作锁的集合，以防止在同一房间和时间范围内进行并发修改。

This approach is called materializing conflicts , because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database [ 11 ]. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much preferable in most cases.

这种方法被称为“实化冲突”，因为它将一个幻象转化为存在于数据库中的一组具体行的锁冲突[11]。不幸的是，找出如何实现冲突可能会很困难和容易出错，并且让并发控制机制渗入应用数据模型是很丑陋的。出于这些原因，如果没有其他替代方案，应将实化冲突视为最后的选择。在大多数情况下，可串行化隔离级别更可取。

Serializability

In this chapter we have seen several examples of transactions that are prone to race conditions. Some race conditions are prevented by the read committed and snapshot isolation levels, but others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s a sad situation:

在本章中，我们看到了几个容易出现竞争条件的交易示例。有些竞争条件可以通过读提交和快照隔离级别来防止，但其他一些无法防止。我们遇到了一些非常棘手的写偏差和幻象问题。这是一个令人沮丧的局面：

Isolation levels are hard to understand, and inconsistently implemented in different databases (e.g., the meaning of “repeatable read” varies significantly).

隔离级别很难理解，在不同的数据库中实现也不一致 (例如，“可重复读”的意义存在显著的差异)。
If you look at your application code, it’s difficult to tell whether it is safe to run at a particular isolation level—especially in a large application, where you might not be aware of all the things that may be happening concurrently.

如果你看你的应用程序代码，很难判断它是否可以在特定的隔离级别下运行，特别是在大型应用程序中，在那里你可能不知道所有可能同时发生的事情。
There are no good tools to help us detect race conditions. In principle, static analysis may help [ 26 ], but research techniques have not yet found their way into practical use. Testing for concurrency issues is hard, because they are usually nondeterministic—problems only occur if you get unlucky with the timing.

目前没有好的工具来帮助我们检测竞态条件。从原理上讲，静态分析可能有所帮助[26]，但研究技术尚未进入实际应用。测试并发问题很困难，因为它们通常是不确定的，只有在时间不利的情况下才会出现问题。

This is not a new problem—it has been like this since the 1970s, when weak isolation levels were first introduced [ 2 ]. All along, the answer from researchers has been simple: use serializable isolation!

这不是一个新问题 - 自从1970年代引入了弱隔离级别以来，一直存在[2]。一直以来，研究者的答案很简单：使用可串行化隔离！

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially , without any concurrency. Thus, the database guarantees that if the transactions behave correctly when run individually, they continue to be correct when run concurrently—in other words, the database prevents all possible race conditions.

可序列化隔离通常被认为是最强的隔离级别。它保证即使事务并行执行，最终结果与逐个按顺序串行执行的结果相同，没有任何并发。因此，数据库保证如果事务在单独运行时正确，它们在并发运行时仍然正确，换句话说，数据库排除了所有可能的竞态条件。

But if serializable isolation is so much better than the mess of weak isolation levels, then why isn’t everyone using it? To answer this question, we need to look at the options for implementing serializability, and how they perform. Most databases that provide serializability today use one of three techniques, which we will explore in the rest of this chapter:

但如果可序列化隔离比弱隔离级别混乱的方式好得多，为什么不是每个人都在使用它？要回答这个问题，我们需要看一下实现可串行性的选项以及它们的性能。今天提供可串行性的大多数数据库使用三种技术之一，我们将在本章的其余部分探讨这些技术：

Literally executing transactions in a serial order (see “Actual Serial Execution” )

字面意义上按照顺序依次执行交易（参见“实际串行执行”）。
Two-phase locking (see “Two-Phase Locking (2PL)” ), which for several decades was the only viable option

两阶段锁（请参见“两阶段锁定（2PL）”），数十年来一直是唯一可行的选择。
Optimistic concurrency control techniques such as serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” )

乐观并发控制技术，例如可串行化快照隔离（参见“可串行化快照隔离（SSI）”）。

For now, we will discuss these techniques primarily in the context of single-node databases; in Chapter 9 we will examine how they can be generalized to transactions that involve multiple nodes in a distributed system.

目前，我们将主要在单节点数据库的背景下讨论这些技术；在第9章中，我们将研究如何将它们推广到涉及分布式系统中多个节点的事务中。

Actual Serial Execution

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely sidestep the problem of detecting and preventing conflicts between transactions: the resulting isolation is by definition serializable.

避免并发问题的最简单方法是完全消除并发：在单个线程上按序执行一次只有一个事务。这样做，我们完全回避了检测和防止事务之间冲突的问题：由此产生的隔离是可串行化的。

Even though this seems like an obvious idea, database designers only fairly recently—around 2007—decided that a single-threaded loop for executing transactions was feasible [ 45 ]. If multi-threaded concurrency was considered essential for getting good performance during the previous 30 years, what changed to make single-threaded execution possible?

即使这似乎是个显而易见的想法，但数据库设计师直到近年来（大约在2007年）才决定使用单线程循环来执行事务 [45]。如果在之前的30年中，多线程并发被认为是获得良好性能的必要条件，那么什么改变了，使得单线程执行成为可能？

Two developments caused this rethink:

两个发展导致了这种重新思考。

RAM became cheap enough that for many use cases is now feasible to keep the entire active dataset in memory (see “Keeping everything in memory” ). When all data that a transaction needs to access is in memory, transactions can execute much faster than if they have to wait for data to be loaded from disk.

RAM的价格越来越便宜，现在在许多情况下，将整个活动数据集保留在内存中变得可行（请参阅“全部保留在内存中”）。当事务需要访问的所有数据都在内存中时，与必须等待从磁盘加载数据相比，事务可以执行得更快。
Database designers realized that OLTP transactions are usually short and only make a small number of reads and writes (see “Transaction Processing or Analytics?” ). By contrast, long-running analytic queries are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation) outside of the serial execution loop.

数据库设计师发现OLTP事务通常很短，只进行少量读写操作（参见“事务处理还是分析？”）。相比之下，长时间运行的分析查询通常是只读的，因此它们可以在一致的快照上运行（使用快照隔离）而不需要在串行执行循环内运行。

The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic [ 46 , 47 , 48 ]. A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking. However, its throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.

以序列方式执行交易的方法在VoltDB/H-Store、Redis和Datomic中得到了实现。为了避免锁定的协调开销，设计用于单线程执行的系统有时比支持并发性的系统表现更好。然而，其吞吐量仅限于单个CPU内核的吞吐量。为了充分利用该单个线程，必须以不同于传统形式的方式来组织交易。

Encapsulating transactions in stored procedures

In the early days of databases, the intention was that a database transaction could encompass an entire flow of user activity. For example, booking an airline ticket is a multi-stage process (searching for routes, fares, and available seats; deciding on an itinerary; booking seats on each of the flights of the itinerary; entering passenger details; making payment). Database designers thought that it would be neat if that entire process was one transaction so that it could be committed atomically.

在数据库的早期阶段，其目的是使一个数据库事务涵盖一个完整的用户活动流程。例如，预订航空票是一个多阶段的过程（搜索路线、票价和可用座位；决定行程；预订行程中每个航班的座位；输入旅客详情；进行支付）。数据库设计师认为，如果整个过程都是一个事务，那将是一件很好的事情，以便可以原子地提交。

Unfortunately, humans are very slow to make up their minds and respond. If a database transaction needs to wait for input from a user, the database needs to support a potentially huge number of concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction. On the web, this means that a transaction is committed within the same HTTP request—a transaction does not span multiple requests. A new HTTP request starts a new transaction.

不幸的是，人类在做决定和响应方面非常缓慢。如果数据库事务需要等待用户的输入，数据库需要支持潜在的大量并发事务，其中大部分处于空闲状态。大多数数据库不能高效地完成这项工作，因此几乎所有OLTP应用程序通过避免在事务中与用户交互等待而使事务短暂。在web上，这意味着一个事务在同一个HTTP请求中提交-一个事务不跨越多个请求。一个新的HTTP请求开始一个新的事务。

Even though the human has been taken out of the critical path, transactions have continued to be executed in an interactive client/server style, one statement at a time. An application makes a query, reads the result, perhaps makes another query depending on the result of the first query, and so on. The queries and results are sent back and forth between the application code (running on one machine) and the database server (on another machine).

即使人类已从关键路径中剔除，交易仍以交互式客户端/服务器方式执行，一次执行一个语句。应用程序发出查询，读取结果，根据第一个查询的结果也许再次发起查询，以此类推。查询和结果在应用程序代码（在一台机器上运行）和数据库服务器（在另一台机器上）之间来回发送。即使该系统去除了人为的要素，也同样可以进行交互式的客户端/服务器的交易，每一条命令都将逐一执行。应用程序会发出请求，读取结果，根据第一个请求的结果决定是否继续发出查询等等。请求和结果将在应用程序代码（运行在一台机器上）和数据库服务器（运行在另一台机器上）之间来回传输。

In this interactive style of transaction, a lot of time is spent in network communication between the application and the database. If you were to disallow concurrency in the database and only process one transaction at a time, the throughput would be dreadful because the database would spend most of its time waiting for the application to issue the next query for the current transaction. In this kind of database, it’s necessary to process multiple transactions concurrently in order to get reasonable performance.

在这种交互式的事务模式中，应用程序和数据库之间需要花费大量时间进行网络通信。如果禁止数据库并发处理，只能一次处理一个事务，那么吞吐量会非常糟糕，因为数据库会花费大部分时间等待应用程序发出当前事务的下一个查询。在这种类型的数据库中，必须同时处理多个事务，才能获得合理的性能。

For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure . The differences between these approaches is illustrated in Figure 7-9 . Provided that all data required by a transaction is in memory, the stored procedure can execute very fast, without waiting for any network or disk I/O.

因此，采用单线程串行交易处理系统无法允许互动的多语句事务。相反地，应用程序必须将整个事务代码作为存储过程提前提交到数据库中。这些方法之间的差异如图7-9所示。只要事务所需的所有数据都在内存中，存储过程就可以非常快速地执行，不需要等待任何网络或磁盘I/O。

Pros and cons of stored procedures

Stored procedures have existed for some time in relational databases, and they have been part of the SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:

存储过程在关系型数据库中已经存在了一段时间，并且它们自1999年以来就是SQL标准（SQL/PSM）的一部分。由于各种原因，它们已经获得了一种不太好的声誉。

Each database vendor has its own language for stored procedures (Oracle has PL/SQL, SQL Server has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in general-purpose programming languages, so they look quite ugly and archaic from today’s point of view, and they lack the ecosystem of libraries that you find with most programming languages.

每个数据库供应商都有自己的存储过程语言（Oracle有PL/SQL，SQL Server有T-SQL，PostgreSQL有PL/pgSQL等）。这些语言没有跟上通用编程语言的发展，因此从今天的角度来看它们看起来非常丑陋和过时，缺乏大多数编程语言所具有的库生态系统。
Code running in a database is difficult to manage: compared to an application server, it’s harder to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to integrate with a metrics collection system for monitoring.

在数据库中运行的代码难以管理：与应用服务器相比，更难调试，版本控制和部署更麻烦，测试更棘手，以及难以集成指标收集系统进行监控。
A database is often much more performance-sensitive than an application server, because a single database instance is often shared by many application servers. A badly written stored procedure (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent badly written code in an application server.

一个数据库通常比应用服务器更注重性能，因为单个数据库实例通常由许多应用服务器共享。一个糟糕编写的存储过程（例如，使用大量内存或CPU时间）可能比应用服务器中相当糟糕的代码引起更多的麻烦。

However, those issues can be overcome. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, Datomic uses Java or Clojure, and Redis uses Lua.

然而，这些问题都可以克服。现代的存储过程实现已经放弃了PL/SQL，转而使用现有的通用编程语言：VoltDB使用Java或Groovy，Datomic使用Java或Clojure，Redis使用Lua。

With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.

通过存储过程和内存数据，使用单线程执行所有事务变得可行。因为它们不需要等待I/O，并且避免了其他并发控制机制的开销，它们可在单线程上实现相当好的吞吐量。

VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from one node to another, it executes the same stored procedure on each replica. VoltDB therefore requires that stored procedures are deterministic (when run on different nodes, they must produce the same result). If a transaction needs to use the current date and time, for example, it must do so through special deterministic APIs.

VoltDB也使用存储过程进行复制：而不是将一个事务的写操作从一个节点复制到另一个节点，它会在每个副本上执行相同的存储过程。因此，VoltDB要求存储过程是确定性的（在不同的节点上运行时，它们必须产生相同的结果）。例如，如果一个事务需要使用当前日期和时间，它必须通过特殊的确定性API来实现。

Partitioning

Executing all transactions serially makes concurrency control much simpler, but limits the transaction throughput of the database to the speed of a single CPU core on a single machine. Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with high write throughput, the single-threaded transaction processor can become a serious bottleneck.

串行执行所有交易使并发控制更简单，但将数据库事务吞吐量限制为单台机器上单个CPU核心的速度。只读事务可以在其他地方执行，使用快照隔离，但对于具有高写入吞吐量的应用程序，单线程事务处理器可能成为严重瓶颈。

In order to scale to multiple CPU cores, and multiple nodes, you can potentially partition your data (see Chapter 6 ), which is supported in VoltDB. If you can find a way of partitioning your dataset so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own partition, which allows your transaction throughput to scale linearly with the number of CPU cores [ 47 ].

为了适应多个CPU核心和多个节点，你可以将你的数据分区（参见第6章），这是VoltDB支持的。如果你能找到一种分区数据集的方法，使得每个事务只需要读写单个分区内的数据，那么每个分区可以有自己独立运行的事务处理线程。在这种情况下，你可以给每个CPU核心分配一个分区，从而使你的事务吞吐量能够与CPU核心数成线性比例的增长[47]。

However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system.

然而，任何需要访问多个分区的交易，数据库必须协调跨所有接触的分区的交易。存储过程需要在所有分区中锁步执行，以确保整个系统的串行性。

Since cross-partition transactions have additional coordination overhead, they are vastly slower than single-partition transactions. VoltDB reports a throughput of about 1,000 cross-partition writes per second, which is orders of magnitude below its single-partition throughput and cannot be increased by adding more machines [ 49 ].

由于跨分区事务需要额外的协调开销，因此它们比单分区事务慢得多。 VoltDB 的跨分区写入吞吐量约为每秒 1,000 次，远低于其单分区吞吐量，无法通过添加更多计算机来增加。

Whether transactions can be single-partition depends very much on the structure of the data used by the application. Simple key-value data can often be partitioned very easily, but data with multiple secondary indexes is likely to require a lot of cross-partition coordination (see “Partitioning and Secondary Indexes” ).

交易能否使用单个分区取决于应用程序使用的数据结构。简单的键值数据可以很容易地进行分区，但具有多个辅助索引的数据很可能需要大量的跨分区协调（请参见"分区和辅助索引"）。

Summary of serial execution

Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:

串行执行事务已成为在某些限制条件下实现可串行隔离的可行方式。

Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.

每个交易必须小而快，因为一个慢交易就可以使所有交易处理停滞。
It is limited to use cases where the active dataset can fit in memory. Rarely accessed data could potentially be moved to disk, but if it needed to be accessed in a single-threaded transaction, the system would get very slow. ^x

仅限于主数据集可以适配内存的情况下使用。很少访问的数据可能会被移动到磁盘，但如果它需要在单线程事务中被访问，系统会变得非常慢。
Write throughput must be low enough to be handled on a single CPU core, or else transactions need to be partitioned without requiring cross-partition coordination.

写入吞吐量必须低到可以在单个 CPU 核心上处理，否则需要对事务进行分区，而无需需要跨分区协调。
Cross-partition transactions are possible, but there is a hard limit to the extent to which they can be used.

跨分区交易是可能的，但它们的使用程度受到严格限制。

Two-Phase Locking (2PL)

For around 30 years, there was only one widely used algorithm for serializability in databases: two-phase locking (2PL). ^xi

大约30年来，在数据库中只有一个广泛使用的序列化算法：二阶段锁定（2PL）。

2PL is not 2PC

Note that while two-phase locking (2PL) sounds very similar to two-phase commit (2PC), they are completely different things. We will discuss 2PC in Chapter 9 .

请注意，虽然两阶段锁定（2PL）听起来与两阶段提交（2PC）非常相似，但它们是完全不同的事情。我们将在第9章讨论2PC。

We saw previously that locks are often used to prevent dirty writes (see “No dirty writes” ): if two transactions concurrently try to write to the same object, the lock ensures that the second writer must wait until the first one has finished its transaction (aborted or committed) before it may continue.

我们之前看到过，锁经常被用来防止脏写（参见“无脏写”）：如果两个事务同时尝试写入同一对象，锁确保第二个写入者必须等待第一个完成其事务（中止或提交）才能继续。

Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:

双阶段锁定类似，但使锁定要求更强。只要没有人写入，多个事务被允许同时读取相同的对象。但是一旦任何人想要写入（修改或删除）对象，就需要独占访问权。

If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.)

如果事务A已经读取了一个对象，而事务B想要写入该对象，B必须等待A提交或中止后才能继续执行。这确保了B不能在A背后意外地更改对象。
If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object, like in Figure 7-1 , is not acceptable under 2PL.)

如果事务A已经写入了一个对象，事务B想要读取该对象，则B必须等待A提交或中止后才能继续。（像图7-1中读取旧版本的对象不符合2PL条件。）

In 2PL, writers don’t just block other writers; they also block readers and vice versa. Snapshot isolation has the mantra readers never block writers, and writers never block readers (see “Implementing snapshot isolation” ), which captures this key difference between snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability, it protects against all the race conditions discussed earlier, including lost updates and write skew.

在2PL中，作家不仅会阻止其他作家，还会阻止读者，反之亦然。快照隔离的口号是“读者从不阻止作家，作家从不阻止读者”（见“实现快照隔离”），这反映了快照隔离和两阶段锁定之间的关键差异。另一方面，由于2PL提供串行化，它可以防止所有早期讨论的竞争条件，包括丢失更新和写偏斜。

Implementation of two-phase locking

2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in DB2 [ 23 , 36 ].

2PL被MySQL（InnoDB）和SQL Server中的可串行化隔离级别，以及DB2中的可重复读隔离级别所使用。

The blocking of readers and writers is implemented by a having a lock on each object in the database. The lock can either be in shared mode or in exclusive mode . The lock is used as follows:

通过在数据库中的每个对象上设置锁实现读取器和写入器的阻塞。该锁可以处于共享模式或独占模式。该锁的使用如下：

If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.

如果一个事务想要读取一个对象，它必须首先以共享模式获取锁。多个事务可以同时持有共享锁，但如果另一个事务已经以排他锁的方式锁定了该对象，这些事务必须等待。
If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No other transaction may hold the lock at the same time (either in shared or in exclusive mode), so if there is any existing lock on the object, the transaction must wait.

如果事务想要写入一个对象，它必须先以独占模式获得锁。在同一时间内，没有其他事务可以以共享或独占模式持有该锁，因此如果对象已经存在锁，则该事务必须等待。
If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly.

如果一个事务首先读取，然后写入一个对象，它可以将其共享锁升级为独占锁。升级的过程与直接获取独占锁相同。
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort). This is where the name “two-phase” comes from: the first phase (while the transaction is executing) is when the locks are acquired, and the second phase (at the end of the transaction) is when all the locks are released.

一旦一个交易获得了锁，它必须一直持有该锁直到交易结束（提交或中止）。这就是“两阶段”名称的由来：第一阶段（当交易正在执行时）是获得锁的阶段，第二阶段（在交易结束时）是释放所有锁的阶段。

Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for transaction B to release its lock, and vice versa. This situation is called deadlock . The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.

由于有许多锁被使用，很容易发生事务A被卡在等待事务B释放锁的状态，反之亦然。这种情况称为死锁。数据库会自动检测事务之间的死锁并中止其中一个，以便其他事务可以进行。被中止的事务需要由应用程序重试。

Performance of two-phase locking

The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the 1970s, is performance: transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.

两阶段锁定的一个巨大缺点，也是自20世纪70年代以来它并没有被所有人广泛使用的原因，就是性能问题：与弱隔离相比，两阶段锁定的交易吞吐量和查询响应时间都明显较差。

This is partly due to the overhead of acquiring and releasing all those locks, but more importantly due to reduced concurrency. By design, if two concurrent transactions try to do anything that may in any way result in a race condition, one has to wait for the other to complete.

这部分是由于获取与释放所有锁的开销，但更重要的是由于并发性降低。按设计，如果两个并发的事务尝试做可能以任何方式导致竞争条件的任何事情，其中一个必须等待另一个完成。

Traditional relational databases don’t limit the duration of a transaction, because they are designed for interactive applications that wait for human input. Consequently, when one transaction has to wait on another, there is no limit on how long it may have to wait. Even if you make sure that you keep all your transactions short, a queue may form if several transactions want to access the same object, so a transaction may have to wait for several others to complete before it can do anything.

传统的关系型数据库不会限制事物的时间，因为它们专为需要人类输入的交互应用程序而设计。因此，当一个事物需要等待另一个事物时，它可能需要等待的时间没有限制。即使你确保所有的事物都很短暂，如果有几个事物想访问同一个对象，那么一个事物在能够执行操作之前可能需要等待其他几个事物完成。

For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at high percentiles (see “Describing Performance” ) if there is contention in the workload. It may take just one slow transaction, or one transaction that accesses a lot of data and acquires many locks, to cause the rest of the system to grind to a halt. This instability is problematic when robust operation is required.

因此，运行2PL的数据库的稳定性可能会相当不稳定，并且当工作负载产生争用时，它们在高百分位数下可以非常慢（请参见“描述性能”）。一个慢速事务或一个访问大量数据并获取许多锁定的事务可能会导致系统的其余部分停止运转。当需要强大的操作时，这种不稳定性是有问题的。

Although deadlocks can happen with the lock-based read committed isolation level, they occur much more frequently under 2PL serializable isolation (depending on the access patterns of your transaction). This can be an additional performance problem: when a transaction is aborted due to deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can mean significant wasted effort.

虽然基于锁的“读提交”隔离级别可以发生死锁，但在2PL可串行化隔离级别下（取决于事务的访问模式），死锁发生的频率更高。这可能是一个额外的性能问题：当一个事务因死锁而被中止并重试时，它需要重新完成所有的工作。如果死锁频繁发生，这可能意味着显著的浪费努力。

Predicate locks

In the preceding description of locks, we glossed over a subtle but important detail. In “Phantoms causing write skew” we discussed the problem of phantoms —that is, one transaction changing the results of another transaction’s search query. A database with serializable isolation must prevent phantoms.

在之前的锁定描述中，我们忽略了一个微妙但重要的细节。在“幻读引起写入偏斜”的问题中，我们讨论了幽灵的问题，即一个事务改变了另一个事务的搜索查询结果。具有序列化隔离的数据库必须防止幽灵现象。

In the meeting room booking example this means that if one transaction has searched for existing bookings for a room within a certain time window (see Example 7-2 ), another transaction is not allowed to concurrently insert or update another booking for the same room and time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a different time that doesn’t affect the proposed booking.)

在会议室预订示例中，这意味着如果一个交易正在搜索特定时间窗口内某个房间的现有预订（参见示例7-2），则不允许另一个交易同时插入或更新同一房间和时间范围的另一个预订。（可以同时插入其他房间的预订，或在不影响拟议预订的不同时间为同一房间插入预订。）

How do we implement this? Conceptually, we need a predicate lock [ 3 ]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition, such as:

我们如何实现这个？从概念上讲，我们需要一个谓词锁[3]。它的工作方式类似于之前描述的共享/独占锁，但它不属于特定对象（例如，在表中的一行），而是属于所有符合某些搜索条件的对象，例如：

SELECT * FROM bookings
  WHERE room_id = 123 AND
    end_time   > '2018-01-01 12:00' AND
    start_time < '2018-01-01 13:00';

A predicate lock restricts access as follows:

谓词锁限制访问如下：

If transaction A wants to read objects matching some condition, like in that SELECT query, it must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B currently has an exclusive lock on any object matching those conditions, A must wait until B releases its lock before it is allowed to make its query.

如果交易A想要读取与某些条件匹配的对象，就像在SELECT查询中一样，它必须在查询条件上获取共享模式谓词锁。如果另一个交易B当前在与这些条件匹配的任何对象上拥有独占锁，则A必须等待B释放其锁之前才允许发出查询。
If transaction A wants to insert, update, or delete any object, it must first check whether either the old or the new value matches any existing predicate lock. If there is a matching predicate lock held by transaction B, then A must wait until B has committed or aborted before it can continue.

如果交易A想要插入、更新或删除任何对象，首先必须检查旧值或新值是否与任何现有谓词锁相匹配。如果事务B持有匹配的谓词锁，则A必须等待B提交或终止后才能继续。

The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.

这里的关键思想是，谓词锁适用于数据库中尚不存在但将来可能添加的对象（幻象）。如果两阶段锁定包括谓词锁，则数据库可以防止所有形式的写偏和其他竞争条件，因此它的隔离性变得可串行化。

Index-range locks

Unfortunately, predicate locks do not perform well: if there are many locks by active transactions, checking for matching locks becomes time-consuming. For that reason, most databases with 2PL actually implement index-range locking (also known as next-key locking ), which is a simplified approximation of predicate locking [ 41 , 50 ].

可惜的是，谓词锁定的性能并不好：如果存在许多由活动事务所持有的锁定，查找匹配的锁定就会变得耗时。因此，大多数实现了二阶段锁定的数据库实际上采用索引范围锁定（也称为next-key锁定），这是谓词锁定的简化近似[41，50]。

It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just room 123) between noon and 1 p.m. This is safe, because any write that matches the original predicate will definitely also match the approximations.

将谓词简化，使其与更大的对象集匹配是安全的。例如，如果你有一个谓词锁定了房间123在中午至下午1点之间的预订，你可以通过锁定任何时间的房间123的预订或锁定在中午至下午1点之间的所有房间（不仅仅是房间123）来近似它。这是安全的，因为与原始谓词匹配的任何写入肯定也与近似匹配。

In the room bookings database you would probably have an index on the room_id column, and/or indexes on start_time and end_time (otherwise the preceding query would be very slow on a large database):

在房间预订数据库中，你可能会在room_id列上建立一个索引，和/或在start_time和end_time上建立索引（否则在大型数据库上，前面的查询会非常慢）。

Say your index is on room_id , and the database uses this index to find existing bookings for room 123. Now the database can simply attach a shared lock to this index entry, indicating that a transaction has searched for bookings of room 123.

假设你的索引是基于房间号(room_id)而建立的，数据库将使用这个索引来查找房间123的预订记录。现在数据库就可以简单地将一个共享锁附加到这个索引条目上，表示一个事务正在搜索房间123的预订记录。
Alternatively, if the database uses a time-based index to find existing bookings, it can attach a shared lock to a range of values in that index, indicating that a transaction has searched for bookings that overlap with the time period of noon to 1 p.m. on January 1, 2018.

或者，如果数据库使用基于时间的索引来查找现有预订，则可以将共享锁附加到该索引中的一系列值上，指示事务已搜索与2018年1月1日中午到下午1点的时间段重叠的预订。

Either way, an approximation of the search condition is attached to one of the indexes. Now, if another transaction wants to insert, update, or delete a booking for the same room and/or an overlapping time period, it will have to update the same part of the index. In the process of doing so, it will encounter the shared lock, and it will be forced to wait until the lock is released.

无论哪种方式，搜索条件的近似值都会被附加到其中一个索引上。现在，如果另一个事务想要插入、更新或删除同一房间和/或重叠时间段的预订，它将不得不更新索引的同一部分。在这个过程中，它将遇到共享锁，并被迫等待直到锁被释放。

This provides effective protection against phantoms and write skew. Index-range locks are not as precise as predicate locks would be (they may lock a bigger range of objects than is strictly necessary to maintain serializability), but since they have much lower overheads, they are a good compromise.

这提供了有效的保护，防止幽灵和写入倾斜。索引范围锁定不如谓词锁定精确（它们可能锁定比维护序列化所需的对象范围更大的范围），但由于它们的开销要小得多，所以它们是一个不错的折衷方案。

If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table. This will not be good for performance, since it will stop all other transactions writing to the table, but it’s a safe fallback position.

如果没有适合的索引来附加一个范围锁，则数据库可以回退到对整个表进行共享锁。这对性能来说不是很好，因为它会停止所有其他正在对表进行写入的事务，但这是一种安全的备份方案。

Serializable Snapshot Isolation (SSI)

This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we have implementations of serializability that don’t perform well (two-phase locking) or don’t scale well (serial execution). On the other hand, we have weak isolation levels that have good performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are serializable isolation and good performance fundamentally at odds with each other?

这一章描绘了数据库并发控制的一幅黯淡画面。一方面，我们拥有串行化的实现，但它们表现不佳（如两阶段锁定）或不能良好扩展（如串行执行）。另一方面，我们有弱隔离级别，其表现良好，但容易出现各种竞态条件（丢失更新，写偏斜，幻像等）。可串行隔离和良好性能在根本上是互相对立的吗？

Perhaps not: an algorithm called serializable snapshot isolation (SSI) is very promising. It provides full serializability, but has only a small performance penalty compared to snapshot isolation. SSI is fairly new: it was first described in 2008 [ 40 ] and is the subject of Michael Cahill’s PhD thesis [ 51 ].

也许不会: 一个名为“可串行化快照隔离”（SSI）的算法非常有前景。它提供了完全的串行性，但相比于快照隔离，只有很小的性能损失。SSI是相当新的: 它首次在2008年被描述[40]，并成为Michael Cahill的博士论文[51]的主题。

Today SSI is used both in single-node databases (the serializable isolation level in PostgreSQL since version 9.1 [ 41 ]) and distributed databases (FoundationDB uses a similar algorithm). As SSI is so young compared to other concurrency control mechanisms, it is still proving its performance in practice, but it has the possibility of being fast enough to become the new default in the future.

今天，SSI在单节点数据库（PostgreSQL自9.1版本以来的串行化隔离级别[41]）和分布式数据库（FoundationDB使用类似的算法）中都得到了应用。由于SSI与其他并发控制机制相比年轻，因此它仍在实践中证明其性能，但它有可能足够快，成为未来的新默认设置。

Pessimistic versus optimistic concurrency control

Two-phase locking is a so-called pessimistic concurrency control mechanism: it is based on the principle that if anything might possibly go wrong (as indicated by a lock held by another transaction), it’s better to wait until the situation is safe again before doing anything. It is like mutual exclusion , which is used to protect data structures in multi-threaded programming.

两阶段锁定是一种所谓的悲观并发控制机制：它基于原则，即如果可能发生问题（由另一事务持有的锁定指示），在做任何事情之前最好等待情况再安全。这就像互斥一样，用于保护多线程编程中的数据结构。

Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each transaction having an exclusive lock on the entire database (or one partition of the database) for the duration of the transaction. We compensate for the pessimism by making each transaction very fast to execute, so it only needs to hold the “lock” for a short time.

串行执行可以说是极度悲观的，本质上相当于每个事务在整个数据库（或数据库的一个分区）上都具有独占锁定，持续时间为整个事务期间。我们通过使每个事务执行非常快来弥补这种悲观情绪，因此它只需要短时间内持有“锁定”即可。

By contrast, serializable snapshot isolation is an optimistic concurrency control technique. Optimistic in this context means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions that executed serializably are allowed to commit.

相比之下，可序列化的快照隔离是一种乐观的并发控制技术。在这个上下文中，乐观意味着如果发生了潜在危险的情况，而不会阻止交易继续，希望一切都能顺利进行。当一个交易想要提交时，数据库会检查是否发生了任何不良情况（即是否违反了隔离性）; 如果是，交易将被中止并需要重试。只有执行可序列化操作的交易才被允许提交。

Optimistic concurrency control is an old idea [ 52 ], and its advantages and disadvantages have been debated for a long time [ 53 ]. It performs badly if there is high contention (many transactions trying to access the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make performance worse.

乐观并发控制是一个旧思想[52]，它的优缺点已经被长时间讨论[53]。如果存在高争用（很多事务试图访问相同的对象），它的表现会很差，因为这会导致许多事务需要中止。如果系统已经接近最大吞吐量，重试事务的额外负载可能会使性能更差。

However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention can be reduced with commutative atomic operations: for example, if several transactions concurrently want to increment a counter, it doesn’t matter in which order the increments are applied (as long as the counter isn’t read in the same transaction), so the concurrent increments can all be applied without conflicting.

然而，如果有足够的备用容量，而且事务之间的争用不是太高，乐观并发控制技术的性能往往优于悲观的技术。通过交换原子操作可以减少争用：例如，如果几个事务同时想要递增一个计数器，递增的顺序并不重要（只要在同一事务中没有读取计数器），因此，可以应用所有并发的递增而不会产生冲突。

As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction are made from a consistent snapshot of the database (see “Snapshot Isolation and Repeatable Read” ). This is the main difference compared to earlier optimistic concurrency control techniques. On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort.

正如其名称所示，SSI基于快照隔离，即事务中的所有读取均从数据库的一致快照中进行（请参见“快照隔离和可重复读”）。这是与早期乐观并发控制技术相比的主要差异。在快照隔离的基础上，SSI添加了一种算法，用于检测写入之间的序列化冲突并确定要中止哪些事务。

Decisions based on an outdated premise

When we previously discussed write skew in snapshot isolation (see “Write Skew and Phantoms” ), we observed a recurring pattern: a transaction reads some data from the database, examines the result of the query, and decides to take some action (write to the database) based on the result that it saw. However, under snapshot isolation, the result from the original query may no longer be up-to-date by the time the transaction commits, because the data may have been modified in the meantime.

在之前讨论的快照隔离中的写倾斜（见“写倾斜和幻读”）中，我们观察到一个经常出现的模式：一个事务从数据库读取一些数据，检查查询结果，并根据所看到的结果来采取某些操作（写入数据库）。但是，在快照隔离下，原始查询的结果在事务提交时可能已经不再是最新的，因为数据可能已在此期间被修改。

Put another way, the transaction is taking an action based on a premise (a fact that was true at the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the transaction wants to commit, the original data may have changed—the premise may no longer be true.

换句话说，该交易是基于一个前提条件采取行动（即在交易开始时是真实情况的事实，例如“目前有两名值班医生”）。稍后，当交易想要提交时，原始数据可能已经发生了变化，即前提条件可能不再成立。

When the application makes a query (e.g., “How many doctors are currently on call?”), the database doesn’t know how the application logic uses the result of that query. To be safe, the database needs to assume that any change in the query result (the premise) means that writes in that transaction may be invalid. In other words, there may be a causal dependency between the queries and the writes in the transaction. In order to provide serializable isolation, the database must detect situations in which a transaction may have acted on an outdated premise and abort the transaction in that case.

当应用程序发出查询（例如，“目前有多少医生在呼叫中？”）时，数据库不知道应用程序逻辑如何使用该查询的结果。为了安全起见，数据库需要假设查询结果的任何变化（前提）意味着事务中的写入可能无效。换句话说，查询和事务中的写入之间可能存在因果依赖关系。为了提供可串行化的隔离性，数据库必须检测可能已经基于过时前提执行操作的事务，并在这种情况下中止事务。

How does the database know if a query result might have changed? There are two cases to consider:

数据库如何判断查询结果是否可能已经改变？这里需要考虑两种情况：

Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)

检测到读取了过期的MVCC对象版本（未提交写入在读取之前发生）。
Detecting writes that affect prior reads (the write occurs after the read)

检测到对先前读取的写入（写入发生在读取之后）。

Detecting stale MVCC reads

Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC; see Figure 7-10 ). When a transaction reads from a consistent snapshot in an MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed at the time when the snapshot was taken. In Figure 7-10 , transaction 43 sees Alice as having on_call = true , because transaction 42 (which modified Alice’s on-call status) is uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already committed. This means that the write that was ignored when reading from the consistent snapshot has now taken effect, and transaction 43’s premise is no longer true.

请注意，快照隔离通常是通过多版本并发控制（MVCC）实现的（见图7-10）。当事务从MVCC数据库中读取一致快照时，它会忽略在拍摄快照时尚未提交的任何其他事务所做的写入。在图7-10中，事务43看到Alice的值为on_call = true，因为修改Alice值的事务42未完成。但是，当事务43想要提交时，事务42已经提交。这意味着在读取一致快照时忽略的写入现在已经生效，因此事务43的前提不再成立。

In order to prevent this anomaly, the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted.

为了防止这种异常情况，数据库需要跟踪何时事务由于MVCC可见性规则而忽略了另一个事务的写操作。当事务想要提交时，数据库会检查是否有任何被忽略的写操作现在已经被提交。如果是，则必须中止该事务。

Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected? Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot isolation’s support for long-running reads from a consistent snapshot.

为什么要等到提交才开始呢？为什么不在检测到陈旧读时立即中止事务43？如果事务43是只读事务，就不需要中止，因为不存在写入偏差的风险。在事务43进行读取时，数据库尚不知道该事务是否稍后会执行写操作。此外，事务42可能会中止或者在事务43提交时仍未提交，因此该读取可能最终并不陈旧。通过避免不必要的中止，SSI保留了快照隔离提供长时间运行读取一致快照的支持。

Detecting writes that affect prior reads

The second case to consider is when another transaction modifies data after it has been read. This case is illustrated in Figure 7-11 .

需要考虑的第二种情况是，当另一个交易在读取数据后修改数据。该情况如图7-11所示。

In the context of two-phase locking we discussed index-range locks (see “Index-range locks” ), which allow the database to lock access to all rows matching some search query, such as WHERE shift_id = 1234 . We can use a similar technique here, except that SSI locks don’t block other transactions.

在两阶段锁定的背景下，我们讨论了索引区间锁定（请参阅“索引区间锁定”），它允许数据库锁定访问与某些搜索查询（如WHERE shift_id = 1234）匹配的所有行。我们可以在这里使用类似的技术，只不过SSI锁定不会阻塞其他事务。

In Figure 7-11 , transactions 42 and 43 both search for on-call doctors during shift 1234 . If there is an index on shift_id , the database can use the index entry 1234 to record the fact that transactions 42 and 43 read this data. (If there is no index, this information can be tracked at the table level.) This information only needs to be kept for a while: after a transaction has finished (committed or aborted), and all concurrent transactions have finished, the database can forget what data it read.

在图7-11中，事务42和43都在查找在班次1234期间值班的医生。如果在shift_id上建立了索引，数据库可以使用索引条目1234记录事务42和43读取此数据的事实。(如果没有索引，这些信息可以在表级别上跟踪。) 这些信息只需要保持一段时间：在事务完成(提交或中止)和所有并发事务完成后，数据库可以忘记它读取的数据。

When a transaction writes to the database, it must look in the indexes for any other transactions that have recently read the affected data. This process is similar to acquiring a write lock on the affected key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: it simply notifies the transactions that the data they read may no longer be up to date.

当事务向数据库写入时，必须查看索引以查找最近读取受影响数据的任何其他事务。该过程类似于获取有关受影响键范围的写锁，但不会阻塞直到读者提交，而是锁定像绊索一样：它只是通知事务，它们读取的数据可能不再是最新的。

In Figure 7-11 , transaction 43 notifies transaction 42 that its prior read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect. However, when transaction 43 wants to commit, the conflicting write from 42 has already been committed, so 43 must abort.

在图7-11中，事务43通知事务42其先前的读取已过时，反之亦然。事务42首先提交，并且成功：尽管事务43的写入会影响42，但43尚未提交，因此写入尚未生效。然而，当事务43想要提交时，42的冲突写入已经提交，因此43必须中止。

Performance of serializable snapshot isolation

As always, many engineering details affect how well an algorithm works in practice. For example, one trade-off is the granularity at which transactions’ reads and writes are tracked. If the database keeps track of each transaction’s activity in great detail, it can be precise about which transactions need to abort, but the bookkeeping overhead can become significant. Less detailed tracking is faster, but may lead to more transactions being aborted than strictly necessary.

一如既往地，许多工程细节会影响算法在实践中的表现。例如，一个权衡是跟踪交易读取和写入的粒度。如果数据库以极高的精度跟踪每个交易的活动，它可以精确确定哪些交易需要中止，但簿记开销可能会变得显著。跟踪较少的细节更快，但可能会导致比严格必要更多的交易被中止。

In some cases, it’s okay for a transaction to read information that was overwritten by another transaction: depending on what else happened, it’s sometimes possible to prove that the result of the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of unnecessary aborts [ 11 , 41 ].

在某些情况下，事务读取被另一个事务覆盖的信息是可以的：根据其他情况，有时可以证明执行结果仍然是可串行化的。 PostgreSQL使用这个理论来减少不必要的终止[11, 41]。

Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa. This design principle makes query latency much more predictable and less variable. In particular, read-only queries can run on a consistent snapshot without requiring any locks, which is very appealing for read-heavy workloads.

与两阶段锁相比，串行化可序列化快照隔离的最大优势在于一个事务不需要阻塞等待另一个事务持有的锁。与快照隔离一样，写者不会阻塞读者，反之亦然。这个设计原则使查询延迟更加可预测和更少变化。特别是，只读查询可以在一致的快照上运行，而不需要任何锁定，这对于读取密集型工作负载非常有吸引力。

Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a single CPU core: FoundationDB distributes the detection of serialization conflicts across multiple machines, allowing it to scale to very high throughput. Even though data may be partitioned across multiple machines, transactions can read and write data in multiple partitions while ensuring serializable isolation [ 54 ].

与串行执行相比，可序列化快照隔离并不限于单个 CPU 核的吞吐量：FoundationDB将序列化冲突检测分布在多台机器上，使其能够扩展到非常高的吞吐量。即使数据可能分布在多台机器上，事务也可以在多个分区中读取和写入数据，同时确保可序列化隔离 [54]。

The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so SSI requires that read-write transactions be fairly short (long-running read-only transactions may be okay). However, SSI is probably less sensitive to slow transactions than two-phase locking or serial execution.

事务的中断率显著影响了SSI的整体性能。例如，长时间读写数据的事务可能会遇到冲突并中断，因此SSI要求读写事务应该相对较短（长时间读取事务可能没问题）。然而，SSI对于慢事务可能比两阶段锁定或串行执行更不敏感。

Summary

Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist. A large class of errors is reduced down to a simple transaction abort , and the application just needs to try again.

事务是一个抽象层，允许应用程序假装某些并发问题和某些硬件和软件故障不存在。许多错误被降低到简单的事务中止，并且应用程序只需要重试即可。

In this chapter we saw many examples of problems that transactions help prevent. Not all applications are susceptible to all those problems: an application with very simple access patterns, such as reading and writing only a single record, can probably manage without transactions. However, for more complex access patterns, transactions can hugely reduce the number of potential error cases you need to think about.

在本章中，我们看到了许多交易可以帮助避免的问题的例子。并非所有应用程序都容易受到所有这些问题的影响：一个具有非常简单访问模式（例如只读写单个记录）的应用程序可能可以在没有交易的情况下管理。但是，对于更复杂的访问模式，交易可以极大地降低需要考虑的潜在错误案例数。

Without transactions, various error scenarios (processes crashing, network interruptions, power outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various ways. For example, denormalized data can easily go out of sync with the source data. Without transactions, it becomes very difficult to reason about the effects that complex interacting accesses can have on the database.

没有事务，各种错误情况（进程崩溃、网络中断、停电、磁盘满、意外并发等）意味着数据可能以各种不一致的方式变得不一致。例如，非规范化数据很容易与源数据不同步。没有事务，很难推断复杂交互访问对数据库的影响。

In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular read committed , snapshot isolation (sometimes called repeatable read ), and serializable . We characterized those isolation levels by discussing various examples of race conditions:

在这一章中，我们深入探讨了并发控制的主题。我们讨论了几种广泛使用的隔离级别，特别是读提交、快照隔离（有时也称为可重复读）和可串行化。我们通过讨论各种竞争条件的例子来表征这些隔离级别。

Dirty reads

One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads.

一个客户端在另一个客户端写入数据未被提交前就读取，而读提交隔离级别及更强力的级别可以防止脏读。

Dirty writes

One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes.

一个客户端覆盖了另一个客户端尚未提交的数据，几乎所有的事务实现都会防止脏写操作。

Read skew (nonrepeatable reads)

A client sees different parts of the database at different points in time. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot at one point in time. It is usually implemented with multi-version concurrency control (MVCC).

客户端在不同时间点看到的数据库部分不同。这个问题最常见地通过快照隔离来解决，它允许事务在某个时间点从一致的快照中读取。通常使用多版本并发控制（MVCC）实现。

Lost updates

Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implementations of snapshot isolation prevent this anomaly automatically, while others require a manual lock ( SELECT FOR UPDATE ).

两个客户端同时执行读取-修改-写入循环。其中一个覆盖另一个的写入，而没有合并其更改，从而导致数据丢失。某些快照隔离的实现自动防止此异常，而另一些需要手动锁定（SELECT FOR UPDATE）。

Write skew

A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly.

一项交易读取某些内容，根据所看到的值做出决策，并将该决策写入数据库。然而，在进行写入时，决策的前提已不再成立。只有序列化隔离防止此异常。

Phantom reads

A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.

一个事务读取与某些搜索条件匹配的对象。另一个客户端进行写入，影响了该搜索结果。快照隔离可以防止简单的幻读，但在写入竞争的情况下，幻像需要特殊处理，比如索引范围锁。

Weak isolation levels protect against some of those anomalies but leave you, the application developer, to handle others manually (e.g., using explicit locking). Only serializable isolation protects against all of these issues. We discussed three different approaches to implementing serializable transactions:

弱隔离级别可以保护一些异常，但需要您作为应用程序开发者手动处理其他异常（例如，使用显式锁定）。只有序列化隔离可以防止所有这些问题。我们讨论了三种实现序列化事务的不同方法：

Literally executing transactions in a serial order

If you can make each transaction very fast to execute, and the transaction throughput is low enough to process on a single CPU core, this is a simple and effective option.

如果您能够使每个交易的执行非常快，且交易吞吐量足够低，可以在单个CPU核心上处理，这是一种简单有效的选择。

Two-phase locking

For decades this has been the standard way of implementing serializability, but many applications avoid using it because of its performance characteristics.

多年来，这一直是实现序列化的标准方式，但许多应用程序避免使用它，因为它的性能特性。

Serializable snapshot isolation (SSI)

A fairly new algorithm that avoids most of the downsides of the previous approaches. It uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction wants to commit, it is checked, and it is aborted if the execution was not serializable.

一种相对较新的算法避免了之前方法的大部分缺点。该算法采用乐观的方式，允许事务在不阻塞的情况下继续进行。当一个事务想要提交时，将进行检查，并且如果执行不可串行化，事务将被中止。

The examples in this chapter used a relational data model. However, as discussed in “The need for multi-object transactions” , transactions are a valuable database feature, no matter which data model is used.

本章的例子使用了一个关系数据模型。然而，正如“需要多对象事务”中所讨论的那样，无论使用哪种数据模型，事务仍然是一个有价值的数据库特性。

In this chapter, we explored ideas and algorithms mostly in the context of a database running on a single machine. Transactions in distributed databases open a new set of difficult challenges, which we’ll discuss in the next two chapters.

在本章中，我们主要探讨了在单台计算机上运行的数据库的思想和算法。分布式数据库中的事务则带来了一系列新的挑战，这将在接下来的两章中进行讨论。

Footnotes

ⁱ Joe Hellerstein has remarked that the C in ACID was “tossed in to make the acronym work” in Härder and Reuter’s paper [ 7 ], and that it wasn’t considered important at the time.

乔·海勒斯坦指出，在哈德尔和鲁特尔的论文中，ACID中的C是"随意加进去的，只是为了让首字母缩写起作用"，当时并不被认为是很重要的。

ⁱⁱ Arguably, an incorrect counter in an email application is not a particularly critical problem. Alternatively, think of a customer account balance instead of an unread counter, and a payment transaction instead of an email.

在电子邮件应用程序中，计数器错误可能并不是特别重要的问题。另一方面，想象一下客户账户余额而不是未读计数器，以及付款交易而不是电子邮件。

ⁱⁱⁱ This is not ideal. If the TCP connection is interrupted, the transaction must be aborted. If the interruption happens after the client has requested a commit but before the server acknowledges that the commit happened, the client doesn’t know whether the transaction was committed or not. To solve this issue, a transaction manager can group operations by a unique transaction identifier that is not bound to a particular TCP connection. We will return to this topic in “The End-to-End Argument for Databases” .

这不是理想的情况。如果TCP连接中断，事务必须被中止。如果中断发生在客户端请求提交但服务器尚未确认提交时，客户端不知道事务是否已提交。为解决此问题，事务管理器可以通过唯一的事务标识符将操作分组，该标识符不绑定到特定的TCP连接。我们将在“数据库的端到端论证”中回到这个话题。

^iv Strictly speaking, the term atomic increment uses the word atomic in the sense of multi-threaded programming. In the context of ACID, it should actually be called isolated or serializable increment. But that’s getting nitpicky.

从严格意义上讲，原子递增一词使用了多线程编程中的原子意义。在ACID的上下文中，实际应该称为隔离或可序列化递增。但这有点吹毛求疵。

^v Some databases support an even weaker isolation level called read uncommitted . It prevents dirty writes, but does not prevent dirty reads.

有些数据库支持更弱的隔离级别，称为“读未提交”。它可以防止脏写，但不能防止脏读。

^vi At the time of writing, the only mainstream databases that use locks for read committed isolation are IBM DB2 and Microsoft SQL Server in the read_committed_snapshot=off configuration [ 23 , 36 ].

目前为止，仅有IBM DB2和Microsoft SQL Server在read_committed_snapshot=off配置下使用锁定读取已提交隔离的主流数据库 [23, 36]。

^vii To be precise, transaction IDs are 32-bit integers, so they overflow after approximately 4 billion transactions. PostgreSQL’s vacuum process performs cleanup which ensures that overflow does not affect the data.

vii 确切来说，交易ID是32位整数，因此在完成约40亿次交易后便会溢出。PostgreSQL的清理过程可保证该溢出不会对数据造成影响。

^viii It is possible, albeit fairly complicated, to express the editing of a text document as a stream of atomic mutations. See “Automatic Conflict Resolution” for some pointers.

可以将文本文档的编辑表示为一系列原子突变，尽管有些复杂。请参阅“自动冲突解决”获取一些指南。

^ix In PostgreSQL you can do this more elegantly using range types, but they are not widely supported in other databases.

在PostgreSQL中，您可以使用范围类型更加优雅地完成此操作，但它们在其他数据库中的支持并不广泛。

^x If a transaction needs to access data that’s not in memory, the best solution may be to abort the transaction, asynchronously fetch the data into memory while continuing to process other transactions, and then restart the transaction when the data has been loaded. This approach is known as anti-caching , as previously mentioned in “Keeping everything in memory” .

如果一笔交易需要访问未在内存中的数据，则最好的解决方案可能是停止该交易，异步获取数据并将其加载到内存中，同时继续处理其他交易，然后在数据加载完成后重新开始该交易。这种方法被称为反缓存，正如之前所提到的“保持所有内容在内存中”。

^xi Sometimes called strong strict two-phase locking (SS2PL) to distinguish it from other variants of 2PL.

有时候被称为强制严格二阶段锁定(SS2PL)，以区别于其他二阶段锁定的变体。

References

[ 1 ] Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, et al.: “ A History and Evaluation of System R ,” Communications of the ACM , volume 24, number 10, pages 632–646, October 1981. doi:10.1145/358769.358784

[1] 唐纳德·D·钱伯林、莫顿·M·阿斯特罕、迈克尔·W·布拉斯根等人： “System R的历史与评价”，ACM通讯杂志，第24卷，第10期，页码632-646，1981年10月。 doi：10.1145/358769.358784

[ 2 ] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: “ Granularity of Locks and Degrees of Consistency in a Shared Data Base ,” in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems , edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in Readings in Database Systems , 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1

[2] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu 和 Irving L. Traiger： “共享数据库中的锁的颗粒度和一致性程度”，收录于 G.M. Nijssen 编辑的《数据库管理系统建模》会议论文集，第 364-394 页，Elsevier/North Holland Publishing，1976。同时还收录于 MIT Press 出版的第四版《数据库系统阅读》（Joseph M. Hellerstein 和 Michael Stonebraker 编辑），ISBN: 978-0-262-69314-1。

[ 3 ] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger: “ The Notions of Consistency and Predicate Locks in a Database System ,” Communications of the ACM , volume 19, number 11, pages 624–633, November 1976.

Kapali P. Eswaran，Jim N. Gray，Raymond A. Lorie和Irving L. Traiger：“数据库系统中的一致性和谓词锁定概念”，ACM通讯，第19卷，第11号，第624-633页，1976年11月。

[ 4 ] “ ACID Transactions Are Incredibly Helpful ,” FoundationDB, LLC, 2013.

“ACID事务非常有用”，FoundationDB，LLC，2013。

[ 5 ] John D. Cook: “ ACID Versus BASE for Database Transactions ,” johndcook.com , July 6, 2009.

[5] 约翰·D·库克（John D. Cook）：“数据库事务中的ACID与BASE”，johndcook.com，2009年7月6日。

[ 6 ] Gavin Clarke: “ NoSQL’s CAP Theorem Busters: We Don’t Drop ACID ,” theregister.co.uk , November 22, 2012.

《非关系型数据库的CAP理论破冰者：我们不会放弃ACID》（来源：theregister.co.uk，2012年11月22日）。

[ 7 ] Theo Härder and Andreas Reuter: “ Principles of Transaction-Oriented Database Recovery ,” ACM Computing Surveys , volume 15, number 4, pages 287–317, December 1983. doi:10.1145/289.291

[7] Theo Härder 和 Andreas Reuter： “基于事务的数据库恢复原则”，ACM Computing Surveys，第15卷，第4期，页码287-317，1983年12月。 doi：10.1145/289.291

[ 8 ] Peter Bailis, Alan Fekete, Ali Ghodsi, et al.: “ HAT, not CAP: Towards Highly Available Transactions ,” at 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2013.

[8] Peter Bailis、Alan Fekete、Ali Ghodsi 等人： “HAT，而非CAP：走向高度可用交易，” 于第14届USENIX操作系统热门话题研讨会（HotOS）中，2013年5月。

[ 9 ] Armando Fox, Steven D. Gribble, Yatin Chawathe, et al.: “ Cluster-Based Scalable Network Services ,” at 16th ACM Symposium on Operating Systems Principles (SOSP), October 1997.

[9] Armando Fox, Steven D. Gribble, Yatin Chawathe等人：“基于集群的可伸缩网络服务”，发表于1997年10月的第16届ACM操作系统原则研讨会（SOSP）。

[ 10 ] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems . Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com .

【10】Philip A. Bernstein、Vassos Hadzilacos 和 Nathan Goodman:《数据库系统的并发控制与恢复》。Addison-Wesley，1987年。ISBN:978-0-201-10715-9，可在research.microsoft.com网站上在线获取。

[ 11 ] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, et al.: “ Making Snapshot Isolation Serializable ,” ACM Transactions on Database Systems , volume 30, number 2, pages 492–528, June 2005. doi:10.1145/1071610.1071615

[11] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil等人：《使快照隔离可串行化》，《ACM数据系统交易》，第30卷，第2期，2005年6月，页码492-528。doi：10.1145 / 1071610.1071615

[ 12 ] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge: “ Understanding the Robustness of SSDs Under Power Fault ,” at 11th USENIX Conference on File and Storage Technologies (FAST), February 2013.

`了解SSD在电源故障下的健壮性`：麦铮、约瑟夫·图塞克、秦峰和马克·利利布里奇，发表于2013年2月第11届USENIX文件和存储技术会议（FAST）。

[ 13 ] Laurie Denness: “ SSDs: A Gift and a Curse ,” laur.ie , June 2, 2015.

"SSD：礼物与诅咒" - 作者 Laurie Denness，来源于网站laur.ie，发表日期为2015年6月2日。

[ 14 ] Adam Surak: “ When Solid State Drives Are Not That Solid ,” blog.algolia.com , June 15, 2015.

[14] 亚当·苏拉克： “当固态硬盘并不稳定”，博客.algolia.com，2015年6月15日。

[ 15 ] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “ All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

【15】Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan等人：《并非所有文件系统都是平等的：关于开发崩溃一致应用程序的复杂性》，2014年10月在第11届USENIX操作系统设计与实现研讨会（OSDI）上发表。

[ 16 ] Chris Siebenmann: “ Unix’s File Durability Problem ,” utcc.utoronto.ca , April 14, 2016.

[16] Chris Siebenmann：“Unix的文件耐久性问题”，utcc.utoronto.ca，2016年4月14日。

[ 17 ] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, et al.: “ An Analysis of Data Corruption in the Storage Stack ,” at 6th USENIX Conference on File and Storage Technologies (FAST), February 2008.

[17] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder等： “存储栈中数据损坏的分析”，发表于2008年2月的第6届USENIX文件和存储技术会议(FAST)。

[ 18 ] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant: “ Flash Reliability in Production: The Expected and the Unexpected ,” at 14th USENIX Conference on File and Storage Technologies (FAST), February 2016.

"[18] Bianca Schroeder、Raghav Lagisetty和Arif Merchant：「Flash生产中的可靠性：预期与意外」，发表于第14届USENIX文件和存储技术研讨会(FAST)，2016年2月。"

[ 19 ] Don Allison: “ SSD Storage – Ignorance of Technology Is No Excuse ,” blog.korelogic.com , March 24, 2015.

“SSD存储——无知技术不是借口”，Don Allison，blog.korelogic.com，2015年3月24日。

[ 20 ] Dave Scherer: “ Those Are Not Transactions (Cassandra 2.0) ,” blog.foundationdb.com , September 6, 2013.

[20] Dave Scherer：“那不是事务（Cassandra 2.0）”，blog.foundationdb.com，2013年9月6日。

[ 21 ] Kyle Kingsbury: “ Call Me Maybe: Cassandra ,” aphyr.com , September 24, 2013.

[21] Kyle Kingsbury：“叫我也许：Cassandra”，aphyr.com，2013年9月24日。

[ 22 ] “ ACID Support in Aerospike ,” Aerospike, Inc., June 2014.

"[22] Aerospike公司，“Aerospike中的ACID支持”，2014年6月。"

[ 23 ] Martin Kleppmann: “ Hermitage: Testing the ‘I’ in ACID ,” martin.kleppmann.com , November 25, 2014.

【23】马丁·克莱普曼： “博物馆：在ACID中测试I”，martin.kleppmann.com，2014年11月25日。

[ 24 ] Tristan D’Agosta: “ BTC Stolen from Poloniex ,” bitcointalk.org , March 4, 2014.

[24] 特里斯坦·达戈斯塔： “Poloniex被盗的比特币。” bitcointalk.org，2014年3月4日。

[ 25 ] bitcointhief2: “ How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More! ,” reddit.com , February 2, 2014.

"25. 比特币小偷2： “我是如何从一个交易所盗走大约100个比特币的，以及我还可以盗取更多的方式！”，reddit.com，2014年2月2日。"

[ 26 ] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “ Automating the Detection of Snapshot Isolation Anomalies ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[26] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “自动检测快照隔离异常”，发表于2007年9月的第33届国际大型数据库会议（VLDB）。

[ 27 ] Michael Melanson: “ Transactions: The Limits of Isolation ,” michaelmelanson.net , March 20, 2014.

[27] 迈克尔·梅兰森： “交易：隔离的限制”，michaelmelanson.net，2014年3月20日。

[ 28 ] Hal Berenson, Philip A. Bernstein, Jim N. Gray, et al.: “ A Critique of ANSI SQL Isolation Levels ,” at ACM International Conference on Management of Data (SIGMOD), May 1995.

[28] Hal Berenson, Philip A. Bernstein, Jim N. Gray等人： “对ANSI SQL隔离级别的批评”，于ACM数据管理国际会议(SIGMOD)于1995年5月举行。

[ 29 ] Atul Adya: “ Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions ,” PhD Thesis, Massachusetts Institute of Technology, March 1999.

[29] Atul Adya：“弱一致性：分布式事务的广义理论和乐观实现”，麻省理工学院博士论文，1999年3月。

[ 30 ] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “ Highly Available Transactions: Virtues and Limitations (Extended Version) ,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

[30] Peter Bailis, Aaron Davidson, Alan Fekete等： “高可用事务：优点和限制（扩展版本）”，于2014年9月在40届国际大数据管理系统大会（VLDB）上发布。

[ 31 ] Bruce Momjian: “ MVCC Unmasked ,” momjian.us , July 2014.

[31] Bruce Momjian: "MVCC Unmasked", momjian.us, 2014年7月。

[ 32 ] Annamalai Gurusami: “ Repeatable Read Isolation Level in InnoDB – How Consistent Read View Works ,” blogs.oracle.com , January 15, 2013.

“在InnoDB中可重复读隔离级别 - 一致性读取视图的工作原理”，Annamalai Gurusami，blogs.oracle.com，2013年1月15日。

[ 33 ] Nikita Prokopov: “ Unofficial Guide to Datomic Internals ,” tonsky.me , May 6, 2014.

[33] 尼基塔·普罗科波夫: "Datomic 内部非官方指南"，tonsky.me，2014年5月6日。

[ 34 ] Baron Schwartz: “ Immutability, MVCC, and Garbage Collection ,” xaprb.com , December 28, 2013.

巴伦·施瓦茨：“不变性、MVCC和垃圾回收”，xaprb.com，2013年12月28日。

[ 35 ] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide . O’Reilly Media, 2010. ISBN: 978-0-596-15589-6

[35] J. Chris Anderson，Jan Lehnardt，和Noah Slater：CouchDB：权威指南。O'Reilly Media，2010年。 ISBN：978-0-596-15589-6

[ 36 ] Rikdeb Mukherjee: “ Isolation in DB2 (Repeatable Read, Read Stability, Cursor Stability, Uncommitted Read) with Examples ,” mframes.blogspot.co.uk , July 4, 2013.

"[36] Rikdeb Mukherjee:“用例分别展示在DB2中的隔离（可重复读、读取稳定性、游标稳定性、未提交读）”，mframes.blogspot.co.uk，2013年7月4日。"

[ 37 ] Steve Hilker: “ Cursor Stability (CS) – IBM DB2 Community ,” toadworld.com , March 14, 2013.

[37] Steve Hilker：IBM DB2社区，“游标稳定性（CS）”，toadworld.com，2013年3月14日。

[ 38 ] Nate Wiger: “ An Atomic Rant ,” nateware.com , February 18, 2010.

[38] 纳特·维格： “一个原子抱怨”，nateware.com， 2010年2月18日。

[ 39 ] Joel Jacobson: “ Riak 2.0: Data Types ,” blog.joeljacobson.com , March 23, 2014.

[39] Joel Jacobson: “Riak 2.0: 数据类型,” blog.joeljacobson.com, 2014年3月23日.

[ 40 ] Michael J. Cahill, Uwe Röhm, and Alan Fekete: “ Serializable Isolation for Snapshot Databases ,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376690

"[40] Michael J. Cahill，Uwe Röhm和Alan Fekete：`基于快照数据库的可串行隔离`，发表于2008年ACM国际数据管理会议(SIGMOD)，doi:10.1145/1376616.1376690。"

[ 41 ] Dan R. K. Ports and Kevin Grittner: “ Serializable Snapshot Isolation in PostgreSQL ,” at 38th International Conference on Very Large Databases (VLDB), August 2012.

[41] Dan R. K. Ports 和 Kevin Grittner： “在 PostgreSQL 中实现串行化快照隔离”，发表于第 38 届国际大型数据库会议（VLDB），2012年8月。

[ 42 ] Tony Andrews: “ Enforcing Complex Constraints in Oracle ,” tonyandrews.blogspot.co.uk , October 15, 2004.

[42] Tony Andrews：“在Oracle中实施复杂的约束，” tonyandrews.blogspot.co.uk，2004年10月15日。

[ 43 ] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, et al.: “ Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System ,” at 15th ACM Symposium on Operating Systems Principles (SOSP), December 1995. doi:10.1145/224056.224070

【43】Douglas B. Terry, Marvin M. Theimer, Karin Petersen等： “在Bayou中管理更新冲突：一种弱连接的复制存储系统”，发表于 1995年12月第15届ACM操作系统原则研讨会（SOSP）。doi:10.1145/224056.224070

[ 44 ] Gary Fredericks: “ Postgres Serializability Bug ,” github.com , September 2015.

"Postgres串行化漏洞"，Gary Fredericks，github.com，2015年9月。

[ 45 ] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “ The End of an Architectural Era (It’s Time for a Complete Rewrite) ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[45] Michael Stonebraker, Samuel Madden, Daniel J. Abadi等人： “一个架构时代的结束（是时候进行完全重写了）”，发表于2007年9月的第33届超大型数据管理会议（VLDB）。

[ 46 ] John Hugg: “ H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures ,” at Data @Scale Boston , November 2014.

"John Hugg在2014年11月的Data @Scale Boston上演讲：“H-Store/VoltDB架构与CEP系统以及新一代流处理架构的对比。”"

[ 47 ] Robert Kallman, Hideaki Kimura, Jonathan Natkins, et al.: “ H-Store: A High-Performance, Distributed Main Memory Transaction Processing System ,” Proceedings of the VLDB Endowment , volume 1, number 2, pages 1496–1499, August 2008.

"[47] Robert Kallman，Hideaki Kimura，Jonathan Natkins等人： “H-Store：一种高性能的分布式主内存事务处理系统，” VLDB 纪念册，卷1，编号2，页1496-1499，2008年8月。" "[47] 罗伯特·卡尔曼，木村秀明，乔纳森·纳特金斯等人： “H-Store：一种高性能、分布式主内存事务处理系统，” VLDB纪念册，卷1，编号2，页1496-1499，2008年8月。"

[ 48 ] Rich Hickey: “ The Architecture of Datomic ,” infoq.com , November 2, 2012.

[48] Rich Hickey: “Datomic 的架构”，infoq.com，2012年11月2日。

[ 49 ] John Hugg: “ Debunking Myths About the VoltDB In-Memory Database ,” voltdb.com , May 12, 2014.

约翰·哈格：“揭开 VoltDB 内存数据库的谣言”，voltdb.com，2014 年 5 月 12 日。

[ 50 ] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “ Architecture of a Database System ,” Foundations and Trends in Databases , volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002

【50】Joseph M. Hellerstein、Michael Stonebraker和James Hamilton： “数据库系统的架构，” 《数据库基础与趋势》，卷1，号2，页141-259，2007年11月。 doi：10.1561/1900000002

[ 51 ] Michael J. Cahill: “ Serializable Isolation for Snapshot Databases ,” PhD Thesis, University of Sydney, July 2009.

迈克尔·J·卡希尔：《可序列化隔离对于快照数据库的应用》，悉尼大学博士论文，2009年7月。

[ 52 ] D. Z. Badal: “ Correctness of Concurrency Control and Implications in Distributed Databases ,” at 3rd International IEEE Computer Software and Applications Conference (COMPSAC), November 1979.

[52] D. Z. 巴达尔： “并发控制的正确性及其在分布式数据库中的意义”，出版于1979年11月第3届国际IEEE计算机软件和应用大会（COMPSAC）上。

[ 53 ] Rakesh Agrawal, Michael J. Carey, and Miron Livny: “ Concurrency Control Performance Modeling: Alternatives and Implications ,” ACM Transactions on Database Systems (TODS), volume 12, number 4, pages 609–654, December 1987. doi:10.1145/32204.32220

[53] Rakesh Agrawal，Michael J. Carey和Miron Livny：“并发控制性能建模：替代方案和影响”，ACM Transactions on Database Systems（TODS），第12卷，第4期，1987年12月，页609-654。 doi：10.1145/32204.32220

[ 54 ] Dave Rosenthal: “ Databases at 14.4MHz ,” blog.foundationdb.com , December 10, 2014.

[54] Dave Rosenthal: “Databases at 14.4MHz,” blog.foundationdb.com, December 10, 2014. [54] 戴夫·罗森塔尔： “14.4MHz下的数据库”， blog.foundationdb.com，2014年12月10日。

Chapter 8. The Trouble with Distributed Systems

Hey I just met you
The network’s laggy
But here’s my data
So store it maybe

嗨我刚认识你网络有点慢但这是我的数据可以存储一下，也许

Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions (2013)

Kyle Kingsbury，Carly Rae Jepsen和网络分区的危险（2013）

A recurring theme in the last few chapters has been how systems handle things going wrong. For example, we discussed replica failover ( “Handling Node Outages” ), replication lag ( “Problems with Replication Lag” ), and concurrency control for transactions ( “Weak Isolation Levels” ). As we come to understand various edge cases that can occur in real systems, we get better at handling them.

最近几章中反复出现的主题是系统如何处理出现故障的情况。例如，我们讨论了复制品故障转移（“处理节点故障”），复制滞后（“复制滞后问题”）和事务的并发控制（“弱隔离级别”）。随着我们逐渐了解实际系统可能出现的各种边缘情况，我们变得更加擅长处理它们。

However, even though we have talked a lot about faults, the last few chapters have still been too optimistic. The reality is even darker. We will now turn our pessimism to the maximum and assume that anything that can go wrong will go wrong. ⁱ (Experienced systems operators will tell you that is a reasonable assumption. If you ask nicely, they might tell you some frightening stories while nursing their scars of past battles.)

然而，即使我们谈论了很多故障，最近的几章仍然过于乐观。现实情况更加黑暗。我们现在将把悲观主义发挥到最大，假设任何可能出错的地方都会出错。（有经验的系统操作员会告诉你这是一个合理的假设。如果你很客气地问，他们可能会在抚摸着过去战斗中留下的伤疤时告诉你一些可怕的故事。）

Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and exciting ways for things to go wrong [ 1 , 2 ]. In this chapter, we will get a taste of the problems that arise in practice, and an understanding of the things we can and cannot rely on.

与单个计算机上编写软件截然不同的是，处理分布式系统涉及许多新颖和令人兴奋的问题 [1, 2]。在本章中，我们将了解实践中出现的问题和我们可以依赖和不能依赖的事情的理解。

In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong. In Chapter 9 , we will look at some examples of algorithms that can provide such guarantees in a distributed system. But first, in this chapter, we must understand what challenges we are up against.

最終，作為工程師，我們的任務是建立系統，即使一切都出了差錯，也能完成他們的工作（即滿足用戶期望的保證）。在第9章中，我們將研究一些可以在分散系統中提供此類保證的算法示例。但首先，在本章中，我們必須了解我們面臨的挑戰是什麼。

This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks ( “Unreliable Networks” ); clocks and timing issues ( “Unreliable Clocks” ); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened ( “Knowledge, Truth, and Lies” ).

这一章是一个非常悲观和沮丧的概述，介绍分布式系统可能出现的问题。我们将观察网络问题（“不可靠的网络”），时钟和时间问题（“不可靠的时钟”），并探讨它们在多大程度上是可以避免的。所有这些问题的后果都是令人迷惑的，因此我们将探索如何思考分布式系统的状态以及如何推理发生的事情（“知识、真相和谎言”）。

Faults and Partial Failures

When you are writing a program on a single computer, it normally behaves in a fairly predictable way: either it works or it doesn’t. Buggy software may give the appearance that the computer is sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just a consequence of badly written software.

当您在单台计算机上编写程序时，它通常表现出相当可预测的方式：要么它有效，要么它无效。有错误的软件可能会让计算机看起来有时“度过糟糕的一天”（这个问题通常可以通过重新启动解决），但这主要是由于编写不良的软件造成的。

There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic ). If there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual computer with good software is usually either fully functional or entirely broken, but not something in between.

没有根本性的原因，使得单个计算机上的软件会出现问题：当硬件功能正常时，同样的操作总是会产生相同的结果（它是确定性的）。如果存在硬件问题（例如，内存损坏或松散的连接器），后果通常是整个系统崩溃（例如，内核恐慌，“蓝屏幕”，无法启动）。具有良好软件的单个计算机通常要么完全功能正常，要么完全损坏，而不会出现任何中间状态。

This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a computer to crash completely rather than returning a wrong result, because wrong results are difficult and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are implemented and present an idealized system model that operates with mathematical perfection. A CPU instruction always does the same thing; if you write some data to memory or disk, that data remains intact and doesn’t get randomly corrupted. This design goal of always-correct computation goes all the way back to the very first digital computer [ 3 ].

这是计算机设计上的有意选择：如果出现内部故障，我们宁愿让计算机完全崩溃，也不愿意返回错误结果，因为处理错误结果都很困难和混乱。因此，计算机隐藏了它们实现的模糊物理现实，并呈现出理想化的系统模型，以数学完美的方式运作。CPU 指令总是执行同样的操作；如果你将某些数据写入内存或磁盘，那么数据会保持完好，不会随机损坏。这种始终正确计算的设计目标一直延续到最早的数字计算机 [3]。

When you are writing software that runs on several computers, connected by a network, the situation is fundamentally different. In distributed systems, we are no longer operating in an idealized system model—we have no choice but to confront the messy reality of the physical world. And in the physical world, a remarkably wide range of things can go wrong, as illustrated by this anecdote [ 4 ]:

当您编写在网络上连接多台计算机运行的软件时，情况是根本不同的。在分布式系统中，我们不再在理想化的系统模型中工作-我们别无选择，只能面对物理世界的混乱现实。而在物理世界中，如此之多的事情都可能出错，正如这个轶事所示[4]:

In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even an ops guy.

在我的有限经验中，我处理过单个数据中心（DC）内长时间存在的网络分区、PDU [电源分配单元] 故障、交换机故障、整个机架的意外断电、整个DC骨干网故障、整个DC电力故障，以及一位低血糖的驾驶员将他的福特皮卡撞进 DC 的 HVAC [供暖、通风和空调] 系统。我甚至不是一名运维人员。

Coda Hale

Coda Hale - 科达·哈尔

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure . The difficulty is that partial failures are nondeterministic : if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. As we shall see, you may not even know whether something succeeded or not, as the time it takes for a message to travel across a network is also nondeterministic!

在分布式系统中，即使系统的其他部分运行良好，仍然可能存在某些组件以不可预测的方式损坏。这被称为部分故障。困难在于，部分故障是不确定的：如果您尝试执行涉及多个节点和网络的任何操作，有时可能会成功，有时则会出现不可预测的故障。正如我们将看到的，您甚至可能不知道某些操作是否成功，因为消息在网络上传输的时间也是不确定的！

This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [ 5 ].

这种不确定性和部分故障的可能性使得分布式系统难以处理。

Cloud Computing and Supercomputing

There is a spectrum of philosophies on how to build large-scale computing systems:

如何构建大规模计算系统有各种不同的哲学观点：

At one end of the scale is the field of high-performance computing (HPC). Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks, such as weather forecasting or molecular dynamics (simulating the movement of atoms and molecules).

在一端的尺度上是高性能计算（HPC）领域。具有成千上万个CPU的超级计算机通常用于计算密集的科学计算任务，例如天气预报或分子动力学（模拟原子和分子的运动）。
At the other extreme is cloud computing , which is not very well defined [ 6 ] but is often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.

在另一个极端是云计算，它没有非常明确的定义，但通常与多租户数据中心、使用以太网连接的普通计算机、弹性/按需资源分配以及按量计费相关联。
Traditional enterprise datacenters lie somewhere between these extremes.

传统企业数据中心处于这两个极端之间。

With these philosophies come very different approaches to handling faults. In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time. If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint [ 7 , 8 ]. Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure—if any part of the system fails, just let everything crash (like a kernel panic on a single machine).

这些哲学观念带来了处理故障的非常不同的方法。在超级计算机中，作业通常会定期将计算状态检查点存储到持久性存储器中。如果一个节点出现故障，一种常见的解决方案是简单地停止整个群集的工作负载。在修复有故障的节点后，计算将从上一个检查点重新启动[7，8]。因此，超级计算机更像单节点计算机，而不是分布式系统：它通过让局部故障升级为总体故障来处理部分故障——如果系统的任何部分出现故障，就让一切崩溃（就像单个机器上的内核恐慌）。

In this book we focus on systems for implementing internet services, which usually look very different from supercomputers:

在本书中，我们专注于实现互联网服务的系统，这些系统通常与超级计算机大不相同。

Many internet-related applications are online , in the sense that they need to be able to serve users with low latency at any time. Making the service unavailable—for example, stopping the cluster for repair—is not acceptable. In contrast, offline (batch) jobs like weather simulations can be stopped and restarted with fairly low impact.

许多与互联网相关的应用程序是在线的，这意味着它们需要能够随时以低延迟为用户提供服务。使服务不可用，例如停止集群进行维修，是不可接受的。相反，像天气模拟这样的离线（批处理）作业可以停止和重新启动，影响相对较小。
Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA). On the other hand, nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.

超级计算机通常使用专门的硬件构建，每个节点非常可靠，节点通过共享内存和远程直接内存存取(RDMA)进行通信。另一方面，云服务中的节点是由商品机器构建的，这些商品机器由于经济规模而能够以更低的成本提供相当的性能，但也具有更高的故障率。
Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth [ 9 ]. Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses [ 10 ], which yield better performance for HPC workloads with known communication patterns.

大型数据中心网络通常基于IP和以Clos拓扑结构排列的以太网，以提供高度双分带宽[9]。超级计算机通常使用专用的网络拓扑结构，例如多维网格和环形结构[10]，这些结构对HPC负载具有已知通信模式的性能更好。
The bigger a system gets, the more likely it is that one of its components is broken. Over time, broken things get fixed and new things break, but in a system with thousands of nodes, it is reasonable to assume that something is always broken [ 7 ]. When the error handling strategy consists of simply giving up, a large system can end up spending a lot of its time recovering from faults rather than doing useful work [ 8 ].

系统越大，它的每个组件出现问题的可能性也越大。随着时间的推移，坏的东西会被修复，新的东西则会出现问题，但在一个拥有成千上万个节点的系统中，可以合理地假设总会有东西是坏的[7]。当错误处理策略只是放弃时，大型系统最终可能会花费大量时间来恢复故障，而不是执行有用的工作[8]。
If the system can tolerate failed nodes and still keep working as a whole, that is a very useful feature for operations and maintenance: for example, you can perform a rolling upgrade (see Chapter 4 ), restarting one node at a time, while the service continues serving users without interruption. In cloud environments, if one virtual machine is not performing well, you can just kill it and request a new one (hoping that the new one will be faster).

如果系统能够容忍故障节点，且整体仍能正常运行，那么这将是运营和维护非常有用的特性。例如，你可以执行滚动升级（见第四章），一次重启一个节点，而服务仍能继续为用户提供服务而不会中断。在云环境中，如果一个虚拟机的表现不佳，你可以将其 “kill”，并请求一个新的虚拟机（希望新的虚拟机速度更快些）。
In a geographically distributed deployment (keeping data geographically close to your users to reduce access latency), communication most likely goes over the internet, which is slow and unreliable compared to local networks. Supercomputers generally assume that all of their nodes are close together.

在地理分布的部署中（使数据在地理上接近用户以减少访问延迟），通信最可能通过互联网进行，与本地网络相比，互联网速度慢且不可靠。超级计算机通常假定它们的所有节点都彼此靠近。

If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components. (As discussed in “Reliability” , there is no such thing as perfect reliability, so we’ll need to understand the limits of what we can realistically promise.)

如果我们想要使分布式系统正常运作，我们必须接受部分失败的可能性，并将容错机制纳入软件中。换句话说，我们需要从不可靠的组件中构建可靠的系统。（正如在“可靠性”中讨论的那样，不存在完美的可靠性，因此我们需要了解我们可以现实承诺的限制。）

Even in smaller systems consisting of only a few nodes, it’s important to think about partial failure. In a small system, it’s quite likely that most of the components are working correctly most of the time. However, sooner or later, some part of the system will become faulty, and the software will have to somehow handle it. The fault handling must be part of the software design, and you (as operator of the software) need to know what behavior to expect from the software in the case of a fault.

即使是由只有几个节点组成的较小系统，考虑到部分故障也是很重要的。在小型系统中，大多数组件大部分时间运行正常是相当可能的。然而，迟早会有一些组件出现故障，软件必须以某种方式进行处理。故障处理必须是软件设计的一部分，您（作为软件的操作者）需要知道在发生故障的情况下可以从软件期望什么样的行为。

It would be unwise to assume that faults are rare and simply hope for the best. It is important to consider a wide range of possible faults—even fairly unlikely ones—and to artificially create such situations in your testing environment to see what happens. In distributed systems, suspicion, pessimism, and paranoia pay off.

认为故障很少并且只是希望一切顺利是不明智的。重要的是要考虑到各种可能的故障，即使是相对不太可能的故障，也要在测试环境中人为创建这种情况来观察发生了什么。在分布式系统中，怀疑，悲观和偏执是有回报的。

Building a Reliable System from Unreliable Components

You may wonder whether this makes any sense—intuitively it may seem like a system can only be as reliable as its least reliable component (its weakest link ). This is not the case: in fact, it is an old idea in computing to construct a more reliable system from a less reliable underlying base [ 11 ]. For example:

你可能会想知道这是否有任何意义 - 直觉上，似乎一个系统只能与其最不可靠的组件一样可靠（最弱的链接）。但事实并非如此：事实上，在计算机领域，从不可靠的基础建立更可靠的系统是一个古老的想法[11]。例如：

Error-correcting codes allow digital data to be transmitted accurately across a communication channel that occasionally gets some bits wrong, for example due to radio interference on a wireless network [ 12 ].

纠错码允许数字数据在通信渠道中准确传输，在这个渠道中有时会出现一些错误的位，例如在无线网络上由于电波干扰。
IP (the Internet Protocol) is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the Transmission Control Protocol) provides a more reliable transport layer on top of IP: it ensures that missing packets are retransmitted, duplicates are eliminated, and packets are reassembled into the order in which they were sent.

IP（Internet协议）是不可靠的：它可能会丢失、延迟、重复或重新排列数据包。TCP（传输控制协议）在IP之上提供更可靠的传输层：它确保丢失的数据包被重传，重复的被消除，并且将数据包重新组装为发送的顺序。

Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can be. For example, error-correcting codes can deal with a small number of single-bit errors, but if your signal is swamped by interference, there is a fundamental limit to how much data you can get through your communication channel [ 13 ]. TCP can hide packet loss, duplication, and reordering from you, but it cannot magically remove delays in the network.

尽管系统可以比其基础部分更可靠，但它可以更可靠的程度总是有限的。例如，纠错码可以处理少量单比特错误，但如果您的信号被干扰淹没，您能通过通信通道传输的数据量存在根本性的限制 [13]。TCP可以隐藏数据包丢失、重复和重新排序，但它不能神奇地消除网络延迟。

Although the more reliable higher-level system is not perfect, it’s still useful because it takes care of some of the tricky low-level faults, and so the remaining faults are usually easier to reason about and deal with. We will explore this matter further in “The end-to-end argument” .

尽管较为可靠的高层系统并不完美，但它仍然很有用，因为它能够解决一些棘手的低层故障，而剩余的故障通常更容易解决。我们将在“端到端论证”中进一步探讨这个问题。

Unreliable Networks

As discussed in the introduction to Part II , the distributed systems we focus on in this book are shared-nothing systems : i.e., a bunch of machines connected by a network. The network is the only way those machines can communicate—we assume that each machine has its own memory and disk, and one machine cannot access another machine’s memory or disk (except by making requests to a service over the network).

正如第二部分的介绍所讨论的，我们在本书中关注的分布式系统是共享无关系统：即由网络连接的一堆机器。网络是这些机器之间唯一的通信途径——我们假设每台机器都有自己的内存和磁盘，一台机器不能访问另一台机器的内存或磁盘（除非通过网络向服务发出请求）。

Shared-nothing is not the only way of building systems, but it has become the dominant approach for building internet services, for several reasons: it’s comparatively cheap because it requires no special hardware, it can make use of commoditized cloud computing services, and it can achieve high reliability through redundancy across multiple geographically distributed datacenters.

共享无的构建系统不是唯一的方式，但已成为构建互联网服务的主要方法，原因如下：它相对便宜，因为不需要特殊的硬件，它可以利用商品化的云计算服务，并且可以通过多个地理分布的数据中心的冗余实现高可靠性。

The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks . In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send a request and expect a response, many things could go wrong (some of which are illustrated in Figure 8-1 ):

互联网和大多数数据中心内部网络（通常使用以太网）是异步分组网络。在这种网络中，一个节点可以向另一个节点发送消息（分组），但是网络不能保证它何时到达或是否到达。如果您发送请求并期望得到响应，很多事情可能会出错（其中一些在图8-1中说明）：

Your request may have been lost (perhaps someone unplugged a network cable).

您的请求可能已经丢失（也许有人拔了一根网络电缆）。
Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded).

您的请求可能正在等待队列中，并且稍后会被交付（也许是因为网络或接收方过载）。
The remote node may have failed (perhaps it crashed or it was powered down).

远程节点可能已经失效（可能是崩溃或关机）。
The remote node may have temporarily stopped responding (perhaps it is experiencing a long garbage collection pause; see “Process Pauses” ), but it will start responding again later.

远程节点可能暂时停止响应（可能正在经历长时间的垃圾回收暂停；请参阅“进程暂停”），但稍后它将重新开始响应。
The remote node may have processed your request, but the response has been lost on the network (perhaps a network switch has been misconfigured).

远程节点可能已经处理了您的请求，但响应却在网络中丢失了（可能是由于网络交换机的错误配置）。
The remote node may have processed your request, but the response has been delayed and will be delivered later (perhaps the network or your own machine is overloaded).

远程节点可能已经处理了您的请求，但是响应被延迟了，稍后会送达（可能是由于网络负载或您自己的计算机过载）。

The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed. These issues are indistinguishable in an asynchronous network: the only information you have is that you haven’t received a response yet. If you send a request to another node and don’t receive a response, it is impossible to tell why.

发送者甚至无法知道数据包是否已经被传递：唯一的选择是由收件人发送响应信息，而这可能会被丢失或延迟。在异步网络中，这些问题是无法区分的：你唯一能得到的信息是你还没有收到响应。如果你向另一个节点发送请求但没有收到响应，那么你无法确定原因。

The usual way of handling this issue is a timeout : after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it).

通常处理此问题的方法是超时：在一段时间后，您放弃等待并假设响应不会到达。然而，当超时发生时，您仍然不知道远程节点是否收到了您的请求（如果请求仍在某个位置排队，则可能仍会传递给收件人，即使发送者已经放弃它）。通常处理此问题的方法是超时：在一段时间后，您放弃等待并假设响应不会到达。然而，当超时发生时，您仍然不知道远程节点是否收到了您的请求（如果请求仍在某个位置排队，则可能仍会传递给收件人，即使发送者已经放弃它）。

Network Faults in Practice

We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. However, it seems that we have not yet succeeded.

数十年来，我们一直在建立计算机网络——人们或许希望我们早已学会如何使它们可靠。然而，现在看来我们还没有成功。

There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company [ 14 ]. One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack [ 15 ]. Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers [ 16 ]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.

一些系统研究和大量的个人经验表明，即使是由一家公司操作的数据中心这样的受控环境中，网络问题也可能很常见。一项在中型数据中心进行的研究发现，每月约有12次网络故障，其中一半将一个计算机断开连接，另一半将整个机架断开连接。另一项研究测量了顶部交换机、聚合交换机和负载均衡器等组件的故障率。它发现，添加冗余网络设备并不能像您希望的那样减少故障，因为它不能防止人为错误（例如，配置错误的交换机），这是故障的主要原因之一。

Public cloud services such as EC2 are notorious for having frequent transient network glitches [ 14 ], and well-managed private datacenter networks can be stabler environments. Nevertheless, nobody is immune from network problems: for example, a problem during a software upgrade for a switch could trigger a network topology reconfiguration, during which network packets could be delayed for more than a minute [ 17 ]. Sharks might bite undersea cables and damage them [ 18 ]. Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully [ 19 ]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.

公共云服务，比如EC2，以频繁的短暂网络故障著称[14]，而良好管理的私有数据中心网络可能会更加稳定。然而，无论如何，任何人都不能免于网络问题：例如，交换机软件升级期间出现问题可能会触发网络拓扑重构，在此期间，网络数据包可能会延迟超过一分钟[17]。鲨鱼可能会咬断海底电缆并损坏它们[18]。其他令人惊讶的故障包括某些时候会丢失所有入站数据包但能成功发送出站数据包的网络接口[19]：仅因为一个网络连接在一个方向工作并不能保证它在相反方向也能正常工作。

Network partitions

When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or netsplit . In this book we’ll generally stick with the more general term network fault , to avoid confusion with partitions (shards) of a storage system, as discussed in Chapter 6 .

当网络的一部分由于网络故障而与其他部分断开连接时，有时称为网络分区或netsplit。在本书中，我们通常会坚持使用更一般的术语网络故障，以避免与存储系统的分区（分片）混淆，如第6章讨论的那样。

Even if network faults are rare in your environment, the fact that faults can occur means that your software needs to be able to handle them. Whenever any communication happens over a network, it may fail—there is no way around it.

即使在您的环境中网络故障很少见，但故障可能发生的事实意味着您的软件需要能够处理它们。无论何时在网络上进行任何通信，都可能会失败-这是无法避免的。

If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests, even when the network recovers [ 20 ], or it could even delete all of your data [ 21 ]. If software is put in an unanticipated situation, it may do arbitrary unexpected things.

如果网络故障的错误处理未定义和测试，可能会发生任意糟糕的事情：例如，集群可能会陷入死锁状态，永久无法提供请求服务，即使网络恢复[20]，或者甚至可能删除所有数据[21]。如果软件遇到未预料的情况，它可能会做出任意意想不到的事情。

Handling network faults doesn’t necessarily mean tolerating them: if your network is normally fairly reliable, a valid approach may be to simply show an error message to users while your network is experiencing problems. However, you do need to know how your software reacts to network problems and ensure that the system can recover from them. It may make sense to deliberately trigger network problems and test the system’s response (this is the idea behind Chaos Monkey; see “Reliability” ).

处理网络故障并不一定意味着容忍它们：如果您的网络通常相当可靠，一种有效的方法可能是在网络遇到问题时向用户显示错误消息。然而，您需要了解您的软件如何应对网络问题，并确保系统可以从中恢复。有意诱发网络问题并测试系统的响应可能是有意义的（这是混沌猴子背后的想法；请参见“可靠性”）。

Detecting Faults

Many systems need to automatically detect faulty nodes. For example:

许多系统需要自动检测故障节点。例如：

A load balancer needs to stop sending requests to a node that is dead (i.e., take it out of rotation ).

负载均衡器需要停止向已经挂掉的节点发送请求（即将其从轮循列表中移除）。
In a distributed database with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader (see “Handling Node Outages” ).

在具有单领导复制的分布式数据库中，如果领导者失败，则需要将其中一个跟随者提升为新领导者（请参见“处理节点故障”）。

Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is working or not. In some specific circumstances you might get some feedback to explicitly tell you that something is not working:

不幸的是，由于网络的不确定性，很难确定节点是否正常工作。在某些特定情况下，您可能会得到一些反馈来明确告诉您某些东西没有正常工作。

If you can reach the machine on which the node should be running, but no process is listening on the destination port (e.g., because the process crashed), the operating system will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply. However, if the node crashed while it was handling your request, you have no way of knowing how much data was actually processed by the remote node [ 22 ].

如果您可以连接到节点应该运行的机器，但目的端口上没有进程正在监听（例如，因为进程崩溃了），操作系统会发送RST或FIN数据包来关闭或拒绝TCP连接。然而，如果节点在处理您的请求时崩溃了，您就无法知道远程节点实际处理了多少数据。
If a node process crashed (or was killed by an administrator) but the node’s operating system is still running, a script can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. For example, HBase does this [ 23 ].

如果节点进程崩溃（或被管理员终止），但节点的操作系统仍在运行，脚本可以通知其他节点关于崩溃的情况，以便另一个节点可以快速接管而不必等待超时过期。例如，HBase就是这样做的。
If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level (e.g., if the remote machine is powered down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared datacenter with no access to the switches themselves, or if you can’t reach the management interface due to a network problem.

如果您可以访问数据中心网络交换机的管理界面，您可以查询它们以检测硬件级别的链路故障（例如，如果远程计算机关闭电源）。如果您通过互联网连接，或者在一个共享数据中心中，无法访问交换机自身，或者由于网络问题无法到达管理界面，则此选项被排除。
If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic failure detection capability either—it is subject to the same limitations as other participants of the network.

如果路由器确定你要连接的IP地址无法到达，它可能会向你发送一个ICMP目的地不可达的数据包。但是，路由器也没有奇迹般的故障检测能力，它受到网络其他参与者同样的限制。

Rapid feedback about a remote node being down is useful, but you can’t count on it. Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it. If you want to be sure that a request was successful, you need a positive response from the application itself [ 24 ].

远程节点故障的快速反馈很有用，但不能依赖它。即使TCP确认消息已传递，应用程序在处理消息之前可能已经崩溃。如果您想确保请求成功，您需要从应用程序本身获得积极的响应[24]。

Conversely, if something has gone wrong, you may get an error response at some level of the stack, but in general you have to assume that you will get no response at all. You can retry a few times (TCP retries transparently, but you may also retry at the application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.

相反地，如果出现了问题，可能会在堆栈的某个层次上收到错误响应，但通常你必须假设你将根本不会收到任何响应。你可以尝试几次（TCP 会透明地重试，但您也可以在应用层重试），等待超时时间到期，如果在超时时间内没有收到回复，最终将节点设置为离线。

Timeouts and Unbounded Delays

If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There is unfortunately no simple answer.

如果一个超时是唯一能够确定故障的方法，那么超时时间应该是多少呢？不幸的是，没有简单的答案。

A long timeout means a long wait until a node is declared dead (and during this time, users may have to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due to a load spike on the node or the network).

长时间超时意味着节点被宣告为死亡需要等待很长时间（在此期间，用户可能需要等待或看到错误消息）。短时间超时可以更快地检测故障，但可能会错误地宣告节点死亡，而实际上它只是暂时减速（例如，由于节点或网络的负载峰值）的风险更高。

Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of performing some action (for example, sending an email), and another node takes over, the action may end up being performed twice. We will discuss this issue in more detail in “Knowledge, Truth, and Lies” , and in Chapters 9 and 11 .

过早宣布节点死亡是有问题的：如果节点实际上仍然存活并正在执行某些操作（例如发送电子邮件），而另一个节点接管了，那么该操作可能会被执行两次。我们将在“知识，真相和谎言”以及第9章和第11章中更详细地讨论这个问题。

When a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working).

当一个节点被宣布死亡时，它的责任需要转移到其他节点，这会给其他节点和网络增加额外的负载。如果系统已经处于高负载状态，过早地宣布节点死亡可能会使问题恶化。特别的，可能会发生节点实际上并没有死亡，只是因为负载过重而反应较慢的情况；将它的负载转移到其他节点可能会导致级联故障（在极端情况下，所有节点互相宣布死亡，一切都停止工作）。

Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet is either delivered within some time d , or it is lost, but delivery never takes longer than d . Furthermore, assume that you can guarantee that a non-failed node always handles a request within some time r . In this case, you could guarantee that every successful request receives a response within time 2 d + r —and if you don’t receive a response within that time, you know that either the network or the remote node is not working. If this was true, 2 d + r would be a reasonable timeout to use.

想象一个虚构的系统，其中网络保证了分组的最大延迟——每个分组都在某个时间d内被传送，否则就会丢失，但传送时间永远不会超过d。此外，请假设您可以保证非故障节点始终在某个时间r内处理请求。在这种情况下，您可以保证每个成功的请求都会在2d + r时间内接收到响应——如果您没有在该时间内收到响应，那么您就知道网络或远程节点不起作用了。如果这是真的，2d + r将是一个合理的超时时间。

Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays (that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive), and most server implementations cannot guarantee that they can handle requests within some maximum time (see “Response time guarantees” ). For failure detection, it’s not sufficient for the system to be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip times to throw the system off-balance.

很不幸，我们所涉及的大多数系统都没有这些保证：异步网络有无限延迟（也就是说，它们会尽可能快地传输数据包，但数据包到达所需的时间没有上限），并且大多数服务器实现不能保证在某个最大时间内处理请求（请参阅“响应时间保证”）。对于故障检测来说，系统大部分时间表现良好是不够的：如果您的超时时间很短，只需一次往返延迟的瞬时波动就足以使系统失衡。

Network congestion and queueing

When driving a car, travel times on road networks often vary most due to traffic congestion. Similarly, the variability of packet delays on computer networks is most often due to queueing [ 25 ]:

在驾驶汽车时，路网上的行驶时间往往最受交通拥堵的影响变化最大。同样，计算机网络中数据包延迟的变化也往往是由于队列延迟引起的。

If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one (as illustrated in Figure 8-2 ). On a busy network link, a packet may have to wait a while until it can get a slot (this is called network congestion ). If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent—even though the network is functioning fine.

如果有多个不同的节点同时尝试发送数据包到相同的目的地，网络交换机必须将它们排队并逐一输入目的地网络链接（如图8-2所示）。在繁忙的网络链接上，数据包可能需要等待一段时间才能获得槽位（这被称为网络拥塞）。如果有太多的传入数据，交换机队列就会填满，数据包就会被丢弃，因此需要重新发送，即使网络正常运行。
When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time.

当数据包到达目标机器时，如果所有CPU核心都在忙，那么操作系统会将来自网络的请求排队，直到应用程序准备好处理它。根据机器的负载情况，这可能需要任意长度的时间。
In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered) by the virtual machine monitor [ 26 ], further increasing the variability of network delays.

在虚拟化环境中，运行中的操作系统常常会因为另一个虚拟机在使用CPU核心而暂停数十毫秒。在此期间，虚拟机无法从网络中消耗任何数据，因此进入的数据将被虚拟机监视器排队缓冲，进一步增加了网络延迟的不确定性。
TCP performs flow control (also known as congestion avoidance or backpressure ), in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node [ 27 ]. This means additional queueing at the sender before the data even enters the network.

TCP执行流量控制（也称拥塞避免或反压），其中节点限制自己的发送速率，以避免超载网络链接或接收节点[27]。这意味着在数据甚至进入网络之前，在发送方进行额外的排队。

Moreover, TCP considers a packet to be lost if it is not acknowledged within some timeout (which is calculated from observed round-trip times), and lost packets are automatically retransmitted. Although the application does not see the packet loss and retransmission, it does see the resulting delay (waiting for the timeout to expire, and then waiting for the retransmitted packet to be acknowledged).

此外，TCP 认为如果一份数据包在规定的超时时间内（这个超时时间是从已观察到的往返时间计算出来的）没有得到确认，那么这份数据包就被视为丢失了。并且，已经丢失的数据包会自动地被重新发送。虽然应用程序并没有意识到数据包的丢失和重新发送，但是它会感知到由此带来的延迟（因为需要等待超时时间的到来，然后再等待重新发送的数据包被确认）。

TCP Versus UDP

Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not perform flow control and does not retransmit lost packets, it avoids some of the reasons for variable network delays (although it is still susceptible to switch queues and scheduling delays).

一些对延迟敏感的应用程序，比如视频会议和Voice over IP（VoIP），采用UDP而不是TCP。这是可靠性和延迟变量性之间的权衡：由于UDP不执行数据流控制，也不重新传输丢失的数据包，因此它避免了某些原因导致的网络延迟变化（尽管仍然容易受到交换机队列和调度延迟的影响）。

UDP is a good choice in situations where delayed data is worthless. For example, in a VoIP phone call, there probably isn’t enough time to retransmit a lost packet before its data is due to be played over the loudspeakers. In this case, there’s no point in retransmitting the packet—the application must instead fill the missing packet’s time slot with silence (causing a brief interruption in the sound) and move on in the stream. The retry happens at the human layer instead. (“Could you repeat that please? The sound just cut out for a moment.”)

UDP在延迟数据无关紧要时是一个好选择。例如，在VoIP电话中，可能没有足够的时间在所需播放数据之前重新传输一次丢失的数据包。在这种情况下，重新传输数据包是没有意义的 - 应用程序必须用沉默填充缺失数据包的时间段（导致声音短暂间断）并继续进行数据流。重试在人类层面上发生。（“请再说一遍好吗？声音刚才短暂中断了。”）

All of these factors contribute to the variability of network delays. Queueing delays have an especially wide range when a system is close to its maximum capacity: a system with plenty of spare capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very quickly.

所有这些因素都会导致网络延迟的变化。排队延迟在系统接近最大容量时具有特别广泛的范围：有足够多备用容量的系统可以轻松地排空队列，而在高度利用的系统中，队列很快就会积累起来。

In public clouds and multi-tenant datacenters, resources are shared among many customers: the network links and switches, and even each machine’s network interface and CPUs (when running on virtual machines), are shared. Batch workloads such as MapReduce (see Chapter 10 ) can easily saturate network links. As you have no control over or insight into other customers’ usage of the shared resources, network delays can be highly variable if someone near you (a noisy neighbor ) is using a lot of resources [ 28 , 29 ].

在公共云和多租户数据中心中，资源被许多客户共享：网络链接和交换机，甚至每台机器的网络接口和 CPU（在虚拟机上运行时）都是共享的。批处理工作负载，例如MapReduce（见第十章），可以轻松饱和网络链接。由于您无法控制或了解其他客户对共享资源的使用，如果某个邻近的用户（嘈杂的邻居）正在使用大量资源，则网络延迟可能会变化很大 [28，29]。

In such environments, you can only choose timeouts experimentally: measure the distribution of network round-trip times over an extended period, and over many machines, to determine the expected variability of delays. Then, taking into account your application’s characteristics, you can determine an appropriate trade-off between failure detection delay and risk of premature timeouts.

在这种环境下，您只能通过实验选择适当的超时时间：长时间内测量网络往返延迟时间分布，并在多台设备上测量，从而确定延迟期望的可变性。然后，考虑您的应用特点，您可以确定适当的故障检测延迟和过早超时的风险之间的权衡。

Even better, rather than using configured constant timeouts, systems can continually measure response times and their variability ( jitter ), and automatically adjust timeouts according to the observed response time distribution. This can be done with a Phi Accrual failure detector [ 30 ], which is used for example in Akka and Cassandra [ 31 ]. TCP retransmission timeouts also work similarly [ 27 ].

更好的做法是，系统可以不使用配置的恒定超时时间，而是不断测量响应时间及其变异性（抖动），并根据观察到的响应时间分布自动调整超时时间。这可以通过Phi Accrual失效探测器来实现，例如Akka和Cassandra中使用的方法。TCP重传超时时间也类似地工作。

Synchronous Versus Asynchronous Networks

Distributed systems would be a lot simpler if we could rely on the network to deliver packets with some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level and make the network reliable so that the software doesn’t need to worry about it?

如果我们可以依赖网络以一定的最长延迟传递数据包，而不是丢弃数据包，分布式系统将会更简单。为什么我们不能在硬件层面解决这个问题，使网络变得可靠，从而软件不需要担心这些问题呢？

To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have similar reliability and predictability in computer networks?

回答这个问题，有趣的是将数据中心网络与传统的固定电话网络（非蜂窝、非VoIP）进行比较，后者非常可靠：延迟的音频帧和掉话非常少。电话呼叫需要始终保持低的端到端延迟和足够的带宽来传输您的语音样本。难道在计算机网络中拥有类似的可靠性和可预测性不好吗？

When you make a call over the telephone network, it establishes a circuit : a fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends [ 32 ]. For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every 250 microseconds [ 33 , 34 ].

当你通过电话网络打电话时，它建立了一个电路：在两个通话者之间的整个路线上，分配了一个固定的，保证的带宽量用于通话。这个电路一直保持到通话结束为止。例如，ISDN网络以固定速率运行为每秒4,000帧。当通话建立时，在每个帧（每个方向）内分配了16个位的空间。因此，在通话期间，每个部分都保证可以每250微秒传送确切的16位音频数据。

This kind of network is synchronous : even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the next hop of the network. And because there is no queueing, the maximum end-to-end latency of the network is fixed. We call this a bounded delay .

这种网络是同步的：即使数据经过多个路由器，由于调用的16位空间已在网络的下一个跳中预留，因此它不会受到排队的影响。由于没有排队，网络的最大端到端延迟是固定的。我们称之为有界时延。

Can we not simply make network delays predictable?

Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t use any bandwidth. ⁱⁱ

请注意，电话网络中的电路与TCP连接非常不同：电路是一定量的预留带宽，当电路建立时，没有其他人可以使用该带宽，而TCP连接的数据包是机会主义地利用可用的网络带宽。您可以给TCP一个可变大小的数据块（例如，电子邮件或网页），它将尝试以最短的时间传输。当TCP连接空闲时，它不会使用任何带宽。

If datacenter networks and the internet were circuit-switched networks, it would be possible to establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not: Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit.

如果数据中心网络和互联网是电路交换网络，那么在建立电路时可以确保最大往返时间。但是，实际上它们不是：以太网和IP是分组交换协议，因为在网络中存在排队和无限延迟而受到影响。这些协议没有电路的概念。

Why do datacenter networks and the internet use packet switching? The answer is that they are optimized for bursty traffic . A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement—we just want it to complete as quickly as possible.

数据中心网络和互联网为什么要使用分组交换？答案是它们被优化用于突发性的流量。电路对于音频或视频呼叫来说很好，这需要在呼叫期间传输相对恒定的比特数。另一方面，请求网页、发送电子邮件或传输文件并没有特定的带宽要求——我们只想尽快完成它。

If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity.

如果你想在电路上传输文件，则需要猜测带宽分配。如果您猜测过低，则传输速度过慢，浪费网络容量。如果您猜测过高，则电路无法设置（因为如果无法保证其带宽分配，则网络不允许创建电路）。因此，对于突发数据传输使用电路将浪费网络容量，并使传输变慢。相比之下，TCP动态调整数据传输速率以适应可用的网络容量。

There have been some attempts to build hybrid networks that support both circuit switching and packet switching, such as ATM. ⁱⁱⁱ InfiniBand has some similarities [ 35 ]: it implements end-to-end flow control at the link layer, which reduces the need for queueing in the network, although it can still suffer from delays due to link congestion [ 36 ]. With careful use of quality of service (QoS, prioritization and scheduling of packets) and admission control (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay [ 25 , 32 ].

有一些尝试构建混合网络以支持电路交换和分组交换，例如 ATM。InfiniBand 有一些相似之处：它在链路层实现端到端的流量控制，这减少了网络中排队的需求，尽管它仍然可能因为链路拥塞而遭受延迟。通过精心利用服务质量（QoS，分组的优先级和调度）和准入控制（限制发送者的速率），可以在分组网络上模拟电路交换或提供统计所限的延迟。

Latency and Resource Utilization

More generally, you can think of variable delays as a consequence of dynamic resource partitioning.

更一般而言，您可以将可变延迟视为动态资源分配的结果。

Say you have a wire between two telephone switches that can carry up to 10,000 simultaneous calls. Each circuit that is switched over this wire occupies one of those call slots. Thus, you can think of the wire as a resource that can be shared by up to 10,000 simultaneous users. The resource is divided up in a static way: even if you’re the only call on the wire right now, and all other 9,999 slots are unused, your circuit is still allocated the same fixed amount of bandwidth as when the wire is fully utilized.

假设你有一个连接两个电话交换机的电线，可以同时承载10,000个电话。每个被接入这条线的电路都占据其中一个通话插槽。因此，你可以将这个电线看作一种资源，最多可以为10,000个用户共享。这个资源是以静态方式分配的：即使现在只有你一个人在通话，其它9,999个插槽都没被使用，你的电路所分配的带宽量仍然和当电线被充分利用时一样固定。

By contrast, the internet shares network bandwidth dynamically . Senders push and jostle with each other to get their packets over the wire as quickly as possible, and the network switches decide which packet to send (i.e., the bandwidth allocation) from one moment to the next. This approach has the downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a fixed cost, so if you utilize it better, each byte you send over the wire is cheaper.

相比之下，互联网动态共享网络带宽。发送方相互竞争，以尽快将数据包传输到线路上，网络交换机决定从一瞬间到另一瞬间发送哪个数据包（即带宽分配）。这种方法的缺点在于会造成排队，但它的优点在于最大限度地利用了线路。线路的成本是固定的，所以如果更好地利用它，则每个字节发送到线路上的成本更低。

A similar situation arises with CPUs: if you share each CPU core dynamically between several threads, one thread sometimes has to wait in the operating system’s run queue while another thread is running, so a thread can be paused for varying lengths of time. However, this utilizes the hardware better than if you allocated a static number of CPU cycles to each thread (see “Response time guarantees” ). Better hardware utilization is also a significant motivation for using virtual machines.

与CPU类似的情况出现了：如果你动态地在几个线程之间分享每个CPU核心，那么一个线程有时必须在操作系统的运行队列中等待，而另一个线程正在运行，所以一个线程可能会暂停不同长度的时间。然而，与为每个线程分配固定数量的CPU周期相比，这更好地利用了硬件（见“响应时间保证”）。更好的硬件利用率也是使用虚拟机的重要动力。

Latency guarantees are achievable in certain environments, if resources are statically partitioned (e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of reduced utilization—in other words, it is more expensive. On the other hand, multi-tenancy with dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.

在某些环境下，如果资源被静态分区（如专用硬件和独占带宽分配），则可以实现延迟保证。然而，这是以降低利用率的代价为代价的，换句话说，它更昂贵。另一方面，具有动态资源分区的多租户提供更好的利用率，因此更便宜，但它的缺点是延迟不确定。

Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.

网络中的可变延迟不是自然法则，而仅仅是成本效益权衡的结果。

However, such quality of service is currently not enabled in multi-tenant datacenters and public clouds, or when communicating via the internet. ^iv Currently deployed technology does not allow us to make any guarantees about delays or reliability of the network: we have to assume that network congestion, queueing, and unbounded delays will happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.

然而，这样的服务质量目前在多租户数据中心、公共云或通过互联网通信时尚未启用。目前部署的技术不允许我们对网络的延迟或可靠性做出任何保证：我们必须假定网络拥塞、排队和无限延迟将会发生。因此，超时的“正确”值没有固定的标准，需要通过实验进行确定。

Unreliable Clocks

Clocks and time are important. Applications depend on clocks in various ways to answer questions like the following:

时钟和时间非常重要。应用程序以各种方式依赖于时钟来回答以下问题：

Has this request timed out yet?

这个请求超时了吗？
What’s the 99th percentile response time of this service?

这项服务的99分位响应时间是多少？
How many queries per second did this service handle on average in the last five minutes?

这项服务在过去五分钟内平均处理了多少个查询请求？
How long did the user spend on our site?

用户在我们的网站上花费了多长时间？
When was this article published?

这篇文章是什么时候发布的？
At what date and time should the reminder email be sent?

提醒邮件应该在什么日期和时间发送？
When does this cache entry expire?

这个缓存条目何时过期？
What is the timestamp on this error message in the log file?

这个错误信息在日志文件中的时间戳是什么？

Examples 1–4 measure durations (e.g., the time interval between a request being sent and a response being received), whereas examples 5–8 describe points in time (events that occur on a particular date, at a particular time).

例子1至4测量持续时间（例如，从发送请求到接收响应的时间间隔），而例子5至8描述时间点（在特定日期、特定时间发生的事件）。

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is sent, but due to variable delays in the network, we don’t know how much later. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved.

在分布式系统中，时间是一个棘手的问题，因为通信不是即时的：一条消息需要时间才能从一台机器传输到另一台机器。消息接收的时间总是晚于发送时间，但由于网络中的可变延迟，我们不知道晚了多少。这个事实有时会使得需要多台机器参与时难以确定事件发生的顺序。

Moreover, each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers [ 37 ]. The servers in turn get their time from a more accurate time source, such as a GPS receiver.

此外，网络上的每台机器都有自己的时钟，这是实际的硬件设备：通常是石英晶体振荡器。这些设备并不完全准确，因此每台机器都有自己的时间概念，可能比其他机器稍微快或慢一些。可以在某种程度上同步时钟：最常用的机制是网络时间协议（NTP），它允许计算机时钟根据一组服务器报告的时间进行调整[37]。这些服务器反过来从更精确的时间来源，如GPS接收器，获取它们的时间。

Monotonic Versus Time-of-Day Clocks

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock . Although they both measure time, it is important to distinguish the two, since they serve different purposes.

现代计算机至少有两种不同类型的时钟：一个是时间钟，另一个是单调钟。虽然它们都可以测量时间，但是将它们区分开来非常重要，因为它们有不同的用途。

Time-of-day clocks

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar (also known as wall-clock time ). For example, clock_gettime(CLOCK_REALTIME) on Linux ^v and System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch : midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds. Some systems use other dates as their reference point.

时钟可以做你所期望的事情：返回当前日期和时间，根据某种日历（也称为挂钟时间）。例如，在Linux上使用clock_gettime(CLOCK_REALTIME)以及在Java中使用System.currentTimeMillis()都会返回距离公元1970年1月1日UTC午夜（根据公历，不包括闰秒）的秒数（或毫秒数）。有些系统使用其他日期作为它们的参考点。

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine (ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have various oddities, as described in the next section. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time [ 38 ].

时间戳通常与NTP同步，这意味着一个机器的时间戳尽可能地与另一个机器的时间戳相同。然而，时间戳也有一些奇怪的地方，如下一节所述。特别是，如果本地时钟比NTP服务器超前太多，它可能会被强制重置，并出现向回跳转到先前的时间点的情况。这些跳跃，以及它们经常忽略闰秒的事实，使时间戳不适合用于计量经过的时间。

Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward in steps of 10 ms on older Windows systems [ 39 ]. On recent systems, this is less of a problem.

时钟的计时精度过去历史上相当粗糙，例如，在旧版Windows系统中每次向前移动10毫秒。在最新系统上，这已经不是问题了。

Monotonic clocks

A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a service’s response time: clock_gettime(CLOCK_MONOTONIC) on Linux and System.nanoTime() in Java are monotonic clocks, for example. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).

单调时钟适用于测量持续时间（时间间隔），例如超时或服务的响应时间：例如在Linux上使用clock_gettime（CLOCK_MONOTONIC），在Java中使用System.nanoTime（）作为单调时钟。这个名字来源于它们保证始终向前运动（而日期时间时钟可能会向后跳转）。

You can check the value of the monotonic clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks. However, the absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was started, or something similarly arbitrary. In particular, it makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.

你可以在某个时间点检查单调时钟的值，然后执行某些操作，稍后再次检查时钟。两个检查之间的差值告诉您两个检查之间经过了多少时间。但是，时钟的绝对值没有意义：它可能是自计算机启动以来的纳秒数，或者类似的任意值。特别是，从两台不同计算机比较单调时钟值是没有意义的，因为它们表示的不是同一件事情。

On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not necessarily synchronized with other CPUs. Operating systems compensate for any discrepancy and try to present a monotonic view of the clock to application threads, even as they are scheduled across different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [ 40 ].

在具有多个CPU插槽的服务器上，每个CPU可能有一个单独的定时器，不一定与其他CPU同步。操作系统会弥补任何差异，并尝试向应用线程呈现时钟的单调视图，即使它们跨越不同的CPU调度。但是，明智的做法是将此单调性保证略微怀疑[40]。

NTP may adjust the frequency at which the monotonic clock moves forward (this is known as slewing the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic clocks is usually quite good: on most systems they can measure time intervals in microseconds or less.

NTP可以调整单调时钟向前移动的频率（这被称为时钟校准），如果它检测到计算机的本地石英晶体比NTP服务器运动得更快或更慢。默认情况下，NTP允许将时钟速度加速或减速最多0.05％，但是NTP无法使单调时钟向前或向后跳跃。单调时钟的分辨率通常很好：在大多数系统上，它们可以测量微秒或更短的时间间隔。

In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.

在分布式系统中，使用单调时钟来测量经过的时间（例如，超时）通常是可以接受的，因为它不假设不同节点的时钟之间有任何同步，且对测量的轻微不准确性不敏感。

Clock Synchronization and Accuracy

Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful. Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:

单调时钟不需要同步，但是时钟需要按照NTP服务器或其他外部时间源设置，以便有用。不幸的是，我们校准时钟的方法并不像您希望的那样可靠或准确 - 硬件时钟和NTP可能会出现问题。举几个例子：

The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of 200 ppm (parts per million) for its servers [ 41 ], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly.

电脑中的石英钟不太准确：它会漂移（比应该的快或慢）。时钟漂移取决于设备的温度。谷歌的服务器假定钟漂移为200个百万分之一（ppm）[41]，相当于每30秒重新与服务器同步一次的时钟漂移为6毫秒，或每天重新同步一次的时钟漂移为17秒。即使一切正常，此漂移也限制了您能够实现的最佳精度。
If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset [ 37 ]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.

如果电脑时钟与NTP服务器差距太大，它可能会拒绝同步，或者本地时钟会被强制重置[37]。在此重置之前和之后观察时间的任何应用程序可能会看到时间倒流或突然向前跳跃。
If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time. Anecdotal evidence suggests that this does happen in practice.

如果某个节点意外地被防火墙隔离在NTP服务器之外，这种配置错误可能会被忽视一段时间。个人经验表明，实际上确实会发生这种情况。
NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [ 42 ], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely.

NTP同步的准确度受网络延迟的影响，当网络拥塞且包传输时间变化时，其准确度将受到限制。一项实验表明，通过互联网同步可以达到最小误差为35毫秒的精度[42]，但是偶尔遇到的网络延迟峰值会导致误差大约为一秒。根据配置，大的网络延迟可能会导致NTP客户端完全放弃。
Some NTP servers are wrong or misconfigured, reporting time that is off by hours [ 43 , 44 ]. NTP clients are quite robust, because they query several servers and ignore outliers. Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.

有一些NTP服务器的时间是错误的或者配置不当，与实际时间相差数小时[43, 44]。 NTP客户端非常强大，因为它们会查询多个服务器并忽略异常值。尽管如此，但是将系统的正确性放在来自互联网上的陌生人的时间上还是令人担忧的。
Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions in systems that are not designed with leap seconds in mind [ 45 ]. The fact that leap seconds have crashed many large systems [ 38 , 46 ] shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second adjustment gradually over the course of a day (this is known as smearing ) [ 47 , 48 ], although actual NTP server behavior varies in practice [ 49 ].

闰秒会导致一分钟长为59秒或61秒，这会影响那些没有考虑闰秒的系统的时间假设。事实上，闰秒已经让许多大型系统崩溃，显示了时钟方面的错误假设多么容易潜入系统中。最好的处理闰秒的方法可能是让NTP服务器“撒谎”，在一天的时间内逐渐进行闰秒调整（这被称为扩散），尽管实际的NTP服务器行为因实践而异。
In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [ 50 ]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [ 26 ].

在虚拟机中，硬件时钟被虚拟化，这给需要准确计时的应用程序带来了额外的挑战[50]。当一个CPU核心在虚拟机之间共享时，每个虚拟机都会暂停数十毫秒，而另一个虚拟机正在运行。从应用程序的角度来看，这种暂停表现为时钟突然向前跳动[26]。
If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you probably cannot trust the device’s hardware clock at all. Some users deliberately set their hardware clock to an incorrect date and time, for example to circumvent timing limitations in games. As a result, the clock might be set to a time wildly in the past or the future.

如果您在未完全控制的设备上运行软件（例如移动设备或嵌入式设备），您可能无法完全信任设备的硬件时钟。一些用户会故意将其硬件时钟设置为不正确的日期和时间，例如为了规避游戏中的时间限制。因此，时钟可能被设置为过去或未来的时间。

It is possible to achieve very good clock accuracy if you care about it sufficiently to invest significant resources. For example, the MiFID II draft European regulation for financial institutions requires all high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help detect market manipulation [ 51 ].

如果你足够重视，愿意投入更多的资源，就有可能实现非常好的时钟精度。例如，欧洲金融机构的《MiFID II》草案要求所有高频交易基金将时钟同步到离协调世界时（UTC）100微秒以内，以帮助调试市场异常，例如“闪崩”，并帮助检测市场操纵[51]。

Such accuracy can be achieved using GPS receivers, the Precision Time Protocol (PTP) [ 52 ], and careful deployment and monitoring. However, it requires significant effort and expertise, and there are plenty of ways clock synchronization can go wrong. If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.

通过使用GPS接收器、精确时间协议（PTP）[52]和仔细的部署和监测可以达到这种准确度。然而，这需要大量的努力和专业技能，并且时钟同步有很多出错的可能。如果您的NTP守护程序配置不正确，或防火墙阻止NTP流量，时钟漂移引起的时钟误差可能会迅速变大。

Relying on Synchronized Clocks

The problem with clocks is that while they seem simple and easy to use, they have a surprising number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward in time, and the time on one node may be quite different from the time on another node.

时钟的问题在于它们看起来简单易用，但实际上却有很多难以预料的缺陷：一天可能并非恰好有86,400秒，时钟可能会倒退时间，而且一个节点上的时间可能与另一个节点上的时间相差很大。

Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though networks are well behaved most of the time, software must be designed on the assumption that the network will occasionally be faulty, and the software must handle such faults gracefully. The same is true with clocks: although they work quite well most of the time, robust software needs to be prepared to deal with incorrect clocks.

在本章早些时候，我们讨论了网络丢包和任意延迟数据包的问题。尽管网络大多数时候都表现良好，但软件必须在假设网络偶尔出现故障的基础上进行设计，并且软件必须能够优雅地处理这些故障。同样的情况也适用于时钟：尽管大多数时候时钟工作正常，但强健的软件需要准备好处理不正确的时钟。

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash [ 53 , 54 ].

问题的一部分在于不正确的时钟很容易被忽略。如果机器的 CPU 有缺陷或其网络配置不正确，它很可能根本就不能工作，因此很快就会被发现并进行修复。另一方面，如果它的石英钟有缺陷或其 NTP 客户端配置不正确，大多数东西看起来似乎运行良好，尽管其时钟逐渐偏离现实。如果某些软件依靠准确同步的时钟，结果更可能是静默和微妙的数据丢失，而不是显着的崩溃[53、54]。

Thus, if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clocks before they can cause too much damage.

因此，如果您使用需要同步时钟的软件，必须仔细监控所有机器之间的时钟偏移量。任何时钟偏离其他机器太远的节点都应被声明为失效并从集群中删除。这种监控确保您在时钟出现问题之前就能注意到，避免造成过度损伤。

Timestamps for ordering events

Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one?

考虑一种情况，依赖于时钟会很诱人，但也很危险：多个节点之间事件的排序。例如，如果两个客户端写入一个分布式数据库，谁先到达那里？哪个写入是最新的？

Figure 8-3 illustrates a dangerous use of time-of-day clocks in a database with multi-leader replication (the example is similar to Figure 5-9 ). Client A writes x = 1 on node 1; the write is replicated to node 3; client B increments x on node 3 (we now have x = 2); and finally, both writes are replicated to node 2.

图 8-3 展示了在具有多主复制的数据库中使用时间戳时可能出现的危险情况（这个例子与图 5-9 类似）。客户端 A 在节点 1 上写入 x = 1；写入操作被复制到节点 3；客户端 B 在节点 3 上增加 x，使得 x = 2；最后，两个写入操作都被复制到节点 2。图 8-3 展示了在具有多主复制的数据库中使用时间戳时可能出现的危险情况（这个例子与图 5-9 类似）。客户端 A 在节点 1 上写入 x = 1；写入操作被复制到节点 3；客户端 B 在节点 3 上增加 x，使得 x = 2；最后，两个写入操作都被复制到节点 2。

In Figure 8-3 , when a write is replicated to other nodes, it is tagged with a timestamp according to the time-of-day clock on the node where the write originated. The clock synchronization is very good in this example: the skew between node 1 and node 3 is less than 3 ms, which is probably better than you can expect in practice.

在图8-3中，当写操作被复制到其他节点时，它会被标记上时间戳，该时间戳基于写操作源节点的当日时间。在此示例中，时钟同步非常好：节点1和节点3之间的偏差少于3毫秒，这可能比实际情况要好。

Nevertheless, the timestamps in Figure 8-3 fail to order the events correctly: the write x = 1 has a timestamp of 42.004 seconds, but the write x = 2 has a timestamp of 42.003 seconds, even though x = 2 occurred unambiguously later. When node 2 receives these two events, it will incorrectly conclude that x = 1 is the more recent value and drop the write x = 2. In effect, client B’s increment operation will be lost.

然而，图8-3中的时间戳未能正确排序事件：写入x = 1的时间戳为42.004秒，但写入x = 2的时间戳为42.003秒，尽管x = 2显然后发生。当节点2接收到这两个事件时，它将错误地得出x = 1是更近的值并删除写入x = 2。实际上，客户端B的增量操作将丢失。

This conflict resolution strategy is called last write wins (LWW), and it is widely used in both multi-leader replication and leaderless databases such as Cassandra [ 53 ] and Riak [ 54 ] (see “Last write wins (discarding concurrent writes)” ). Some implementations generate timestamps on the client rather than the server, but this doesn’t change the fundamental problems with LWW:

这种冲突解决策略被称为最后一次写入获胜(LWW)，它被广泛用于多领导者复制和无领导者数据库，比如Cassandra[53]和Riak[54](详见“最后一次写入获胜(丢弃并发写入)”）。一些实现会在客户端而非服务器上生成时间戳，但这不会改变LWW的根本问题：

Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [ 54 , 55 ]. This scenario can cause arbitrary amounts of data to be silently dropped without any error being reported to the application.

数据库写入可能会神秘消失：当一个节点的时钟落后时，它无法覆盖先前由具有快速时钟的节点写入的值，直到节点间的时钟偏差已经过去。此情况可能导致任意数量的数据被无声丢弃，而应用程序不会报告任何错误。
LWW cannot distinguish between writes that occurred sequentially in quick succession (in Figure 8-3 , client B’s increment definitely occurs after client A’s write) and writes that were truly concurrent (neither writer was aware of the other). Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality (see “Detecting Concurrent Writes” ).

LWW不能区分在快速连续发生的顺序写（在图8-3中，客户端B的递增明显是在客户端A的写之后发生的）和真正并发的写（两个写者都不知道对方的写）。需要额外的因果跟踪机制，如版本向量，才能防止因果关系的违规（请参阅“检测并发写入”）。
It is possible for two nodes to independently generate writes with the same timestamp, especially when the clock only has millisecond resolution. An additional tiebreaker value (which can simply be a large random number) is required to resolve such conflicts, but this approach can also lead to violations of causality [ 53 ].

两个节点独立生成相同时间戳的写入是有可能的，特别是当时钟仅具有毫秒分辨率时。需要一个额外的决定胜负的值（可以简单地是一个大的随机数）来解决这种冲突，但这种方法也可能导致因果关系的违反。[53]。

Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and discarding others, it’s important to be aware that the definition of “recent” depends on a local time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet arrived before it was sent, which is impossible.

因此，尽管通过保留最“近期”的数值并丢弃其他数值来解决冲突是很诱人的，但重要的是要意识到“近期”定义取决于本地的时间时钟，可能是不正确的。即使使用紧密NTP同步的时钟，您可能会在时间戳100毫秒（根据发件人的时钟）发送数据包，并在时间戳99毫秒（根据收件人的时钟）到达，因此似乎该数据包在发送之前就已到达，这是不可能的。

Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur? Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip time, in addition to other sources of error such as quartz drift. For correct ordering, you would need the clock source to be significantly more accurate than the thing you are measuring (namely network delay).

NTP同步能否精确到无法发生错误排列的程度？可能不行，因为NTP的同步精度本身受到网络往返时间的限制，除了石英漂移等其他误差来源。为了正确排序，您需要的时钟源比您要测量的东西（即网络延迟）更准确。

So-called logical clocks [ 56 , 57 ], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events (see “Detecting Concurrent Writes” ). Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another). In contrast, time-of-day and monotonic clocks, which measure actual elapsed time, are also known as physical clocks . We’ll look at ordering a bit more in “Ordering Guarantees” .

所谓的逻辑时钟[56, 57]，它们基于递增计数器而不是振荡的石英晶体，是对于事件排序的一种安全替代方法（见“检测并发写入”）。逻辑时钟不测量日期或经过的秒数，只测量事件的相对顺序（一个事件在另一个事件之前或之后发生）。相反，测量实际经过时间的时钟，如当天时间和单调时钟，也被称为物理时钟。我们将在“排序保证”中更详细地讨论排序。

Clock readings have a confidence interval

You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with an NTP server on the local network every minute. With an NTP server on the public internet, the best possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over 100 ms when there is network congestion [ 57 ].

您可以使用微秒甚至纳秒的分辨率读取机器的时钟。但即使您可以获得如此精细的测量值，也并不意味着该值实际上具有如此高的精度。事实上，很可能不是这样的。正如以前提到的那样，即使您每分钟都使用本地网络上的NTP服务器进行同步，精度不准确的石英钟的漂移也很容易达到几毫秒。在公共互联网上使用NTP服务器，最好的可能精度可能只有几十毫秒，并且在网络拥塞时错误可能轻易飙升到100毫秒以上[57]。

Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [ 58 ]. If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.

因此，将时钟读数视为时间点是没有意义的，它更像是一个时间范围，处于置信区间内：例如，系统可能有95％的信心现在的时间在分钟过去的10.3和10.5秒之间，但它并不比那更精确[58]。如果我们只知道时间+/- 100毫秒，则时间戳中的微秒数字基本上是没有意义的。

The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or atomic (caesium) clock directly attached to your computer, the expected error range is reported by the manufacturer. If you’re getting the time from a server, the uncertainty is based on the expected quartz drift since your last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to the server (to a first approximation, and assuming you trust the server).

不确定性范围可以根据您的时间源进行计算。如果您的计算机直接连接了GPS接收器或原子（铯）钟，制造商会报告预期的误差范围。如果您从服务器获取时间，则不确定性是基于您上次与服务器同步以来预期的石英漂移，加上NTP服务器的不确定性，加上到服务器的网络往返时间（作为初步估计，并假定您信任该服务器）。

Unfortunately, most systems don’t expose this uncertainty: for example, when you call clock_gettime() , the return value doesn’t tell you the expected error of the timestamp, so you don’t know if its confidence interval is five milliseconds or five years.

不幸的是，大多数系统没有暴露这种不确定性：例如，当您调用 clock_gettime() 时，返回值不会告诉您时间戳的预期误差，因此您不知道其置信区间是五毫秒还是五年。

An interesting exception is Google’s TrueTime API in Spanner [ 41 ], which explicitly reports the confidence interval on the local clock. When you ask it for the current time, you get back two values: [ earliest , latest ] , which are the earliest possible and the latest possible timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is somewhere within that interval. The width of the interval depends, among other things, on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.

Google的Spanner中有一个有趣的例外是TrueTime API，它明确报告本地时钟的置信区间。当你询问它当前的时间时，你会得到两个值：[最早的，最晚的]，它们是可能的最早和最晚时间戳。根据其不确定性计算，时钟知道实际当前时间在该区间内的某个位置。区间的宽度取决于许多因素，包括本地石英钟上一次与更准确的时钟源同步的时间。

Synchronized clocks for global snapshots

In “Snapshot Isolation and Repeatable Read” we discussed snapshot isolation , which is a very useful feature in databases that need to support both small, fast read-write transactions and large, long-running read-only transactions (e.g., for backups or analytics). It allows read-only transactions to see the database in a consistent state at a particular point in time, without locking and interfering with read-write transactions.

在“快照隔离与可重复读”一章中，我们讨论了快照隔离，它是数据库中非常有用的特性，支持小型，快速的读写事务和大型，长时间运行的只读事务（例如备份或分析）。它使只读事务在不需要锁定和干扰读写事务的情况下，在特定时间点上以一致的状态查看数据库。

The most common implementation of snapshot isolation requires a monotonically increasing transaction ID. If a write happened later than the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for generating transaction IDs.

最常见的快照隔离实现需要单调递增的事务ID。如果写操作发生在快照之后（即，写操作的事务ID大于快照的事务ID），那么该写操作对快照事务是不可见的。在单节点数据库中，一个简单的计数器就足以生成事务ID。

However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all partitions) is difficult to generate, because it requires coordination. The transaction ID must reflect causality: if transaction B reads a value that was written by transaction A, then B must have a higher transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid transactions, creating transaction IDs in a distributed system becomes an untenable bottleneck. ^vi

然而，当数据库分布在许多机器上，可能在多个数据中心，生成一个全局的、单调递增的事务ID（跨所有分区）是困难的，因为它需要协调。事务ID必须反映因果关系：如果事务B读取了事务A写入的值，那么B必须具有比A更高的事务ID，否则快照将不一致。对于大量小的、快速的事务，在分布式系统中创建事务ID变得不可行。

Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get the synchronization good enough, they would have the right properties: later transactions have a higher timestamp. The problem, of course, is the uncertainty about clock accuracy.

我们能否使用同步计时钟的时间戳作为交易ID？如果我们能够把同步做好，它们会具有正确的特性：较晚的交易具有较高的时间戳。当然，问题在于关于时钟精度的不确定性。

Spanner implements snapshot isolation across datacenters in this way [ 59 , 60 ]. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp ( A = [ A _earliest , A _latest ] and B = [ B _earliest , B _latest ]), and those two intervals do not overlap (i.e., A _earliest < A _latest < B _earliest < B _latest ), then B definitely happened after A—there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

Spanner在数据中心之间实现了快照隔离，其方法如下：它使用TrueTime API报告的时钟置信度间隔，并基于以下观察：如果您有两个置信区间，每个区间都由最早和最晚可能的时间戳（A = [Aearliest，Alatest]和B = [Bearliest，Blatest]）组成，并且这两个区间不重叠（即Aearliest < Alatest < Bearliest < Blatest），则B肯定发生在A之后 - 完全没有疑问。仅当区间重叠时，我们才不确定A和B的顺序。

In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [ 41 ].

为了确保交易时间戳反映因果关系，Spanner故意在提交读写事务之前等待置信区间的长度。通过这样做，确保任何可能读取数据的事务的时间足够晚，使它们的置信区间不重叠。为了尽可能缩短等待时间，Spanner需要尽可能减小时钟不确定性；为此，Google在每个数据中心部署GPS接收器或原子钟，使时钟同步在约7毫秒内[41]。

Using clock synchronization for distributed transaction semantics is an area of active research [ 57 , 61 , 62 ]. These ideas are interesting, but they have not yet been implemented in mainstream databases outside of Google.

使用时钟同步来传播事务语义是一个活跃的研究领域[57，61，62]。这些想法很有趣，但是除了Google以外，它们尚未在主流数据库中得到实现。

Process Pauses

Let’s consider another example of dangerous clock use in a distributed system. Say you have a database with a single leader per partition. Only the leader is allowed to accept writes. How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?

让我们考虑分布式系统中另一个危险的时钟使用例子。假设你有一个数据库，每个分区只有一个领导者。只有领导者可以接受写入。一个节点如何知道它仍然是领导者(它没有被其他节点宣布死亡)，并且它可以安全地接受写入呢？

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout [ 63 ]. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.

一种选择是领导者从其他节点获得租赁，类似于带有超时的锁[63]。每次只能有一个节点持有租赁 - 因此，当一个节点获得租赁时，它知道自己在一段时间内是领导者，直到租赁到期。为了保持领导地位，节点必须在其到期之前定期更新租赁。如果节点失败，则停止更新租赁，因此在其到期时另一个节点可以接管。

You can imagine the request-handling loop looking something like this:

你可以想象请求处理循环大概长这样：

while (true) {
    request = getIncomingRequest();

    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }

    if (lease.isValid()) {
        process(request);
    }
}

What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the lease is set by a different machine (where the expiry may be calculated as the current time plus 30 seconds, for example), and it’s being compared to the local system clock. If the clocks are out of sync by more than a few seconds, this code will start doing strange things.

这段代码有什么问题？首先，它依赖于同步的时钟：租约的到期时间是由另一台机器设置的（到期时间可能被计算为当前时间加30秒），并且与本地系统时钟进行比较。如果时钟相差超过几秒钟，这段代码就会开始出现奇怪的问题。

Secondly, even if we change the protocol to only use the local monotonic clock, there is another problem: the code assumes that very little time passes between the point that it checks the time ( System.currentTimeMillis() ) and the time when the request is processed ( process(request) ). Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the lease doesn’t expire in the middle of processing a request.

其次，即使我们将协议更改为仅使用本地单调时钟，还有另一个问题：代码假定在检查时间（System.currentTimeMillis()）和处理请求（process(request)）之间经过了很少的时间。通常，此代码运行速度非常快，因此10秒缓冲区足以确保租约不会在处理请求的过程中过期。

However, what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line lease.isValid() before finally continuing. In that case, it’s likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won’t notice that the lease has expired until the next iteration of the loop—by which time it may have already done something unsafe by processing the request.

然而，如果程序执行中出现了意外的暂停怎么办？例如，在线路lease.isValid()附近的线程停止了15秒，最终才继续执行。在这种情况下，租约很可能在请求被处理时已经过期，并且另一个节点已经成为了领导者。然而，对于该线程来说，没有任何东西能告诉它它被暂停了这么久，因此这段代码在下一次循环之前不会意识到租约已经过期，到那时它可能已经通过处理请求做了一些不安全的事情。

Is it crazy to assume that a thread might be paused for so long? Unfortunately not. There are various reasons why this could happen:

这么假定一个线程可能会暂停这么长时间是疯狂的吗？不幸的是不是。有各种原因会导致这种情况发生：

Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads. These “stop-the-world” GC pauses have sometimes been known to last for several minutes [ 64 ]! Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code—even they need to stop the world from time to time [ 65 ]. Although the pauses can often be reduced by changing allocation patterns or tuning GC settings [ 66 ], we must assume the worst if we want to offer robust guarantees.

许多编程语言的运行时环境（比如Java虚拟机）都有垃圾回收器（GC），需要偶尔停止所有正在运行的线程。这些“停止世界”GC暂停有时知道持续几分钟！即使是所谓的“并发”垃圾回收器，如HotSpot JVM的CMS，也不能完全与应用程序代码并行运行，即它们也需要定期停止工作。虽然通过改变分配模式或调整GC设置经常可以减少暂停，但如果我们想要提供强大的保证，我们必须假设最坏情况。
In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution). This pause can occur at any time in a process’s execution and can last for an arbitrary length of time. This feature is sometimes used for live migration of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory [ 67 ].

在虚拟化环境中，虚拟机可以被暂停（暂停所有进程执行并将内存内容保存到磁盘）和恢复（恢复存储在内存中的内容并继续执行）。此暂停可以发生在进程执行的任何时间，并且可以持续任意长的时间。该功能有时用于虚拟机的实时迁移，无需重新启动，此时暂停的长度取决于进程写入内存的速率[67]。
On end-user devices such as laptops, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.

在终端用户设备上，例如笔记本电脑，执行也可以任意暂停和恢复，例如当用户关闭笔记本电脑盖子时。
When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time . If the machine is under heavy load—i.e., if there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again.

当操作系统切换到另一个线程时，或者当虚拟机中的运行时hypervisor切换到另一个虚拟机时，当前正在运行的线程可以在代码的任意点暂停。在虚拟机中，CPU在其他虚拟机中所花费的时间称为被窃取时间。如果机器负载很重，即有很长的线程等待运行的队列，那么在暂停的线程再次运行之前可能需要一些时间。
If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete [ 68 ]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays [ 69 ]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays [ 29 ].

如果应用程序执行同步磁盘访问，则线程可能会因等待缓慢的磁盘I/O操作而暂停[68]。在许多语言中，即使代码没有明确提到文件访问，磁盘访问也可能会突然发生 - 例如，Java类装入器会在第一次使用类文件时惰性地加载它们，这可能会在程序执行期间的任何时候发生。I/O暂停和GC暂停甚至可能合并它们的延迟[69]。如果磁盘实际上是网络文件系统或网络块设备（例如Amazon的EBS），则I/O延迟还受网络延迟变化的影响[29]。
If the operating system is configured to allow swapping to disk ( paging ), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may in turn require a different page to be swapped out to disk. In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing ). To avoid this problem, paging is often disabled on server machines (if you would rather kill a process to free up memory than risk thrashing).

如果操作系统配置为允许交换到磁盘（分页），一个简单的内存访问可能导致一个页面错误，需要将一个页面从磁盘加载到内存。在这个缓慢的I / O操作发生时，线程会暂停。如果内存压力很大，这可能需要将不同的页面交换到磁盘。在极端情况下，操作系统可能会花费大部分时间将页面从内存中换入和换出，并且实际上完成很少的工作（这称为抖动）。为了避免这个问题，服务器机器上通常禁用分页（如果您宁愿杀死一个进程来释放内存而不是冒险抖动）。
A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell. This signal immediately stops the process from getting any more CPU cycles until it is resumed with SIGCONT , at which point it continues running where it left off. Even if your environment does not normally use SIGSTOP , it might be sent accidentally by an operations engineer.

Unix进程可以通过发送SIGSTOP信号暂停，例如在shell中按下Ctrl-Z。此信号立即停止进程从获取更多的CPU周期，直到使用SIGCONT恢复为止，此时它将在离开的地方继续运行。即使您的环境通常不使用SIGSTOP，操作工程师可能会意外发送它。

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur.

所有这些情况都可能在任何时候抢占正在运行的线程，并在稍后的时间恢复它，而线程甚至都不会注意到。这个问题类似于在单个机器上编写多线程代码的线程安全性：您不能假设任何有关计时的事情，因为任意的上下文切换和并行性可能会发生。

When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems, because a distributed system has no shared memory—only messages sent over an unreliable network.

在单机上编写多线程代码时，我们有相当不错的工具可以使其线程安全：互斥锁、信号量、原子计数器、无锁数据结构、阻塞队列等。不幸的是，这些工具无法直接转换为分布式系统，因为分布式系统没有共享内存，只有通过不可靠网络发送的消息。

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding. Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.

分布式系统中的一个节点必须假定其执行可以在任何时候暂停很长一段时间，甚至在函数中间。在暂停期间，其余世界会继续前进，甚至可能宣布暂停的节点已经死亡，因为它没有响应。最终，暂停的节点可能会继续运行，并且甚至在稍后检查其时钟时也没有注意到它睡着了。

Response time guarantees

In many programming languages and operating systems, threads and processes may pause for an unbounded amount of time, as discussed. Those reasons for pausing can be eliminated if you try hard enough.

在许多编程语言和操作系统中，线程和进程可能会无限期地暂停，如上文所述。如果你努力尝试，这些暂停的原因可以被消除。

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system. These are so-called hard real-time systems.

一些软件运行在可能导致严重损坏的环境中：控制飞机、火箭、机器人、汽车和其他物理对象的计算机必须快速而可预测地响应它们的传感器输入。在这些系统中，软件必须在指定的截止日期前做出响应；如果没有遇到截止日期，可能会导致整个系统的失败。这些是所谓的硬实时系统。

Is real-time really real?

In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the term real-time on the web, where it describes servers pushing data to clients and stream processing without hard response time constraints (see Chapter 11 ).

在嵌入式系统中，实时意味着系统经过精心设计和测试，以满足所有情况下的指定时间保证。这个意义与网络上更模糊的实时术语有所区别，在这里它描述的是服务器向客户端推送数据和流处理而没有严格的响应时间约束(见第11章)。

For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag release system.

例如，如果您汽车上的传感器检测到您正在经历一场车祸，您不希望由于空袭释放系统中不恰当的GC暂停而延迟气囊的释放。

Providing real-time guarantees in a system requires support from all levels of the software stack: a real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specified intervals is needed; library functions must document their worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely (real-time garbage collectors exist, but the application must still ensure that it doesn’t give the GC too much work to do); and an enormous amount of testing and measurement must be done to ensure that guarantees are being met.

提供系统实时保障需要软件堆栈的所有层面的支持：需要实时操作系统（RTOS）用于允许进程在指定的时间间隔内安排具有保证的CPU时间分配；库函数必须记录它们的最坏情况执行时间；动态内存分配可能会受到限制或完全禁止（实时垃圾收集器存在，但应用程序仍必须确保不要给GC太多工作要做）；必须进行大量的测试和测量，以确保满足保证。

All of this requires a large amount of additional work and severely restricts the range of programming languages, libraries, and tools that can be used (since most languages and tools do not provide real-time guarantees). For these reasons, developing real-time systems is very expensive, and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above all else (see also “Latency and Resource Utilization” ).

所有的这些都需要大量的额外工作并严重限制可以使用的编程语言、库和工具的范围（因为大多数语言和工具不提供实时保证）。由于这些原因，开发实时系统非常昂贵，并且它们在安全关键的嵌入式设备中最常用。此外，“实时”不同于“高性能”——实际上，实时系统的吞吐量可能更低，因为它们必须把及时响应放在首位（还请参阅 “延迟和资源利用”）。

For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.

对于大多数服务器端数据处理系统而言，实时保障并不经济或合适。因此，这些系统必须忍受在非实时环境下操作所带来的暂停和时钟不稳定性。

Limiting the impact of garbage collection

The negative effects of process pauses can be mitigated without resorting to expensive real-time scheduling guarantees. Language runtimes have some flexibility around when they schedule garbage collections, because they can track the rate of object allocation and the remaining free memory over time.

处理暂停的负面影响可以通过不采用昂贵的实时调度保证来减轻。语言运行时在调度垃圾收集时有一定的灵活性，因为它们可以跟踪对象分配速率以及随时间剩余的空闲内存。

An emerging idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles of response time [ 70 , 71 ]. Some latency-sensitive financial trading systems [ 72 ] use this approach.

一个新兴的想法是将GC暂停视为节点的短暂计划停机，并让其他节点处理客户端请求，而其中一个节点正在收集垃圾。如果运行时可以警告应用程序某个节点即将需要GC暂停，应用程序可以停止向该节点发送新请求，等待其完成处理未完成的请求，然后在没有正在进行的请求时执行GC。这种技巧可将GC暂停隐藏起来，从而减少响应时间的高百分位数[70, 71]。一些具有延迟敏感的金融交易系统[72]使用了这种方法。

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects [ 65 , 73 ]. One node can be restarted at a time, and traffic can be shifted away from the node before the planned restart, like in a rolling upgrade (see Chapter 4 ).

一种此想法的变体是仅将垃圾收集器用于短寿命对象（这些对象易于收集），并定期重新启动进程，以防止它们积累足以需要长寿命对象的完整垃圾回收 [65，73]。可以逐个重新启动节点，并在计划的重新启动之前将流量从节点中转移，就像滚动升级一样（请参阅第4章）。

These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their impact on the application.

这些措施不能完全预防垃圾收集暂停，但它们可以有用地减少其对应用程序的影响。

Knowledge, Truth, and Lies

So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.

到目前为止，在这一章中，我们已经探讨了分布式系统与在单台计算机上运行的程序的不同之处：没有共享内存，只有通过不可靠的具有可变延迟的网络进行消息传递，而且系统可能会受到部分故障、不可靠的时钟和处理暂停的影响。

The consequences of these issues are profoundly disorienting if you’re not used to distributed systems. A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network. A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.

如果你不习惯分布式系统，这些问题的后果会给你带来深深的迷茫。网络中的节点无法确定任何事情 - 它只能根据接收到的消息（或未收到的消息）来做猜测。节点只能通过与其他节点交换信息才能了解另一个节点的状态（它存储的数据，是否正常运行等）。如果远程节点没有响应，就无法知道它处于什么状态，因为网络中的问题不能可靠地区分是节点的问题还是网络的问题。

Discussions of these systems border on the philosophical: What do we know to be true or false in our system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable? Should software systems obey the laws that we expect of the physical world, such as cause and effect?

这些系统的讨论涉及哲学问题：我们在系统中知道什么是真的或假的？如果感知和测量机制不可靠，我们对这种知识有多少把握？软件系统应该遵守我们期望物理世界的法则，比如因果关系吗？

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model ) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

幸运的是，我们不需要深究生命的意义。在分布式系统中，我们可以陈述我们对行为的假设（系统模型），并设计实际系统以满足这些假设。算法可以被证明在某个系统模型内正确运行。这意味着可靠的行为是可以实现的，即使底层的系统模型提供的保证非常少。

However, although it is possible to make software well behaved in an unreliable system model, it is not straightforward to do so. In the rest of this chapter we will further explore the notions of knowledge and truth in distributed systems, which will help us think about the kinds of assumptions we can make and the guarantees we may want to provide. In Chapter 9 we will proceed to look at some examples of distributed systems, algorithms that provide particular guarantees under particular assumptions.

然而，尽管在不可靠的系统模型中使软件行为良好是可能的，但这并不是一件简单的事情。在本章的其余部分中，我们将进一步探讨分布式系统中知识和真相的概念，这将帮助我们考虑我们可以做出什么样的假设以及我们希望提供什么样的保证。在第9章中，我们将继续研究一些分布式系统的示例和算法，它们可以在特定假设下提供特定的保证。

The Truth Is Defined by the Majority

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed [ 19 ]. Even though that node is working perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the funeral procession continues with stoic determination.

想像一下一个有不对称故障的网络：一个节点能够接收到对它发送的所有消息，但是从该节点发送的任何消息都被丢弃或延迟[19]。即使该节点运行得非常好，而且正在接收其他节点的请求，但其他节点无法听到其响应。经过一段时间后，其他节点会宣布它死亡，因为他们没有从该节点那里听到消息。这种情况就像一场噩梦一样展开：这个半断开的节点被拖入坟墓，大喊着“我还没死！”——但由于没有人能听到它的尖叫声，葬礼队伍会义无反顾地继续前行。

In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it is sending are not being acknowledged by other nodes, and so realize that there must be a fault in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the semi-disconnected node cannot do anything about it.

在一个稍微不那么可怕的情况下，半断开的节点可能会注意到它发送的消息未被其他节点确认，因此意识到网络中一定存在故障。然而，其他节点错误地宣布该节点死亡，而半断开的节点无法对此做出任何反应。

As a third scenario, imagine a node that experiences a long stop-the-world garbage collection pause. All of the node’s threads are preempted by the GC and paused for one minute, and consequently, no requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and eventually declare the node dead and load it onto the hearse. Finally, the GC finishes and the node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting with bystanders. At first, the GCing node doesn’t even realize that an entire minute has passed and that it was declared dead—from its perspective, hardly any time has passed since it was last talking to the other nodes.

作为第三种情景，想象一下一个节点经历了一次长时间的停机垃圾收集暂停。所有该节点的线程都被垃圾收集器抢占并暂停了一分钟，因此，没有任何请求被处理和响应被发送。其他节点等待、重试、变得不耐烦，并最终宣布该节点已经死亡并将其装入灵车。最后，垃圾收集完成，该节点的线程继续执行，就好像什么都没有发生过一样。其他节点会感到惊讶，因为这个本来被认为已经死亡的节点突然从棺材里探出头来，身体健康，并开始和旁观者愉快地聊天。一开始，正在进行垃圾收集的节点甚至没有意识到已经过去了整整一分钟，而它被宣布死亡-从它的角度来看，自上次与其他节点交流以来几乎没有经过任何时间。

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum , that is, voting among the nodes (see “Quorums for reading and writing” ): decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

这些故事的寓意是，一个节点不能仅依靠自己对情况的判断。分布式系统不能只依赖于单个节点，因为节点可能随时出现故障，导致系统出现故障并且无法恢复。相反，许多分布式算法依赖于法定人数，即节点之间的投票（见“用于读写的法定人数”）：决策需要来自多个节点的最小投票数，以减少对任何特定节点的依赖。

That includes decisions about declaring nodes dead. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.

这包括关于宣布节点死亡的决定。如果一组节点宣布另一个节点已经死亡，那么即使该节点仍然非常有生气，也必须被视为已死亡。个体节点必须遵守法定人数的决定并下台。

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds of quorums are possible). A majority quorum allows the system to continue working if individual nodes have failed (with three nodes, one failure can be tolerated; with five nodes, two failures can be tolerated). However, it is still safe, because there can only be only one majority in the system—there cannot be two majorities with conflicting decisions at the same time. We will discuss the use of quorums in more detail when we get to consensus algorithms in Chapter 9 .

通常情况下，法定人数是超过一半以上节点的绝对多数（虽然其他种类的法定人数也是可能的）。多数派法定人数可以使系统在个别节点失败时继续工作（对于三个节点，可以容忍一个故障；对于五个节点，可以容忍两个故障）。但是，它仍然是安全的，因为系统只能有一个多数派—同时不能有两个有冲突决定的多数派。当我们在第9章讨论共识算法时，我们将更详细地讨论法定人数的使用。

The leader and the lock

Frequently, a system requires there to be only one of some thing. For example:

经常，系统要求有一些东西只能有一个。例如：

Only one node is allowed to be the leader for a database partition, to avoid split brain (see “Handling Node Outages” ).

一个数据库分区只允许存在一个领导者节点，以避免出现分裂脑的情况（参见“处理节点故障”）。
Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it.

只允许一个事务或客户端持有锁定特定资源或对象的锁，以防止同时写入并损坏它。
Only one user is allowed to register a particular username, because a username must uniquely identify a user.

一个用户名只能注册给一位用户，因为用户名必须唯一标识一个用户。

Implementing this in a distributed system requires care: even if a node believes that it is “the chosen one” (the leader of the partition, the holder of the lock, the request handler of the user who successfully grabbed the username), that doesn’t necessarily mean a quorum of nodes agrees! A node may have formerly been the leader, but if the other nodes declared it dead in the meantime (e.g., due to a network interruption or GC pause), it may have been demoted and another leader may have already been elected.

在分布式系统中实施这个需要小心：即使一个节点认为自己是“被选择的”（分区的领导者、锁定的持有者、成功抓取用户名称的请求处理器），这并不意味着节点的多数同意！一个节点可能曾经是领导者，但如果其他节点在此期间声明它已死（例如，由于网络中断或GC暂停），它可能已经被降级，另一个领导者可能已经被选举。

If a node continues acting as the chosen one, even though the majority of nodes have declared it dead, it could cause problems in a system that is not carefully designed. Such a node could send messages to other nodes in its self-appointed capacity, and if other nodes believe it, the system as a whole may do something incorrect.

如果一个节点即使大多数节点已经宣布它死亡，仍然继续担任被选择的节点，会在系统不小心设计的情况下造成问题。这样的节点可能会在自我任命的能力下向其他节点发送消息，如果其他节点相信它，整个系统可能会做出错误的操作。

For example, Figure 8-4 shows a data corruption bug due to an incorrect implementation of locking. (The bug is not theoretical: HBase used to have this problem [ 74 , 75 ].) Say you want to ensure that a file in a storage service can only be accessed by one client at a time, because if multiple clients tried to write to it, the file would become corrupted. You try to implement this by requiring a client to obtain a lease from a lock service before accessing the file.

例如，图8-4展示了由于锁定实现不正确而导致的数据损坏错误。（这个错误不是理论上的：HBase曾经有过这个问题[74, 75]。）假设你想确保存储服务中的文件一次只能被一个客户端访问，因为如果多个客户端尝试写入它，文件会变得损坏。你尝试通过要求客户端在访问文件之前从锁定服务中获取租约来实现这一点。

The problem is an example of what we discussed in “Process Pauses” : if the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file. As a result, the clients’ writes clash and corrupt the file.

问题是我们在“流程暂停”中讨论的一个例子：如果持有租约的客户端暂停时间过长，租约将会过期。另一个客户端可以获取相同文件的租约，并开始写入文件。当暂停的客户端回来时，它会错误地认为自己仍然拥有有效的租约，然后继续写入文件。结果，客户端的写入冲突并破坏了文件。

Fencing tokens

When using a lock or lease to protect access to some resource, such as the file storage in Figure 8-4 , we need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system. A fairly simple technique that achieves this goal is called fencing , and is illustrated in Figure 8-5 .

当使用锁定或租赁来保护某些资源的访问权限（例如图8-4中的文件存储）时，我们需要确保一个错误地认为自己是“被选中者”的节点不能破坏整个系统。一个相当简单的技术可以实现这个目标，它叫做栅栏，如图8-5所示。

Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token , which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token.

假设每次锁服务授予锁或租约时，它也返回一个围栏令牌，这是一个数字，每次授予锁时都会增加（例如，由锁服务增加）。然后，我们可以要求每次客户端向存储服务发送写入请求时，必须包含其当前的围栏令牌。

In Figure 8-5 , client 1 acquires the lease with a token of 33, but then it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the number always increases) and then sends its write request to the storage service, including the token of 34. Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.

在图8-5中，客户端1使用33个令牌获得租约，但之后进入长时间暂停并且租约到期了。客户端2使用34个令牌（数字一直增加）获得租约，然后将其写请求发送给存储服务，包括34个令牌。稍后，客户端1重新活跃并将其写请求发送到存储服务，包括其33个令牌值。但是，存储服务器记得它已经处理了一个更高令牌号（34）的写请求，因此拒绝了使用33个令牌的请求。

If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing, they have the required properties [ 74 ].

如果ZooKeeper作为锁定服务使用，则事务ID zxid或节点版本cversion可用作分隔令牌。由于它们保证单调递增，因此具有所需的属性[74]。

Note that this mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed—it is not sufficient to rely on clients checking their lock status themselves. For resources that do not explicitly support fencing tokens, you might still be able work around the limitation (for example, in the case of a file storage service you could include the fencing token in the filename). However, some kind of check is necessary to avoid processing requests outside of the lock’s protection.

请注意，此机制要求资源自身在检查令牌时扮演积极角色，拒绝任何具有较旧令牌的写入，该令牌已经被处理，它并不能仅仅依赖客户端自己检查其锁定状态。对于不明确支持栅栏令牌的资源，您可能仍能绕过此限制（例如，在文件存储服务的情况下，您可以在文件名中包含栅栏令牌）。但必须进行某种检查，以避免处理锁定保护范围外的请求。

Checking a token on the server side may seem like a downside, but it is arguably a good thing: it is unwise for a service to assume that its clients will always be well behaved, because the clients are often run by people whose priorities are very different from the priorities of the people running the service [ 76 ]. Thus, it is a good idea for any service to protect itself from accidentally abusive clients.

在服务器端检查令牌可能似乎有缺点，但这也是一件好事：服务不应该假设其客户始终表现良好，因为客户的优先级通常与运行服务的人的优先级非常不同。因此，任何服务都应该保护自己免受意外虐待的客户。

Byzantine Faults

Fencing tokens can detect and block a node that is inadvertently acting in error (e.g., because it hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.

栅栏令牌可以检测和阻止错误地操作节点（例如，因为它尚未发现其租期已过期）的节点。然而，如果节点有意想要破坏系统的保证，它可以轻松地通过发送带有伪造栅栏令牌的消息来实现。

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume that if a node does respond, it is telling the “truth”: to the best of its knowledge, it is playing by the rules of the protocol.

在这本书中，我们假设节点是不可靠但诚实的：它们可能很慢或从不响应（由于故障），它们的状态可能过时（由于GC暂停或网络延迟），但我们假设如果节点确实响应，它正在“真实”地讲述事实：据它所知，它正在按照协议的规则进行操作。

Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)—for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault , and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem [ 77 ].

如果存在节点可能“说谎”（发送任意错误或损坏的响应）的风险，分布式系统问题会变得更加困难 - 例如，如果一个节点可能声称已收到某个消息，而实际上没有。这种行为被称为拜占庭故障，而在这种不信任环境中达成共识的问题被称为拜占庭将军问题[77]。

The Byzantine Generals Problem

The Byzantine Generals Problem is a generalization of the so-called Two Generals Problem [ 78 ], which imagines a situation in which two army generals need to agree on a battle plan. As they have set up camp on two different sites, they can only communicate by messenger, and the messengers sometimes get delayed or lost (like packets in a network). We will discuss this problem of consensus in Chapter 9 .

拜占庭将军问题是所谓的两名将军问题的概括，即想象在两名军队将军需要达成战斗计划的情况下，他们在两个不同的地点扎营，只能通过信使进行交流，而信使有时会延迟或丢失（就像网络中的数据包一样）。我们将在第9章讨论这个共识问题。

In the Byzantine version of the problem, there are n generals who need to agree, and their endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the others by sending fake or untrue messages (while trying to remain undiscovered). It is not known in advance who the traitors are.

在拜占庭版的问题中，有n个将军需要达成一致，但是其中存在一些叛徒妨碍了他们的努力。大多数将军是忠诚的，因此发送真实的信息，但是叛徒可能会试图发送虚假或不真实的信息来欺骗和混淆其他人（同时试图保持不被发现）。事先并不知道谁是叛徒。

Byzantium was an ancient Greek city that later became Constantinople, in the place which is now Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from Byzantine in the sense of excessively complicated, bureaucratic, devious , which was used in politics long before computers [ 79 ]. Lamport wanted to choose a nationality that would not offend any readers, and he was advised that calling it The Albanian Generals Problem was not such a good idea [ 80 ].

拜占庭是一座古希腊城市，后来成为了土耳其伊斯坦布尔的君士坦丁堡。没有证据表明拜占庭将军比其他地方的将军更容易进行阴谋诡计。相反，这个名称源于“拜占庭式”的含义，即过于复杂、官僚、狡猾，这个词在电脑之前的政治领域中就被使用了。Lamport希望选择一个不会冒犯任何读者的国籍，并被建议称之为“阿尔巴尼亚将军问题”并不是个好主意。

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances. For example:

如果某些节点出现故障并且违反协议，或者有恶意攻击者干扰网络，一个系统是拜占庭容错的，可以继续正确运行。这个问题在某些具体情况下非常重要。例如：

In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board, or a rocket colliding with the International Space Station), flight control systems must tolerate Byzantine faults [ 81 , 82 ].

在航空航天环境中，计算机的内存或CPU寄存器中的数据可能会因为辐射而损坏，导致其对其他节点以任意不可预测的方式做出响应。由于系统故障会非常昂贵（例如，飞机坠毁并导致所有人死亡，或火箭与国际空间站相撞），所以飞行控制系统必须容忍拜占庭故障[81, 82]。
In a system with multiple participating organizations, some participants may attempt to cheat or defraud others. In such circumstances, it is not safe for a node to simply trust another node’s messages, since they may be sent with malicious intent. For example, peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority [ 83 ].

在多个参与组织的系统中，某些参与方可能尝试欺骗或欺诈其他人。在这种情况下，节点不能简单地相信另一个节点的信息，因为它们可能具有恶意意图。例如，像比特币和其他区块链一样的对等网络可以被视为一种让相互不信任的各方同意某个交易是否发生，而不依赖于中央机构的方式。

However, in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults. In your datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a major problem. Protocols for making systems Byzantine fault-tolerant are quite complicated [ 84 ], and fault-tolerant embedded systems rely on support from the hardware level [ 81 ]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical.

然而，在本书讨论的系统中，我们通常可以安全地假设没有拜占庭故障。在您的数据中心中，所有节点都受到您组织的控制（因此希望它们是可信的），并且辐射水平低到足以使内存损坏不成为主要问题。使系统具有拜占庭容错性的协议非常复杂[84]，并且容错嵌入式系统依赖于硬件级别的支持[81]。在大多数服务器端数据系统中，部署拜占庭容错解决方案的成本使它们变得不切实际。

Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where there is no such central authority, Byzantine fault tolerance is more relevant.

网络应用需要预料到受最终用户控制的客户端（如Web浏览器）的任意和恶意行为。这就是为什么输入验证、净化和输出转义非常重要的原因：例如，防止SQL注入和跨站点脚本。然而，在这里我们通常不使用拜占庭容错协议，只需将服务器作为决定允许哪些客户端行为的权威。在点对点网络中，没有这样的中央权威，因此拜占庭容错更加相关。

A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly (i.e., if you have four nodes, at most one may malfunction). To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.

软件中的漏洞可能被视为拜占庭故障，但如果你将同一软件部署到所有节点上，那么拜占庭容错算法将无法拯救你。大多数拜占庭容错算法需要超过三分之二的节点正常运行（例如，如果你有四个节点，则最多只能有一个节点出现故障）。如果想要使用这种方法来解决漏洞，你需要有四个独立的软件实现，并希望漏洞只出现在其中一个实现中。

Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if an attacker can compromise one node, they can probably compromise all of them, because they are probably running the same software. Thus, traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.

同样，如果协议可以保护我们免受漏洞、安全妥协和恶意攻击的侵害，那将会很吸引人。不幸的是，这也是不现实的：在大多数系统中，如果攻击者可以入侵一个节点，他们可能可以入侵所有节点，因为它们可能运行相同的软件。因此，传统机制（身份验证、访问控制、加密、防火墙等）继续成为对抗攻击者的主要保护。

Weak forms of lying

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software that guard against weak forms of “lying”—for example, invalid messages due to hardware issues, software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and pragmatic steps toward better reliability. For example:

尽管我们假设节点一般是诚实的，但添加机制以防止“说谎”的弱形式可能是值得的，例如由于硬件问题、软件漏洞和错误配置导致的无效消息。这些保护机制不是完整的拜占庭容错，因为它们不能抵御有决心的对手，但它们仍然是朝着更好的可靠性迈进的简单而实用的步骤。例如：

Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems, drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection [ 85 , 86 , 87 ]. Simple measures are usually sufficient protection against such corruption, such as checksums in the application-level protocol.

网络数据包有时会因为硬件问题或操作系统、驱动程序、路由器等的bugs而变为损坏状态。通常，TCP和UDP内置进行校验和检测的功能可以检测出损坏的数据包，但有时依然无法检测出损坏的数据包[85，86，87]。一些简单的措施通常足以防止这种损坏，例如在应用层协议中进行校验和检测。
A publicly accessible application must carefully sanitize any inputs from users, for example checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations. An internal service behind a firewall may be able to get away with less strict checks on inputs, but some basic sanity-checking of values (e.g., in protocol parsing [ 85 ]) is a good idea.

公共可访问的应用程序必须仔细清洗来自用户的任何输入，例如检查一个值是否在合理范围内并限制字符串的大小，以防止通过大内存分配导致拒绝服务攻击。位于防火墙后面的内部服务可能可以对输入进行较少的严格检查，但对值进行一些基本的合理性检查（例如，在协议解析中[85]）是一个好主意。
NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an incorrect time is detected as an outlier and is excluded from synchronization [ 37 ]. The use of multiple servers makes NTP more robust than if it only uses a single server.

NTP客户端可以配置多个服务器地址。同步时，客户端联系所有服务器，估计它们的误差，并检查大多数服务器是否在某个时间范围内达成一致。只要大多数服务器正常，就可以检测到错误时间的错误配置NTP服务器，并将其排除在同步之外。[37]使用多个服务器使NTP比仅使用单个服务器更加鲁棒。

System Model and Reality

Many algorithms have been designed to solve distributed systems problems—for example, we will examine solutions for the consensus problem in Chapter 9 . In order to be useful, these algorithms need to tolerate the various faults of distributed systems that we discussed in this chapter.

很多算法都被设计用来解决分布式系统问题——例如，在第9章中，我们将讨论共识问题的解决方案。为了有用，这些算法需要容忍我们在本章中讨论过的分布式系统的各种故障。

Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run. This in turn requires that we somehow formalize the kinds of faults that we expect to happen in a system. We do this by defining a system model , which is an abstraction that describes what things an algorithm may assume.

算法需要编写成不太依赖于硬件和软件配置的方式运行。因此，我们需要以某种方式将我们所预期发生的系统故障形式化。我们通过定义系统模型来实现此目的，这是描述算法可能假设的抽象概念。

With regard to timing assumptions, three system models are in common use:

关于时间假设，常用的有三种系统模型：

Synchronous model

The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound [ 88 ]. The synchronous model is not a realistic model of most practical systems, because (as discussed in this chapter) unbounded delays and pauses do occur.

同步模型假设有工作网络延迟、进程暂停和时钟误差的限制。这并不意味着完全同步的时钟或网络延迟为零；它只是意味着你知道网络延迟、暂停和时钟漂移永远不会超过某个固定的上限。同步模型不是大多数实际系统的现实模型，因为（正如本章所讨论的）无限的延迟和暂停确实会发生。

Partially synchronous model

Partial synchrony means that a system behaves like a synchronous system most of the time , but it sometimes exceeds the bounds for network delay, process pauses, and clock drift [ 88 ]. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved—otherwise we would never be able to get anything done—but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become arbitrarily large.

部分同步意味着系统大部分时间表现为同步系统，但有时会超出网络延迟、进程暂停和时钟漂移的限制。[88]这是许多系统的现实模型：大多数时间，网络和进程表现良好 -否则我们就无法完成任何事情- 但我们必须考虑到任何时间假设可能会偶尔破裂。当这种情况发生时，网络延迟、暂停和时钟误差可能变得任意大。

Asynchronous model

In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock (so it cannot use timeouts). Some algorithms can be designed for the asynchronous model, but it is very restrictive.

在这个模型中，算法不允许做出任何时间假设——事实上，它甚至没有时钟（因此无法使用超时）。一些算法可以为异步模型设计，但它是非常严格的。

Moreover, besides timing issues, we have to consider node failures. The three most common system models for nodes are:

此外，除了时间问题外，我们还必须考虑节点故障。节点的三种最常见的系统模型是：

Crash-stop faults

In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing. This means that the node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back.

在崩溃停止模型中，算法可能假设节点只能以一种方式失败，即崩溃。这意味着节点可能随时突然停止响应，然后该节点就永远离开了——它永远不会回来。

Crash-recovery faults

We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed to be lost.

我们假设节点随时可能崩溃，并可能在一段未知的时间后重新开始响应。在崩溃恢复模型中，假定节点具有稳定的存储（即，不揮发的磁盘存储），该存储在崩溃期间保持不变，而内存状态被认为是丢失的。

Byzantine (arbitrary) faults

Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.

节点可以做任何事情，包括尝试欺骗其他节点，正如最后一节所描述的那样。

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model. But how do distributed algorithms cope with that model?

对于建模实际系统来说，具有崩溃恢复错误的部分同步模型通常是最有用的模型。但是分布式算法如何应对这个模型呢？

Correctness of an algorithm

To define what it means for an algorithm to be correct , we can describe its properties . For example, the output of a sorting algorithm has the property that for any two distinct elements of the output list, the element further to the left is smaller than the element further to the right. That is simply a formal way of defining what it means for a list to be sorted.

定义算法正确性的意义，我们可以描述它的属性。例如，排序算法的输出具有以下属性：对于输出列表中的任意两个不同元素，左边的元素比右边的元素小。这只是简单地定义了列表排序的意义。

Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct. For example, if we are generating fencing tokens for a lock (see “Fencing tokens” ), we may require the algorithm to have the following properties:

同样地，我们可以列出分布式算法需要具备的特性来定义正确性。例如，如果我们正在为一个锁生成围栏令牌（参见“围栏令牌”），我们可能需要算法具备以下特性：

Uniqueness

No two requests for a fencing token return the same value.

没有两个请求返回相同的围栏令牌值。

Monotonic sequence

If request x returned token t _x , and request y returned token t _y , and x completed before y began, then t _x < t _y .

如果请求x返回令牌tx，请求y返回令牌ty，并且x在y开始之前完成，则tx < ty。

Availability

A node that requests a fencing token and does not crash eventually receives a response.

请求围栏令牌的节点最终都会收到响应。

An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model. But how does this make sense? If all nodes crash, or all network delays suddenly become infinitely long, then no algorithm will be able to get anything done.

算法在某些系统模型中是正确的，如果它总是满足其在该系统模型中可能发生的所有情况下的属性。但这有意义吗？如果所有节点崩溃，或者所有网络延迟突然变得无限长，那么任何算法都无法完成任何任务。

Safety and liveness

To clarify the situation, it is worth distinguishing between two different kinds of properties: safety and liveness properties. In the example just given, uniqueness and monotonic sequence are safety properties, but availability is a liveness property.

为了弄清楚情况，值得区分两种不同的属性：安全属性和活性属性。在刚才给出的例子中，唯一性和单调序列是安全属性，而可用性是活性属性。

What distinguishes the two kinds of properties? A giveaway is that liveness properties often include the word “eventually” in their definition. (And yes, you guessed it— eventual consistency is a liveness property [ 89 ].)

两种属性的区别在哪里呢？一个很明显的区分是活性属性的定义通常包含“最终”一词。（是的，你猜对了——最终一致性是一种活性属性 [89]。）

Safety is often informally defined as nothing bad happens , and liveness as something good eventually happens . However, it’s best to not read too much into those informal definitions, because the meaning of good and bad is subjective. The actual definitions of safety and liveness are precise and mathematical [ 90 ]:

安全通常是非正式地定义为没有发生任何坏事，而活力则是最终会发生一些好事情。但是，最好不要过多解读这些非正式的定义，因为好和坏的意义是主观的。安全和生命力的实际定义是精确和数学的[90]。

If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned). After a safety property has been violated, the violation cannot be undone—the damage is already done.

如果安全属性被违反，我们可以指出违反发生的特定时间点（例如，如果唯一性属性被违反，则我们可以确定返回了重复围栏标记的特定操作）。安全属性一旦被违反，就无法撤消违规行为——损害已经发生。
A liveness property works the other way round: it may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).

一个生存性质的工作方式正好相反：在某个时间点上可能不成立（例如，一个节点可能已经发送了一个请求但还没有收到响应），但总有希望它在未来会得到满足（即通过接收响应）。

An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties always hold, in all possible situations of a system model [ 88 ]. That is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result (i.e., that the safety properties remain satisfied).

区分安全性和活性属性的优点是帮助我们处理复杂的系统模型。对于分布式算法，通常要求安全属性在系统模型的所有可能情况下始终保持[88]。也就是说，即使所有节点崩溃或整个网络失败，该算法仍必须确保不返回错误结果（即，安全属性仍保持满足）。

However, with liveness properties we are allowed to make caveats: for example, we could say that a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage. The definition of the partially synchronous model requires that eventually the system returns to a synchronous state—that is, any period of network interruption lasts only for a finite duration and is then repaired.

然而，使用活性属性，我们可以作出附加条件：例如，我们可以说只有在大多数节点未崩溃且网络最终从停机中恢复时，请求才需要收到响应。部分同步模型的定义需要系统最终返回同步状态，即任何网络中断期间仅持续有限时间，并且随后得到修复。

Mapping system models to the real world

Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of reality come back to bite you again, and it becomes clear that the system model is a simplified abstraction of reality.

安全性和活性性质以及系统模型对于推理分布式算法的正确性非常有用。然而，在实践中实现算法时，现实的复杂事实再次出现，系统模型就成为现实的简化抽象。

For example, algorithms in the crash-recovery model generally assume that data in stable storage survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration [ 91 ]? What happens if a server has a firmware bug and fails to recognize its hard drives on reboot, even though the drives are correctly attached to the server [ 92 ]?

例如，在崩溃恢复模型中的算法通常假定稳定存储器中的数据幸存于崩溃之后。但是，如果存储器中的数据受到损坏或由于硬件错误或错误配置而被抹去，会怎样呢？如果服务器存在固件漏洞，在重新启动时无法识别与服务器正确连接的硬盘，会怎么样呢？

Quorum algorithms (see “Quorums for reading and writing” ) rely on a node remembering the data that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new system model is needed, in which we assume that stable storage mostly survives crashes, but may sometimes be lost. But that model then becomes harder to reason about.

“Quorum算法（参见“用于读取和写入的Quorums”）依赖于节点记住其声称已存储的数据。如果一个节点可能会患有健忘症并忘记以前存储的数据，那就破坏了议员条件，从而破坏了算法的正确性。也许需要一个新的系统模型，其中我们假设稳定存储大多能够经受崩溃，但有时可能会丢失。但是，这个模型就变得更难思考。”

The theoretical description of an algorithm can declare that certain things are simply assumed not to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can and cannot happen. However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to printf("Sucks to be you") and exit(666) —i.e., letting a human operator clean up the mess [ 93 ]. (This is arguably the difference between computer science and software engineering.)

算法的理论描述可以声明某些事情是根本不可能发生的——在非拜占庭系统中，我们确实需要做出一些关于可能发生故障的假设和不可能发生故障的假设。然而，一个真实的实现仍可能需要包含处理假定不可能发生的情况的代码，即使该处理归结为printf("Sucks to be you")和exit(666)——即让一个人工操作员清理混乱[93]。（这可以说是计算机科学和软件工程之间的区别。）

That is not to say that theoretical, abstract system models are worthless—quite the opposite. They are incredibly helpful for distilling down the complexity of real systems to a manageable set of faults that we can reason about, so that we can understand the problem and try to solve it systematically. We can prove algorithms correct by showing that their properties always hold in some system model.

这并不意味着理论上的、抽象的系统模型毫无价值——恰恰相反。它们非常有用，可以将真实系统的复杂性简化成可管理的一组错误，以便我们可以理性地理解问题并尝试系统地解决它。我们可以通过在某些系统模型中证明算法的正确性，以展示其属性始终保持不变。

Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly. But it’s a very good first step, because the theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances. Theoretical analysis and empirical testing are equally important.

证明算法正确并不意味着在实际系统中的实现一定会始终正确行为。但这是一个非常好的第一步，因为理论分析可以揭示算法中可能长期隐藏的问题，在非常罕见的情况下可能才会让您的假设（例如有关时间的假设）被击败。理论分析和经验测试同样重要。

Summary

In this chapter we have discussed a wide range of problems that can occur in distributed systems, including:

在本章中，我们讨论了分布式系统中可能出现的各种问题，包括：

Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.

每当您尝试通过网络发送数据包时，它都有可能会丢失或者任意延迟。同样，回复也有可能会丢失或者延迟，所以如果没有收到回复，您就不知道消息是否已经传递。
A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval.

一个节点的时钟可能会与其他节点明显不同步（即使您尽力设置网络时钟协议（NTP）），它可能会突然向前或向后跳跃时间，依赖它是危险的，因为您很可能没有好的测量您时钟误差间隔的方法。
A process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.

一个进程可能在执行的任何时刻暂停相当长的时间（可能是因为全停垃圾收集器），被其他节点声明为死亡，然后再次复活，而不知道它曾经暂停过。

The fact that such partial failures can occur is the defining characteristic of distributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

分布式系统的定义特点是可能会发生这种局部故障。每当软件涉及其他节点时，它偶尔会失败，轻微卡顿，或干脆不响应(最终超时)。在分布式系统中，我们尝试将软件的容错性建立起来，使系统在一些组成部分损坏时仍能正常运行。

To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crashing. Moreover, sometimes a node can be in a degraded state: for example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [ 94 ]. Such a node that is “limping” but not dead can be even more difficult to deal with than a cleanly failed node.

容忍故障的第一步是检测它们，但即使这也很难。大多数系统没有准确的机制来检测节点是否失败，因此大多数分布式算法依赖于超时来确定远程节点是否仍然可用。然而，超时无法区分网络和节点故障，并且可变的网络延迟有时会导致节点被错误地怀疑崩溃。此外，有时节点可能处于退化状态：例如，由于驱动程序错误，千兆网络接口可能突然降至1Kb/s的吞吐量[94]。这样一个“跛足”但并未死亡的节点比一台干净失败的节点更难处理。

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.

一旦检测到故障，让系统容忍它也不容易：机器之间没有全局变量、共享内存、公共知识或任何其他共享状态。节点甚至不能达成一致的时间，更不用说任何更深刻的事情了。信息从一个节点流向另一个节点的唯一方式是通过不可靠的网络发送。重大决策不能由单个节点安全地做出，因此我们需要协议以征求其他节点的帮助，并试图获得多数同意。

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer [ 5 ], and indeed a single computer can do a lot nowadays [ 95 ]. If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

如果你已经习惯于在单一计算机的理想化数学完美中编写软件，其中相同操作始终确定性地返回相同的结果，那么转向分布式系统的混乱物理现实可能会有点震惊。相反，分布式系统工程师通常会认为，如果问题可以在单个计算机上解决，那么问题就很简单，而事实上，单个计算机现在可以做很多事情。如果您可以避免打开潘多拉的盒子，只需将事物保留在单台机器上，那么通常值得这样做。

However, as discussed in the introduction to Part II , scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node.

然而，正如第二部分介绍中所讨论的，可扩展性并不是想要使用分布式系统的唯一原因。容错性和低延迟（通过将数据放置在靠近用户的地方）同样重要，而这些目标无法通过单个节点实现。

In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.

在本章中，我们也进行了一些离题探讨，探讨网络、时钟和进程的不可靠性是否是自然法则。我们发现这不是自然法则：可以在网络中提供硬实时响应保证和有界延迟，但这样做非常昂贵，会导致硬件资源利用率降低。大多数非安全关键系统选择廉价而不可靠的解决方案，而非昂贵而可靠的解决方案。

We also touched on supercomputers, which assume reliable components and thus have to be stopped and restarted entirely when a component does fail. By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level—at least in theory. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.)

我们也谈及了超级计算机，它们需要可靠的组件，因此当一个组件出现故障时必须完全停止和重新启动。相比之下，分布式系统可以在服务级别上永远运行，因为所有故障和维护都可以在节点级别处理-至少在理论上是如此。（实际上，如果所有节点都推出了错误的配置更改，那仍然会使分布式系统崩溃。）

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems.

这一章节全部关于问题，并给我们带来了灰暗的前景。在下一章中，我们将继续探讨解决方案，并讨论一些旨在应对分布式系统中所有问题的算法。

Footnotes

ⁱ With one exception: we will assume that faults are non-Byzantine (see “Byzantine Faults” ).

除一种情况外：我们将假定故障是非拜占庭式的（见“拜占庭故障”）。

ⁱⁱ Except perhaps for an occasional keepalive packet, if TCP keepalive is enabled.

除非TCP keepalive已启用，否则只有偶尔的保持活动数据包。

ⁱⁱⁱ Asynchronous Transfer Mode (ATM) was a competitor to Ethernet in the 1980s [ 32 ], but it didn’t gain much adoption outside of telephone network core switches. It has nothing to do with automatic teller machines (also known as cash machines), despite sharing an acronym. Perhaps, in some parallel universe, the internet is based on something like ATM—in that universe, internet video calls are probably a lot more reliable than they are in ours, because they don’t suffer from dropped and delayed packets.

iii 异步传输模式（ATM）在1980年代是以太网的竞争对手[32]，但它并没有在电话网络核心交换机以外得到广泛采用。尽管使用了同一个缩写，它与自动取款机没有任何关系。或许在某个平行宇宙里，互联网基于类似ATM的东西-在那个宇宙里，互联网视频通话可能比我们的更可靠，因为它们不会遭受丢失和延迟的数据包。

^iv Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of networks, not individual connections between hosts, and at a much longer timescale.

互联网服务供应商间的iv Peering协议和通过边界网关协议（BGP）建立路由的过程，更像是电路交换，而不是IP本身。在这个层面上，可以购买专用带宽。然而，互联网路由在网络层级上运行，而不是主机之间的个别连接，时间尺度也更长。

^v Although the clock is called real-time , it has nothing to do with real-time operating systems, as discussed in “Response time guarantees” .

尽管时钟被称为实时时钟，但它与实时操作系统无关，如“响应时间保证”所讨论的。

^vi There are distributed sequence number generators, such as Twitter’s Snowflake, that generate approximately monotonically increasing unique IDs in a scalable way (e.g., by allocating blocks of the ID space to different nodes). However, they typically cannot guarantee an ordering that is consistent with causality, because the timescale at which blocks of IDs are assigned is longer than the timescale of database reads and writes. See also “Ordering Guarantees” .

有分布式序列号生成器，如Twitter的Snowflake，可以可伸缩地生成大约单调递增的唯一ID（例如，通过将ID空间的块分配给不同的节点）。但是，它们通常无法保证顺序一致与因果关系一致，因为分配ID块的时间尺度比数据库读写的时间尺度长。请参见“排序保证”。

References

[ 1 ] Mark Cavage: “ There’s Just No Getting Around It: You’re Building a Distributed System ,” ACM Queue , volume 11, number 4, pages 80-89, April 2013. doi:10.1145/2466486.2482856

[1] Mark Cavage：「离不开分布式系统：你正在构建它」，ACM Queue，第11卷，第4期，第80-89页，2013年4月。doi:10.1145/2466486.2482856。

[ 2 ] Jay Kreps: “ Getting Real About Distributed System Reliability ,” blog.empathybox.com , March 19, 2012.

[2] Jay Kreps：“认真对待分布式系统的可靠性”，blog.empathybox.com，2012年3月19日。

[ 3 ] Sydney Padua: The Thrilling Adventures of Lovelace and Babbage: The (Mostly) True Story of the First Computer . Particular Books, April 2015. ISBN: 978-0-141-98151-2

[3] 悉尼·帕杜亚: 洛韦莱斯和巴贝奇的惊险历险：第一台计算机的“大部分”真实故事。宝特书籍，2015年4月。ISBN: 978-0-141-98151-2。

[ 4 ] Coda Hale: “ You Can’t Sacrifice Partition Tolerance ,” codahale.com , October 7, 2010.

[4] Coda Hale： “牺牲分区容错性是不可取的,” codahale.com，2010年10月7日。

[ 5 ] Jeff Hodges: “ Notes on Distributed Systems for Young Bloods ,” somethingsimilar.com , January 14, 2013.

[5] Jeff Hodges: "分布式系统简明指南", somethingsimilar.com, 2013年1月14日。

[ 6 ] Antonio Regalado: “ Who Coined ‘Cloud Computing’? ,” technologyreview.com , October 31, 2011.

"谁创造了‘云计算’？"，Antonio Regalado，2011年10月31日，technologyreview.com。"

[ 7 ] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle: “ The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition ,” Synthesis Lectures on Computer Architecture , volume 8, number 3, Morgan & Claypool Publishers, July 2013. doi:10.2200/S00516ED2V01Y201306CAC024 , ISBN: 978-1-627-05010-4

“数据中心作为一台计算机：大型仓库式机器设计介绍，第二版”，计算机架构综合论文集，第8卷，第3期， Morgan & Claypool出版社，2013年7月。 DOI：10.2200/S00516ED2V01Y201306CAC024， ISBN：978-1-627-05010-4。

[ 8 ] David Fiala, Frank Mueller, Christian Engelmann, et al.: “ Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing ,” at International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), November 2012.

David Fiala，Frank Mueller，Christian Engelmann等人：“大规模高性能计算中的静默数据损坏的检测和纠正”，于2012年11月在高性能计算、网络、存储和分析国际会议（SC12）上发表。

[ 9 ] Arjun Singh, Joon Ong, Amit Agarwal, et al.: “ Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network ,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2785956.2787508

"[9] 阿尔琼·辛格、朱恩·翁、阿米特·阿格拉沃尔等人：《木星升起：谷歌数据中心网络中十年来的Clos拓扑结构和集中控制》，发表于ACM数据通信特别兴趣小组(SIGCOMM)年会，2015年8月。doi：10.1145/2785956.2787508。"

[ 10 ] Glenn K. Lockwood: “ Hadoop’s Uncomfortable Fit in HPC ,” glennklockwood.blogspot.co.uk , May 16, 2014.

[10] Glenn K. Lockwood：“Hadoop在高性能计算中的不适合”，glennklockwood.blogspot.co.uk，2014年5月16日。

[ 11 ] John von Neumann: “ Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components ,” in Automata Studies (AM-34) , edited by Claude E. Shannon and John McCarthy, Princeton University Press, 1956. ISBN: 978-0-691-07916-5

[11] 约翰·冯·诺伊曼： “概率逻辑和可靠生物体从不可靠组件合成”，收录于《自动机研究》（AM-34），克洛德·E·香农和约翰·麦卡锡编辑，普林斯顿大学出版社，1956年。 ISBN: 978-0-691-07916-5。

[ 12 ] Richard W. Hamming: The Art of Doing Science and Engineering . Taylor & Francis, 1997. ISBN: 978-9-056-99500-3

[12] 理查德·W·汉明：做科学和工程的艺术。泰勒和弗朗西斯，1997年。 ISBN：978-9-056-99500-3。

[ 13 ] Claude E. Shannon: “ A Mathematical Theory of Communication ,” The Bell System Technical Journal , volume 27, number 3, pages 379–423 and 623–656, July 1948.

[13] 克劳德·E·香农：《通信的数学理论》，贝尔系统技术杂志，第27卷，第3期，1948年7月，379-423和623-656页。

[ 14 ] Peter Bailis and Kyle Kingsbury: “ The Network Is Reliable ,” ACM Queue , volume 12, number 7, pages 48-55, July 2014. doi:10.1145/2639988.2639988

[14] Peter Bailis 和 Kyle Kingsbury: “网络是可靠的”，ACM Queue，卷 12，号码 7，页码 48-55，2014 年 7 月。doi:10.1145/2639988.2639988。

[ 15 ] Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish: “ Taming Uncertainty in Distributed Systems with Help from the Network ,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741976

"在分布式系统中通过网络协作抑制不确定性"，作者为Joshua B. Leners、Trinabh Gupta、Marcos K. Aguilera和Michael Walfish，发表于2015年4月的第10届欧洲计算机系统会议（EuroSys）。doi:10.1145/2741948.2741976。"

[ 16 ] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: “ Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications ,” at ACM SIGCOMM Conference , August 2011. doi:10.1145/2018436.2018477

[16] Phillipa Gill, Navendu Jain和Nachiappan Nagappan：“理解数据中心网络故障：测量、分析和影响”，发表于ACM SIGCOMM会议，2011年8月。 doi:10.1145/2018436.2018477

[ 17 ] Mark Imbriaco: “ Downtime Last Saturday ,” github.com , December 26, 2012.

[17] Mark Imbriaco: "上周六的停机时间"，github.com，2012年12月26日。

[ 18 ] Will Oremus: “ The Global Internet Is Being Attacked by Sharks, Google Confirms ,” slate.com , August 15, 2014.

“全球互联网受到鲨鱼袭击，谷歌证实”，Slate.com，2014年8月15日。

[ 19 ] Marc A. Donges: “ Re: bnx2 cards Intermittantly Going Offline ,” Message to Linux netdev mailing list, spinics.net , September 13, 2012.

"[19] Marc A. Donges： “关于bnx2网络卡断开连接”的信息发送至Linux netdev邮件列表，spinics.net，2012年9月13日。"

[ 20 ] Kyle Kingsbury: “ Call Me Maybe: Elasticsearch ,” aphyr.com , June 15, 2014.

[20] 凯尔·金斯伯利：“Call Me Maybe：Elasticsearch”，aphyr.com，2014年6月15日。

[ 21 ] Salvatore Sanfilippo: “ A Few Arguments About Redis Sentinel Properties and Fail Scenarios ,” antirez.com , October 21, 2014.

[21] Salvatore Sanfilippo：“Redis Sentinel 属性和故障场景的几个论点”，antirez.com，2014年10月21日。

[ 22 ] Bert Hubert: “ The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable ,” blog.netherlabs.nl , January 18, 2009.

[22] Bert Hubert：“超级SO_LINGER页面，或：为什么我的TCP不可靠”，blog.netherlabs.nl，2009年1月18日。

[ 23 ] Nicolas Liochon: “ CAP: If All You Have Is a Timeout, Everything Looks Like a Partition ,” blog.thislongrun.com , May 25, 2015.

"CAP：如果你手上只有一个超时，那么所有事情都看起来像是一个分区"， Nicolas Liochon在他的博客“thislongrun”中写道，发布于2015年5月25日。"

[ 24 ] Jerome H. Saltzer, David P. Reed, and David D. Clark: “ End-To-End Arguments in System Design ,” ACM Transactions on Computer Systems , volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

[24] Jerome H. Saltzer, David P. Reed, and David D. Clark：“系统设计中的端到端论证”，ACM 计算机系统交易，第 2 卷，第 4 期，页面 277-288，1984 年 11 月。doi:10.1145/357401.357402。

[ 25 ] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, et al.: “ Queues Don’t Matter When You Can JUMP Them! ,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

"[25] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog等人: “Queues Don’t Matter When You Can JUMP Them!,” 发表于第12届USENIX网络系统设计与实现研讨会（NSDI），2015年5月。"

[ 26 ] Guohui Wang and T. S. Eugene Ng: “ The Impact of Virtualization on Network Performance of Amazon EC2 Data Center ,” at 29th IEEE International Conference on Computer Communications (INFOCOM), March 2010. doi:10.1109/INFCOM.2010.5461931

[26] 王国辉和T.S. Eugene Ng：“虚拟化对亚马逊EC2数据中心网络性能的影响”，发表于第29届IEEE国际计算机通信会议（INFOCOM），2010年3月。doi:10.1109/INFCOM.2010.5461931。

[ 27 ] Van Jacobson: “ Congestion Avoidance and Control ,” at ACM Symposium on Communications Architectures and Protocols (SIGCOMM), August 1988. doi:10.1145/52324.52356

[27]范·雅各布森： “拥塞避免和控制”，发表于1988年8月的ACM通信体系结构和协议（SIGCOMM）研讨会。 doi:10.1145/52324.52356

[ 28 ] Brandon Philips: “ etcd: Distributed Locking and Service Discovery ,” at Strange Loop , September 2014.

[28] Brandon Philips: "etcd: 分布式锁和服务发现"，在Stranger Loop，2014年9月。

[ 29 ] Steve Newman: “ A Systematic Look at EC2 I/O ,” blog.scalyr.com , October 16, 2012.

[29] Steve Newman： "系统地观察EC2 I/O"，blog.scalyr.com，2012年10月16日。

[ 30 ] Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “ The ϕ Accrual Failure Detector ,” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004.

【30】Naohiro Hayashibara、Xavier Défago、Rami Yared 和 Takuya Katayama：“Φ积累故障检测器”，日本高级科学技术学院信息科学学院技术报告 IS-RR-2004-010，2004 年 5 月。

[ 31 ] Jeffrey Wang: “ Phi Accrual Failure Detector ,” ternarysearch.blogspot.co.uk , August 11, 2013.

[31] Jeffrey Wang：“Phi Accrual故障探测器”，ternarysearch.blogspot.co.uk，2013年8月11日。

[ 32 ] Srinivasan Keshav: An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network . Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6

[32] Srinivasan Keshav:《计算机网络的工程方法：ATM网络、互联网和电话网络》Addison-Wesley Professional，1997年5月出版。ISBN: 978-0-201-63442-6。

[ 33 ] Cisco, “ Integrated Services Digital Network ,” docwiki.cisco.com .

[33] Cisco，“集成服务数字网”，docwiki.cisco.com。

[ 34 ] Othmar Kyas: ATM Networks . International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

[34] Othmar Kyas: ATM网络。国际汤森出版，1995年。ISBN：978-1-850-32128-6。

[ 35 ] “ InfiniBand FAQ ,” Mellanox Technologies, December 22, 2014.

"无限带宽常见问题解答，" Mellanox Technologies，2014年12月22日。

[ 36 ] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman: “ End-to-End Congestion Control for InfiniBand ,” at 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. doi:10.1109/INFCOM.2003.1208949

「[36] Jose Renato Santos，Yoshio Turner 和 G. (John) Janakiraman：「InfiniBand 的端到端拥塞控制」，发表于第22届IEEE计算机和通信学会联合会议(INFOCOM)，2003年4月。同时也由惠普实验室Palo Alto发表，技术报告HPL-2002-359。 doi:10.1109/INFCOM.2003.1208949」

[ 37 ] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley: “ The NTP FAQ and HOWTO ,” ntp.org , November 2006.

[37] Ulrich Windl，David Dalton，Marc Martinec和Dale R. Worley： “NTP FAQ和HOWTO”，ntp.org，2006年11月。

[ 38 ] John Graham-Cumming: “ How and why the leap second affected Cloudflare DNS ,” blog.cloudflare.com , January 1, 2017.

[38] 约翰·格雷厄姆-卡明 (John Graham-Cumming): “‘闰秒’ 如何影响 Cloudflare DNS 以及其原因,” 博客地址为 blog.cloudflare.com，发表日期为 2017 年 1 月 1 日。

[ 39 ] David Holmes: “ Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows ,” blogs.oracle.com , October 2, 2006.

[39] David Holmes: "Hotspot VM 内部: 时钟，定时器和调度事件 - 第一部分 - Windows," blogs.oracle.com, 2006 年 10 月 2 日。

[ 40 ] Steve Loughran: “ Time on Multi-Core, Multi-Socket Servers ,” steveloughran.blogspot.co.uk , September 17, 2015.

[40] Steve Loughran: “在多核、多插槽服务器上的时间管理”，steveloughran.blogspot.co.uk，2015年9月17日。

[ 41 ] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “ Spanner: Google’s Globally-Distributed Database ,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

[41] James C. Corbett，Jeffrey Dean，Michael Epstein等人： “Spanner：Google 全球分布式数据库”，在第十届 USENIX 操作系统设计和实现研讨会（OSDI）上，2012年10月。

[ 42 ] M. Caporaloni and R. Ambrosini: “ How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet? ,” European Journal of Physics , volume 23, number 4, pages L17–L21, June 2012. doi:10.1088/0143-0807/23/4/103

[42] M. Caporaloni和R. Ambrosini：“个人电脑通过互联网可以以多大精度跟踪UTC时间标准？”，欧洲物理学杂志，第23卷，第4期，页码L17–L21，2012年6月。doi:10.1088/0143-0807/23/4/103。

[ 43 ] Nelson Minar: “ A Survey of the NTP Network ,” alumni.media.mit.edu , December 1999.

[43] 尼尔森·米纳尔：“NTP网络调查”，alumni.media.mit.edu，1999年12月。

[ 44 ] Viliam Holub: “ Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem ,” blog.logentries.com , March 14, 2014.

[44] 维利亚姆·霍鲁布： “在卡桑德拉集群中同步时钟问题，第一部分 - 问题”，blog.logentries.com，2014年3月14日。

[ 45 ] Poul-Henning Kamp: “ The One-Second War (What Time Will You Die?) ,” ACM Queue , volume 9, number 4, pages 44–48, April 2011. doi:10.1145/1966989.1967009

[45] Poul-Henning Kamp：“一秒战争（你将死于何时？）”，《ACM Queue》杂志，2011年4月第9卷第4期，44-48页。doi：10.1145 / 1966989.1967009。

[ 46 ] Nelson Minar: “ Leap Second Crashes Half the Internet ,” somebits.com , July 3, 2012.

[46] Nelson Minar：“闰秒导致一半的互联网崩溃”，somebits.com，2012年7月3日。

[ 47 ] Christopher Pascoe: “ Time, Technology and Leaping Seconds ,” googleblog.blogspot.co.uk , September 15, 2011.

克里斯托弗·帕斯科（Christopher Pascoe）：“时间，技术和跳秒”，googleblog.blogspot.co.uk，2011年9月15日。

[ 48 ] Mingxue Zhao and Jeff Barr: “ Look Before You Leap – The Coming Leap Second and AWS ,” aws.amazon.com , May 18, 2015.

[48] 明雪赵（Mingxue Zhao）和杰夫·巴尔（Jeff Barr）：“Before You Leap – 即将到来的跳秒和 AWS，”aws.amazon.com，2015年5月18日。

[ 49 ] Darryl Veitch and Kanthaiah Vijayalayan: “ Network Timing and the 2015 Leap Second ,” at 17th International Conference on Passive and Active Measurement (PAM), April 2016. doi:10.1007/978-3-319-30505-9_29

[49] Darryl Veitch和Kanthaiah Vijayalayan： “网络时序和2015年闰秒”，收录于第17届被动和主动测量国际会议（PAM），于2016年4月。 doi:10.1007/978-3-319-30505-9_29

[ 50 ] “ Timekeeping in VMware Virtual Machines ,” Information Guide, VMware, Inc., December 2011.

“VMware虚拟机中的时间管理”，信息指南，VMware公司，2011年12月。

[ 51 ] “ MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I (Draft) ,” European Securities and Markets Authority, Report ESMA/2015/1464, September 2015.

`[51] "MiFID II / MiFIR：监管技术和实施标准 - 附录一（草案）" ，欧洲证券和市场监管局，报告ESMA/2015/1464，2015年9月。`

[ 52 ] Luke Bigum: “ Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1) ,” lmax.com , November 27, 2015.

[52]卢克·比格姆：“以最少开支解决MiFID II时钟同步问题（第1部分）”， lmax.com，2015年11月27日。

[ 53 ] Kyle Kingsbury: “ Call Me Maybe: Cassandra ,” aphyr.com , September 24, 2013.

[53] Kyle Kingsbury：“Call Me Maybe：Cassandra”，aphyr.com，2013年9月24日。 [53] 凯尔·金斯伯利：“Call Me Maybe：Cassandra”，aphyr.com，2013年9月24日。

[ 54 ] John Daily: “ Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems ,” basho.com , November 12, 2013.

"钟表是坏的，或者，欢迎来到分布式系统的奇妙世界"，basho.com，2013年11月12日。

[ 55 ] Kyle Kingsbury: “ The Trouble with Timestamps ,” aphyr.com , October 12, 2013.

[55] Kyle Kingsbury：“时间戳的难题”，aphyr.com，2013年10月12日。

[ 56 ] Leslie Lamport: “ Time, Clocks, and the Ordering of Events in a Distributed System ,” Communications of the ACM , volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

“时间、时钟和分布式系统中事件的排序”，ACM通讯，第21卷，第7期，1978年7月，第558-565页。doi:10.1145/359545.359563。

[ 57 ] Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, et al.: “ Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases ,” State University of New York at Buffalo, Computer Science and Engineering Technical Report 2014-04, May 2014.

[57] Sandeep Kulkarni，Murat Demirbas，Deepak Madeppa等人：“全球分布式数据库中的逻辑物理时钟和一致快照”，纽约州立大学布法罗分校，计算机科学和工程技术报告2014-04，2014年5月。

[ 58 ] Justin Sheehy: “ There Is No Now: Problems With Simultaneity in Distributed Systems ,” ACM Queue , volume 13, number 3, pages 36–41, March 2015. doi:10.1145/2733108

[58] Justin Sheehy：“分布式系统中的同时性问题：不存在“现在””，《ACM队列》杂志，第13卷，第3期，第36-41页，2015年3月。doi:10.1145/2733108。

[ 59 ] Murat Demirbas: “ Spanner: Google’s Globally-Distributed Database ,” muratbuffalo.blogspot.co.uk , July 4, 2013.

"[59] Murat Demirbas: "Spanner: Google的全球分布式数据库"，muratbuffalo.blogspot.co.uk，2013年7月4日。"

[ 60 ] Dahlia Malkhi and Jean-Philippe Martin: “ Spanner’s Concurrency Control ,” ACM SIGACT News , volume 44, number 3, pages 73–77, September 2013. doi:10.1145/2527748.2527767

【60】 Dahlia Malkhi和Jean-Philippe Martin：“Spanner的并发控制”，ACM SIGACT新闻，44卷3号，2013年9月，73-77页。doi:10.1145/2527748.2527767。

[ 61 ] Manuel Bravo, Nuno Diegues, Jingna Zeng, et al.: “ On the Use of Clocks to Enforce Consistency in the Cloud ,” IEEE Data Engineering Bulletin , volume 38, number 1, pages 18–31, March 2015.

[61] Manuel Bravo, Nuno Diegues, Jingna Zeng等人：《在云计算中使用时钟以实现一致性》，IEEE数据工程通报，第38卷，第1期，2015年3月，18-31页。

[ 62 ] Spencer Kimball: “ Living Without Atomic Clocks ,” cockroachlabs.com , February 17, 2016.

[62] Spencer Kimball：“没有原子钟的生活”，cockroachlabs.com，2016年2月17日。

[ 63 ] Cary G. Gray and David R. Cheriton: “ Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency ,” at 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989. doi:10.1145/74850.74870

[63] Cary G. Gray和David R. Cheriton：“租约：分布式文件缓存一致性的高效容错机制”，发表于1989年12月的第12届ACM操作系统原理研讨会（SOSP）。doi：10.1145/74850.74870。

[ 64 ] Todd Lipcon: “ Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1 ,” blog.cloudera.com , February 24, 2011.

[64] Todd Lipcon: "使用MemStore-Local分配缓冲区来避免Apache HBase的Full GC：第一部分"，blog.cloudera.com，2011年2月24日。

[ 65 ] Martin Thompson: “ Java Garbage Collection Distilled ,” mechanical-sympathy.blogspot.co.uk , July 16, 2013.

马丁·汤普森：《Java 垃圾回收概述》，mechanical-sympathy.blogspot.co.uk，2013年7月16日。

[ 66 ] Alexey Ragozin: “ How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater ,” java.dzone.com , June 28, 2011.

[66] Alexey Ragozin: “如何驯服Java GC暂停？在16GB堆和更大的情况下生存”， java.dzone.com，2011年6月28日。

[ 67 ] Christopher Clark, Keir Fraser, Steven Hand, et al.: “ Live Migration of Virtual Machines ,” at 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.

[67] 克里斯托弗·克拉克（Christopher Clark），凯尔·弗雷泽（Keir Fraser），史蒂文·汉德（Steven Hand）等：“虚拟机的实时迁移”，发表于第二届USENIX协会网络系统设计与实现研讨会（NSDI），2005年5月。

[ 68 ] Mike Shaver: “ fsyncers and Curveballs ,” shaver.off.net , May 25, 2008.

[68] 迈克·沙弗(Mike Shaver)：“fsyncers和Curveballs”，shaver.off.net，2008年5月25日。

[ 69 ] Zhenyun Zhuang and Cuong Tran: “ Eliminating Large JVM GC Pauses Caused by Background IO Traffic ,” engineering.linkedin.com , February 10, 2016.

【69】庄振云和陈庆：《消除背景IO流量造成的JVM大GC暂停》，engineering.linkedin.com，2016年2月10日。

[ 70 ] David Terei and Amit Levy: “ Blade: A Data Center Garbage Collector ,” arXiv:1504.02578, April 13, 2015.

[70] David Terei 和 Amit Levy: “Blade: 数据中心垃圾回收器,” arXiv:1504.02578, 2015 年 4 月 13 日。

[ 71 ] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz: “ Trash Day: Coordinating Garbage Collection in Distributed Systems ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

【71】Martin Maas、Tim Harris、Krste Asanović和John Kubiatowicz：“Trash Day：协调分布式系统中的垃圾回收”，于2015年5月在第15届USENIX操作系统热点工作坊（HotOS）上发表。

[ 72 ] “ Predictable Low Latency ,” Cinnober Financial Technology AB, cinnober.com , November 24, 2013.

[72] “可预测的低延迟”，Cinnober金融科技有限公司，cinnober.com，2013年11月24日。

[ 73 ] Martin Fowler: “ The LMAX Architecture ,” martinfowler.com , July 12, 2011.

[73] 马丁·福勒: “LMAX 架构”, martinfowler.com, 2011 年 7 月 12 日。

[ 74 ] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination . O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

【74】Flavio P. Junqueira和Benjamin Reed: 《ZooKeeper：分布式进程协调》。O'Reilly Media，2013年。ISBN：978-1-449-36130-3。

[ 75 ] Enis Söztutar: “ HBase and HDFS: Understanding Filesystem Usage in HBase ,” at HBaseCon , June 2013.

[75] Enis Söztutar：在HBaseCon会议上，于2013年6月提出“HBase和HDFS：了解HBase中的文件系统使用”。

[ 76 ] Caitie McCaffrey: “ Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived ,” caitiem.com , June 23, 2015.

[76] Caitie McCaffrey：客户就是混蛋：又名《如何在《光环4》上线时抵抗DoS攻击》（How Halo 4 DoSed the Services at Launch & How We Survived），caitiem.com，2015年6月23日。

[ 77 ] Leslie Lamport, Robert Shostak, and Marshall Pease: “ The Byzantine Generals Problem ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 4, number 3, pages 382–401, July 1982. doi:10.1145/357172.357176

“拜占庭将军问题”，这篇论文是由Leslie Lamport、Robert Shostak和Marshall Pease合著的，发表在1982年7月的ACM Transactions on Programming Languages and Systems (TOPLAS)杂志上，第4卷第3期，382-401页。DOI:10.1145/357172.357176。

[ 78 ] Jim N. Gray: “ Notes on Data Base Operating Systems ,” in Operating Systems: An Advanced Course , Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7

[78] 吉姆·格雷：《数据库操作系统笔记》，收录于《操作系统：高阶课程》，计算机科学讲义，卷 60，由 R. Bayer、R. M. Graham 和 G. Seegmüller 编辑，Springer-Verlag 出版，1978 年，页码 393-481。ISBN: 978-3-540-08755-7。

[ 79 ] Brian Palmer: “ How Complicated Was the Byzantine Empire? ,” slate.com , October 20, 2011.

[79] 布莱恩·帕尔默：“拜占庭帝国有多复杂？” slate.com，2011年10月20日。

[ 80 ] Leslie Lamport: “ My Writings ,” research.microsoft.com , December 16, 2014. This page can be found by searching the web for the 23-character string obtained by removing the hyphens from the string allla-mport-spubso-ntheweb .

“Leslie Lamport：我的著作”，research.microsoft.com，2014年12月16日。可以通过搜索从字符字符串allla-mport-spubso-ntheweb中删除连字符的23个字符字符串找到此页面。

[ 81 ] John Rushby: “ Bus Architectures for Safety-Critical Embedded Systems ,” at 1st International Workshop on Embedded Software (EMSOFT), October 2001.

[81] 约翰·拉什比：「安全关键嵌入式系统的总线架构」，于2001年10月举办的第一届嵌入式软件研讨会（EMSOFT）上发表。

[ 82 ] Jake Edge: “ ELC: SpaceX Lessons Learned ,” lwn.net , March 6, 2013.

"ELC: SpaceX Lessons Learned," lwn.net, March 6, 2013. "ELC：SpaceX的教训与经验," lwn.net，2013年3月6日。

[ 83 ] Andrew Miller and Joseph J. LaViola, Jr.: “ Anonymous Byzantine Consensus from Moderately-Hard Puzzles: A Model for Bitcoin ,” University of Central Florida, Technical Report CS-TR-14-01, April 2014.

[83] 安德鲁·米勒和约瑟夫·J·拉维奥拉： “中度难度谜题的匿名拜占庭共识：比特币的模型”，中佛罗里达大学，技术报告CS-TR-14-01，2014年4月。

[ 84 ] James Mickens: “ The Saddest Moment ,” USENIX ;login: logout , May 2013.

"最悲哀的时刻，" USENIX ;login: logout, 2013年5月。"

[ 85 ] Evan Gilman: “ The Discovery of Apache ZooKeeper’s Poison Packet ,” pagerduty.com , May 7, 2015.

[85] Evan Gilman：“Apache ZooKeeper的毒包的发现”，pagerduty.com，2015年5月7日。

[ 86 ] Jonathan Stone and Craig Partridge: “ When the CRC and TCP Checksum Disagree ,” at ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), August 2000. doi:10.1145/347059.347561

[86] Jonathan Stone和Craig Partridge： “当CRC和TCP校验和不一致时”，《计算机通信应用、技术、架构与协议ACM会议》(SIGCOMM)，2000年8月。 doi:10.1145/347059.347561

[ 87 ] Evan Jones: “ How Both TCP and Ethernet Checksums Fail ,” evanjones.ca , October 5, 2015.

[87] Evan Jones：“TCP和以太网校验和是如何失败的”，evanjones.ca，2015年10月5日。

[ 88 ] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “ Consensus in the Presence of Partial Synchrony ,” Journal of the ACM , volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[88] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “在部分同步状态下的共识”，ACM杂志，第35卷，第2期，页码288-323，1988年4月发表。doi:10.1145/42282.42283

[ 89 ] Peter Bailis and Ali Ghodsi: “ Eventual Consistency Today: Limitations, Extensions, and Beyond ,” ACM Queue , volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[89] Peter Bailis 和 Ali Ghodsi：「最终一致性今日的局限性、扩展和未来」，ACM Queue，第 11 卷，第 3 期，第 55-63 页，2013 年 3 月。 doi:10.1145/2460276.2462076

[ 90 ] Bowen Alpern and Fred B. Schneider: “ Defining Liveness ,” Information Processing Letters , volume 21, number 4, pages 181–185, October 1985. doi:10.1016/0020-0190(85)90056-0

【90】Bowen Alpern和Fred B. Schneider：“定义活性”，《信息处理信函》，第21卷，第4期，181-185页，1985年10月。doi：10.1016/0020-0190（85）90056-0。

[ 91 ] Flavio P. Junqueira: “ Dude, Where’s My Metadata? ,” fpj.me , May 28, 2015.

[91] Flavio P. Junqueira：“兄弟们，我的元数据在哪里？”fpj.me，2015年5月28日。

[ 92 ] Scott Sanders: “ January 28th Incident Report ,” github.com , February 3, 2016.

[92] Scott Sanders：“1月28日事件报告”，github.com，2016年2月3日。

[ 93 ] Jay Kreps: “ A Few Notes on Kafka and Jepsen ,” blog.empathybox.com , September 25, 2013.

[93] 杰伊·克雷普斯：《有关Kafka和Jepsen的几点注意事项》，blog.empathybox.com，2013年9月25日。

[ 94 ] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “ Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems ,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523627

【94】邓青，郝明喆，Tanakorn Leesatapornwongsa 等著：“Limplock：了解Limpware对scale-out云系统的影响”，收录于2013年第四届ACM云计算研讨会（SoCC），2013年10月发表。DOI:10.1145/2523616.2523627。

[ 95 ] Frank McSherry, Michael Isard, and Derek G. Murray: “ Scalability! But at What COST? ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[95] Frank McSherry, Michael Isard和Derek G. Murray："可伸缩性！但代价是什么？", 在第15届USENIX操作系统热点工作坊（HotOS），2015年5月。

Chapter 9. Consistency and Consensus

Is it better to be alive and wrong or right and dead?

活着做错了还是死了做对了更好？

Jay Kreps, A Few Notes on Kafka and Jepsen (2013)

Jay Kreps, 关于Kafka和Jepsen的几点说明（2013年）

Lots of things can go wrong in distributed systems, as discussed in Chapter 8 . The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of tolerating faults—that is, of keeping the service functioning correctly, even if some internal component is faulty.

分布式系统中可能出现很多问题，如第8章所讨论的那样。处理此类故障的最简单方法是直接让整个服务失败，并向用户显示错误消息。如果该解决方案不可接受，则需要找到容错的方法——即使某些内部组件出现故障，也能正确地保持服务运行。

In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant distributed systems. We will assume that all the problems from Chapter 8 can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.

在本章中，我们将谈论一些算法和协议的示例，用于构建容错的分布式系统。我们将假设来自第8章的所有问题都可能发生：数据包可能在网络中丢失、重排序、重复或任意延迟；时钟充其量只是近似的；节点可以随时暂停（例如，由于垃圾收集）或崩溃。

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. This is the same approach as we used with transactions in Chapter 7 : by using a transaction, the application can pretend that there are no crashes (atomicity), that nobody else is concurrently accessing the database (isolation), and that storage devices are perfectly reliable (durability). Even though crashes, race conditions, and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them.

构建容错系统的最佳方法是找到一些具有有用保证的通用抽象，仅实现一次，然后让应用程序依赖于这些保证。这与第7章中使用事务的方法相同：通过使用事务，应用程序可以假装没有崩溃（原子性），没有其他人同时访问数据库（隔离性）以及存储设备完全可靠（持久性）。尽管发生崩溃，竞争条件和磁盘故障，但事务抽象将这些问题隐藏起来，使应用程序不必担心它们。

We will now continue along the same lines, and seek abstractions that can allow an application to ignore some of the problems with distributed systems. For example, one of the most important abstractions for distributed systems is consensus : that is, getting all of the nodes to agree on something. As we shall see in this chapter, reliably reaching consensus in spite of network faults and process failures is a surprisingly tricky problem.

现在我们将沿用相同的思路，寻求可以让应用程序忽略分布式系统中某些问题的抽象。例如，分布式系统中最重要的抽象之一是一致性：即让所有节点都达成一致。正如我们将在本章中看到的那样，即使在网络故障和进程失败的情况下可靠地达成共识是一个非常棘手的问题。

Once you have an implementation of consensus, applications can use it for various purposes. For example, say you have a database with single-leader replication. If the leader dies and you need to fail over to another node, the remaining database nodes can use consensus to elect a new leader. As discussed in “Handling Node Outages” , it’s important that there is only one leader, and that all nodes agree who the leader is. If two nodes both believe that they are the leader, that situation is called split brain , and it often leads to data loss. Correct implementations of consensus help avoid such problems.

一旦你实现了共识，应用程序可以用它来实现各种目的。例如，假设你有一个单领导者复制的数据库。如果领导者死亡并且你需要转移到另一个节点，剩余的数据库节点可以使用共识来选举新领导者。如“处理节点故障”中所讨论的那样，重要的是只有一个领导者，并且所有节点都同意谁是领导者。如果两个节点都认为自己是领导者，那种情况被称为脑裂，并且通常会导致数据丢失。正确实现共识有助于避免这些问题。

Later in this chapter, in “Distributed Transactions and Consensus” , we will look into algorithms to solve consensus and related problems. But first we first need to explore the range of guarantees and abstractions that can be provided in a distributed system.

稍后在本章的“分布式事务与一致性”部分，我们将探讨解决一致性和相关问题的算法。但是首先，我们需要探索分布式系统中可以提供的保证和抽象范围。

We need to understand the scope of what can and cannot be done: in some situations, it’s possible for the system to tolerate faults and continue working; in other situations, that is not possible. The limits of what is and isn’t possible have been explored in depth, both in theoretical proofs and in practical implementations. We will get an overview of those fundamental limits in this chapter.

我们需要理解可行和不可行的范畴：在某些情况下，系统可以容忍故障并继续工作；在其他情况下，这是不可能的。这些可能与不可能的限制已经在理论证明和实际应用中深入探讨过。本章中将简要介绍这些基本限制。

Researchers in the field of distributed systems have been studying these topics for decades, so there is a lot of material—we’ll only be able to scratch the surface. In this book we don’t have space to go into details of the formal models and proofs, so we will stick with informal intuitions. The literature references offer plenty of additional depth if you’re interested.

分布式系统领域的研究人员已经研究这些主题数十年，因此有很多材料-我们只能刮到表面。在本书中，我们没有空间详细介绍正式模型和证明，因此我们将坚持非正式的直觉。如果您感兴趣，文献参考资料提供了大量额外的深度。

Consistency Guarantees

In “Problems with Replication Lag” we looked at some timing issues that occur in a replicated database. If you look at two database nodes at the same moment in time, you’re likely to see different data on the two nodes, because write requests arrive on different nodes at different times. These inconsistencies occur no matter what replication method the database uses (single-leader, multi-leader, or leaderless replication).

在“复制延迟问题”中，我们探讨了发生在复制数据库中的一些时间问题。如果您在同一时刻查看两个数据库节点，您可能会在这两个节点上看到不同的数据，因为写请求到达不同的节点的时间不同。这些不一致性发生无论数据库使用什么复制方法（单主复制，多主复制或无主复制）。

Most replicated databases provide at least eventual consistency , which means that if you stop writing to the database and wait for some unspecified length of time, then eventually all read requests will return the same value [ 1 ]. In other words, the inconsistency is temporary, and it eventually resolves itself (assuming that any faults in the network are also eventually repaired). A better name for eventual consistency may be convergence , as we expect all replicas to eventually converge to the same value [ 2 ].

大多数复制的数据库至少提供最终一致性，这意味着如果您停止写入数据库并等待一定时间，那么最终所有读取请求将返回相同的值。换句话说，不一致是暂时的，并且最终会自行解决（假设网络中的任何故障也最终得到修复）。最终一致性的更好名称可能是收敛，因为我们期望所有副本最终会收敛到相同的值。

However, this is a very weak guarantee—it doesn’t say anything about when the replicas will converge. Until the time of convergence, reads could return anything or nothing [ 1 ]. For example, if you write a value and then immediately read it again, there is no guarantee that you will see the value you just wrote, because the read may be routed to a different replica (see “Reading Your Own Writes” ).

然而，这是一个非常薄弱的保证-它并没有说明副本什么时候会汇聚。在汇聚的时候之前，读取可能会返回任何东西或什么都没有[1]。例如，如果您写入一个值然后立即再次读取它，就不保证您会看到刚刚写入的值，因为读取可能会路由到另一个副本（请参见“读取您自己的写入”）。

Eventual consistency is hard for application developers because it is so different from the behavior of variables in a normal single-threaded program. If you assign a value to a variable and then read it shortly afterward, you don’t expect to read back the old value, or for the read to fail. A database looks superficially like a variable that you can read and write, but in fact it has much more complicated semantics [ 3 ].

最终一致性对应用程序开发者来说很难，因为它与普通单线程程序中变量的行为完全不同。如果你给一个变量赋值，然后很快就读取它，你不会期望读回旧的值，或者读取失败。数据库表面上看起来像一个可以读写的变量，但实际上它具有更复杂的语义。

When working with a database that provides only weak guarantees, you need to be constantly aware of its limitations and not accidentally assume too much. Bugs are often subtle and hard to find by testing, because the application may work well most of the time. The edge cases of eventual consistency only become apparent when there is a fault in the system (e.g., a network interruption) or at high concurrency.

当与提供弱保证的数据库一起工作时，您需要时刻注意其限制，不要意外地假设太多。由于应用程序大多数时间表现良好，因此错误通常很微妙，难以通过测试找到。在系统出现故障（例如，网络中断）或高并发时，最终一致性的边界情况才会变得明显。

In this chapter we will explore stronger consistency models that data systems may choose to provide. They don’t come for free: systems with stronger guarantees may have worse performance or be less fault-tolerant than systems with weaker guarantees. Nevertheless, stronger guarantees can be appealing because they are easier to use correctly. Once you have seen a few different consistency models, you’ll be in a better position to decide which one best fits your needs.

在本章中，我们将探讨数据系统可能选择提供的更强一致性模型。它们并不是免费的：具有更强的保证的系统可能比具有更弱保证的系统性能更差或更不容错。尽管如此，更强的保证可以吸引人，因为它们更容易正确使用。一旦您看过几种不同的一致性模型，您将更好地决定哪种最适合您的需求。

There is some similarity between distributed consistency models and the hierarchy of transaction isolation levels we discussed previously [ 4 , 5 ] (see “Weak Isolation Levels” ). But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

分布式一致性模型和我们之前讨论过的交易隔离级别层次之间存在一些相似之处（参见“弱隔离级别”）。但是，虽然存在一定重叠，但它们主要是独立的问题：事务隔离主要是为了避免由并发执行事务引起的竞争条件，而分布式一致性主要是为了协调副本的状态，以应对延迟和故障的问题。

This chapter covers a broad range of topics, but as we shall see, these areas are in fact deeply linked:

这一章涵盖了广泛的话题，但是正如我们将看到的那样，这些领域实际上是深度相互关联的。

We will start by looking at one of the strongest consistency models in common use, linearizability , and examine its pros and cons.

我们将从最常用的一种最强一致性模型——线性一致性开始，探讨它的优缺点。
We’ll then examine the issue of ordering events in a distributed system ( “Ordering Guarantees” ), particularly around causality and total ordering.

我们将审视分布式系统中事件订购的问题（“订购保证”），特别关注因果关系和完全订购。
In the third section ( “Distributed Transactions and Consensus” ) we will explore how to atomically commit a distributed transaction, which will finally lead us toward solutions for the consensus problem.

在第三部分（“分布式事务和共识”）中，我们将探讨如何原子地提交分布式事务，最终将带领我们走向解决共识问题的解决方案。

Linearizability

In an eventually consistent database, if you ask two different replicas the same question at the same time, you may get two different answers. That’s confusing. Wouldn’t it be a lot simpler if the database could give the illusion that there is only one replica (i.e., only one copy of the data)? Then every client would have the same view of the data, and you wouldn’t have to worry about replication lag.

在一致性最终性数据库中，如果您在同一时间向两个不同的副本询问同一个问题，您可能会得到两个不同的答案，这很令人困惑。如果数据库能够给出只有一个副本（即只有一个数据副本）的假象，这不是更简单吗？那么每个客户机都将拥有相同的数据视图，您就不必担心复制延迟了。

This is the idea behind linearizability [ 6 ] (also known as atomic consistency [ 7 ], strong consistency , immediate consistency , or external consistency [ 8 ]). The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

这就是线性一致性的理念（也称原子一致性，强一致性，即时一致性或外部一致性）。线性一致性的确切定义相当微妙，我们将在本节的其余部分中探讨它。但基本思想是使系统看起来好像只有一份数据副本，所有操作都是原子操作。通过这个保证，即使现实中可能有多个副本，应用程序也不必担心它们。

In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee . To clarify this idea, let’s look at an example of a system that is not linearizable.

在一个可线性化的系统中，只要一个客户端成功地完成了写操作，所有从数据库中读取的客户端都必须能够看到刚刚写入的值。保持数据单一副本的幻觉意味着保证读取的值是最新的、最新的，而不是来自过时的高速缓存或副本。换句话说，可线性化是一个最近性保证。为了澄清这个想法，让我们看一个不可线性化的系统的例子。

Figure 9-1 shows an example of a nonlinearizable sports website [ 9 ]. Alice and Bob are sitting in the same room, both checking their phones to see the outcome of the 2014 FIFA World Cup final. Just after the final score is announced, Alice refreshes the page, sees the winner announced, and excitedly tells Bob about it. Bob incredulously hits reload on his own phone, but his request goes to a database replica that is lagging, and so his phone shows that the game is still ongoing.

图9-1展示了一个无法线性化的体育网站示例[9]。爱丽丝和鲍勃坐在同一个房间里，都在检查手机来查看2014年世界杯足球比赛的结果。在最终比分宣布之后，爱丽丝刷新页面，看到了获胜者的宣布，并激动地告诉鲍勃。鲍勃不可思议地在自己的手机上点击刷新，但他的请求发到了一个滞后的数据库副本，因此他的手机显示比赛仍在进行中。

If Alice and Bob had hit reload at the same time, it would have been less surprising if they had gotten two different query results, because they wouldn’t know at exactly what time their respective requests were processed by the server. However, Bob knows that he hit the reload button (initiated his query) after he heard Alice exclaim the final score, and therefore he expects his query result to be at least as recent as Alice’s. The fact that his query returned a stale result is a violation of linearizability.

如果Alice和Bob同时点击了刷新按钮，如果他们获得了两个不同的查询结果，那也不会太令人惊讶，因为他们无法确定各自的请求何时被服务器处理。然而，Bob知道他在听到Alice宣布最终比分之后才点击了刷新按钮（发起了他的查询），因此他期望他的查询结果至少与Alice的一样新。他的查询返回了过时的结果，这是线性可串行化的违反。

What Makes a System Linearizable?

The basic idea behind linearizability is simple: to make a system appear as if there is only a single copy of the data. However, nailing down precisely what that means actually requires some care. In order to understand linearizability better, let’s look at some more examples.

线性化的基本思想很简单：让系统看起来只有一个数据副本。不过，确切地理解这意味着仍然需要一些谨慎。为了更好地理解线性化，让我们看一些更多的例子。

Figure 9-2 shows three clients concurrently reading and writing the same key x in a linearizable database. In the distributed systems literature, x is called a register —in practice, it could be one key in a key-value store, one row in a relational database, or one document in a document database, for example.

图9-2展示了三个客户端同时在一个可线性化的数据库中读取和写入同一个键x。在分布式系统文献中，x被称为寄存器，在实践中，它可能是键值存储中的一个键，关系数据库中的一行或文档数据库中的一个文档，等等。

For simplicity, Figure 9-2 shows only the requests from the clients’ point of view, not the internals of the database. Each bar is a request made by a client, where the start of a bar is the time when the request was sent, and the end of a bar is when the response was received by the client. Due to variable network delays, a client doesn’t know exactly when the database processed its request—it only knows that it must have happened sometime between the client sending the request and receiving the response. ⁱ

简单起见，图9-2仅显示来自客户端的请求，而不显示数据库的内部情况。每个条形图表示客户端发出的一个请求，其中条形图的起始点是请求发送的时间，终点是客户端接收响应的时间。由于网络延迟的变化，客户机不知道数据库何时处理了请求，它只知道这必须发生在客户端发送请求和接收响应之间的某个时间。

In this example, the register has two types of operations:

在这个例子中，该寄存器有两种类型的操作：

read ( x ) ⇒ v means the client requested to read the value of register x , and the database returned the value v .

读取（x）⇒ v 表示客户端请求读取寄存器 x 的值，数据库返回值 v。
write ( x , v ) ⇒ r means the client requested to set the register x to value v , and the database returned response r (which could be ok or error ).

"write（x，v）⇒ r" 意为客户端请求将寄存器 x 设置为值 v，数据库返回响应 r（可能是“好”或“错误”）。

In Figure 9-2 , the value of x is initially 0, and client C performs a write request to set it to 1. While this is happening, clients A and B are repeatedly polling the database to read the latest value. What are the possible responses that A and B might get for their read requests?

在图9-2中，x的值最初为0，客户端C执行写请求将其设置为1。在此期间，客户端A和B正在反复轮询数据库以读取最新值。A和B可能会得到什么样的读取响应？

The first read operation by client A completes before the write begins, so it must definitely return the old value 0.

客户端A的第一次读操作在写入开始前完成，因此它一定会返回旧值0。
The last read by client A begins after the write has completed, so it must definitely return the new value 1 if the database is linearizable: we know that the write must have been processed sometime between the start and end of the write operation, and the read must have been processed sometime between the start and end of the read operation. If the read started after the write ended, then the read must have been processed after the write, and therefore it must see the new value that was written.

客户A最后读取的内容是在写入完成后进行的，因此，如果数据库是可线性化的，它一定会返回新值1：我们知道写操作必须在开始和结束之间处理，读操作必须在开始和结束之间处理。如果读取在写入结束后开始，则读取必须在写入后处理，因此它必须看到被写入的新值。
Any read operations that overlap in time with the write operation might return either 0 or 1, because we don’t know whether or not the write has taken effect at the time when the read operation is processed. These operations are concurrent with the write.

任何与写操作重叠的读取操作可能会返回0或1，因为我们不知道读操作在处理时写操作是否已经生效。这些操作与写操作并发进行。

However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent with a write can return either the old or the new value, then readers could see a value flip back and forth between the old and the new value several times while a write is going on. That is not what we expect of a system that emulates a “single copy of the data.” ⁱⁱ

然而，这还不足以完全描述线性化：如果与写操作并发的读取可以返回旧值或新值，则在写操作进行时读取程序可以多次看到值在旧值和新值之间切换，这不符合模拟“数据的单一副本”的系统的预期。

To make the system linearizable, we need to add another constraint, illustrated in Figure 9-3 .

为使该系统可线性化，我们需要添加另一个约束条件，如图9-3所示。

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.

在线系统中，我们想象在写操作开始和结束之间必须有某个时间点，其中x的值会原子性地从0翻转为1。因此，如果一个客户端的读操作返回新值1，所有后续的读操作都必须返回新值，即使写操作还未完成。

This timing dependency is illustrated with an arrow in Figure 9-3 . Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read. Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is still ongoing. (It’s the same situation as with Alice and Bob in Figure 9-1 : after Alice has read the new value, Bob also expects to read the new value.)

在图9-3中，这种时序依赖性用箭头表示。客户端A首先读取了新值1。在A的读取返回后，B开始了新的读取。由于B的读取严格发生在A的读取之后，即使C的写入仍在进行中，它也必须返回1。（这与图9-1中的Alice和Bob的情况相同：在Alice读取新值后，Bob也希望读取新值。）

We can further refine this timing diagram to visualize each operation taking effect atomically at some point in time. A more complex example is shown in Figure 9-4 [ 10 ].

我们可以进一步细化这个时序图，以便在某个时间点上可视化每个操作的原子效果。更复杂的示例如图9-4所示。

In Figure 9-4 we add a third type of operation besides read and write :

在图9-4中，除了读和写之外，我们添加了第三种操作：

cas ( x , v _old , v _new ) ⇒ r means the client requested an atomic compare-and-set operation (see “Compare-and-set” ). If the current value of the register x equals v _old , it should be atomically set to v _new . If x ≠ v _old then the operation should leave the register unchanged and return an error. r is the database’s response ( ok or error ).

cas(x，vold，vnew)⇒r 表示客户端请求原子比较和交换操作（参见“比较和交换”）。如果寄存器x的当前值等于vold，则应将其原子地设置为vnew。如果x≠vold，则操作应使寄存器保持不变并返回错误。 r是数据库的响应（ok或error）。

Each operation in Figure 9-4 is marked with a vertical line (inside the bar for each operation) at the time when we think the operation was executed. Those markers are joined up in a sequential order, and the result must be a valid sequence of reads and writes for a register (every read must return the value set by the most recent write).

图9-4中每个操作都标有竖线（在每个操作的竖杠内），表示我们认为该操作执行的时间点。这些标记按顺序连接起来，结果必须是寄存器读写的有效序列（每次读取必须返回最新写入的值）。

The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward. This requirement ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.

线性可达性的要求是操作标记连线必须始终向前移动（从左到右），不得向后。此要求确保了我们早期讨论的最新性保证：一旦写入或读取新值，所有后续读取操作都将看到该值，直到再次被覆盖。

There are a few interesting details to point out in Figure 9-4 :

图9-4中有一些有趣的细节需要指出：

First client B sent a request to read x , then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the network, so it only reached the database after the two writes.

首先客户端B发送了读取x的请求，然后客户端D发送了将x设置为0的请求，接着客户端A发送了将x设置为1的请求。然而，返回给B的值是1（由A写入的值）。这没问题：这意味着数据库首先处理了D的写入，然后是A的写入，最后是B的读取。尽管这不是请求发送的顺序，但这是一个可接受的顺序，因为这三个请求是并发的。也许B的读取请求在网络中略有延迟，因此它只在两个写入之后才到达数据库。
Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it doesn’t mean the value was read before it was written, it just means the ok response from the database to client A was slightly delayed in the network.

客户端-B的读取在客户端-A从数据库接收到响应之前返回了1，表示值1的写入成功。这也是可以的：并不意味着在写入之前就读取了该值，这只是意味着来自数据库对客户端A的“ok”响应在网络中稍有延迟。
This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set ( cas ) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).

这种模型不假设事务隔离：另一个客户端可以随时更改一个值。例如，C首先读取1，然后读取2，因为该值在两次读取之间被B更改。原子比较和交换（cas）操作可用于检查该值是否已被另一个客户端并发更改：B和C的cas请求成功，但D的cas请求失败（当数据库处理它时，x的值不再是0）。
The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1 .

客户端B（在一个阴影条中）的最后读取结果不是线性化的。该操作与C的cas写操作同时进行，该操作将x从2更新为4。在没有其他请求的情况下，B的读取返回2是可以的。但是，在B的读取开始之前，客户端A已经读取了新值4，因此B不允许读取比A旧的值。这又是Figure 9-1中的Alice和Bob相同的情况。

That is the intuition behind linearizability; the formal definition [ 6 ] describes it more precisely. It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order [ 11 ].

这就是线性可操作性背后的直觉；正式定义[6]更加精确地描述了它。可以（尽管计算成本高昂）通过记录所有请求和响应的时间并检查它们是否可以被安排成有效的顺序[11]来测试系统的行为是否是线性可操作的。

Linearizability Versus Serializability

Linearizability is easily confused with serializability (see “Serializability” ), as both words seem to mean something like “can be arranged in a sequential order.” However, they are two quite different guarantees, and it is important to distinguish between them:

线性化很容易与串行化混淆（参见“串行化”），因为这两个词似乎都表示“可以按顺序安排”。然而，它们是两个非常不同的保证，因此区分它们很重要。

Serializability

Serializability is an isolation property of transactions , where every transaction may read and write multiple objects (rows, documents, records)—see “Single-Object and Multi-Object Operations” . It guarantees that transactions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run [ 12 ].

序列化性是事务的孤立性属性，其中每个事务可能会读取和写入多个对象（行、文档、记录）——见“单对象和多对象操作”。它保证事务的行为与它们在某个连续顺序下执行时相同（每个事务在下一个事务开始之前运行到完成）。该顺序与实际运行事务的顺序不同也是可以的。

Linearizability

Linearizability is a recency guarantee on reads and writes of a register (an individual object ). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew (see “Write Skew and Phantoms” ), unless you take additional measures such as materializing conflicts (see “Materializing conflicts” ).

线性化可确保对寄存器（一个独立的对象）的读写保持最新性。它不会将操作分组为事务，因此无法防止诸如写偏斜（参见“写偏斜和幻象”）等问题，除非你采取其他措施，例如实现冲突（参见“实现冲突”）。

A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability ( strong-1SR ) [ 4 , 13 ]. Implementations of serializability based on two-phase locking (see “Two-Phase Locking (2PL)” ) or actual serial execution (see “Actual Serial Execution” ) are typically linearizable.

数据库可以同时提供序列化和线性化，这种组合被称为严格序列化或强一拷贝序列化（Strong-1SR）[4，13]。基于两阶段锁定（参见“两阶段锁定（2PL）”）或实际串行执行（参见“实际串行执行”）的序列化实现通常是可线性化的。

However, serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” ) is not linearizable: by design, it makes reads from a consistent snapshot, to avoid lock contention between readers and writers. The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.

然而，可序列化快照隔离（参见“可序列化快照隔离（SSI）”）不是线性化的：根据设计，它使读取从一致的快照，以避免读者和写者之间的锁争用。一致快照的整个重点在于它不包括比快照更近的写入，因此从快照读取不是线性化的。

Relying on Linearizability

In what circumstances is linearizability useful? Viewing the final score of a sporting match is perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any real harm in this situation. However, there a few areas in which linearizability is an important requirement for making a system work correctly.

在什么情况下，线性一致性是有用的？查看体育比赛的最终得分可能是一个轻浮的例子：在这种情况下，几秒钟前过时的结果不太可能造成任何实际伤害。但是，有一些领域需要线性一致性是使系统正常工作的重要要求。

Locking and leader election

A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader [ 14 ]. No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock; otherwise it is useless.

使用单主副本的系统需要确保确实只有一个领导者而不是多个(分离脑). 选举领导者的一种方法是使用锁：每个启动的节点都尝试获得锁，成功的节点成为领导者[14]. 无论如何实现这个锁，它必须是线性化的：所有节点必须同意哪个节点拥有锁；否则它是无用的。

Coordination services like Apache ZooKeeper [ 15 ] and etcd [ 16 ] are often used to implement distributed locks and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter, in “Fault-Tolerant Consensus” ). ⁱⁱⁱ There are still many subtle details to implementing locks and leader election correctly (see for example the fencing issue in “The leader and the lock” ), and libraries like Apache Curator [ 17 ] help by providing higher-level recipes on top of ZooKeeper. However, a linearizable storage service is the basic foundation for these coordination tasks.

类似Apache ZooKeeper和etcd这样的协调服务通常用于实现分布式锁和领导者选举。它们使用共识算法以容错方式实现线性化操作（我们在本章后面“容错共识”中讨论这些算法）。实现正确的锁和领导者选举仍然存在许多微妙的细节（例如在“领导者和锁”中的栅栏问题），Apache Curator等库提供了在ZooKeeper之上提供更高级别的配方来帮助。然而，线性化存储服务是这些协调任务的基础。

Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC) [ 18 ]. RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system. Since these linearizable locks are on the critical path of transaction execution, RAC deployments usually have a dedicated cluster interconnect network for communication between database nodes.

分布式锁定还在某些分布式数据库中以更细粒度的级别使用，例如Oracle Real Application Clusters（RAC）[18]。 RAC每页磁盘使用一个锁，多个节点共享对同一磁盘存储系统的访问。由于这些可线性化锁定在事务执行的关键路径上，因此RAC部署通常具有专用的群集互连网络，用于数据库节点之间的通信。

Constraints and uniqueness guarantees

Uniqueness constraints are common in databases: for example, a username or email address must uniquely identify one user, and in a file storage service there cannot be two files with the same path and filename. If you want to enforce this constraint as the data is written (such that if two people try to concurrently create a user or a file with the same name, one of them will be returned an error), you need linearizability.

在数据库中，唯一性约束是很常见的：例如，用户名或电子邮件地址必须唯一地标识一个用户，在文件存储服务中不能有两个具有相同路径和文件名的文件。如果您想在写入数据时强制执行此约束（例如，如果两个人尝试同时创建具有相同名称的用户或文件，则其中一个将返回错误），则需要线性化。

This situation is actually similar to a lock: when a user registers for your service, you can think of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic compare-and-set, setting the username to the ID of the user who claimed it, provided that the username is not already taken.

这种情况实际上类似于一把锁：当用户注册您的服务时，您可以将其视为获得所选择的用户名的“锁定”。操作也非常类似于原子比较和设置，将用户名设置为声明它的用户ID，前提是该用户名尚未被占用。

Similar issues arise if you want to ensure that a bank account balance never goes negative, or that you don’t sell more items than you have in stock in the warehouse, or that two people don’t concurrently book the same seat on a flight or in a theater. These constraints all require there to be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.

如果您想确保银行账户余额从不为负，或者仓库存货不会超售，或者两个人不会同时预订同一架飞机或剧院的座位，就会出现类似的问题。所有这些限制都需要有一个最新的值（账户余额、库存水平、座位占用率）得到所有节点的一致认可。

In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if a flight is overbooked, you can move customers to a different flight and offer them compensation for the inconvenience). In such cases, linearizability may not be needed, and we will discuss such loosely interpreted constraints in “Timeliness and Integrity” .

在实际应用中，有时候处理这些约束条件可以放宽一些（举个例子，如果一个航班过度预订，可以把乘客安排到另一个航班，并提供赔偿来安抚），在这种情况下，就不需要线性化，我们将在“及时性和完整性”中探讨这些宽松解释的约束条件。

However, a hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, can be implemented without requiring linearizability [ 19 ].

然而，像关系数据库中通常发现的硬独特性约束一样，需要线性化。其他类型的约束，如外键或属性约束，可以在不需要线性化的情况下实现[19]。

Cross-channel timing dependencies

Notice a detail in Figure 9-1 : if Alice hadn’t exclaimed the score, Bob wouldn’t have known that the result of his query was stale. He would have just refreshed the page again a few seconds later, and eventually seen the final score. The linearizability violation was only noticed because there was an additional communication channel in the system (Alice’s voice to Bob’s ears).

请注意图9-1中的一个细节：如果Alice没有大声喊出比分，Bob就不会知道他的查询结果已经过期了。他只会在几秒钟后再次刷新页面，并最终看到最终比分。只有因为系统中有一个额外的通信渠道（Alice的声音传到Bob的耳朵），才会注意到线性一致性的违规行为。

Similar situations can arise in computer systems. For example, say you have a website where users can upload a photo, and a background process resizes the photos to lower resolution for faster download (thumbnails). The architecture and dataflow of this system is illustrated in Figure 9-5 .

类似的情况也会在计算机系统中出现。例如，假设您有一个网站，用户可以上传照片，而一个后台进程会调整照片的分辨率以便更快地下载（缩略图）。此系统的体系结构和数据流程如图9-5所示。

The image resizer needs to be explicitly instructed to perform a resizing job, and this instruction is sent from the web server to the resizer via a message queue (see Chapter 11 ). The web server doesn’t place the entire photo on the queue, since most message brokers are designed for small messages, and a photo may be several megabytes in size. Instead, the photo is first written to a file storage service, and once the write is complete, the instruction to the resizer is placed on the queue.

图片调整器需要明确指示执行调整工作，此指示通过消息队列（见第11章）从Web服务器发送到调整器。Web服务器不会将整个照片放入队列中，因为大多数消息代理都设计用于小型消息，而一张照片可能有几兆字节。相反，照片首先被写入文件存储服务，一旦写入完成，就将调整器的指示放置在队列上。

If the file storage service is linearizable, then this system should work fine. If it is not linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in Figure 9-5 ) might be faster than the internal replication inside the storage service. In this case, when the resizer fetches the image (step 5), it might see an old version of the image, or nothing at all. If it processes an old version of the image, the full-size and resized images in the file storage become permanently inconsistent.

如果文件存储服务是可线性化的，那么系统应该可以正常工作。如果它不是可线性化的，就存在竞争条件的风险：消息队列（图9-5中的步骤3和4）可能比存储服务内部复制更快。在这种情况下，当调整大小程序获取图像（步骤5）时，它可能会看到旧版本的图像，或者根本什么都看不到。如果处理旧版本的图像，则文件存储中的全尺寸和调整大小后的图像将永久不一致。

This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 9-1 , where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Alice’s mouth and Bob’s ears.

这个问题的产生是因为Web服务器和调整器之间存在两个不同的通信渠道：文件存储和消息队列。没有线性化的新近性保证，两个渠道之间可能存在竞争条件。这种情况类似于图9-1，其中也存在两个通信渠道之间的竞赛条件：数据库复制和Alice的嘴和Bob的耳朵之间的现实音频通道。

Linearizability is not the only way of avoiding this race condition, but it’s the simplest to understand. If you control the additional communication channel (like in the case of the message queue, but not in the case of Alice and Bob), you can use alternative approaches similar to what we discussed in “Reading Your Own Writes” , at the cost of additional complexity.

线性一致性并不是避免这种竞争条件的唯一方法，但是它是最容易理解的方法。如果你控制了额外的通信渠道（就像在消息队列的情况下，但不是在Alice和Bob的情况下），你可以使用类似于我们在“读取您自己的写入”的讨论中所讨论的替代方法，但这需要额外的复杂性。

Implementing Linearizable Systems

Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we might implement a system that offers linearizable semantics.

现在我们已经看了几个证明线性化是有用的例子，让我们思考一下如何实现一个提供线性化语义的系统。

Since linearizability essentially means “behave as though there is only a single copy of the data, and all operations on it are atomic,” the simplest answer would be to really only use a single copy of the data. However, that approach would not be able to tolerate faults: if the node holding that one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.

由于“线性可化”实质上意味着“表现得好像只有一个数据副本，而所有对它的操作都是原子性的”，因此最简单的答案是确实只使用一个数据副本。但是，这种方法将无法容忍故障：如果持有该副本的节点出现故障，数据将丢失，或者至少在节点重新启动之前无法访问。

The most common approach to making a system fault-tolerant is to use replication. Let’s revisit the replication methods from Chapter 5 , and compare whether they can be made linearizable:

最常见的使系统具有容错能力的方法是使用复制。让我们回顾一下第5章的复制方法，并比较它们是否可以变成线性化：

Single-leader replication (potentially linearizable)

In a system with single-leader replication (see “Leaders and Followers” ), the leader has the primary copy of the data that is used for writes, and the followers maintain backup copies of the data on other nodes. If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. ^iv However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs [ 10 ].

在单主复制系统中（参见“领导者和追随者”），领导者拥有用于写入的主要数据副本，而追随者在其他节点上维护数据的备份副本。如果您从领导者或同步更新的追随者中读取，则它们有可能是可线性化的。但是，并非每个单主数据库都实际上是可线性化的，这可能是由于设计缺陷（例如，它使用快照隔离）或由于并发性错误。

Using the leader for reads relies on the assumption that you know for sure who the leader is. As discussed in “The Truth Is Defined by the Majority” , it is quite possible for a node to think that it is the leader, when in fact it is not—and if the delusional leader continues to serve requests, it is likely to violate linearizability [ 20 ]. With asynchronous replication, failover may even lose committed writes (see “Handling Node Outages” ), which violates both durability and linearizability.

使用领袖节点进行读操作的前提是确定领袖节点的身份。如同在“真相由大多数决定”中所述，有可能节点认为自己是领袖，实际上却不是。如果虚妄的领袖节点继续处理请求，很可能会违反线性化条件[20]。在异步复制中，故障转移甚至可能会丢失已提交的写操作（见“处理节点故障”），这不仅违反了持久化条件，也违反了线性化条件。

Consensus algorithms (linearizable)

Some consensus algorithms, which we will discuss later in this chapter, bear a resemblance to single-leader replication. However, consensus protocols contain measures to prevent split brain and stale replicas. Thanks to these details, consensus algorithms can implement linearizable storage safely. This is how ZooKeeper [ 21 ] and etcd [ 22 ] work, for example.

一些共识算法与单领导者复制有相似之处，但共识协议包含了防止分裂大脑和陈旧副本的措施。由于这些细节，共识算法可以安全地实现可线性化存储。例如，ZooKeeper和etcd就是这样工作的。

Multi-leader replication (not linearizable)

Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution (see “Handling Write Conflicts” ). Such conflicts are an artifact of the lack of a single copy of the data.

多主复制系统通常不是可线性化的，因为它们同时在多个节点上处理写入，并异步将它们复制到其他节点。因此，它们可能会产生需要解决的冲突写入（参见“处理写入冲突”）。这些冲突是由于数据缺乏单一副本的缺陷所导致的。

Leaderless replication (probably not linearizable)

For systems with leaderless replication (Dynamo-style; see “Leaderless Replication” ), people sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes ( w + r > n ). Depending on the exact configuration of the quorums, and depending on how you define strong consistency, this is not quite true.

对于没有主节点复制的系统（Dynamo式; 请参阅“无主节点复制”），有些人声称可以通过要求仲裁读写（w+r>n）来获得“强一致性”。根据仲裁的确切配置和您如何定义强一致性，这并不完全正确。

“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra; see “Relying on Synchronized Clocks” ) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew. Sloppy quorums ( “Sloppy Quorums and Hinted Handoff” ) also ruin any chance of linearizability. Even with strict quorums, nonlinearizable behavior is possible, as demonstrated in the next section.

基于时间（例如在Cassandra中）的“最后一次写入成功”冲突解决方法是几乎肯定非线性的，因为由于时钟偏差，时钟时间戳不能保证与实际事件排序一致。松散的法定人数（“松散的法定人数和提示性转交”）也破坏了线性化的任何可能性。即使采用严格的法定人数，非线性化的行为也是可能的，这在下一节中有所展示。

Linearizability and quorums

Intuitively, it seems as though strict quorum reads and writes should be linearizable in a Dynamo-style model. However, when we have variable network delays, it is possible to have race conditions, as demonstrated in Figure 9-6 .

直觉上，严格的法定人数读写应该在Dynamo风格模型中是可线性化的。但是，当我们有可变的网络延迟时，可能会出现竞争条件，如图9-6所示。

In Figure 9-6 , the initial value of x is 0, and a writer client is updating x to 1 by sending the write to all three replicas ( n = 3, w = 3). Concurrently, client A reads from a quorum of two nodes ( r = 2) and sees the new value 1 on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two nodes, and gets back the old value 0 from both.

在图9-6中，x的初始值为0，并且写入客户端通过向所有三个副本（n=3，w=3）发送写操作来将x更新为1。同时，客户端A从两个节点的仲裁中读取（r=2），并在其中一个节点上看到新的值1。同时进行写入，客户端B从不同的两个节点的仲裁中进行读取，并从两个节点中获取旧值0。

The quorum condition is met ( w + r > n ), but this execution is nevertheless not linearizable: B’s request begins after A’s request completes, but B returns the old value while A returns the new value. (It’s once again the Alice and Bob situation from Figure 9-1 .)

达到了法定人数条件（w + r > n），但该执行仍然不是可线性化的：B的请求开始在A的请求完成之后，但B返回了旧值，而A返回了新值。（这又是图9-1中的爱丽丝和鲍勃情况。）

Interestingly, it is possible to make Dynamo-style quorums linearizable at the cost of reduced performance: a reader must perform read repair (see “Read repair and anti-entropy” ) synchronously, before returning results to the application [ 23 ], and a writer must read the latest state of a quorum of nodes before sending its writes [ 24 , 25 ]. However, Riak does not perform synchronous read repair due to the performance penalty [ 26 ]. Cassandra does wait for read repair to complete on quorum reads [ 27 ], but it loses linearizability if there are multiple concurrent writes to the same key, due to its use of last-write-wins conflict resolution.

有趣的是，Dynamo式环境下的法定人数可以在牺牲性能的前提下实现线性化: 读者必须在同步执行读修复之后才能返回结果给应用程序进行读修复(见“读修复和反熵”），而写者则必须在发送其写入之前读取节点法定人数的最新状态。然而，Riak由于性能惩罚不执行同步读修复。Cassandra会等待法定人数读完成读修复，但是由于使用了最后一次写决定冲突解决方法，如果有多个并发写入同一键，则会失去线性化。

Moreover, only linearizable read and write operations can be implemented in this way; a linearizable compare-and-set operation cannot, because it requires a consensus algorithm [ 28 ].

此外，仅有可线性化的读写操作可以以这种方式实现；线性化的比较和交换操作则不行，因为这需要共识算法 [28]。

In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.

总的来说，最安全的做法是假设一个没有领导的系统，采用Dynamo风格的复制，不能提供线性可序性。

The Cost of Linearizability

As some replication methods can provide linearizability and others cannot, it is interesting to explore the pros and cons of linearizability in more depth.

由于一些复制方法可以提供线性化，而其他方法不能，因此更有趣的是深入探讨线性化的优缺点。

We already discussed some use cases for different replication methods in Chapter 5 ; for example, we saw that multi-leader replication is often a good choice for multi-datacenter replication (see “Multi-datacenter operation” ). An example of such a deployment is illustrated in Figure 9-7 .

在第5章中，我们已经讨论了使用不同复制方法的一些用例；例如，我们发现对于多数据中心复制，多领导者复制通常是一个好选择（请参见“多数据中心操作”）。这样的部署示例如图9-7所示。

Consider what happens if there is a network interruption between the two datacenters. Let’s assume that the network within each datacenter is working, and clients can reach the datacenters, but the datacenters cannot connect to each other.

考虑一下如果两个数据中心之间发生网络中断会发生什么。假设每个数据中心内部的网络都正常运行，客户端能够访问数据中心，但两个数据中心无法相互连接。

With a multi-leader database, each datacenter can continue operating normally: since writes from one datacenter are asynchronously replicated to the other, the writes are simply queued up and exchanged when network connectivity is restored.

使用多领导者数据库，每个数据中心都可以继续正常运行：由于来自一个数据中心的写入会异步地复制到其他数据中心，所以写入只是简单地排队并在网络连接恢复时交换。

On the other hand, if single-leader replication is used, then the leader must be in one of the datacenters. Any writes and any linearizable reads must be sent to the leader—thus, for any clients connected to a follower datacenter, those read and write requests must be sent synchronously over the network to the leader datacenter.

另一方面，如果使用单一领导者复制，则领导者必须在一个数据中心中。任何写操作和任何线性化读取操作必须发送到领导者 - 因此，对于连接到追随者数据中心的任何客户端，这些读取和写入请求必须同步通过网络发送到领导者数据中心。

If the network between datacenters is interrupted in a single-leader setup, clients connected to follower datacenters cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads. They can still make reads from the follower, but they might be stale (nonlinearizable). If the application requires linearizable reads and writes, the network interruption causes the application to become unavailable in the datacenters that cannot contact the leader.

如果单领导者设置中的数据中心之间的网络中断，则连接到跟随者数据中心的客户端无法联系领导者，因此它们无法对数据库进行任何写入或任何线性读取。他们仍然可以从追随者那里读取，但可能会过期（非线性）。如果应用程序需要线性读取和写入，则网络中断会导致无法联系领导者的数据中心中的应用程序变得不可用。

If clients can connect directly to the leader datacenter, this is not a problem, since the application continues to work normally there. But clients that can only reach a follower datacenter will experience an outage until the network link is repaired.

如果客户可以直接连接到主数据中心，那么这不是问题，因为应用程序在那里仍然正常工作。但是，只能连接到从属数据中心的客户将经历停机，直到网络连接修复为止。

The CAP theorem

This issue is not just a consequence of single-leader and multi-leader replication: any linearizable database has this problem, no matter how it is implemented. The issue also isn’t specific to multi-datacenter deployments, but can occur on any unreliable network, even within one datacenter. The trade-off is as follows: ^v

这个问题不仅仅是单领导者和多领导者复制的结果：任何线性化数据库都会有这个问题，无论它是如何实现的。这个问题也不是针对多数据中心部署的，但是可能发生在任何不可靠的网络上，甚至在一个数据中心内部也是如此。权衡如下：

If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the network problem is fixed, or return an error (either way, they become unavailable ).

如果您的应用程序需要线性可操作性，而某些副本由于网络问题与其他副本断开连接，则某些副本在断开连接时无法处理请求：它们必须等待网络问题修复，或返回错误（无论哪种方式，它们都变得不可用）。
If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.

如果您的应用程序不需要线性化，那么可以编写以使每个副本可以独立处理请求的方式，即使它与其他副本断开连接（例如，多领导者）。在这种情况下，应用程序可以在网络问题的情况下保持可用，但其行为不是线性的。

Thus, applications that don’t require linearizability can be more tolerant of network problems. This insight is popularly known as the CAP theorem [ 29 , 30 , 31 , 32 ], named by Eric Brewer in 2000, although the trade-off has been known to designers of distributed databases since the 1970s [ 33 , 34 , 35 , 36 ].

因此，不需要线性化的应用程序可以更容忍网络问题。这个洞见被称为CAP定理[29，30，31，32]，由Eric Brewer在2000年命名，尽管这种权衡已经为分布式数据库的设计者所知道自1970年代开始[33，34，35，36]。

CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases. At the time, many distributed databases focused on providing linearizable semantics on a cluster of machines with shared storage [ 18 ], and CAP encouraged database engineers to explore a wider design space of distributed shared-nothing systems, which were more suitable for implementing large-scale web services [ 37 ]. CAP deserves credit for this culture shift—witness the explosion of new database technologies since the mid-2000s (known as NoSQL).

CAP最初被提出时作为一个经验性的规则，并没有精确的定义，旨在开始一场关于数据库中权衡的讨论。当时，许多分布式数据库都专注于在共享存储的机群上提供可线性化的语义[18]，而CAP鼓励数据库工程师探索更广阔的分布式共享无关系统设计空间，这更适合实现大规模的Web服务[37]。 CAP应该得到这种文化转变的认可，这也证明了自2000年代中期以来新数据库技术的爆炸式增长（称为NoSQL）。

The Unhelpful CAP Theorem

CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3 . Unfortunately, putting it this way is misleading [ 32 ] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not [ 38 ].

CAP有时被称为一致性(C),可用性(A),分区容错(P)，需要在三个中选择两个。然而，这种描述很容易引起误解，因为网络分区其实是一种故障，而不是一种你可以选择的事情：无论你喜不喜欢，它们都会发生。

At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned [ 39 ]. A more reliable network needs to make this choice less often, but at some point the choice is inevitable.

有时当网络工作正常时，系统可以提供一致性（linearizability）和总可用性。当网络故障发生时，您必须在一致性或总可用性之间做出选择。因此，更好的表达 CAP 的方式是分区时要么一致，要么可用。一个更可靠的网络需要更少地做出这种选择，但在某些情况下选择是不可避免的。

In discussions of CAP there are several contradictory definitions of the term availability , and the formalization as a theorem [ 30 ] does not match its usual meaning [ 40 ]. Many so-called “highly available” (fault-tolerant) systems actually do not meet CAP’s idiosyncratic definition of availability. All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided.

在 CAP 的讨论中，有几个不同的定义与 “可用性” 一词相矛盾，而定理 [30] 的形式化并不符合其通常的含义 [40]。许多所谓的“高可用性” （容错性）系统实际上并不符合 CAP 特有的可用性定义。总之，CAP 周围存在很多的误解和混淆，它并不能帮助我们更好地了解系统，因此最好避免使用 CAP。

The CAP theorem as formally defined [ 30 ] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault ( network partitions , ^vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [ 9 , 40 ].

CAP 定理在正式定义上非常狭窄[30]，只考虑了一种一致性模型（即线性一致性）和一种故障（网络分割、或节点之间无法连接）。它并未涉及网络延迟、宕机等其他权衡方面。因此，尽管 CAP 在历史上具有影响力，但它对于设计系统几乎没有实际价值[9，40]。

There are many more interesting impossibility results in distributed systems [ 41 ], and CAP has now been superseded by more precise results [ 2 , 42 ], so it is of mostly historical interest today.

在分布式系统中还有许多更有趣的不可能性结果[41]，而CAP理论现在已被更精确的结果[2,42]所取代，因此它今天主要是具有历史意义的。

Linearizability and network delays

Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable [ 43 ]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a memory barrier or fence [ 44 ] is used).

尽管线性化是一个有用的保证，但实际上很少有系统是线性化的。例如，即使是现代多核CPU上的RAM也不是线性化的[43]：如果一个运行在一个CPU核心上的线程写入一个内存地址，另一个CPU核心上的线程短时间后读取相同的地址，则不能保证读取第一个线程写入的值（除非使用了内存屏障或栅栏[44]）。

The reason for this behavior is that every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory. Since accessing data in the cache is much faster than going to main memory [ 45 ], this feature is essential for good performance on modern CPUs. However, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost.

这种行为的原因是每个CPU核心都有自己的内存缓存和存储缓冲区。默认情况下，内存访问首先进入缓存，并且任何更改都会异步写入主内存。由于在缓存中访问数据比访问主内存要快得多，[45]，因此这个功能对于现代CPU的良好性能至关重要。然而，现在数据有了几个副本（一个在主内存中，可能还有几个在各种缓存中），这些副本会异步更新，因此失去了线性化。

Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance , not fault tolerance.

为什么要做出这种折衷？使用CAP定理来证明多核内存一致性模型是毫无意义的：在一台计算机内，我们通常假定通信可靠，而且如果一个CPU核心与其余部分断开连接，我们不指望它能继续正常操作。放弃线性化的原因是为了性能，而不是容错性。

The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance [ 46 ]. Linearizability is slow—and this is true all the time, not only during a network fault.

许多分布式数据库选择不提供线性化保证，主要是为了提高性能，而不是为了容错性[46]。线性化会降低速度，而且这种情况始终存在，不仅仅是在网络故障期间。

Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is no: Attiya and Welch [ 47 ] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see “Timeouts and Unbounded Delays” ), the response time of linearizable reads and writes is inevitably going to be high. A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems. In Chapter 12 we will discuss some approaches for avoiding linearizability without sacrificing correctness.

我们能否找到更高效的可线性化存储实现呢？看起来答案是否定的：Attiya和Welch在[47]中证明，如果你想要线性化，读写请求的响应时间至少与网络延迟的不确定性成比例。在延迟高度变化的网络中，像大多数计算机网络一样（见“超时和无界延迟”），线性化读写的响应时间不可避免地会很高。更快的线性化算法不存在，但弱一致性模型可以更快，因此这种折衷对于延迟敏感的系统非常重要。在第12章中，我们将讨论一些避免线性化而不牺牲正确性的方法。

Ordering Guarantees

We said previously that a linearizable register behaves as if there is only a single copy of the data, and that every operation appears to take effect atomically at one point in time. This definition implies that operations are executed in some well-defined order. We illustrated the ordering in Figure 9-4 by joining up the operations in the order in which they seem to have executed.

我们之前说过，线性寄存器的行为就好像只有一个数据的副本，每个操作都似乎原子地在某个时间点上生效。这个定义意味着操作按某种明确定义的顺序执行。我们通过按看起来执行的顺序连接操作的方式在图9-4中说明了这个排序。

Ordering has been a recurring theme in this book, which suggests that it might be an important fundamental idea. Let’s briefly recap some of the other contexts in which we have discussed ordering:

"订购已经是本书中经常出现的主题，这暗示着它可能是一个重要的基本思想。让我们简要回顾一些我们已经讨论过的其他订购上下文："

In Chapter 5 we saw that the main purpose of the leader in single-leader replication is to determine the order of writes in the replication log—that is, the order in which followers apply those writes. If there is no single leader, conflicts can occur due to concurrent operations (see “Handling Write Conflicts” ).

在第5章中，我们看到单领导副本中领导者的主要目的是确定复制日志中写入的顺序 - 即从属者应用这些写入的顺序。如果没有单个领导者，则由于并发操作可能会发生冲突（请参见“处理写冲突”）。
Serializability, which we discussed in Chapter 7 , is about ensuring that transactions behave as if they were executed in some sequential order . It can be achieved by literally executing transactions in that serial order, or by allowing concurrent execution while preventing serialization conflicts (by locking or aborting).

可串行化性是我们在第七章中讨论的一个概念，其目的是确保事务的行为就像它们是按某个顺序执行的一样。这可以通过以字面顺序执行事务来实现，也可以通过允许并发执行并同时防止序列化冲突（通过锁定或中止）来实现。
The use of timestamps and clocks in distributed systems that we discussed in Chapter 8 (see “Relying on Synchronized Clocks” ) is another attempt to introduce order into a disorderly world, for example to determine which one of two writes happened later.

我们在第8章讨论的分布式系统中使用时间戳和时钟（参见“依赖同步时钟”）是引入顺序到无序世界的另一种尝试，例如确定哪一个写操作发生在后面。

It turns out that there are deep connections between ordering, linearizability, and consensus. Although this notion is a bit more theoretical and abstract than the rest of this book, it is very helpful for clarifying our understanding of what systems can and cannot do. We will explore this topic in the next few sections.

原来，排序、线性化和共识之间存在深刻的关联。虽然这个想法比本书其他部分更理论和抽象，但它对于澄清系统能够做什么和不能做什么非常有帮助。在接下来的几节中，我们将探讨这个主题。结果表明，排序、线性化和共识之间存在深厚的联系。虽然此概念比本书其他部分略微更具理论性和抽象性，但它非常有助于澄清我们对系统能做什么和不能做什么的理解。我们将在接下来的几个部分中探讨这个主题。

Ordering and Causality

There are several reasons why ordering keeps coming up, and one of the reasons is that it helps preserve causality . We have already seen several examples over the course of this book where causality has been important:

有几个原因使得订单持续出现，其中一个原因是它有助于维护因果关系。在本书的过程中，我们已经看到了几个因果关系很重要的例子。

In “Consistent Prefix Reads” ( Figure 5-5 ) we saw an example where the observer of a conversation saw first the answer to a question, and then the question being answered. This is confusing because it violates our intuition of cause and effect: if a question is answered, then clearly the question had to be there first, because the person giving the answer must have seen the question (assuming they are not psychic and cannot see into the future). We say that there is a causal dependency between the question and the answer.

在“一致的前缀读取”（图5-5）中，我们看到一个例子，观察者先看到一个问题的答案，然后再看到问题被回答。这很令人困惑，因为它违反了我们对因果关系的直觉：如果一个问题被回答了，那么显然问题必须先存在，因为回答问题的人必须看到问题（假设他们不是通灵的，不能看到未来）。我们说，问题和答案之间存在因果依赖关系。
A similar pattern appeared in Figure 5-9 , where we looked at the replication between three leaders and noticed that some writes could “overtake” others due to network delays. From the perspective of one of the replicas it would look as though there was an update to a row that did not exist. Causality here means that a row must first be created before it can be updated.

Figure 5-9中出现了类似的模式，我们观察三个领导者之间的复制，发现由于网络延迟，一些写操作可能会“超越”其他操作。从一个副本的角度来看，这会让一行看起来像是被更新了，但实际上该行并不存在。在这里，因果关系意味着一行必须首先被创建才能被更新。
In “Detecting Concurrent Writes” we observed that if you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. This happened before relationship is another expression of causality: if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concurrent, there is no causal link between them; in other words, we are sure that neither knew about the other.

在“检测并发写入”中，我们观察到如果你有两个操作A和B，有三种可能性：A发生在B之前，或B发生在A之前，或A和B是并发的。这种发生前关系是因果关系的另一种表达方式：如果A发生在B之前，那意味着B可能知道A，或者建立在A之上，或者依赖A。如果A和B是并发的，它们之间没有因果联系；换句话说，我们确定它们中没有一个知道另一个。
In the context of snapshot isolation for transactions ( “Snapshot Isolation and Repeatable Read” ), we said that a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality : if the snapshot contains an answer, it must also contain the question being answered [ 48 ]. Observing the entire database at a single point in time makes it consistent with causality: the effects of all operations that happened causally before that point in time are visible, but no operations that happened causally afterward can be seen. Read skew (non-repeatable reads, as illustrated in Figure 7-6 ) means reading data in a state that violates causality.

在“快照隔离和可重复读”事务的快照隔离上下文中，我们说事务从一致的快照中读取。但在这个背景下，“一致”是什么意思呢？这意味着它符合因果关系：如果快照包含一个答案，它必须同时包含问题[48]。在单个时间点观察整个数据库可以保持因果一致性：所有在该时间点之前因果发生的操作的影响是可见的，但是之后因果发生的任何操作都将看不到。读取偏差（非重复读，如图7-6所示）意味着在违反因果关系的状态下读取数据。
Our examples of write skew between transactions (see “Write Skew and Phantoms” ) also demonstrated causal dependencies: in Figure 7-8 , Alice was allowed to go off call because the transaction thought that Bob was still on call, and vice versa. In this case, the action of going off call is causally dependent on the observation of who is currently on call. Serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” ) detects write skew by tracking the causal dependencies between transactions.

我们在事务之间举例说明写同步（请参见“写同步和幽灵”）也展示了因果依赖关系：在图7-8中，当事务认为Bob还在通话中时，允许Alice离开通话状态，反之亦然。在这种情况下，退出通话状态的动作在于观察当前在通话中的人。可序列化快照隔离（请参见“可序列化快照隔离（SSI）”）通过跟踪事务之间的因果依赖关系来检测写同步。
In the example of Alice and Bob watching football ( Figure 9-1 ), the fact that Bob got a stale result from the server after hearing Alice exclaim the result is a causality violation: Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice. The same pattern appeared again in “Cross-channel timing dependencies” in the guise of an image resizing service.

在阿丽斯和鲍勃观看足球的例子中（见图9-1），鲍勃在听到阿丽斯喊出分数后从服务器得到了陈旧的结果，这是一个因果性违规：阿丽斯的呼喊在因果上依赖于比分的宣布，因此鲍勃也应该在听到阿丽斯的声音后能看见比分。在“跨渠道时序依赖”中，同样的模式又以图像调整服务的形式出现了。

Causality imposes an ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer. And, like in real life, one thing leads to another: one node reads some data and then writes something as a result, another node reads the thing that was written and writes something else in turn, and so on. These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.

因果关系对事件施加了排序：原因先于结果；在消息接收之前发送消息；问题先于答案。就像在现实生活中，一件事情引发另一件事情：一个节点读取一些数据，然后作为结果写入一些内容，另一个节点读取已写入的内容，接着又写入其他内容，等等。这些因果依赖操作的链条定义了系统中的因果顺序——也就是发生在先和发生在后的顺序。

If a system obeys the ordering imposed by causality, we say that it is causally consistent . For example, snapshot isolation provides causal consistency: when you read from the database, and you see some piece of data, then you must also be able to see any data that causally precedes it (assuming it has not been deleted in the meantime).

如果一个系统遵守因果关系所强加的顺序，我们称其为因果一致性。例如，快照隔离提供了因果一致性：当您从数据库读取数据时，您看到的某些数据，那么您也必须能够看到在它之前发生的任何数据（假设它在此期间未被删除）。

The causal order is not a total order

A total order allows any two elements to be compared, so if you have two elements, you can always say which one is greater and which one is smaller. For example, natural numbers are totally ordered: if I give you any two numbers, say 5 and 13, you can tell me that 13 is greater than 5.

全序允许比较任意两个元素，所以如果你有两个元素，总能判断哪个更大哪个更小。例如，自然数是全序的：如果我给你任意两个数字，比如5和13，你可以告诉我13比5大。

However, mathematical sets are not totally ordered: is { a , b } greater than { b , c }? Well, you can’t really compare them, because neither is a subset of the other. We say they are incomparable , and therefore mathematical sets are partially ordered : in some cases one set is greater than another (if one set contains all the elements of another), but in other cases they are incomparable.

数学集合并不完全有序：{a，b}是否比{b，c}大？由于它们互不包含，因此实际上无法进行比较。我们称它们是不可比较的，因此数学集合是部分有序的：在某些情况下，一个集合比另一个集合更大（如果一个集合包含另一个集合的所有元素），但在其他情况下，它们是不可比较的。

The difference between a total order and a partial order is reflected in different database consistency models:

全序和偏序之间的区别反映在不同的数据库一致性模型中：

Linearizability

In a linearizable system, we have a total order of operations: if the system behaves as if there is only a single copy of the data, and every operation is atomic, this means that for any two operations we can always say which one happened first. This total ordering is illustrated as a timeline in Figure 9-4 .

在线性化系统中，我们有操作的总序列：如果系统的行为就像只有一个数据副本一样，并且每个操作都是原子操作，这意味着对于任何两个操作，我们都可以确定哪个操作先发生。这个总序列被展示在图9-4中的时间轴上。

Causality

We said that two operations are concurrent if neither happened before the other (see “The “happens-before” relationship and concurrency” ). Put another way, two events are ordered if they are causally related (one happened before the other), but they are incomparable if they are concurrent. This means that causality defines a partial order , not a total order: some operations are ordered with respect to each other, but some are incomparable.

如果两个操作相互独立，那么我们称它们是并行的（参见“‘先于’关系和并发性”）。换句话说，如果两个事件是有因果关系的（一个事件在另一个事件之前发生），那么它们是有序的，但是如果它们是并发的，则它们是无法比较的。这意味着因果关系定义了部分顺序，而不是完全顺序：一些操作是有序的，而一些操作是无法比较的。

Therefore, according to this definition, there are no concurrent operations in a linearizable datastore: there must be a single timeline along which all operations are totally ordered. There might be several requests waiting to be handled, but the datastore ensures that every request is handled atomically at a single point in time, acting on a single copy of the data, along a single timeline, without any concurrency.

因此，根据这个定义，在一个可线性化的数据存储中不存在并发操作：必须存在一个时间轴，沿该时间轴所有操作都被完全排序。可能有几个请求正在等待处理，但数据存储确保每个请求在单一时间点以原子方式处理，作用于单一数据副本，沿单一时间轴，不包含任何并发。

Concurrency would mean that the timeline branches and merges again—and in this case, operations on different branches are incomparable (i.e., concurrent). We saw this phenomenon in Chapter 5 : for example, Figure 5-14 is not a straight-line total order, but rather a jumble of different operations going on concurrently. The arrows in the diagram indicate causal dependencies—the partial ordering of operations.

并发意味着时间线分支和再次合并 - 在这种情况下，不同分支上的操作是不可比较的（即并发）。我们在第5章中看到了这种现象：例如，图5-14不是一条直线总序，而是一堆不同的操作并发进行。图中的箭头表示因果依赖关系 - 操作的部分排序。

If you are familiar with distributed version control systems such as Git, their version histories are very much like the graph of causal dependencies. Often one commit happens after another, in a straight line, but sometimes you get branches (when several people concurrently work on a project), and merges are created when those concurrently created commits are combined.

如果您熟悉分布式版本控制系统，如Git，它们的版本历史记录非常类似于因果依赖关系的图形。通常一个提交发生在另一个提交之后，呈直线状态，但有时会出现分支(当几个人同时在一个项目上工作时)，并且当同时创建的提交合并时也会创建分支。

Linearizability is stronger than causal consistency

So what is the relationship between the causal order and linearizability? The answer is that linearizability implies causality: any system that is linearizable will preserve causality correctly [ 7 ]. In particular, if there are multiple communication channels in a system (such as the message queue and the file storage service in Figure 9-5 ), linearizability ensures that causality is automatically preserved without the system having to do anything special (such as passing around timestamps between different components).

什么是因果关系和可线性化之间的关系？答案是线性化意味着因果关系：任何可线性化的系统都会正确地保留因果关系。特别是，在系统中存在多个通信通道（如图9-5中的消息队列和文件存储服务），线性化确保因果关系自动保留，而不需要系统做任何特殊的事情（例如在不同组件之间传递时间戳）。

The fact that linearizability ensures causality is what makes linearizable systems simple to understand and appealing. However, as discussed in “The Cost of Linearizability” , making a system linearizable can harm its performance and availability, especially if the system has significant network delays (for example, if it’s geographically distributed). For this reason, some distributed data systems have abandoned linearizability, which allows them to achieve better performance but can make them difficult to work with.

线性化确保因果关系是使线性化系统易于理解和吸引人的原因。然而，如“线性化成本”所讨论的那样，使系统线性化可能会损害其性能和可用性，特别是如果系统存在重要的网络延迟（例如，如果它地理分布）。因此，一些分布式数据系统已经放弃了线性化，这使它们可以实现更好的性能，但可能使它们难以使用。

The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [ 2 , 42 ].

好消息是有一个中间地带是可能的。线性一致性并不是保持因果关系的唯一方法 - 还有其他方法。一个系统可以在不招致线性一致性的性能损失的情况下保持因果一致性（特别是CAP定理不适用）。实际上，因果一致性是最强大的一致性模型，它不会因网络延迟而变慢，并且在网络故障时仍然可用[2,42]。

In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this observation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems [ 49 , 50 , 51 ].

许多情况下，表面上需要线性化的系统实际上只需要因果一致性，这可以更有效地实现。基于这一发现，研究人员正在探索保留因果关系的新型数据库，其性能和可用性特性类似于最终一致性系统。

As this research is quite recent, not much of it has yet made its way into production systems, and there are still challenges to be overcome [ 52 , 53 ]. However, it is a promising direction for future systems.

由于这项研究相对较新，尚未在生产系统中得到广泛应用，仍然存在一些挑战 [52, 53]。然而，这是未来系统发展的一个有前途的方向。

Capturing causal dependencies

We won’t go into all the nitty-gritty details of how nonlinearizable systems can maintain causal consistency here, but just briefly explore some of the key ideas.

我们不会在这里详细讨论非线性系统如何维持因果一致性的所有细节，但只是简要探讨一些关键的思想。

In order to maintain causality, you need to know which operation happened before which other operation. This is a partial order: concurrent operations may be processed in any order, but if one operation happened before another, then they must be processed in that order on every replica. Thus, when a replica processes an operation, it must ensure that all causally preceding operations (all operations that happened before) have already been processed; if some preceding operation is missing, the later operation must wait until the preceding operation has been processed.

为了保持因果关系，你需要知道哪个操作在哪个操作前发生。这是部分顺序：并发操作可以以任何顺序进行处理，但如果一个操作在另一个操作之前发生，则它们必须在每个副本上以那个顺序进行处理。因此，当副本处理操作时，它必须确保所有因果关系前面的操作（所有已发生的操作）都已经被处理了；如果缺少某个前一操作，则后续操作必须等待前一操作已经被处理。

In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system. If a node had already seen the value X when it issued the write Y, then X and Y may be causally related. The analysis uses the kinds of questions you would expect in a criminal investigation of fraud charges: did the CEO know about X at the time when they made decision Y?

为了确定因果依赖关系，我们需要描述系统中节点的“知识”的一些方式。如果节点在发出写入Y时已经看到了值X，则X和Y可能存在因果关系。分析使用了诈骗指控刑事调查中所期望的问题类型：CEO在做出决策Y时是否知道X的存在？

The techniques for determining which operation happened before which other operation are similar to what we discussed in “Detecting Concurrent Writes” . That section discussed causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this [ 54 ].

确定哪个操作在哪个其他操作之前发生的技术与我们在“检测并发写入”中讨论过的类似。该部分讨论了在无领导数据存储中的因果关系，我们需要检测对同一关键字的并发写入，以防止丢失更新。因果一致性更进一步：它需要跟踪整个数据库的因果依赖关系，而不仅仅是单个关键字。版本向量可以推广到这样做。

In order to determine the causal ordering, the database needs to know which version of the data was read by the application. This is why, in Figure 5-13 , the version number from the prior operation is passed back to the database on a write. A similar idea appears in the conflict detection of SSI, as discussed in “Serializable Snapshot Isolation (SSI)” : when a transaction wants to commit, the database checks whether the version of the data that it read is still up to date. To this end, the database keeps track of which data has been read by which transaction.

为了确定因果关系的顺序，数据库需要知道应用程序读取了哪个数据版本。因此，在图5-13中，先前操作的版本号在写入时传递回数据库。类似的想法出现在SSI的冲突检测中，如“可序列化快照隔离（SSI）”所讨论的那样：当一个事务想要提交时，数据库会检查它所读取的数据版本是否仍然是最新的。为此，数据库跟踪哪些事务已读取了哪些数据。

Sequence Number Ordering

Although causality is an important theoretical concept, actually keeping track of all causal dependencies can become impractical. In many applications, clients read lots of data before writing something, and then it is not clear whether the write is causally dependent on all or only some of those prior reads. Explicitly tracking all the data that has been read would mean a large overhead.

尽管因果关系是一个重要的理论概念，但实际上跟踪所有的因果依赖关系可能变得不切实际。在许多应用中，客户端在写入某些内容之前读取了大量数据，然后就不清楚写入是否对所有或仅某些先前读取的数据有因果依赖关系。显式跟踪所有已读取的数据将意味着巨大的开销。

However, there is a better way: we can use sequence numbers or timestamps to order events. A timestamp need not come from a time-of-day clock (or physical clock, which have many problems, as discussed in “Unreliable Clocks” ). It can instead come from a logical clock , which is an algorithm to generate a sequence of numbers to identify operations, typically using counters that are incremented for every operation.

然而，有一种更好的方法：我们可以使用序列号或时间戳来排序事件。时间戳不需要来自于时钟（或物理时钟，由于存在许多问题，如“不可靠的时钟”所述）。它可以来自于逻辑时钟，这是一种算法，用于生成一系列数字来识别操作，通常使用针对每个操作递增的计数器。

Such sequence numbers or timestamps are compact (only a few bytes in size), and they provide a total order : that is, every operation has a unique sequence number, and you can always compare two sequence numbers to determine which is greater (i.e., which operation happened later).

这样的序列号或时间戳非常紧凑（仅几个字节大小），并提供了完整的排序：也就是说，每个操作都有一个唯一的序列号，您始终可以比较两个序列号以确定哪个更大（即哪个操作发生得更晚）。

In particular, we can create sequence numbers in a total order that is consistent with causality : ^vii we promise that if operation A causally happened before B, then A occurs before B in the total order (A has a lower sequence number than B). Concurrent operations may be ordered arbitrarily. Such a total order captures all the causality information, but also imposes more ordering than strictly required by causality.

特别是，我们可以创建与因果关系一致的总序列号序列：我们保证，如果操作A在因果上先于B，那么A出现在总序列中早于B（A的序列号比B小）。并发操作可以任意排序。这种总序列捕获了所有因果关系信息，但也强加了比因果关系要求更多的排序。

In a database with single-leader replication (see “Leaders and Followers” ), the replication log defines a total order of write operations that is consistent with causality. The leader can simply increment a counter for each operation, and thus assign a monotonically increasing sequence number to each operation in the replication log. If a follower applies the writes in the order they appear in the replication log, the state of the follower is always causally consistent (even if it is lagging behind the leader).

在具有单领导者复制的数据库中（请参见“领导者和追随者”），复制日志定义了与因果一致的写操作总序。领导者可以简单地为每个操作增加一个计数器，因此为复制日志中的每个操作分配一个单调递增的序列号。如果追随者按照复制日志中出现的顺序应用写入，则追随者的状态始终是因果一致的（即使滞后于领导者）。

Noncausal sequence number generators

If there is not a single leader (perhaps because you are using a multi-leader or leaderless database, or because the database is partitioned), it is less clear how to generate sequence numbers for operations. Various methods are used in practice:

如果没有单个领导者（可能是因为使用了多个领导者或无领导者数据库，或者因为数据库被分区），则如何生成操作的序列号就不太清楚了。实践中使用各种方法：

Each node can generate its own independent set of sequence numbers. For example, if you have two nodes, one node can generate only odd numbers and the other only even numbers. In general, you could reserve some bits in the binary representation of the sequence number to contain a unique node identifier, and this would ensure that two different nodes can never generate the same sequence number.

每个节点可以生成自己独立的序列号集合。例如，如果有两个节点，则一个节点可以仅生成奇数，另一个节点仅生成偶数。通常，您可以保留序列号的二进制表示中的一些位来包含唯一的节点标识符，这将确保两个不同的节点永远不会生成相同的序列号。
You can attach a timestamp from a time-of-day clock (physical clock) to each operation [ 55 ]. Such timestamps are not sequential, but if they have sufficiently high resolution, they might be sufficient to totally order operations. This fact is used in the last write wins conflict resolution method (see “Timestamps for ordering events” ).

可以为每个操作附加一个来自时钟的时间戳（物理时钟）[55]。这些时间戳不是连续的，但如果它们具有足够高的分辨率，则可能足以完全有序操作。这个事实被用于最后写入赢得冲突解决方法（请参阅“时间戳用于事件排序”）。
You can preallocate blocks of sequence numbers. For example, node A might claim the block of sequence numbers from 1 to 1,000, and node B might claim the block from 1,001 to 2,000. Then each node can independently assign sequence numbers from its block, and allocate a new block when its supply of sequence numbers begins to run low.

你可以预分配一段序列号的块。例如，节点A可以声称从1到1,000的序列号块，而节点B可以声称从1,001到2,000的块。然后，每个节点可以独立地从其块中分配序列号，并在其序列号供应开始变得不足时分配一个新块。

These three options all perform better and are more scalable than pushing all operations through a single leader that increments a counter. They generate a unique, approximately increasing sequence number for each operation. However, they all have a problem: the sequence numbers they generate are not consistent with causality .

这三个选项都比通过单个领袖推送所有操作并增加计数器更有效，且更具可扩展性。它们为每个操作生成唯一的、近乎递增的序列号。然而，它们都存在一个问题：它们生成的序列号与因果关系不一致。

The causality problems occur because these sequence number generators do not correctly capture the ordering of operations across different nodes:

因果关系问题的发生是因为这些序列号生成器无法正确捕捉不同节点之间操作的顺序。

Each node may process a different number of operations per second. Thus, if one node generates even numbers and the other generates odd numbers, the counter for even numbers may lag behind the counter for odd numbers, or vice versa. If you have an odd-numbered operation and an even-numbered operation, you cannot accurately tell which one causally happened first.

每个节点可能每秒处理不同数量的操作。因此，如果一个节点生成偶数，另一个生成奇数，偶数计数器可能落后于奇数计数器，反之亦然。如果你有一个奇数操作和一个偶数操作，你不能准确地判断哪一个先发生了。
Timestamps from physical clocks are subject to clock skew, which can make them inconsistent with causality. For example, see Figure 8-3 , which shows a scenario in which an operation that happened causally later was actually assigned a lower timestamp. ^viii

物理时钟的时间戳受到时钟偏差的影响，可能与因果关系不一致。例如，参见图8-3，它展示了一个场景，在该场景中，一个因果上更晚发生的操作实际上被分配了一个较低的时间戳。
In the case of the block allocator, one operation may be given a sequence number in the range from 1,001 to 2,000, and a causally later operation may be given a number in the range from 1 to 1,000. Here, again, the sequence number is inconsistent with causality.

在块分配器的情况下，一个操作可能被分配一个序列号，范围从1,001到2,000，后续操作可能被分配一个编号范围在1到1,000之间。在这种情况下，序列号与因果关系不一致。

Lamport timestamps

Although the three sequence number generators just described are inconsistent with causality, there is actually a simple method for generating sequence numbers that is consistent with causality. It is called a Lamport timestamp , proposed in 1978 by Leslie Lamport [ 56 ], in what is now one of the most-cited papers in the field of distributed systems.

尽管刚才描述的三个序列号生成器与因果关系不一致，但实际上有一种简单的方法可以生成序列号，该方法与因果关系一致。这被称为Lamport时间戳，是由Leslie Lamport在1978年提出的，现在是分布式系统领域中最受引用的论文之一。

The use of Lamport timestamps is illustrated in Figure 9-8 . Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of ( counter , node ID ). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.

Lamport时间戳的使用如图9-8所示。每个节点都有一个唯一的标识符，并且每个节点都保留着它已处理的操作数的计数器。 Lamport时间戳只是一个（计数器，节点ID）的对。两个节点有时可能具有相同的计数器值，但通过在时间戳中包含节点ID，每个时间戳都变得唯一。

A Lamport timestamp bears no relationship to a physical time-of-day clock, but it provides total ordering: if you have two timestamps, the one with a greater counter value is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp.

Lamport时间戳与实际日历时间无关，但它提供了全局排序：如果你有两个时间戳，拥有更大计数器值的时间戳更大；如果计数器值相同，则拥有更大节点ID的时间戳更大。

So far this description is essentially the same as the even/odd counters described in the last section. The key idea about Lamport timestamps, which makes them consistent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

目前，这个描述与上一节中描述的奇偶计数器基本相同。 Lamport时间戳的关键思想是使它们与因果关系一致的：每个节点和每个客户端都会跟踪迄今为止看到的最大计数器值，并在每个请求中包括该最大值。当节点收到具有大于自己计数器值的最大计数器值的请求或响应时，它立即将自己的计数器增加到那个最大值。

This is shown in Figure 9-8 , where client A receives a counter value of 5 from node 2, and then sends that maximum of 5 to node 1. At that time, node 1’s counter was only 1, but it was immediately moved forward to 5, so the next operation had an incremented counter value of 6.

这在图9-8中所示，客户端A从节点2接收了计数器值5，然后将最大值5发送到节点1。当时，节点1的计数器仅为1，但它立即向前移动到5，因此下一个操作的计数器值为6。

As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.

只要每次操作都携带最大计数器值，此方案将确保Lamport时间戳的顺序符合因果关系，因为每个因果依赖都会导致时间戳增加。

Lamport timestamps are sometimes confused with version vectors, which we saw in “Detecting Concurrent Writes” . Although there are some similarities, they have a different purpose: version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering of Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

Lamport时间戳有时会与版本向量混淆，我们在“检测并发写入”中已经看到了。虽然有一些相似之处，但它们有不同的目的：版本向量可以区分两个操作是否并发或是否彼此因果相关，而Lamport时间戳始终强制执行完全排序。从Lamport时间戳的完全排序中，您无法确定两个操作是否并发或是否彼此因果相关。与版本向量相比，Lamport时间戳的优点在于它们更紧凑。

Timestamp ordering is not sufficient

Although Lamport timestamps define a total order of operations that is consistent with causality, they are not quite sufficient to solve many common problems in distributed systems.

虽然Lamport时间戳定义了与因果一致的操作总顺序，但它们并不足以解决许多分布式系统中的常见问题。

For example, consider a system that needs to ensure that a username uniquely identifies a user account. If two users concurrently try to create an account with the same username, one of the two should succeed and the other should fail. (We touched on this problem previously in “The leader and the lock” .)

例如，考虑一个需要确保用户名唯一标识用户账户的系统。如果两个用户同时尝试使用相同的用户名创建账户，其中一个应该成功，另一个应该失败。（我们之前在“领导者和锁”中提到过这个问题。）

At first glance, it seems as though a total ordering of operations (e.g., using Lamport timestamps) should be sufficient to solve this problem: if two accounts with the same username are created, pick the one with the lower timestamp as the winner (the one who grabbed the username first), and let the one with the greater timestamp fail. Since timestamps are totally ordered, this comparison is always valid.

乍一看，似乎全面排序操作（例如使用 Lamport 时间戳）足以解决此问题：如果创建了两个具有相同用户名的帐户，请选择时间戳较低的帐户作为获胜者（第一个占用用户名的人），并让时间戳较大的帐户失败。由于时间戳是完全有序的，因此此比较始终有效。

This approach works for determining the winner after the fact: once you have collected all the username creation operations in the system, you can compare their timestamps. However, it is not sufficient when a node has just received a request from a user to create a username, and needs to decide right now whether the request should succeed or fail. At that moment, the node does not know whether another node is concurrently in the process of creating an account with the same username, and what timestamp that other node may assign to the operation.

这种方法适用于事后确定胜者：一旦您收集了系统中所有的用户名创建操作，就可以比较它们的时间戳。但是，当一个节点刚刚收到用户创建用户名的请求，并且需要立即决定请求是成功还是失败时，这种方法是不够的。在那一刻，该节点不知道另一个节点是否正在同时创建具有相同用户名的帐户，以及另一个节点可能为该操作分配的时间戳。

In order to be sure that no other node is in the process of concurrently creating an account with the same username and a lower timestamp, you would have to check with every other node to see what it is doing [ 56 ]. If one of the other nodes has failed or cannot be reached due to a network problem, this system would grind to a halt. This is not the kind of fault-tolerant system that we need.

为确保没有其他节点同时使用相同的用户名和较低的时间戳创建帐户，您需要与每个其他节点检查其正在做什么[56]。如果其中一个其他节点失败或由于网络问题无法访问，则该系统将停止工作。这不是我们所需的容错系统。

The problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

问题在于只有在收集了所有操作后，操作的总顺序才会出现。如果另一个节点已经生成了一些操作，但您还不知道它们是什么，您不能构建操作的最终顺序：来自其他节点的未知操作可能需要被插入到总顺序的各个位置。

To conclude: in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized. If you have an operation to create a username, and you are sure that no other node can insert a claim for the same username ahead of your operation in the total order, then you can safely declare the operation successful.

为了实现像用户名唯一性约束这样的东西，仅仅拥有一种操作的完全排序是不够的 - 你还需要知道什么时候这个排序被最终确定。如果你有一个操作来创建一个用户名，并且你确信没有其他节点能在全序中在你的操作之前插入相同的用户名声明，那么你可以安全地声明该操作成功。

This idea of knowing when your total order is finalized is captured in the topic of total order broadcast .

这种在整个订单结束时知道的想法被称为总订单广播的话题所涵盖。

Total Order Broadcast

If your program runs only on a single CPU core, it is easy to define a total ordering of operations: it is simply the order in which they were executed by the CPU. However, in a distributed system, getting all nodes to agree on the same total ordering of operations is tricky. In the last section we discussed ordering by timestamps or sequence numbers, but found that it is not as powerful as single-leader replication (if you use timestamp ordering to implement a uniqueness constraint, you cannot tolerate any faults).

如果您的程序仅在单个CPU核心上运行，则很容易定义操作的完整排序：它只是CPU执行它们的顺序。然而，在分布式系统中，使所有节点都同意操作的完整排序很棘手。在最后一节中，我们讨论了按时间戳或序列号排序，但发现它不如单领导者复制强大（如果您使用时间戳排序来实现唯一性约束，则无法容忍任何故障）。

As discussed, single-leader replication determines a total order of operations by choosing one node as the leader and sequencing all operations on a single CPU core on the leader. The challenge then is how to scale the system if the throughput is greater than a single leader can handle, and also how to handle failover if the leader fails (see “Handling Node Outages” ). In the distributed systems literature, this problem is known as total order broadcast or atomic broadcast [ 25 , 57 , 58 ]. ^ix

如讨论过的，单领导者复制通过选择一个节点作为领导者，并在领导者的单个 CPU 核心上排序所有操作，从而确定操作的总顺序。然后，如果吞吐量超过单个领导者可以处理的范围，如何扩展系统，以及如何处理领导者失败（请参见“处理节点故障”）则是挑战所在。在分布式系统文献中，这个问题被称为总序广播或原子广播 [25,57,58]。

Scope of ordering guarantee

Partitioned databases with a single leader per partition often maintain ordering only per partition, which means they cannot offer consistency guarantees (e.g., consistent snapshots, foreign key references) across partitions. Total ordering across all partitions is possible, but requires additional coordination [ 59 ].

每个分区有一个单独的领导者的分区数据库通常只能保持每个分区的排序，这意味着它们无法在分区之间提供一致性保证（例如，一致的快照、外键引用）。跨所有分区的总排序是可能的，但需要额外的协调。

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:

总订单广播通常被描述为节点间交换消息的协议。简单来说，它要求始终满足两个安全属性。

Reliable delivery

No messages are lost: if a message is delivered to one node, it is delivered to all nodes.

没有遗失的消息：如果一条消息被传递到一个节点，那么它也会被传递到所有其他节点。

Totally ordered delivery

Messages are delivered to every node in the same order.

消息被按照相同的顺序传递到每个节点。

A correct algorithm for total order broadcast must ensure that the reliability and ordering properties are always satisfied, even if a node or the network is faulty. Of course, messages will not be delivered while the network is interrupted, but an algorithm can keep retrying so that the messages get through when the network is eventually repaired (and then they must still be delivered in the correct order).

总序播算法必须确保可靠性和排序属性始终得到满足，即使有节点或网络故障。当然，当网络中断时，消息将无法传递，但是算法可以不断尝试使得消息在网络最终修复时（并且仍然以正确的顺序交付）。

Using total order broadcast

Consensus services such as ZooKeeper and etcd actually implement total order broadcast. This fact is a hint that there is a strong connection between total order broadcast and consensus, which we will explore later in this chapter.

共识服务如ZooKeeper和etcd实际上实现了完全有序广播。这个事实提示完全有序广播和共识之间存在着很强的联系，在本章后面我们将会探讨这一点。

Total order broadcast is exactly what you need for database replication: if every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other (aside from any temporary replication lag). This principle is known as state machine replication [ 60 ], and we will return to it in Chapter 11 .

“全序广播”是数据库复制所需的关键因素：如果每条消息代表对数据库的写入，并且每个副本以相同的顺序处理相同的写入，则各个副本将保持一致（除了任何短暂的复制延迟）。这个原则被称为状态机复制[60]，我们将在第11章中回到它。”

Similarly, total order broadcast can be used to implement serializable transactions: as discussed in “Actual Serial Execution” , if every message represents a deterministic transaction to be executed as a stored procedure, and if every node processes those messages in the same order, then the partitions and replicas of the database are kept consistent with each other [ 61 ].

类似地，全序播送可以用于实现可串行化事务：如同“实际串行执行”中所讨论的，如果每个消息代表一个要作为存储过程执行的确定性事务，且每个节点按相同顺序处理这些消息，则数据库的分区和副本保持一致。

An important aspect of total order broadcast is that the order is fixed at the time the messages are delivered: a node is not allowed to retroactively insert a message into an earlier position in the order if subsequent messages have already been delivered. This fact makes total order broadcast stronger than timestamp ordering.

全序广播的一个重要方面是消息在传递时顺序固定：如果后续消息已经被传递，节点就不允许将消息反向插入到早期位置。这个事实使得全序广播比时间戳排序更加强大。

Another way of looking at total order broadcast is that it is a way of creating a log (as in a replication log, transaction log, or write-ahead log): delivering a message is like appending to the log. Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.

另一种理解全序广播的方式是将其视为创建日志的一种方式（例如复制日志、事务日志或预写式日志）：传递消息就像附加到日志中一样。由于所有节点必须按照相同的顺序传递相同的消息，因此所有节点可以读取日志并查看相同的消息序列。

Total order broadcast is also useful for implementing a lock service that provides fencing tokens (see “Fencing tokens” ). Every request to acquire the lock is appended as a message to the log, and all messages are sequentially numbered in the order they appear in the log. The sequence number can then serve as a fencing token, because it is monotonically increasing. In ZooKeeper, this sequence number is called zxid [ 15 ].

全序广播还可用于实现提供围栏令牌的锁定服务（请见“围栏令牌”）。每个请求获取锁都将作为一个消息附加到日志中，并按照它们在日志中出现的顺序进行顺序编号。这个序列号可以作为围栏令牌，因为它是单调递增的。在ZooKeeper中，这个序列号称为zxid [15]。

Implementing linearizable storage using total order broadcast

As illustrated in Figure 9-4 , in a linearizable system there is a total order of operations. Does that mean linearizability is the same as total order broadcast? Not quite, but there are close links between the two. ^x

如图9-4所示，在可线性化的系统中存在操作的总序。这是否意味着线性化与总序广播相同？并非完全如此，但两者之间存在密切联系。

Total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others). By contrast, linearizability is a recency guarantee: a read is guaranteed to see the latest value written.

总订单广播是异步的：保证消息按照固定顺序可靠地传送，但不保证何时传送消息（因此一个收件人可能会落后于其他人）。相比之下，线性一致性是最新性保证：读取保证可以看到最新写入的值。

However, if you have total order broadcast, you can build linearizable storage on top of it. For example, you can ensure that usernames uniquely identify user accounts.

然而，如果您拥有完全有序的广播，则可以在其上构建可线性化的存储。例如，您可以确保用户名唯一地标识用户帐户。

Imagine that for every possible username, you can have a linearizable register with an atomic compare-and-set operation. Every register initially has the value null (indicating that the username is not taken). When a user wants to create a username, you execute a compare-and-set operation on the register for that username, setting it to the user account ID, under the condition that the previous register value is null . If multiple users try to concurrently grab the same username, only one of the compare-and-set operations will succeed, because the others will see a value other than null (due to linearizability).

假设对于每个可能的用户名，您可以拥有一个具有原子比较和设置操作的可线性化寄存器。每个寄存器最初值为 null（表示该用户名尚未被占用）。当用户想要创建用户名时，您执行对该用户名的寄存器的比较和设置操作，在先前的寄存器值为 null 的条件下将其设置为用户帐户 ID。如果多个用户尝试并发获取相同的用户名，则只有一个比较和设置操作将成功，因为其他人将看到一个不为 null 的值（由于可线性化性）。

You can implement such a linearizable compare-and-set operation as follows by using total order broadcast as an append-only log [ 62 , 63 ]:

您可以通过使用全序广播作为仅限追加日志来实现这样的可线性比较和设置操作。[62，63]：

Append a message to the log, tentatively indicating the username you want to claim.

向日志中追加一条消息，暂时指示您想要声明的用户名。
Read the log, and wait for the message you appended to be delivered back to you. ^xi

阅读日志，等待回传给您的消息。
Check for any messages claiming the username that you want. If the first message for your desired username is your own message, then you are successful: you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client. If the first message for your desired username is from another user, you abort the operation.

检查是否有任何声称使用您所需用户名的消息。如果您所需用户名的第一条消息是您自己的消息，则您已经成功：您可以提交用户名声明（可能通过将另一条消息附加到日志中）并向客户确认。如果您所需用户名的第一条消息来自另一个用户，则应中止操作。

Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first. Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted. A similar approach can be used to implement serializable multi-object transactions on top of a log [ 62 ].

由于日志条目以相同顺序传递到所有节点，如果有多个并发写入，则所有节点将就哪个写入先到达达成一致。选择冲突写入中的第一个作为赢家并放弃后续写入，确保所有节点都同意写入是成功还是中止。类似的方法可用于在日志之上实现可串行化的多对象事务。

While this procedure ensures linearizable writes, it doesn’t guarantee linearizable reads—if you read from a store that is asynchronously updated from the log, it may be stale. (To be precise, the procedure described here provides sequential consistency [ 47 , 64 ], sometimes also known as timeline consistency [ 65 , 66 ], a slightly weaker guarantee than linearizability.) To make reads linearizable, there are a few options:

尽管此过程确保了可线性化的写入，但并不保证可线性化的读取。如果您从记录异步更新的存储器中读取，它可能是过时的。为了使读取可线性化，有几个选择：（要精确，请注意此处描述的过程提供了“顺序一致性”，有时也称为“时间线一致性”，这是比线性一致性稍微弱一些的保证。）

You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you. The message’s position in the log thus defines the point in time at which the read happens. (Quorum reads in etcd work somewhat like this [ 16 ].)

通过附加消息、读取日志并在消息被发送回您时执行实际读取操作，您可以通过日志对读取进行排序。因此，消息在日志中的位置定义了读取发生的时间点。（etcd 中的 Quorum 读取有些类似这种方式 [16]）。
If the log allows you to fetch the position of the latest log message in a linearizable way, you can query that position, wait for all entries up to that position to be delivered to you, and then perform the read. (This is the idea behind ZooKeeper’s sync() operation [ 15 ].)

如果日志允许您以线性化的方式获取最新日志消息的位置，您可以查询该位置，等待所有条目到达该位置，然后执行读取操作。（这就是ZooKeeper的sync()操作的背后思想）
You can make your read from a replica that is synchronously updated on writes, and is thus sure to be up to date. (This technique is used in chain replication [ 63 ]; see also “Research on Replication” .)

你可以从一个同步更新写操作的副本中阅读，因此保证它是最新的。这种技术在链式复制中被使用（参见“复制研究”）。

Implementing total order broadcast using linearizable storage

The last section showed how to build a linearizable compare-and-set operation from total order broadcast. We can also turn it around, assume that we have linearizable storage, and show how to build total order broadcast from it.

上一节展示了如何从全序广播构建可线性化的比较和交换操作。我们也可以反过来，假设我们拥有可线性化存储，并展示如何从中构建全序广播。

The easiest way is to assume you have a linearizable register that stores an integer and that has an atomic increment-and-get operation [ 28 ]. Alternatively, an atomic compare-and-set operation would also do the job.

最简单的方式就是假设有一个可以存储整数的可线性化寄存器，并且拥有原子的增加和获取操作[28]。另一种选择是使用原子的比较并设置操作。

The algorithm is simple: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

算法很简单：对于想要通过总序传播发送的每一条消息，你需要递增并获取线性化整数，然后将你从寄存器中获取的值作为序列号附加到消息上。接着你可以将消息发送到所有节点（重新发送任何丢失的消息），收件人将按照序列号依次传递消息。

Note that unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. Thus, if a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6. The same is not the case with Lamport timestamps—in fact, this is the key difference between total order broadcast and timestamp ordering.

请注意，与Lamport时间戳不同，从递增线性寄存器获得的数字形成一个没有间隙的序列。因此，如果节点已经传送了消息4并接收到序列号为6的传入消息，则它知道在传送消息6之前必须等待消息5。对于Lamport时间戳来说，情况并非如此——实际上，这是总序列广播和时间戳排序之间的关键区别。

How hard could it be to make a linearizable integer with an atomic increment-and-get operation? As usual, if things never failed, it would be easy: you could just keep it in a variable on one node. The problem lies in handling the situation when network connections to that node are interrupted, and restoring the value when that node fails [ 59 ]. In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm.

“使用原子增量和获取操作来生成可线性化的整数会有多难呢？通常情况下，如果不会出现故障，那么这非常容易实现：在一个节点上将其保留在一个变量中即可。问题在于当与该节点的网络连接中断时，如何处理并在该节点发生故障时恢复该值[59]。一般情况下，如果您足够努力地思考线性序列号生成器，最终会使用共识算法。”

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus [ 28 , 67 ]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

这不是巧合：可以证明，可线性化的比较并设置（或增量并获取）寄存器和全序广播都等价于共识[28，67]。也就是说，如果你可以解决其中一个问题，你可以将其转化为其他问题的解决方案。这是一个相当深刻和令人惊讶的见解！

It is time to finally tackle the consensus problem head-on, which we will do in the rest of this chapter.

到了终于直面共识问题的时候了，我们将在本章的其余部分来解决它。

Distributed Transactions and Consensus

Consensus is one of the most important and fundamental problems in distributed computing. On the surface, it seems simple: informally, the goal is simply to get several nodes to agree on something . You might think that this shouldn’t be too hard. Unfortunately, many broken systems have been built in the mistaken belief that this problem is easy to solve.

共识是分布式计算中最重要和基本的问题之一。表面上看，它似乎很简单：通俗地讲，目标就是让几个节点就某个事项达成一致。你可能会认为这不应该太难。不幸的是，很多破碎的系统在错误的信仰下构建，即解决这个问题很容易。

Although consensus is very important, the section about it appears late in this book because the topic is quite subtle, and appreciating the subtleties requires some prerequisite knowledge. Even in the academic research community, the understanding of consensus only gradually crystallized over the course of decades, with many misunderstandings along the way. Now that we have discussed replication ( Chapter 5 ), transactions ( Chapter 7 ), system models ( Chapter 8 ), linearizability, and total order broadcast (this chapter), we are finally ready to tackle the consensus problem.

尽管共识非常重要，但这本书中关于共识的部分出现得较晚，因为这个主题非常微妙，需要一些先决知识才能欣赏其微妙之处。即使在学术研究界，对共识的理解也是在数十年的过程中逐渐凝聚起来的，并伴随着许多误解。现在，我们已经讨论了复制（第5章）、事务（第7章）、系统模型（第8章）、线性化和全序广播（本章），终于可以着手解决共识问题。

There are a number of situations in which it is important for nodes to agree. For example:

有很多情况需要节点达成共识。例如：

Leader election

In a database with single-leader replication, all nodes need to agree on which node is the leader. The leadership position might become contested if some nodes can’t communicate with others due to a network fault. In this case, consensus is important to avoid a bad failover, resulting in a split brain situation in which two nodes both believe themselves to be the leader (see “Handling Node Outages” ). If there were two leaders, they would both accept writes and their data would diverge, leading to inconsistency and data loss.

在具有单领导副本的数据库中，所有节点都需要就哪个节点是领导者达成一致。如果由于网络故障某些节点无法与其他节点通信，则领导地位可能会受到争议。在这种情况下，共识对于避免糟糕的故障转移非常重要，导致分裂的大脑情况，即两个节点都认为自己是领导者（请参阅“处理节点故障”）。如果有两个领导者，他们都将接受写入并且它们的数据将分歧，导致不一致和数据丢失。

Atomic commit

In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others. If we want to maintain transaction atomicity (in the sense of ACID; see “Atomicity” ), we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong). This instance of consensus is known as the atomic commit problem. ^xii

在支持跨多个节点或分区的事务的数据库中，我们面临的问题是该事务可能在某些节点上失败，但在其他节点上成功。如果我们想要维护事务的原子性（在ACID意义上；参见“原子性”），我们必须让所有节点就事务的结果达成一致：如果出现任何问题，则它们全部中止/回滚，或者如果没有任何问题，则它们全部提交。这种共识的实例称为原子提交问题。

The Impossibility of Consensus

You may have heard about the FLP result [ 68 ]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?

你可能听说过FLP结果[68]，得名于Fischer、Lynch和Paterson，它证明如果存在节点崩溃的风险，就没有一种算法能始终达成共识。在分布式系统中，我们必须假设节点可能会崩溃，因此可靠的共识是不可能的。然而，我们现在正讨论达成共识的算法。这是怎么回事？

The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” ), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [ 67 ]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [ 69 ].

FLP 结果证明了在异步系统模型下（参见“系统模型与现实”），只要算法是确定性的且不使用任何时钟或超时，那么共识就是不可能的。如果允许算法使用超时或其他方法来识别疑似崩溃节点（即使有时候怀疑是错误的），那么共识就可以得到解决[67]。即使只允许算法使用随机数字也足以摆脱不可能性结果[69]。

Thus, although the FLP result about the impossibility of consensus is of great theoretical importance, distributed systems can usually achieve consensus in practice.

因此，尽管FLP关于共识不可能性的结果在理论上具有重要意义，分布式系统通常可以在实践中实现共识。

In this section we will first examine the atomic commit problem in more detail. In particular, we will discuss the two-phase commit (2PC) algorithm, which is the most common way of solving atomic commit and which is implemented in various databases, messaging systems, and application servers. It turns out that 2PC is a kind of consensus algorithm—but not a very good one [ 70 , 71 ].

在本节中，我们将首先更详细地研究原子提交问题。特别地，我们将讨论两阶段提交（2PC）算法，这是解决原子提交问题的最常见方式，并在各种数据库、消息系统和应用服务器中实现。事实证明，2PC是一种共识算法，但不是很好的算法[70，71]。

By learning from 2PC we will then work our way toward better consensus algorithms, such as those used in ZooKeeper (Zab) and etcd (Raft).

通过学习2PC，我们将逐步学习更好的共识算法，例如在ZooKeeper中使用的Zab和在etcd中使用的Raft。

Atomic Commit and Two-Phase Commit (2PC)

In Chapter 7 we learned that the purpose of transaction atomicity is to provide simple semantics in the case where something goes wrong in the middle of making several writes. The outcome of a transaction is either a successful commit , in which case all of the transaction’s writes are made durable, or an abort , in which case all of the transaction’s writes are rolled back (i.e., undone or discarded).

在第七章中，我们了解到交易的原子性的目的是在进行多个写操作时出现错误时使用简单的语义。事务的结果要么是成功提交，此时所有事务的写操作都是持久性的，要么是中止，此时所有事务的写操作都会被回滚（即撤销或丢弃）。

Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state. This is especially important for multi-object transactions (see “Single-Object and Multi-Object Operations” ) and databases that maintain secondary indexes. Each secondary index is a separate data structure from the primary data—thus, if you modify some data, the corresponding change needs to also be made in the secondary index. Atomicity ensures that the secondary index stays consistent with the primary data (if the index became inconsistent with the primary data, it would not be very useful).

原子性防止失败的事务在数据库中留下未完成的结果和未更新的状态。这对于多对象事务（请参见“单对象和多对象操作”）和维护辅助索引的数据库尤其重要。每个辅助索引都是与主数据结构分开的单独数据结构，因此，如果您修改了某些数据，则相应的更改也需要在辅助索引中进行。原子性确保辅助索引与主数据保持一致（如果索引与主数据不一致，它将没有用处）。

From single-node to distributed atomic commit

For transactions that execute at a single database node, atomicity is commonly implemented by the storage engine. When the client asks the database node to commit the transaction, the database makes the transaction’s writes durable (typically in a write-ahead log; see “Making B-trees reliable” ) and then appends a commit record to the log on disk. If the database crashes in the middle of this process, the transaction is recovered from the log when the node restarts: if the commit record was successfully written to disk before the crash, the transaction is considered committed; if not, any writes from that transaction are rolled back.

针对在单个数据库节点执行的交易，通常由存储引擎来实现原子性。当客户端要求数据库节点提交交易时，数据库会将交易写入持久化存储（通常是写前日志；参见“使B树可靠”），然后在磁盘上附加提交记录。如果数据库在此过程中崩溃，则节点重新启动时从日志中恢复交易：如果提交记录在崩溃之前成功写入磁盘，则认为交易已提交；否则，来自该交易的任何写操作都将回滚。

Thus, on a single node, transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record [ 72 ]. The key deciding moment for whether the transaction commits or aborts is the moment at which the disk finishes writing the commit record: before that moment, it is still possible to abort (due to a crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.

因此，在单个节点上，事务的提交关键取决于数据持久写入磁盘的顺序：首先是数据，然后是提交记录。确定事务提交或中止的关键决定时刻是磁盘完成写入提交记录的时刻：在此之前，由于崩溃可能仍然可以中止，但在此之后，即使数据库崩溃，事务也已提交。因此，它是单个设备（连接到一个特定节点的一个特定磁盘驱动器的控制器）使提交具有原子性。

However, what if multiple nodes are involved in a transaction? For example, perhaps you have a multi-object transaction in a partitioned database, or a term-partitioned secondary index (in which the index entry may be on a different node from the primary data; see “Partitioning and Secondary Indexes” ). Most “NoSQL” distributed datastores do not support such distributed transactions, but various clustered relational systems do (see “Distributed Transactions in Practice” ).

然而，如果涉及多个节点的交易怎么办？例如，在分区数据库中可能有多个对象事务，或者在术语分区的二级索引中（其中索引条目可能位于与主数据不同的节点上；请参阅“分区和二级索引”）。大多数“NoSQL”分布式数据存储不支持这样的分布式事务，但各种集群关系系统都支持（请参阅“实践中的分布式事务”）。

In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could easily happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee:

在这些情况下，简单地向所有节点发送一个提交请求并在每个节点上独立提交交易是不够的。这样做可能会导致提交成功于某些节点，但在其他节点上失败，这将违反原子性保证。

Some nodes may detect a constraint violation or conflict, making an abort necessary, while other nodes are successfully able to commit.

一些节点可能会检测到约束违规或冲突，需要进行中止，而其他节点能够成功提交。
Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.

有些提交请求可能会在网络中丢失，最终由于超时而中止，而其他提交请求则得以通过。
Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.

有些节点在提交记录完全写入之前可能会崩溃并在恢复时回滚，而其他节点则可顺利提交。

If some nodes commit the transaction but others abort it, the nodes become inconsistent with each other (like in Figure 7-3 ). And once a transaction has been committed on one node, it cannot be retracted again if it later turns out that it was aborted on another node. For this reason, a node must only commit once it is certain that all other nodes in the transaction are also going to commit.

如果某些节点提交事务但其他节点中止它，则节点彼此不一致（如图7-3所示）。一旦事务在一个节点上提交，如果后来发现它在另一个节点上被中止，它就不能被撤回。因此，节点必须在确信所有其他节点也将提交后才能提交。

A transaction commit must be irrevocable—you are not allowed to change your mind and retroactively abort a transaction after it has been committed. The reason for this rule is that once data has been committed, it becomes visible to other transactions, and thus other clients may start relying on that data; this principle forms the basis of read committed isolation, discussed in “Read Committed” . If a transaction was allowed to abort after committing, any transactions that read the committed data would be based on data that was retroactively declared not to have existed—so they would have to be reverted as well.

提交的交易必须是不可撤销的——一旦事务提交后，您不允许更改想法或撤消交易。这个规则的原因是，一旦数据已经提交，它就会被其他交易可见，因此其他客户端可能会开始依赖这些数据；这个原则构成了"读已提交隔离"的基础，见“读已提交”。如果一个交易在提交后允许被中止，任何读取已提交数据的交易都将基于事后宣布不存在的数据——因此它们也必须被回滚。

(It is possible for the effects of a committed transaction to later be undone by another, compensating transaction [ 73 , 74 ]. However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness requirements are the application’s problem.)

一项已提交的事务的影响可以被另一个补偿性事务撤销[73，74]。然而，从数据库的角度来看，这是一个分开的事务，因此任何跨事务的正确性要求都是应用程序的问题。

Introduction to two-phase commit

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort. It is a classic algorithm in distributed databases [ 13 , 35 , 75 ]. 2PC is used internally in some databases and also made available to applications in the form of XA transactions [ 76 , 77 ] (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP web services [ 78 , 79 ].

两阶段提交是一种算法，用于实现跨多个节点的原子事务提交-即确保所有节点都提交或所有节点都中止。这是分布式数据库中经典的算法。2PC在某些数据库中被内部使用，并以XA事务（例如Java事务API支持的方式）或通过WS-AtomicTransaction供应用程序使用，以用于SOAP Web服务。

The basic flow of 2PC is illustrated in Figure 9-9 . Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).

2PC的基本流程如图9-9所示。与单节点事务不同的是，2PC的提交/撤销过程分为两个阶段（因此得名）。

Don’t confuse 2PC and 2PL

Two-phase commit (2PC) and two-phase locking (see “Two-Phase Locking (2PL)” ) are two very different things. 2PC provides atomic commit in a distributed database, whereas 2PL provides serializable isolation. To avoid confusion, it’s best to think of them as entirely separate concepts and to ignore the unfortunate similarity in the names.

两阶段提交（2PC）和两阶段锁定（参见“两阶段锁定（2PL）”）是两个非常不同的概念。2PC在分布式数据库中提供原子提交，而2PL提供可串行化隔离。为了避免混淆，最好将它们视为完全独立的概念，并忽略名称上的不幸相似之处。

2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager ). The coordinator is often implemented as a library within the same application process that is requesting the transaction (e.g., embedded in a Java EE container), but it can also be a separate process or service. Examples of such coordinators include Narayana, JOTM, BTM, or MSDTC.

2PC使用了一种在单节点交易中通常不出现的新组件：协调员（也称事务管理器）。协调员通常作为同一应用程序过程中的库来实现（例如嵌入在Java EE容器中），但它也可以是一个单独的进程或服务。这些协调员的示例包括Narayana、JOTM、BTM或MSDTC。

A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:

2PC事务始于应用读取和写入多个数据库节点上的数据，就像平常一样。我们将这些数据库节点称为事务的参与者。当应用准备提交时，协调者开始第一阶段：向每个节点发送一个准备请求，询问它们是否能够提交。然后，协调者跟踪参与者的响应。

If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.

如果所有参与者都回复 “是”，表示他们已准备好承诺，那么协调员会在第二阶段发送承诺请求，并且实际承诺会发生。
If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

如果任何参与者回答“否”，协调员将向第2阶段中的所有节点发送中止请求。

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister asks the bride and groom individually whether each wants to marry the other, and typically receives the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted [ 73 ].

这个过程有点像西方文化传统的婚礼仪式：牧师单独问新郎和新娘是否希望嫁娶对方，通常两人都会回答“是的”。在获得双方确认后，牧师宣布他们成为夫妻：交易得以进行，这个喜悦的事实会向所有出席者广播。如果新娘或新郎有任何人没有回答“是”，婚礼仪式就会被取消 [73]。

A system of promises

From this short description it might not be clear why two-phase commit ensures atomicity, while one-phase commit across several nodes does not. Surely the prepare and commit requests can just as easily be lost in the two-phase case. What makes 2PC different?

从这个简短的描述中可能不清楚为什么双阶段提交确保原子性，而跨多个节点的单阶段提交则不是。在两阶段情况下，准备和提交请求同样容易丢失。是什么让2PC不同呢？

To understand why it works, we have to break down the process in a bit more detail:

要理解为什么它起作用，我们必须更详细地分解该过程：

When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.

当应用程序想要开始一个分布式事务时，它会向协调者请求一个事务ID。这个事务ID是全局唯一的。
The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage (for example, a node crashes or a request times out), the coordinator or any of the participants can abort.

应用程序在每个参与者上开始一个单节点事务，并将全局唯一的事务ID附加到单节点事务。所有读写操作都在这些单节点事务中完成。如果在此阶段出现任何问题（例如，节点崩溃或请求超时），协调者或任何参与者都可以中止。
When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.

当应用准备执行时，协调者会向所有参与者发送一个带有全局事务ID标记的准备请求。如果其中任何一个请求失败或超时，协调者会向所有参与者发送一个有关该事务ID的中止请求。
When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances. This includes writing all transaction data to disk (a crash, a power failure, or running out of disk space is not an acceptable excuse for refusing to commit later), and checking for any conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it.

当参与者接收到准备请求时，它确保在所有情况下都能够确定地提交该事务。这包括将所有事务数据写入磁盘（崩溃、停电或磁盘空间不足不是拒绝稍后提交的合理理由），并检查是否存在任何冲突或约束违规。通过回复“是”给协调者，节点承诺在请求时能够提交事务而不出现错误。换句话说，参与者放弃了终止事务的权利，但并没有实际提交它。
When the coordinator has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction (committing only if all participants voted “yes”). The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point .

当协调员接收到所有准备请求的回复后，它会对事务作出最终的决定，即是提交还是中止事务（只有所有参与者都投票“是”才能提交）。协调员必须将该决定写入其磁盘上的事务日志，以便在随后发生崩溃时知道其决定的方向。这就是所谓的提交点。
Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes,” it cannot refuse to commit when it recovers.

一旦协调者的决定已被写入磁盘，提交或中止请求将发送给所有参与者。如果该请求失败或超时，协调者必须永远重试，直到成功。没有回头路：如果决定是提交，那么无论需要多少次重试，该决定都必须执行。如果参与者在此期间崩溃，那么当其恢复时，交易将被提交，因为参与者投了“是”票，因此在恢复时不能拒绝提交。

Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit record to the transaction log.)

因此，该协议包含两个至关重要的“无回头点”：当参与者投票“是”，它承诺它肯定能够稍后提交（虽然管理员仍可能选择放弃）; 一旦协调员做出决定，该决定就是不可撤销的。这些承诺确保了2PC的原子性。（单节点原子提交将这两个事件合并为一个：将提交记录写入事务日志。）

Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the minister speak the words “You are now husband and wife,” that doesn’t change the fact that the transaction was committed. When you recover consciousness later, you can find out whether you are married or not by querying the minister for the status of your global transaction ID, or you can wait for the minister’s next retry of the commit request (since the retries will have continued throughout your period of unconsciousness).

回到婚姻的类比，在说“我愿意”之前，你和你的新娘/新郎有自由地通过说“绝不！”或者其他类似的话终止交易。但是，一旦你说了“我愿意”，你就不能收回。如果在说“我愿意”之后你晕倒了，没有听到牧师说“你现在成为夫妻了”，这并不会改变交易已经完成的事实。当你恢复意识后，你可以通过查询牧师的全局交易ID的状态来了解你是否结婚了，或者你可以等待牧师下一次重试提交请求（因为在你失去意识的期间，重试将一直进行）。

Coordinator failure

We have discussed what happens if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely. However, it is less clear what happens if the coordinator crashes.

如果在两阶段提交期间，参与者或网络出现故障，我们已经讨论了出现的情况：如果任何一方的准备请求失败或超时，协调者将中止事务；如果提交或中止请求失败，协调者将无限重试。然而，如果协调者崩溃，会发生什么就不那么清楚了。

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain .

如果协调器在发送准备请求之前失败，参与者可以安全地中止交易。但是，一旦参与者收到准备请求并投票“是”，它就不能再单方面中止交易了——它必须等待协调器回复交易是已提交还是已中止。如果此时协调器崩溃或网络失败，则参与者除了等待外无能为力。这种情况下参与者的交易被称为不确定或不确定状态。

The situation is illustrated in Figure 9-10 . In this particular example, the coordinator actually decided to commit, and database 2 received the commit request. However, the coordinator crashed before it could send the commit request to database 1, and so database 1 does not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, it is not safe to unilaterally commit, because another participant may have aborted.

情况如图9-10所示。在这个特殊例子中，协调器实际上决定提交，数据库2接收到提交请求。然而，在协调器发送提交请求到数据库1之前，协调器崩溃了，所以数据库1不知道是要提交还是要中止。即使超时也无法解决这个问题：如果数据库1在超时后单方面中止，它将与已经提交的数据库2不一致。同样地，单方面提交也不安全，因为另一个参与者可能已经中止。

Without hearing from the coordinator, the participant has no way of knowing whether to commit or abort. In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

如果没有听到协调员的消息，参与者就没有办法知道是要继续还是中止。原则上，参与者可以相互沟通，了解每个人的投票情况并达成共识，但这不是2PC协议的一部分。

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants: when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

2PC只能等待协调者恢复才能完成。这就是为什么在向参与者发送提交或中止请求之前，协调者必须将其提交或中止决策写入磁盘上的事务日志中的原因：当协调者恢复时，它通过读取其事务日志确定所有存疑事务的状态。在协调者的日志中没有提交记录的任何事务都会被中止。因此，2PC的提交点归结为协调者上的常规单节点原子提交。

Three-phase commit

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover. In theory, it is possible to make an atomic commit protocol nonblocking , so that it does not get stuck if a node fails. However, making this work in practice is not so straightforward.

两阶段提交被称为阻塞原子提交协议，因为2PC可能会被卡住，等待协调者恢复。理论上，可能会使原子提交协议非阻塞，这样即使节点失败也不会被卡住。然而，在实践中使其正常工作并不那么简单。

As an alternative to 2PC, an algorithm called three-phase commit (3PC) has been proposed [ 13 , 80 ]. However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most practical systems with unbounded network delay and process pauses (see Chapter 8 ), it cannot guarantee atomicity.

作为 2PC 的替代方案，提出了一种称为三阶段提交（3PC）的算法[13, 80]。然而，3PC 假设网络具有有限延迟和节点具有有限响应时间；在大多数具有无界网络延迟和进程暂停的实际系统中（请参见第 8 章），无法保证原子性。

In general, nonblocking atomic commit requires a perfect failure detector [ 67 , 71 ]—i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed. For this reason, 2PC continues to be used, despite the known problem with coordinator failure.

一般而言，非阻塞原子提交需要完美的失效检测器[67，71]，即可靠的机制来判断节点是否已经崩溃。在延迟无限制的网络中，超时并不是可靠的故障检测器，因为由于网络问题，请求可能超时，即使没有节点崩溃。因此，尽管协调员故障是已知的问题，但2PC仍然被使用。

Distributed Transactions in Practice

Distributed transactions, especially those implemented with two-phase commit, have a mixed reputation. On the one hand, they are seen as providing an important safety guarantee that would be hard to achieve otherwise; on the other hand, they are criticized for causing operational problems, killing performance, and promising more than they can deliver [ 81 , 82 , 83 , 84 ]. Many cloud services choose not to implement distributed transactions due to the operational problems they engender [ 85 , 86 ].

分布式事务，特别是使用两阶段提交实现的事务，具有良好的声誉。一方面，它们被视为提供了一个重要的安全保证，否则很难达到；另一方面，它们被批评会引起操作问题，损害性能，并承诺无法实现的问题。许多云服务不选择实现分布式事务，因为它们会引起操作问题。

Some implementations of distributed transactions carry a heavy performance penalty—for example, distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions [ 87 ], so it is not surprising when people advise against using them. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing ( fsync ) that is required for crash recovery [ 88 ], and the additional network round-trips.

一些分布式事务的实现会带来严重的性能惩罚，比如MySQL中的分布式事务被报道比单节点事务慢10倍以上，因此不难理解人们建议不要使用它们。两阶段提交中的许多性能成本归因于在崩溃恢复时需要的额外磁盘强制性操作（fsync），以及额外的网络往返。

However, rather than dismissing distributed transactions outright, we should examine them in some more detail, because there are important lessons to be learned from them. To begin, we should be precise about what we mean by “distributed transactions.” Two quite different types of distributed transactions are often conflated:

然而，我们不应该轻易地排斥分布式事务，而应该更加深入地研究它们，因为从中可以学到重要的教训。首先，我们应该明确我们所说的“分布式事务”是什么。通常会混淆两种非常不同的分布式事务类型：

Database-internal distributed transactions

Some distributed databases (i.e., databases that use replication and partitioning in their standard configuration) support internal transactions among the nodes of that database. For example, VoltDB and MySQL Cluster’s NDB storage engine have such internal transaction support. In this case, all the nodes participating in the transaction are running the same database software.

一些分布式数据库（即在其标准配置中使用复制和分区的数据库）支持该数据库节点间的内部事务。例如，VoltDB和MySQL Cluster的NDB存储引擎都具有此类内部事务支持。在这种情况下，参与事务的所有节点都运行相同的数据库软件。

Heterogeneous distributed transactions

In a heterogeneous transaction, the participants are two or more different technologies: for example, two databases from different vendors, or even non-database systems such as message brokers. A distributed transaction across these systems must ensure atomic commit, even though the systems may be entirely different under the hood.

在异构事务中，参与者是两个或更多不同的技术：例如，来自不同供应商的两个数据库，甚至非数据库系统，如消息代理。在这些系统之间的分布式事务必须确保原子提交，即使这些系统在内部完全不同。

Database-internal transactions do not have to be compatible with any other system, so they can use any protocol and apply optimizations specific to that particular technology. For that reason, database-internal distributed transactions can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.

数据库内部事务不必与任何其他系统兼容，因此可以使用任何协议并应用特定于该特定技术的优化。因此，数据库内部分布式事务通常可以非常好地工作。另一方面，跨异构技术的事务要具有更大的挑战性。

Exactly-once message processing

Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For example, a message from a message queue can be acknowledged as processed if and only if the database transaction for processing the message was successfully committed. This is implemented by atomically committing the message acknowledgment and the database writes in a single transaction. With distributed transaction support, this is possible, even if the message broker and the database are two unrelated technologies running on different machines.

异构分布式事务使得不同的系统可以强大地集成在一起。例如，只有在成功提交处理消息的数据库事务后，消息队列中的消息才能被确认已处理。这是通过原子地在单个事务中提交消息确认和数据库写入来实现的。即使消息代理和数据库是运行在不同机器上的两种不相关技术，分布式事务支持也可以实现这一点。

If either the message delivery or the database transaction fails, both are aborted, and so the message broker may safely redeliver the message later. Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded. The abort discards any side effects of the partially completed transaction.

如果消息传递或数据库事务中任一一个失败，这两个都会被取消，因此消息中间件可以安全地稍后重发消息。因此，通过原子地提交消息和其处理的副作用，我们可以确保该消息被有效处理一次，即使在成功之前需要进行几次重试。中止会丢弃已部分完成事务的任何副作用。

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol, however. For example, say a side effect of processing a message is to send an email, and the email server does not support two-phase commit: it could happen that the email is sent two or more times if message processing fails and is retried. But if all side effects of processing a message are rolled back on transaction abort, then the processing step can safely be retried as if nothing had happened.

如果事务受影响的所有系统都能够使用相同的原子提交协议，那么这样的分布式事务才有可能实现。例如，假设处理消息的一个副作用是发送电子邮件，而邮件服务器不支持两阶段提交：如果消息处理失败并重试，则可能发送两次或更多次电子邮件。但如果在事务中断时回滚处理消息的所有副作用，就可以安全地重试处理步骤，就像什么也没发生一样。

We will return to the topic of exactly-once message processing in Chapter 11 . Let’s look first at the atomic commit protocol that allows such heterogeneous distributed transactions.

我们将在第11章回到确保仅有一次消息处理的话题。首先让我们看一下允许这样异构分布式事务的原子提交协议。

XA transactions

X/Open XA (short for eXtended Architecture ) is a standard for implementing two-phase commit across heterogeneous technologies [ 76 , 77 ]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).

X/Open XA（扩展架构）是一种跨异构技术实现两阶段提交的标准 [76, 77]。它于1991年引入，并得到广泛应用：XA受到许多传统关系数据库（包括PostgreSQL、MySQL、DB2、SQL Server和Oracle）和消息代理（包括ActiveMQ、HornetQ、MSMQ和IBM MQ）的支持。

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.

XA不是一个网络协议，它只是一个用于与事务协调器进行接口交互的C API。其他语言中也存在这个API的绑定；例如，在Java EE应用程序世界中，XA事务是使用Java事务API（JTA）实现的，而Java数据库连接（JDBC）和Java消息服务（JMS）API的许多数据库驱动程序和消息代理驱动程序都支持JTA。

XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. If the driver supports XA, that means it calls the XA API to find out whether an operation should be part of a distributed transaction—and if so, it sends the necessary information to the database server. The driver also exposes callbacks through which the coordinator can ask the participant to prepare, commit, or abort.

XA 假定您的应用程序使用网络驱动程序或客户端库来与参与者数据库或消息服务进行通信。如果该驱动程序支持 XA，则表示它会调用 XA API 来确定操作是否应是分布式事务的一部分，如果是，则向数据库服务器发送必要的信息。该驱动程序还通过回调公开回调功能，协调器可以请求参与者准备、提交或中止。

The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). It keeps track of the participants in a transaction, collects partipants’ responses after asking them to prepare (via a callback into the driver), and uses a log on the local disk to keep track of the commit/abort decision for each transaction.

事务协调器实现XA API。标准并未规定它应如何实现，但实际上，协调器通常只是一个库，加载到发出事务的应用程序的相同进程中（而不是一个单独的服务）。它跟踪事务中的参与者，在请求它们准备（通过驱动程序回调）后收集参与者的响应，并使用本地磁盘上的日志来跟踪每个事务的提交/中止决策。

If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction. Only then can the coordinator use the database driver’s XA callbacks to ask participants to commit or abort, as appropriate. The database server cannot contact the coordinator directly, since all communication must go via its client library.

如果应用程序崩溃，或运行应用程序的计算机死机，协调器也将失效。任何准备好但未提交的事务参与者此时就会陷入疑惑。由于协调器日志存储在应用服务器的本地磁盘上，因此必须重新启动该服务器，并且协调器库必须读取日志以恢复每个事务的提交/中止结果。只有这样，协调器才能使用数据库驱动程序的XA回调来要求参与者提交或中止。数据库服务器无法直接联系协调器，因为所有通信都必须通过其客户端库。

Holding locks while in doubt

Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?

为什么我们如此关注一笔交易是否被搁置在疑问当中？难道系统的其他部分不可以继续工作，而忽略最终会被清理的搁置交易吗？

The problem is with locking . As discussed in “Read Committed” , database transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty writes. In addition, if you want serializable isolation, a database using two-phase locking would also have to take a shared lock on any rows read by the transaction (see “Two-Phase Locking (2PL)” ).

问题出在锁定上。如在“读取已提交”中所讨论，数据库事务通常会对其修改的任何行进行行级独占锁定以防止脏写操作。此外，如果您想要可串行化的隔离级别，则使用两阶段锁定的数据库还必须对事务读取的任何行进行共享锁定（请参见“两阶段锁定（2PL）”）。

The database cannot release those locks until the transaction commits or aborts (illustrated as a shaded area in Figure 9-9 ). Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least until the situation is manually resolved by an administrator.

数据库在事务提交或中止前无法释放这些锁定（如图9-9中的阴影区所示）。因此，在使用两阶段提交时，事务必须在有疑问的情况下一直保持锁定。如果协调者崩溃并需要20分钟才能重新启动，则这些锁定将被保持20分钟。如果由于某种原因协调者的日志完全丢失，则这些锁将永远保持，或者至少直到管理员手动解决该情况。

While those locks are held, no other transaction can modify those rows. Depending on the database, other transactions may even be blocked from reading those rows. Thus, other transactions cannot simply continue with their business—if they want to access that same data, they will be blocked. This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.

当这些锁定被保留时，没有其他交易可以修改这些行。根据数据库，其他交易甚至可能被阻止阅读这些行。因此，其他交易不能简单地继续业务 - 如果他们想访问相同的数据，他们将被阻止。这可能导致您的应用程序的大部分部分变得不可用，直到未决交易解决。

Recovering from coordinator failure

In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the log and resolve any in-doubt transactions. However, in practice, orphaned in-doubt transactions do occur [ 89 , 90 ]—that is, transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because the transaction log has been lost or corrupted due to a software bug). These transactions cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions.

从理论上讲，如果协调者崩溃并重新启动，它应该从日志中清晰地恢复自己的状态并解决任何不确定的事务。然而，在实践中，孤立的不确定事务确实会发生[89, 90]--即无论出于什么原因（例如，由于软件漏洞而导致事务日志丢失或损坏），协调者无法决定结果的事务。这些事务无法自动解决，因此它们会永远停留在数据库中，持有锁并阻塞其他事务。

Even rebooting your database servers will not fix this problem, since a correct implementation of 2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk violating the atomicity guarantee). It’s a sticky situation.

即使重新启动数据库服务器也不会解决此问题，因为正确实现2PC必须保留不确定事务的锁甚至跨越重启 (否则它会冒失违反原子性保证)。这是一个棘手的情况。

The only way out is for an administrator to manually decide whether to commit or roll back the transactions. The administrator must examine the participants of each in-doubt transaction, determine whether any participant has committed or aborted already, and then apply the same outcome to the other participants. Resolving the problem potentially requires a lot of manual effort, and most likely needs to be done under high stress and time pressure during a serious production outage (otherwise, why would the coordinator be in such a bad state?).

唯一的解决方法是由管理员手动决定提交还是回滚交易。管理员必须检查每个不确定的事务的参与者，确定是否已经有任何参与者提交或中止，然后将相同的结果应用于其他参与者。解决这个问题可能需要大量的手动工作，很可能需要在严重的生产中断期间，在高压和时间压力下完成（否则，为什么协调者会处于如此糟糕的状态？）。

Many XA implementations have an emergency escape hatch called heuristic decisions : allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator [ 76 , 77 , 91 ]. To be clear, heuristic here is a euphemism for probably breaking atomicity , since it violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for getting out of catastrophic situations, and not for regular use.

许多XA实现都有一个紧急逃生方法，称为启发式决策：允许参与者在未从协调者得到明确定论决策的情况下单方面决定中止或提交一个未确定的事务[76，77，91]。需要指出的是，在这里，“启发式”是破坏原子性的委婉说法，因为它违反了两阶段提交的承诺系统。因此，启发式决策仅适用于紧急情况，而非常规情况下的使用。

Limitations of distributed transactions

XA transactions solve the real and important problem of keeping several participant data systems consistent with each other, but as we have seen, they also introduce major operational problems. In particular, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database:

XA事务解决了保持多个参与方数据系统相互一致的真正重要问题，但正如我们所看到的，它们也引入了重大的操作问题。特别是，关键的认识是事务协调器本身就像是一个数据库（其中存储着事务结果），因此需要像任何其他重要数据库一样小心对待。

If the coordinator is not replicated but runs only on a single machine, it is a single point of failure for the entire system (since its failure causes other application servers to block on locks held by in-doubt transactions). Surprisingly, many coordinator implementations are not highly available by default, or have only rudimentary replication support.

如果协调器未被复制，而只在单台机器上运行，那么整个系统都存在单点故障（因为它的故障会导致其他应用服务器因为维护不确定事务持有的锁而被阻塞）。令人惊讶的是，许多协调器实现默认情况下并非高可用性的，或者只有基本的复制支持。
Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database, which has the advantage that application servers can be added and removed at will. However, when the coordinator is part of the application server, it changes the nature of the deployment. Suddenly, the coordinator’s logs become a crucial part of the durable system state—as important as the databases themselves, since the coordinator logs are required in order to recover in-doubt transactions after a crash. Such application servers are no longer stateless.

许多服务器端应用程序采用无状态模型（正如HTTP所支持的），所有持久状态都存储在数据库中，这样的好处是应用程序服务器可以随意添加和删除。但是当协调器成为应用服务器的一部分时，它改变了部署的性质。突然间，协调器的日志成为了持久系统状态的关键部分，和数据库本身一样重要，因为在崩溃后需要使用协调器的日志来恢复未确定的事务。这样的应用程序服务器不再是无状态的。
Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator. For example, it cannot detect deadlocks across different systems (since that would require a standardized protocol for systems to exchange information on the locks that each transaction is waiting for), and it does not work with SSI (see “Serializable Snapshot Isolation (SSI)” ), since that would require a protocol for identifying conflicts across different systems.

由于XA需要与各种数据系统兼容，因此它必然是最低公共分母。例如，它无法在不同系统之间检测死锁（因为这需要标准化协议来交换有关每个事务正在等待的锁的信息），并且它不能与SSI（参见“可串行快照隔离（SSI）”）一起工作，因为这需要一个用于识别不同系统之间冲突的协议。
For database-internal distributed transactions (not XA), the limitations are not so great—for example, a distributed version of SSI is possible. However, there remains the problem that for 2PC to successfully commit a transaction, all participants must respond. Consequently, if any part of the system is broken, the transaction also fails. Distributed transactions thus have a tendency of amplifying failures , which runs counter to our goal of building fault-tolerant systems.

对于数据库内部分布式事务（非XA事务），限制并不是很大，例如可以实现分布式的SSI版本。然而，仍然存在一个问题，就是为了使2PC成功提交一个事务，所有参与者都必须做出响应。因此，如果系统的任何部分出现故障，事务也会失败。分布式事务因此具有放大故障的趋势，这与我们构建容错系统的目标相悖。

Do these facts mean we should give up all hope of keeping several systems consistent with each other? Not quite—there are alternative methods that allow us to achieve the same thing without the pain of heterogeneous distributed transactions. We will return to these in Chapters 11 and 12 . But first, we should wrap up the topic of consensus.

这些事实是否意味着我们应该放弃所有希望让多个系统保持一致？并非如此——有替代方法可以让我们在不进行异构分布式事务的痛苦情况下实现相同的目标。我们将在第11和12章回到这些方法。但首先，我们应该结束共识主题。

Fault-Tolerant Consensus

Informally, consensus means getting several nodes to agree on something. For example, if several people concurrently try to book the last seat on an airplane, or the same seat in a theater, or try to register an account with the same username, then a consensus algorithm could be used to determine which one of these mutually incompatible operations should be the winner.

非正式地，共识意味着让几个节点就某事达成共识。例如，如果几个人同时尝试预订一架飞机上的最后一个座位，或者一个剧院里的同一个座位，或者尝试注册具有相同用户名的帐户，则可以使用共识算法来确定哪一个这些相互不兼容的操作应该胜出。

The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the seat-booking example, when several customers are concurrently trying to buy the last seat, each node handling a customer request may propose the ID of the customer it is serving, and the decision indicates which one of those customers got the seat.

共识问题通常被形式化如下：一个或多个节点可能会提出值，而共识算法则决定其中的一个值。在预订座位的例子中，当几个客户同时尝试购买最后一个座位时，每个处理客户请求的节点可能会提出服务的客户的 ID，决策指示哪个客户得到了座位。

In this formalism, a consensus algorithm must satisfy the following properties [ 25 ]: ^xiii

在这个形式化表述中，共识算法必须满足以下性质[25]：xiii

Uniform agreement

No two nodes decide differently.

没有两个节点会做出不同的决定。

Integrity

No node decides twice.

没有节点决策两次。

Validity

If a node decides value v , then v was proposed by some node.

如果一个节点决定了值 v，那么 v 是由某个节点提出的。

Termination

Every node that does not crash eventually decides some value.

每个不崩溃的节点最终都会决定某个值。

The uniform agreement and integrity properties define the core idea of consensus: everyone decides on the same outcome, and once you have decided, you cannot change your mind. The validity property exists mostly to rule out trivial solutions: for example, you could have an algorithm that always decides null , no matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but not the validity property.

一致性协议和完整性属性定义了共识的核心思想：每个人都决定相同的结果，一旦你做出决定，就不能改变主意。有效性属性主要存在于排除微不足道的解决方案：例如，你可以有一个算法，无论提议是什么，都决定为null；这个算法将满足一致性协议和完整性属性，但不符合有效性属性。

If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, if that one node fails, then the system can no longer make any decisions. This is, in fact, what we saw in the case of two-phase commit: if the coordinator fails, in-doubt participants cannot decide whether to commit or abort.

如果您不关心容错性，那么满足前三个属性是很容易的：您只需将一个节点硬编码为“独裁者”，并让该节点做出所有决策。然而，如果该节点失败，那么系统将无法做出任何决策。事实上，这就是我们在两阶段提交的情况下看到的情况：如果协调员失败，则存在疑问的参与者无法决定是提交还是中止。

The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a liveness property, whereas the other three are safety properties—see “Safety and liveness” .)

终止属性将容错的概念形式化。它基本上表明共识算法不能无休止地闲置下去，换句话说，它必须取得进展。即使某些节点失败，其他节点仍必须达成决定。（终止是活跃性质，而其他三种是安全性质——请参见“安全性和活跃性”。）

The system model of consensus assumes that when a node “crashes,” it suddenly disappears and never comes back. (Instead of a software crash, imagine that there is an earthquake, and the datacenter containing your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud and is never going to come back online.) In this system model, any algorithm that has to wait for a node to recover is not going to be able to satisfy the termination property. In particular, 2PC does not meet the requirements for termination.

共识的系统模型假设当节点"崩溃"时，它会突然消失并永远不会再回来。(不是软件崩溃，想象一下地震，包含您节点的数据中心被山体滑坡摧毁。您必须假设您的节点被30英尺深的泥土埋葬，永远不会再次联机。)在这个系统模型中，任何需要等待节点恢复的算法都无法满足终止属性。特别是，二阶段提交不符合终止要求。

Of course, if all nodes crash and none of them are running, then it is not possible for any algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination [ 67 ]. That majority can safely form a quorum (see “Quorums for reading and writing” ).

当然，如果所有节点都崩溃且没有任何一个正在运行，则任何算法都无法决定任何事情。算法能够容忍故障的次数是有限的：实际上，可以证明任何一种共识算法都需要至少半数的节点正常运行才能保证终止[67]。这种半数可以安全地形成一个法定人数(见“读写的法定人数”)。

Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable. However, most implementations of consensus ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem [ 92 ]. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make invalid decisions.

因此，终止属性取决于假设节点数量少于一半会崩溃或无法访问。然而，大多数共识实现都确保安全性属性-协议、完整性和有效性-始终得到满足，即使大多数节点失败或存在严重的网络问题[92]。因此，大规模故障可能会使系统无法处理请求，但它不会通过导致做出无效决策来损坏共识系统。

Most consensus algorithms assume that there are no Byzantine faults, as discussed in “Byzantine Faults” . That is, if a node does not correctly follow the protocol (for example, if it sends contradictory messages to different nodes), it may break the safety properties of the protocol. It is possible to make consensus robust against Byzantine faults as long as fewer than one-third of the nodes are Byzantine-faulty [ 25 , 93 ], but we don’t have space to discuss those algorithms in detail in this book.

大多数共识算法假定没有拜占庭错误，如“拜占庭错误”中所讨论的那样。即，如果节点不正确地遵循协议（例如，如果它向不同节点发送矛盾的消息），它可能会破坏协议的安全性质。只要不到三分之一的节点存在拜占庭错误，就有可能使共识免受拜占庭错误的影响，但我们没有足够的空间在本书中详细讨论这些算法。

Consensus algorithms and total order broadcast

The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [ 94 , 95 ], Paxos [ 96 , 97 , 98 , 99 ], Raft [ 22 , 100 , 101 ], and Zab [ 15 , 21 , 102 ]. There are quite a few similarities between these algorithms, but they are not the same [ 103 ]. In this book we won’t go into full details of the different algorithms: it’s sufficient to be aware of some of the high-level ideas that they have in common, unless you’re implementing a consensus system yourself (which is probably not advisable—it’s hard [ 98 , 104 ]).

最为知名的容错共识算法包括Viewstamped Replication (VSR)[94, 95]、Paxos[96, 97, 98, 99]、Raft[22, 100, 101]和Zab[15, 21, 102]。这些算法有许多相似之处，但并非完全相同[103]。在本书中，我们不会详细介绍不同算法的细节：只需要了解它们在高级别想法方面的某些共性即可，除非你正在实现一个共识系统（这可能并不明智——很难[98, 104]）。

Most of these algorithms actually don’t directly use the formal model described here (proposing and deciding on a single value, while satisfying the agreement, integrity, validity, and termination properties). Instead, they decide on a sequence of values, which makes them total order broadcast algorithms, as discussed previously in this chapter (see “Total Order Broadcast” ).

大多数这些算法实际上并没有直接使用描述在这里的形式化模型（提出并决定单个值，同时满足协议、完整性、有效性和终止性属性）。相反，它们决定一个值的序列，使它们成为总排序广播算法，就像在本章之前讨论的那样（请参见“总排序广播”）。

Remember that total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, this is equivalent to performing several rounds of consensus: in each round, nodes propose the message that they want to send next, and then decide on the next message to be delivered in the total order [ 67 ].

全序广播要求所有消息被恰好传输一次，且顺序相同，发送到所有节点。如果你仔细想一想，这等同于进行多轮共识：在每一轮中，节点都会提出他们要发送的消息，然后决定下一个要传输到全序中的消息。

So, total order broadcast is equivalent to repeated rounds of consensus (each consensus decision corresponding to one message delivery):

因此，全局订单广播等同于重复的共识轮次（每个共识决策对应一条消息传递）。

Due to the agreement property of consensus, all nodes decide to deliver the same messages in the same order.

由于共识属性的协议，所有节点决定以相同顺序传递相同的消息。
Due to the integrity property, messages are not duplicated.

由于完整性属性，消息不会重复。
Due to the validity property, messages are not corrupted and not fabricated out of thin air.

由于有效性特性，消息不会被损坏或虚构出来。
Due to the termination property, messages are not lost.

由于终止属性，消息不会丢失。

Viewstamped Replication, Raft, and Zab implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus. In the case of Paxos, this optimization is known as Multi-Paxos.

视图共识复制、Raft和Zab直接实现了总序播，因为它比重复一次只传递一个值的共识更高效。在Paxos的情况下，这种优化被称为多Paxos。

Single-leader replication and consensus

In Chapter 5 we discussed single-leader replication (see “Leaders and Followers” ), which takes all the writes to the leader and applies them to the followers in the same order, thus keeping replicas up to date. Isn’t this essentially total order broadcast? How come we didn’t have to worry about consensus in Chapter 5 ?

在第5章，我们讨论了单领导副本复制(参见“领导者和追随者”)，它将所有写入操作发送到领导者并按照相同的顺序应用到追随者上，因此保持副本最新。这不是本质上的总序广播吗?为什么我们在第5章不必担心共识问题?

The answer comes down to how the leader is chosen. If the leader is manually chosen and configured by the humans in your operations team, you essentially have a “consensus algorithm” of the dictatorial variety: only one node is allowed to accept writes (i.e., make decisions about the order of writes in the replication log), and if that node goes down, the system becomes unavailable for writes until the operators manually configure a different node to be the leader. Such a system can work well in practice, but it does not satisfy the termination property of consensus because it requires human intervention in order to make progress.

答案取决于领导者是如何被选择的。如果领导者是由您的运营团队人工选择和配置的，那么您实际上拥有一种“独裁式”的共识算法：只允许一个节点接受写入（即决定复制日志中写入顺序的节点），如果该节点故障，则系统变得不可用，直到操作员手动配置一个不同的节点成为领导者。这样的系统在实践中可以很好地工作，但它并不满足共识的终止属性，因为它需要人为干预才能取得进展。

Some databases perform automatic leader election and failover, promoting a follower to be the new leader if the old leader fails (see “Handling Node Outages” ). This brings us closer to fault-tolerant total order broadcast, and thus to solving consensus.

一些数据库执行自动领导选举和故障转移，如果旧领导者失败，则晋升一个跟随者成为新领导者（参见“处理节点故障”）。这使我们更接近容错的完全排序广播，从而解决共识问题。

However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to be the leader, and consequently get the database into an inconsistent state. Thus, we need consensus in order to elect a leader. But if the consensus algorithms described here are actually total order broadcast algorithms, and total order broadcast is like single-leader replication, and single-leader replication requires a leader, then…

然而，存在一个问题。我们先前讨论了分裂大脑的问题，并说所有节点需要达成一致，以确定领导者，否则两个不同的节点都可能认为自己是领导者，从而使数据库进入不一致状态。因此，在选举领导者时，我们需要共识。但是，如果这里描述的共识算法实际上是总序广播算法，而总序广播就像单领导者复制，单领导者复制需要领导者，那么...

It seems that in order to elect a leader, we first need a leader. In order to solve consensus, we must first solve consensus. How do we break out of this conundrum?

似乎为了选举一位领袖，我们需要首先有一位领袖。为了解决共识问题，我们必须首先解决共识问题。我们如何打破这个困境？

Epoch numbering and quorums

All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.

迄今为止讨论的所有共识协议内部都以某种形式使用领导者，但他们并不保证该领导者是唯一的。相反，它们可以提供一个较弱的保证：该协议定义了一个时代号（在Paxos中称为投票号，在Viewstamped Replication中称为视图号，在Raft中称为任期号），并且保证在每个时代内，领导者是唯一的。

Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.

每当当前领导人被认为已经死亡时，节点之间就开始投票选举新领导人。此次选举将被赋予一个增加的时代号码，因此时代号码是完全有序和单调递增的。如果在两个不同时代中存在两个不同的领导者之间的冲突（也许是因为前任领导者实际上并未死亡），那么具有较高时代号码的领导者将占上风。

Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. How does a leader know that it hasn’t been ousted by another node? Recall “The Truth Is Defined by the Majority” : a node cannot necessarily trust its own judgment—just because a node thinks that it is the leader, that does not necessarily mean the other nodes accept it as their leader.

领导想要做出决策，必须先检查是否有更高时期号的另一位领导可能做出冲突的决定。领导如何知道自己没有被其他节点取代？回想一下“真理由大多数定义”：节点不能必然信任自己的判断——仅仅因为一个节点认为自己是领导，这并不意味着其他节点也承认它是领导。

Instead, it must collect votes from a quorum of nodes (see “Quorums for reading and writing” ). For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal. The quorum typically, but not always, consists of a majority of nodes [ 105 ]. A node votes in favor of a proposal only if it is not aware of any other leader with a higher epoch.

相反，它必须从节点配额中收集投票（请参见“读写的配额”）。对于领导者想要做出的每个决策，它必须将提议的值发送给其他节点，并等待配额中的节点赞同提议。配额通常（但并不总是）由大多数节点组成[105]。仅当节点不知道任何具有较高时期的其他领导者时，节点才投赞成票。

Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent leader election [ 105 ]. Thus, if the vote on a proposal does not reveal any higher-numbered epoch, the current leader can conclude that no leader election with a higher epoch number has happened, and therefore be sure that it still holds the leadership. It can then safely decide the proposed value.

因此，我们有两轮投票：第一次选举领导人，第二次投票表决领导人的提案。关键在于这两个投票的法定人数必须重叠：如果对提案的投票成功，至少投票支持者中的一个节点也必须参与了最近的领导人选举 [105]。因此，如果对提案的投票没有显示出任何更高的时期编号，现任领导人就可以得出结论，没有任何比当前更高时期编号的领导人选举发生过，因此可以确定仍然持有领导权。然后可以安全地决定建议的价值。

This voting process looks superficially similar to two-phase commit. The biggest differences are that in 2PC the coordinator is not elected, and that fault-tolerant consensus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.

这个投票过程表面上看起来类似于两阶段提交。最大的区别在于，2PC中没有选举协调员，而容错一致性算法仅需要来自大多数节点的投票，而2PC要求每个参与者都投赞成票。此外，共识算法定义了恢复过程，使节点能够在选举新领导者后进入一致的状态，确保始终满足安全属性。这些差异对共识算法的正确性和容错性至关重要。

Limitations of consensus

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant (able to make progress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way (see “Implementing linearizable storage using total order broadcast” ).

共识算法是分布式系统的重大突破：它们为一切不确定的系统带来了具体的安全属性（协议、完整性和有效性），并且它们依然具备容错性（只要大部分节点正常工作并且可达就可以实现进步）。它们提供总序列广播，因此也可以通过容错方式实现可线性操作（详见“使用总序列广播实现可线性存储”）。

Nevertheless, they are not used everywhere, because the benefits come at a cost.

然而，它们并非到处都被使用，因为好处需要付出代价。

The process by which nodes vote on proposals before they are decided is a kind of synchronous replication. As discussed in “Synchronous Versus Asynchronous Replication” , databases are often configured to use asynchronous replication. In this configuration, some committed data can potentially be lost on failover—but many people choose to accept this risk for the sake of better performance.

节点在决策之前对提案进行投票的过程属于同步复制的一种。如“同步与异步复制”所述，数据库通常被配置为使用异步复制。在此配置中，一些已提交的数据可能会在故障转移时丢失 - 但许多人选择为了更好的性能而接受这种风险。

Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority), or a minimum of five nodes to tolerate two failures (the remaining three out of five form a majority). If a network failure cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest is blocked (see also “The Cost of Linearizability” ).

共识系统总是需要严格的多数来运作。这意味着你需要至少三个节点才能容忍一个故障（三个中的两个形成多数），或者至少五个节点来容忍两个故障（五个中的三个形成多数）。如果网络故障将一些节点从其余部分中隔离开来，则只有网络的多数部分才能取得进展，其余部分将被阻塞（也请参见“线性一致性成本”）。

Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the cluster to change over time, but they are much less well understood than static membership algorithms.

大多数共识算法假定参与投票的节点集是固定的，这意味着您不能只是添加或移除集群中的节点。动态成员扩展共识算法允许集群中的节点集随时间变化，但它们比静态成员算法不太容易理解。

Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue. Although this error does not harm the safety properties, frequent leader elections result in terrible performance because the system can end up spending more time choosing a leader than doing any useful work.

共识系统通常依靠超时来检测故障节点。在具有高度可变网络延迟的环境中，特别是分布式系统中，往往会发生节点错误地认为领导者由于短暂的网络问题而失败的情况。尽管这种错误不会损害安全性能，但频繁的领导人选举会导致可怕的性能问题，因为系统最终可能会花费更多时间选择一个领导者而不是执行任何有用的工作。

Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft has been shown to have unpleasant edge cases [ 106 ]: if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes, or the current leader is continually forced to resign, so the system effectively never makes progress. Other consensus algorithms have similar problems, and designing algorithms that are more robust to unreliable networks is still an open research problem.

有时，共识算法特别敏感于网络问题。例如，Raft已经被证明有不愉快的边缘情况 [106]: 如果整个网络除了一个特定的网络链接出现问题以外都正常，Raft会陷入领导者不断在两个节点间跳转，或当前领导者被迫不断辞职的情况，因此系统实际上无法进展。其他共识算法也有类似的问题，设计更加适应不可靠网络的算法仍然是一项开放的研究问题。

Membership and Coordination Services

Projects like ZooKeeper or etcd are often described as “distributed key-value stores” or “coordination and configuration services.” The API of such a service looks pretty much like that of a database: you can read and write the value for a given key, and iterate over keys. So if they’re basically databases, why do they go to all the effort of implementing a consensus algorithm? What makes them different from any other kind of database?

像ZooKeeper或etcd这样的项目通常被称为“分布式键值存储”或“协调和配置服务”。这样的服务的API看起来非常像数据库：您可以读取和写入给定键的值，并迭代键。那么，如果它们基本上是数据库，为什么它们要付出所有实现共识算法的努力呢？它们与任何其他类型的数据库有何不同之处？

To understand this, it is helpful to briefly explore how a service like ZooKeeper is used. As an application developer, you will rarely need to use ZooKeeper directly, because it is actually not well suited as a general-purpose database. It is more likely that you will end up relying on it indirectly via some other project: for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background. What is it that these projects get from it?

为了理解这个，简单探索一下像ZooKeeper这样的服务的使用是有帮助的。作为一个应用程序开发人员，你很少需要直接使用ZooKeeper，因为它实际上不适合作为通用数据库使用。更有可能的是，你将间接地依赖它通过一些其他项目：例如，HBase，Hadoop YARN，OpenStack Nova和Kafka都依赖于ZooKeeper在后台运行。这些项目从中得到了什么？

ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability)—so you wouldn’t want to store all of your application’s data here. That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm. As discussed previously, total order broadcast is just what you need for database replication: if each message represents a write to the database, applying the same writes in the same order keeps replicas consistent with each other.

ZooKeeper和etcd被设计用于存储小量的数据，可以完全适应于内存（尽管它们仍然写入磁盘以实现耐久性） - 所以您不希望在这里存储应用程序的所有数据。这小小的数据量使用容错总序广播算法在所有节点中复制。如前所述，总序广播是数据库复制所需的：如果每个消息代表对数据库的写入，则按照相同的顺序应用相同的写入保持副本互相一致。

ZooKeeper is modeled after Google’s Chubby lock service [ 14 , 98 ], implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:

ZooKeeper是根据Google的Chubby锁服务[14,98]建模的，不仅实现了完全有序广播（因此协商），而且还实现了一系列其他有趣的功能，这些功能在构建分布式系统时特别有用：

Linearizable atomic operations

Using an atomic compare-and-set operation, you can implement a lock: if several nodes concurrently try to perform the same operation, only one of them will succeed. The consensus protocol guarantees that the operation will be atomic and linearizable, even if a node fails or the network is interrupted at any point. A distributed lock is usually implemented as a lease , which has an expiry time so that it is eventually released in case the client fails (see “Process Pauses” ).

使用原子比较和设置操作，您可以实现一个锁：如果多个节点同时尝试执行相同的操作，只有其中一个会成功。共识协议保证操作是原子和可线性化的，即使节点失败或网络在任何时候中断。分布式锁通常作为租约实现，该租约具有过期时间，以便在客户端失败时最终释放（请参见“进程暂停”）。

Total ordering of operations

As discussed in “The leader and the lock” , when some resource is protected by a lock or lease, you need a fencing token to prevent clients from conflicting with each other in the case of a process pause. The fencing token is some number that monotonically increases every time the lock is acquired. ZooKeeper provides this by totally ordering all operations and giving each operation a monotonically increasing transaction ID ( zxid ) and version number ( cversion ) [ 15 ].

当一个资源通过锁或租赁得到保护时，需要一个围栏令牌来防止在进程暂停的情况下客户端相互冲突。围栏令牌是一个数字，每次获取锁时单调递增。ZooKeeper通过完全排序所有操作并为每个操作分配单调递增的事务ID（zxid）和版本号（cversion）[15]来提供此功能。

Failure detection

Clients maintain a long-lived session on ZooKeeper servers, and the client and server periodically exchange heartbeats to check that the other node is still alive. Even if the connection is temporarily interrupted, or a ZooKeeper node fails, the session remains active. However, if the heartbeats cease for a duration that is longer than the session timeout, ZooKeeper declares the session to be dead. Any locks held by a session can be configured to be automatically released when the session times out (ZooKeeper calls these ephemeral nodes ).

客户端在ZooKeeper服务器上维护一个长期的会话，客户端和服务器定期交换心跳以检查对方节点是否仍然活着。即使连接暂时中断或ZooKeeper节点失败，会话仍然保持活动状态。但是，如果心跳停止的持续时间超过会话超时时间，ZooKeeper会宣布会话已死亡。会话持有的任何锁都可以配置为在会话超时时自动释放（ZooKeeper将这些节点称为临时节点）。

Change notifications

Not only can one client read locks and values that were created by another client, but it can also watch them for changes. Thus, a client can find out when another client joins the cluster (based on the value it writes to ZooKeeper), or if another client fails (because its session times out and its ephemeral nodes disappear). By subscribing to notifications, a client avoids having to frequently poll to find out about changes.

不仅一个客户端可以读取由另一个客户端创建的锁和值，它也可以监视它们的变化。因此，一个客户端可以发现另一个客户端何时加入集群（基于它写入ZooKeeper的值），或者如果另一个客户端失败（因为其会话超时并且其短暂节点消失）。通过订阅通知，客户端避免了频繁轮询以了解更改的麻烦。

Of these features, only the linearizable atomic operations really require consensus. However, it is the combination of these features that makes systems like ZooKeeper so useful for distributed coordination.

其中，只有可线性化的原子操作确实需要共识。然而，正是这些特征的结合使得像ZooKeeper这样的系统在分布式协调方面非常有用。

Allocating work to nodes

One example in which the ZooKeeper/Chubby model works well is if you have several instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.

ZooKeeper/Chubby 模型运作良好的一个例子是，如果您有几个进程或服务的实例，并且需要选择其中一个作为领导者或主节点。如果领导者失败，另一个节点应该接管。当然，这对于单领导数据库非常有用，但对于工作调度程序和类似的有状态系统也非常有用。

Another example arises when you have some partitioned resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which partition to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load (see “Rebalancing Partitions” ). As nodes are removed or fail, other nodes need to take over the failed nodes’ work.

当您拥有一些分区资源（数据库、消息流、文件存储、分布式应用程序系统等），并且需要决定将哪个分区分配给哪个节点时，另一个例子就会出现。随着新节点加入群集，一些分区需要从现有节点移动到新节点，以便重新平衡负载（请参见“重新平衡分区”）。当节点被移除或失败时，其他节点需要接管失败节点的工作。

These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and notifications in ZooKeeper. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator [ 17 ] that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which has a poor success record [ 107 ].

这些任务可以通过在ZooKeeper中谨慎使用原子操作、临时节点和通知来实现。如果正确地执行，这种方法允许应用程序在没有人为干预的情况下自动从故障中恢复。尽管出现了像Apache Curator [17]这样的库，为ZooKeeper客户端API提供了更高级别的工具，但这并不容易——但它仍然比尝试从头开始实现必要的一致性算法要好得多，后者的成功记录很差[107]。

An application may initially run only on a single node, but eventually may grow to thousands of nodes. Trying to perform majority votes over so many nodes would be terribly inefficient. Instead, ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients. Thus, ZooKeeper provides a way of “outsourcing” some of the work of coordinating nodes (consensus, operation ordering, and failure detection) to an external service.

一个应用程序最初可能只在一个节点上运行，但最终可能会扩展到数千个节点。尝试在这么多节点上执行多数投票将非常低效。相反，ZooKeeper 在固定数量的节点上运行（通常为三个或五个），并在这些节点上执行其多数投票，同时支持潜在的大量客户端。因此，ZooKeeper 提供了一种将协调节点的一些工作（共识、操作排序和故障检测）“外包”给外部服务的方式。

Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on 10.1.1.23 is the leader for partition 7,” which may change on a timescale of minutes or hours. It is not intended for storing the runtime state of the application, which may change thousands or even millions of times per second. If application state needs to be replicated from one node to another, other tools (such as Apache BookKeeper [ 108 ]) can be used.

通常，ZooKeeper 管理的数据类型变化非常缓慢：例如，“运行在 10.1.1.23 的节点是分区 7 的领导者”，这类信息的变化可能在几分钟或几小时内发生。它并不适用于存储应用程序的运行时状态，因为应用程序状态可能每秒钟甚至每毫秒都会发生数千次或数百万次的变化。如果需要将应用程序状态从一个节点复制到另一个节点，则可以使用其他工具（例如 Apache BookKeeper [108]）。

Service discovery

ZooKeeper, etcd, and Consul are also often used for service discovery —that is, to find out which IP address you need to connect to in order to reach a particular service. In cloud datacenter environments, where it is common for virtual machines to continually come and go, you often don’t know the IP addresses of your services ahead of time. Instead, you can configure your services such that when they start up they register their network endpoints in a service registry, where they can then be found by other services.

ZooKeeper、etcd 和 Consul 也常用于服务发现，即查找连接到特定服务所需的 IP 地址。在云数据中心环境中，由于虚拟机经常出现并消失，你通常无法提前知道服务的 IP 地址。因此，你可以配置服务，在它们启动时注册其网络端点到服务注册表中，在那里其他服务可以找到它们。

However, it is less clear whether service discovery actually requires consensus. DNS is the traditional way of looking up the IP address for a service name, and it uses multiple layers of caching to achieve good performance and availability. Reads from DNS are absolutely not linearizable, and it is usually not considered problematic if the results from a DNS query are a little stale [ 109 ]. It is more important that DNS is reliably available and robust to network interruptions.

然而，服务发现是否真的需要共识尚不太清楚。DNS是查找服务名称的IP地址的传统方法，它使用多层缓存来实现良好的性能和可用性。DNS的读取绝对不是线性化的，如果DNS查询的结果有点过时，通常不会被认为是有问题的。更重要的是，DNS可靠可用且能够抵御网络中断。

Although service discovery does not require consensus, leader election does. Thus, if your consensus system already knows who the leader is, then it can make sense to also use that information to help other services discover who the leader is. For this purpose, some consensus systems support read-only caching replicas. These replicas asynchronously receive the log of all decisions of the consensus algorithm, but do not actively participate in voting. They are therefore able to serve read requests that do not need to be linearizable.

虽然服务发现不需要共识，但领导者选举需要。因此，如果您的共识系统已经知道领导者是谁，那么使用该信息来帮助其他服务发现领导者是有意义的。为此，一些共识系统支持只读缓存副本。这些副本异步接收所有共识算法决策的日志，但不参与投票。因此，它们能够为不需要线性化的读取请求提供服务。

Membership services

ZooKeeper and friends can be seen as part of a long history of research into membership services , which goes back to the 1980s and has been important for building highly reliable systems, e.g., for air traffic control [ 110 ].

ZooKeeper和它的伙伴们可以被视为研究成员服务的悠久历史的一部分，这始于20世纪80年代，对于构建高度可靠的系统非常重要，例如空中交通管制[110]。

A membership service determines which nodes are currently active and live members of a cluster. As we saw throughout Chapter 8 , due to unbounded network delays it’s not possible to reliably detect whether another node has failed. However, if you couple failure detection with consensus, nodes can come to an agreement about which nodes should be considered alive or not.

会员服务确定当前哪些节点是集群的活动成员。就像我们在第8章中看到的那样，由于无限制的网络延迟，不可能可靠地检测另一个节点是否失败。但是，如果将故障检测与共识结合起来，节点可以就应认为是活动或非活动状态达成一致。

It could still happen that a node is incorrectly declared dead by consensus, even though it is actually alive. But it is nevertheless very useful for a system to have agreement on which nodes constitute the current membership. For example, choosing a leader could mean simply choosing the lowest-numbered among the current members, but this approach would not work if different nodes have divergent opinions on who the current members are.

可能仍然会发生共识错误地宣布某个节点已经死亡，即使它实际上是活着的。但是对于系统来说，达成关于哪些节点构成当前成员的一致性是非常有用的。例如，选择一个领导者可能意味着只需选择当前成员中编号最低的，但是如果不同节点对当前成员有不同的看法，这种方法就行不通了。

Summary

In this chapter we examined the topics of consistency and consensus from several different angles. We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically. Although linearizability is appealing because it is easy to understand—it makes a database behave like a variable in a single-threaded program—it has the downside of being slow, especially in environments with large network delays.

在本章中，我们从多个角度探讨了一致性和共识的主题。我们深入研究了线性一致性这种流行的一致性模型：它的目标是让复制的数据看起来只有一个副本，并使所有操作在原子级别上执行。尽管线性一致性很吸引人，因为它易于理解-它使数据库表现得像单线程程序中的变量-但它的缺点是速度较慢，特别是在具有大网络延迟的环境中。

We also explored causality, which imposes an ordering on events in a system (what happened before what, based on cause and effect). Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.

我们还研究了因果关系，它对系统中事件的顺序施加了一个约束（根据因果关系发生在先）。与线性可操作性不同，它将所有操作放在单个完全有序的时间线上，因果一致性提供了一个更弱的一致性模型：有些事情可以同时发生，因此版本历史就像带有分支和合并的时间表。因果一致性没有线性可操作性的协调开销，并且对网络问题的敏感度要小得多。

However, even if we capture the causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in “Timestamp ordering is not sufficient” we considered the example of ensuring that a username is unique and rejecting concurrent registrations for the same username. If one node is going to accept a registration, it needs to somehow know that another node isn’t concurrently in the process of registering the same name. This problem led us toward consensus .

然而，即使我们捕捉到因果顺序（例如使用Lamport时间戳），我们发现有些事情不能用这种方式实现：在“时间戳顺序不足够”中，我们考虑了确保用户名是唯一的并拒绝同时注册相同用户名的示例。如果一个节点将要接受注册，它需要确切地知道另一个节点是否正在同时注册相同的名称。这个问题让我们思考到了共识。

We saw that achieving consensus means deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable. With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other (in the sense that if you have a solution for one of them, you can easily transform it into a solution for one of the others). Such equivalent problems include:

我们发现，达成共识意味着以一种让所有节点都同意所决定的事情的方式做出决定，并且决定是不可撤销的。经过一番调查，事实证明广泛的问题实际上都可以归结为共识，并且它们彼此之间是等效的（也就是说，如果你有其中一个问题的解决方案，你可以轻松地将它转化为另一个问题的解决方案）。这样的等效问题包括：

Linearizable compare-and-set registers

The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.

寄存器需要在原子级别决定是否设置其值，基于当前值是否等于操作中给定的参数。

Atomic transaction commit

A database must decide whether to commit or abort a distributed transaction.

一个数据库必须决定是提交还是中止分布式事务。

Total order broadcast

The messaging system must decide on the order in which to deliver messages.

消息系统必须决定交付消息的顺序。

Locks and leases

When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.

当多个客户端争夺锁或租赁时，锁会决定哪一个成功获得它。

Membership/coordination service

Given a failure detector (e.g., timeouts), the system must decide which nodes are alive, and which should be considered dead because their sessions timed out.

给定一个故障检测器（例如超时），系统必须决定哪些节点是存活的，哪些应该被认为是已经超时的死亡节点。

Uniqueness constraint

When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.

当多个事务同时尝试使用相同的关键字创建冲突记录时，约束必须决定哪个事务可以成功执行，哪个应该因为违反约束而失败。

All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations, uniqueness constraints, a totally ordered replication log, and more.

如果只有一个节点，或者愿意将决策能力分配给单个节点，所有这些都很简单。在单个领导者数据库中，所有决策权力都被授予领导者，这就是为什么这样的数据库能够提供线性化操作、唯一性约束、完全有序的复制日志等功能的原因。

However, if that single leader fails, or if a network interruption makes the leader unreachable, such a system becomes unable to make any progress. There are three ways of handling that situation:

然而，如果这个单一领导者失败，或者网络中断使领导者无法联系，这样的系统就无法继续前进。解决这种情况有三种方法：

Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termination property: if the leader does not recover, the system can be blocked forever.

等待领导者恢复，并接受系统在此期间将被阻塞。许多XA/JTA事务协调器选择这个选项。这种方法并未完全解决共识，因为它并不满足终止属性：如果领导者无法恢复，系统可能会永远被阻塞。
Manually fail over by getting humans to choose a new leader node and reconfigure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer system, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers.

通过人工选择新的领导节点并重新配置系统来手动进行故障转移。许多关系型数据库采取此方法。这是一种由“上帝行为”形成的共识 - 人类操作员在计算机系统之外做出决策。故障转移的速度受到人类行动速度的限制，一般比计算机慢。
Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [ 107 ].

使用一种算法来自动选择新领袖。这种方法需要共识算法，建议使用能够正确处理网络异常情况的经过验证的算法[107]。

Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes. Thus, in some sense, having a leader only “kicks the can down the road”: consensus is still required, only in a different place, and less frequently. The good news is that fault-tolerant algorithms and systems for consensus exist, and we briefly discussed them in this chapter.

尽管单一领导数据库可以在每次写入时提供线性化而无需执行共识算法，但仍需要共识来维护其领导和领导变更。因此，在某种意义上，仅有领导者“拖延”问题：仍需要共识，只是在不同的地方，且较少频繁。好消息是，具备容错性的共识算法和系统存在，我们在本章中简要讨论了它们。

Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use. It’s not easy to use, but it is much better than trying to develop your own algorithms that can withstand all the problems discussed in Chapter 8 . If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.

像ZooKeeper这样的工具在提供“外包”的共识、故障检测和成员服务方面起着重要作用，应用程序可利用其功能。尽管不易使用，但它比尝试开发能够经受住第八章中讨论的所有问题的自有算法要好得多。如果您想要做类似于共识可简化的事情，并且希望它具备容错性，则最好使用类似于ZooKeeper的工具。

Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus. The conflicts that occur in these systems (see “Handling Write Conflicts” ) are a consequence of not having consensus across different leaders, but maybe that’s okay: maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.

然而，并非每个系统都必须要求共识：例如，无领导或多领导复制系统通常不使用全局共识。在这些系统中发生的冲突（参见“处理写冲突”）是由于不同领导者之间没有共识所导致的，但也许这是可以接受的：也许我们只需要在没有线性可用性的情况下应对，并学会更好地处理具有分支和合并版本历史的数据。

This chapter referenced a large body of research on the theory of distributed systems. Although the theoretical papers and proofs are not always easy to understand, and sometimes make unrealistic assumptions, they are incredibly valuable for informing practical work in this field: they help us reason about what can and cannot be done, and help us find the counterintuitive ways in which distributed systems are often flawed. If you have the time, the references are well worth exploring.

本章引用了大量有关分布式系统理论的研究。虽然这些理论论文和证明不总是容易理解，有时做出了不切实际的假设，但它们对于指导实践工作非常有价值：它们帮助我们思考什么能够、什么不能够做，并帮助我们找到分布式系统常常存在的反直觉漏洞。如果您有时间，这些参考资料也值得探索。

This brings us to the end of Part II of this book, in which we covered replication ( Chapter 5 ), partitioning ( Chapter 6 ), transactions ( Chapter 7 ), distributed system failure models ( Chapter 8 ), and finally consistency and consensus ( Chapter 9 ). Now that we have laid a firm foundation of theory, in Part III we will turn once again to more practical systems, and discuss how to build powerful applications from heterogeneous building blocks.

这就是本书第二部分的结尾，我们在本部分中涉及到了复制（第5章），分区（第6章），事务（第7章），分布式系统故障模型（第8章）以及一致性和共识（第9章）。现在我们已经奠定了扎实的理论基础，在第三部分中，我们将再次转向更实际的系统，讨论如何从不同的构建块中构建强大的应用程序。

Footnotes

ⁱ A subtle detail of this diagram is that it assumes the existence of a global clock, represented by the horizontal axis. Even though real systems typically don’t have accurate clocks (see “Unreliable Clocks” ), this assumption is okay: for the purposes of analyzing a distributed algorithm, we may pretend that an accurate global clock exists, as long as the algorithm doesn’t have access to it [ 47 ]. Instead, the algorithm can only see a mangled approximation of real time, as produced by a quartz oscillator and NTP.

这个图表的微小细节假设了全局时钟的存在，由水平轴表示。尽管真实系统通常没有准确的时钟（参见“不可靠的时钟”），但是这种假设是可以的：为了分析分布式算法，我们可以假装存在一个准确的全局时钟，只要算法没有访问它。相反，算法只能看到由石英振荡器和NTP生成的真实时间的扭曲近似。

ⁱⁱ A register in which reads may return either the old or the new value if they are concurrent with a write is known as a regular register [ 7 , 25 ].

一个读取可能会与写并发的旧值或新值的寄存器被称为常规寄存器[7，25]。

ⁱⁱⁱ Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale, since by default they can be served by any one of the replicas. You can optionally request a linearizable read: etcd calls this a quorum read [ 16 ], and in ZooKeeper you need to call sync() before the read [ 15 ]; see “Implementing linearizable storage using total order broadcast” .

ZooKeeper和etcd在写入方面提供线性写入，但读取可能是陈旧的，因为默认情况下可以由任何一个副本提供服务。您可以选择请求线性读取：etcd称之为仲裁读取[16]，而在ZooKeeper中，您需要在读取之前调用sync（）[15]，请参见“使用总顺序广播实现线性存储”。

^iv Partitioning (sharding) a single-leader database, so that there is a separate leader per partition, does not affect linearizability, since it is only a single-object guarantee. Cross-partition transactions are a different matter (see “Distributed Transactions and Consensus” ).

将单主数据库进行 iv 分区（切片），以使每个分区有单独的主节点，并不影响线性一致性，因为它仅仅是单对象的保证。跨分区事务是另一回事（请参见“分布式事务和一致性”）。

^v These two choices are sometimes known as CP (consistent but not available under network partitions) and AP (available but not consistent under network partitions), respectively. However, this classification scheme has several flaws [ 9 ], so it is best avoided.

这两种选择有时被称为CP（在网络分区下保持一致但不可用）和AP（在网络分区下可用但不一致），但是这个分类方案存在几个缺陷 [9]，因此最好避免使用。

^vi As discussed in “Network Faults in Practice” , this book uses partitioning to refer to deliberately breaking down a large dataset into smaller ones ( sharding ; see Chapter 6 ). By contrast, a network partition is a particular type of network fault, which we normally don’t consider separately from other kinds of faults. However, since it’s the P in CAP, we can’t avoid the confusion in this case.

正如《实际网络故障》所讨论的，本书使用分区来指将一个大数据集故意分成较小的数据集（分片；参见第6章）。相比之下，网络分区是一种特定类型的网络故障，我们通常不将其与其他故障分别考虑。然而，由于这是CAP中的P，因此在这种情况下我们不能避免混淆。

^vii A total order that is inconsistent with causality is easy to create, but not very useful. For example, you can generate a random UUID for each operation, and compare UUIDs lexicographically to define the total ordering of operations. This is a valid total order, but the random UUIDs tell you nothing about which operation actually happened first, or whether the operations were concurrent.

一个与因果关系不一致的全序很容易创建，但并不是很有用。例如，你可以为每个操作生成一个随机 UUID，并按字典序比较 UUID 来定义操作的总排序。这是一个有效的总排序，但随机 UUID 并不能告诉你哪个操作实际上是先发生的，或者这些操作是否同时发生。

^viii It is possible to make physical clock timestamps consistent with causality: in “Synchronized clocks for global snapshots” we discussed Google’s Spanner, which estimates the expected clock skew and waits out the uncertainty interval before committing a write. This method ensures that a causally later transaction is given a greater timestamp. However, most clocks cannot provide the required uncertainty metric.

viii 物理时钟的时间戳可以与因果关系保持一致：在“全局快照同步时钟”中，我们讨论了谷歌的Spanner，它估计了预期的时钟偏差，并在提交写入之前等待不确定性间隔。这种方法确保因果关系更晚的事务被赋予更大的时间戳。然而，大多数时钟无法提供所需的不确定度量。

^ix The term atomic broadcast is traditional, but it is very confusing as it’s inconsistent with other uses of the word atomic : it has nothing to do with atomicity in ACID transactions and is only indirectly related to atomic operations (in the sense of multi-threaded programming) or atomic registers (linearizable storage). The term total order multicast is another synonym.

“原子广播”这个术语虽然传统，但与“原子性”在 ACID 事务中的使用不一致，与多线程编程中的“原子操作”或线性存储的“原子寄存器”只有间接关联，因此容易令人困惑。另一个同义词是“全局有序组播”。

^x In a formal sense, a linearizable read-write register is an “easier” problem. Total order broadcast is equivalent to consensus [ 67 ], which has no deterministic solution in the asynchronous crash-stop model [ 68 ], whereas a linearizable read-write register can be implemented in the same system model [ 23 , 24 , 25 ]. However, supporting atomic operations such as compare-and-set or increment-and-get in a register makes it equivalent to consensus [ 28 ]. Thus, the problems of consensus and a linearizable register are closely related.

从正式角度来看，可线性化的读写寄存器问题相对来说是“更简单”的。总序播送等同于共识[67]，而在异步宕机模型[68]中它没有确定性解。然而，可线性化的读写寄存器可以在同一系统模型中实现[23, 24, 25]。但是，将支持原子操作如比较-设置或增量和获取寄存器时，它就等效于共识[28]。因此，共识和可线性化寄存器的问题密切相关。

^xi If you don’t wait, but acknowledge the write immediately after it has been enqueued, you get something similar to the memory consistency model of multi-core x86 processors [ 43 ]. That model is neither linearizable nor sequentially consistent.

如果你不等待，而是在它被入队后立即确认写操作，你会得到类似于多核 x86 处理器的内存一致性模型[43]。该模型既不是线性化的，也不是顺序一致的。

^xii Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [ 70 , 71 ]. Nonblocking atomic commit is harder than consensus—see “Three-phase commit” .

原子提交和共识的规范略有不同：只有当所有参与方投票赞成提交时，原子事务才能提交，并且必须在任何参与方需要中止时中止。共识允许决定任何由参与方提出的值。但是，原子提交和共识可以互相简化。非阻塞式原子提交比共识更难 - 参见“三阶段提交”。

^xiii This particular variant of consensus is called uniform consensus , which is equivalent to regular consensus in asynchronous systems with unreliable failure detectors [ 71 ]. The academic literature usually refers to processes rather than nodes , but we use nodes here for consistency with the rest of this book.

这种共识的特定变化被称为统一共识，它在具有不可靠失败检测器的异步系统中等同于常规共识[71]。学术文献通常提到进程而不是节点，但我们在这里使用节点以保持本书的一致性。

References

[ 1 ] Peter Bailis and Ali Ghodsi: “ Eventual Consistency Today: Limitations, Extensions, and Beyond ,” ACM Queue , volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[1] Peter Bailis和Ali Ghodsi：“如今的最终一致性：限制、扩展和未来”，ACM队列，卷11，号3，页55-63，2013年3月。doi：10.1145 / 2460276.2462076。

[ 2 ] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin: “ Consistency, Availability, and Convergence ,” University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011.

[2] Prince Mahajan，Lorenzo Alvisi和Mike Dahlin：“一致性、可用性和收敛性”，德克萨斯大学奥斯汀分校，计算机科学系，技术报告UTCS TR-11-22，2011年5月。

[ 3 ] Alex Scotti: “ Adventures in Building Your Own Database ,” at All Your Base , November 2015.

[3] Alex Scotti：在All Your Base 2015年11月的演讲“打造自己的数据库中的冒险”。

[ 4 ] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “ Highly Available Transactions: Virtues and Limitations ,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014. Extended version published as pre-print arXiv:1302.0309 [cs.DB].

【4】Peter Bailis、Aaron Davidson、Alan Fekete等人：《高可用事务：优点与局限》，发表于第40届国际超大型数据库会议（VLDB），2014年9月。扩展版本已作为预印本arXiv:1302.0309 [cs.DB]发表。

[ 5 ] Paolo Viotti and Marko Vukolić: “ Consistency in Non-Transactional Distributed Storage Systems ,” arXiv:1512.00168, 12 April 2016.

[5] Paolo Viotti和Marko Vukolić： "非事务性分布式存储系统的一致性"，arXiv:1512.00168，2016年4月12日。

[ 6 ] Maurice P. Herlihy and Jeannette M. Wing: “ Linearizability: A Correctness Condition for Concurrent Objects ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 12, number 3, pages 463–492, July 1990. doi:10.1145/78969.78972

【6】Maurice P. Herlihy和 Jeannette M. Wing：“线性化：并发对象的正确性条件”，ACM Transactions on Programming Languages and Systems（TOPLAS），卷12，号3，页463-492，1990年7月。doi:10.1145/78969.78972。

[ 7 ] Leslie Lamport: “ On interprocess communication ,” Distributed Computing , volume 1, number 2, pages 77–101, June 1986. doi:10.1007/BF01786228

Leslie Lamport：“论进程间通信”，《分布式计算》杂志，第1卷，第2期，77-101页，1986年6月。DOI:10.1007/BF01786228。

[ 8 ] David K. Gifford: “ Information Storage in a Decentralized Computer System ,” Xerox Palo Alto Research Centers, CSL-81-8, June 1981.

[8] 大卫·基福德 (David K. Gifford): 《分散计算机系统中的信息存储》(Information Storage in a Decentralized Computer System)，施乐帕洛阿尔托研究中心 (Xerox Palo Alto Research Centers)，CSL-81-8，1981年6月。

[ 9 ] Martin Kleppmann: “ Please Stop Calling Databases CP or AP ,” martin.kleppmann.com , May 11, 2015.

请帮我翻译，“请不要再将数据库称为CP或AP”，马丁·克莱普曼（Martin Kleppmann），“martin.kleppmann.com”，2015年5月11日。

[ 10 ] Kyle Kingsbury: “ Call Me Maybe: MongoDB Stale Reads ,” aphyr.com , April 20, 2015.

[10] Kyle Kingsbury: “Call Me Maybe: MongoDB 过时读取”，aphyr.com，2015年4月20日。

[ 11 ] Kyle Kingsbury: “ Computational Techniques in Knossos ,” aphyr.com , May 17, 2014.

[11] Kyle Kingsbury： “Knossos中的计算技术”，aphyr.com，2014年5月17日。 [11] Kyle Kingsbury： “Knossos中的计算技术”，aphyr.com，2014年5月17日。

[ 12 ] Peter Bailis: “ Linearizability Versus Serializability ,” bailis.org , September 24, 2014.

[12] Peter Bailis：“线性化可靠性与序列化可靠性”，bailis.org，2014年9月24日。

[ 13 ] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems . Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com .

[13] Philip A. Bernstein，Vassos Hadzilacos和Nathan Goodman：数据库系统中的并发控制和恢复。 Addison-Wesley，1987年。 ISBN：978-0-201-10715-9，可在research.microsoft.com网站上在线获取。

[ 14 ] Mike Burrows: “ The Chubby Lock Service for Loosely-Coupled Distributed Systems ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[14] Mike Burrows：“疏松分布式系统的Chubby Lock服务”，发表于第七届USENIX操作系统设计与实现研讨会（OSDI），2006年11月。

[ 15 ] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination . O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

[15] Flavio P. Junqueira和Benjamin Reed： ZooKeeper：分布式进程协调。O'Reilly Media，2013年。 ISBN：978-1-449-36130-3。

[ 16 ] “ etcd 2.0.12 Documentation ,” CoreOS, Inc., 2015.

「[16] “etcd 2.0.12 文档”，CoreOS, Inc.，2015年。」

[ 17 ] “ Apache Curator ,” Apache Software Foundation, curator.apache.org , 2015.

“Apache Curator”，Apache软件基金会，curator.apache.org，2015年。

[ 18 ] Morali Vallath: Oracle 10g RAC Grid, Services & Clustering . Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7

【18】Morali Vallath： Oracle 10g RAC格网、服务与集群技术。Elsevier Digital Press出版社，2006。 ISBN：978-1-555-58321-7。

[ 19 ] Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “ Coordination-Avoiding Database Systems ,” Proceedings of the VLDB Endowment , volume 8, number 3, pages 185–196, November 2014.

[19] Peter Bailis, Alan Fekete, Michael J Franklin等人： “避免协调的数据库系统”，《VLDB Endowment会议论文集》第8卷，第3期，185-196页，2014年11月。

[ 20 ] Kyle Kingsbury: “ Call Me Maybe: etcd and Consul ,” aphyr.com , June 9, 2014.

[20] Kyle Kingsbury：“Call Me Maybe：etcd和Consul”，aphyr.com，2014年6月9日。

[ 21 ] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: “ Zab: High-Performance Broadcast for Primary-Backup Systems ,” at 41st IEEE International Conference on Dependable Systems and Networks (DSN), June 2011. doi:10.1109/DSN.2011.5958223

[21] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: "Zab：高性能的主备系统广播协议"。41st IEEE 国际可靠系统和网络会议（DSN），2011年6月。doi:10.1109/DSN.2011.5958223。

[ 22 ] Diego Ongaro and John K. Ousterhout: “ In Search of an Understandable Consensus Algorithm (Extended Version) ,” at USENIX Annual Technical Conference (ATC), June 2014.

[22] Diego Ongaro和John K. Ousterhout： “在寻找可理解的一致性算法（扩展版）”，于2014年6月的USENIX年度技术会议上。

[ 23 ] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev: “ Sharing Memory Robustly in Message-Passing Systems ,” Journal of the ACM , volume 42, number 1, pages 124–142, January 1995. doi:10.1145/200836.200869

[23] Hagit Attiya, Amotz Bar-Noy 和 Danny Dolev：“在消息传递系统中可靠地共享内存”，ACM杂志，卷42期1，页码124-142，1995年1月。doi:10.1145/200836.200869。

[ 24 ] Nancy Lynch and Alex Shvartsman: “ Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts ,” at 27th Annual International Symposium on Fault-Tolerant Computing (FTCS), June 1997. doi:10.1109/FTCS.1997.614100

[24] 纳西·林奇（Nancy Lynch）和亚历克斯·施瓦茨曼（Alex Shvartsman）: “使用动态仲裁认可广播实现共享内存的强大仿真”，发表于1997年6月的第27届国际容错计算研讨会（FTCS）。 doi：10.1109/FTCS.1997.614100

[ 25 ] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: Introduction to Reliable and Secure Distributed Programming , 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, doi:10.1007/978-3-642-15260-3

[25] Christian Cachin, Rachid Guerraoui, 和 Luís Rodrigues: 《可靠和安全的分布式编程导论》，第二版。Springer，2011年。ISBN: 978-3-642-15259-7， doi:10.1007/978-3-642-15260-3。

[ 26 ] Sam Elliott, Mark Allen, and Martin Kleppmann: personal communication , thread on twitter.com , October 15, 2015.

[26] Sam Elliott，Mark Allen和Martin Kleppmann：个人通信，Twitter.com上的主题，2015年10月15日。

[ 27 ] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis: “ Possible Issue with Read Repair? ,” email thread on cassandra-dev mailing list, October 2012.

[27] 尼克拉斯·埃克斯特伦、米哈伊尔·潘琴科和乔纳森·埃利斯： “读修复可能存在问题？”，来自Cassandra开发者邮件列表的电子邮件线程，2012年10月。

[ 28 ] Maurice P. Herlihy: “ Wait-Free Synchronization ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 13, number 1, pages 124–149, January 1991. doi:10.1145/114005.102808

[28] Maurice P. Herlihy：「无等待同步」，ACM编程语言和系统事务（TOPLAS），卷13，号1，页124-149，1991年1月。DOI：10.1145/114005.102808。

[ 29 ] Armando Fox and Eric A. Brewer: “ Harvest, Yield, and Scalable Tolerant Systems ,” at 7th Workshop on Hot Topics in Operating Systems (HotOS), March 1999. doi:10.1109/HOTOS.1999.798396

请帮我翻译：[29] Armando Fox和Eric A. Brewer：“收割，产量和可扩展容错系统”，第7届操作系统热门话题研讨会（HotOS），1999年3月。doi:10.1109/HOTOS.1999.798396。答案：[29] Armando Fox和Eric A. Brewer：“收割，产量和可扩展容错系统”，第7届操作系统热门话题研讨会（HotOS），1999年3月。doi:10.1109/HOTOS.1999.798396。

[ 30 ] Seth Gilbert and Nancy Lynch: “ Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services ,” ACM SIGACT News , volume 33, number 2, pages 51–59, June 2002. doi:10.1145/564585.564601

[30] Seth Gilbert和Nancy Lynch: “Brewer Conjecture和一致、可用、容错的Web服务的可行性”, ACM SIGACT新闻，卷33，号2，页51-59，2002年6月。 doi:10.1145/564585.564601。

[ 31 ] Seth Gilbert and Nancy Lynch: “ Perspectives on the CAP Theorem ,” IEEE Computer Magazine , volume 45, number 2, pages 30–36, February 2012. doi:10.1109/MC.2011.389

「31」塞思·吉尔伯特和南希·琳奇：《CAP定理的视角》，IEEE计算机杂志，2012年2月第45卷第2期，第30至36页。doi:10.1109/MC.2011.389。

[ 32 ] Eric A. Brewer: “ CAP Twelve Years Later: How the ‘Rules’ Have Changed ,” IEEE Computer Magazine , volume 45, number 2, pages 23–29, February 2012. doi:10.1109/MC.2012.37

[32] Eric A. Brewer：“CAP十二年后：‘规则’如何改变”，《IEEE计算机杂志》，2012年2月，第45卷，第2期，23-29页。doi:10.1109/MC.2012.37。

[ 33 ] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen: “ Consistency in Partitioned Networks ,” ACM Computing Surveys , volume 17, number 3, pages 341–370, September 1985. doi:10.1145/5505.5508

[33] Susan B. Davidson、Hector Garcia-Molina和Dale Skeen：“分区网络中的一致性”，ACM Computing Surveys，第17卷，第3期，页341-370，1985年9月。doi:10.1145/5505.5508。

[ 34 ] Paul R. Johnson and Robert H. Thomas: “ RFC 677: The Maintenance of Duplicate Databases ,” Network Working Group, January 27, 1975.

[34] 保罗·R·约翰逊和罗伯特·H·托马斯： “RFC 677：重复数据库的维护，”网络工作组，1975年1月27日。

[ 35 ] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “ Notes on Distributed Databases ,” IBM Research, Research Report RJ2571(33471), July 1979.

[35] Bruce G. Lindsay，Patricia Griffiths Selinger，C. Galtieri，等： “Notes on Distributed Databases，”IBM 研究，研究报告 RJ2571(33471)，1979 年 7 月。【译文】[35]布鲁斯·林赛、帕特里夏·格里菲斯·赛林格、C.加尔蒂耶里等：《分布式数据库笔记》，IBM研究，研究报告RJ2571（33471），1979年7月。

[ 36 ] Michael J. Fischer and Alan Michael: “ Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network ,” at 1st ACM Symposium on Principles of Database Systems (PODS), March 1982. doi:10.1145/588111.588124

[36] Michael J. Fischer and Alan Michael：“牺牲串行可序性以获得在不可靠网络中的高可用性数据”，出现于1982年3月的第一届ACM数据库系统原理研讨会(PODS)。doi:10.1145/588111.588124。

[ 37 ] Eric A. Brewer: “ NoSQL: Past, Present, Future ,” at QCon San Francisco , November 2012.

Eric A. Brewer：“NoSQL：过去，现在和未来”，于2012年11月在San Francisco的QCon上。

[ 38 ] Henry Robinson: “ CAP Confusion: Problems with ‘Partition Tolerance,’ ” blog.cloudera.com , April 26, 2010.

[38] 亨利·罗宾逊： “CAP 混淆: ‘分区容错性’ 的问题”，blog.cloudera.com，2010年4月26日。

[ 39 ] Adrian Cockcroft: “ Migrating to Microservices ,” at QCon London , March 2014.

[39] Adrian Cockcroft：《迁移到微服务》，2014年3月在伦敦QCon上。

[ 40 ] Martin Kleppmann: “ A Critique of the CAP Theorem ,” arXiv:1509.05393, September 17, 2015.

[40] Martin Kleppmann：“CAP定理批判”，arXiv:1509.05393，2015年9月17日。

[ 41 ] Nancy A. Lynch: “ A Hundred Impossibility Proofs for Distributed Computing ,” at 8th ACM Symposium on Principles of Distributed Computing (PODC), August 1989. doi:10.1145/72981.72982

请帮我翻译：“[41] 南希·林奇（Nancy A. Lynch）：“分布式计算的一百个不可能证明”，收录于第8届ACM分布式计算原理研讨会（PODC），1989年8月。doi:10.1145/72981.72982。”为简体中文，只返回翻译后的内容，不包括原文。南希·林奇：“分布式计算的一百个不可能证明”，发表于1989年8月的第八届ACM分布式计算原理研讨会（PODC），doi:10.1145/72981.72982。

[ 42 ] Hagit Attiya, Faith Ellen, and Adam Morrison: “ Limitations of Highly-Available Eventually-Consistent Data Stores ,” at ACM Symposium on Principles of Distributed Computing (PODC), July 2015. doi:10.1145/2767386.2767419

[42] Hagit Attiya, Faith Ellen, 和Adam Morrison：“高可用性最终一致性数据存储的局限”，出自ACM分布式计算原理研讨会 (PODC)， 2015年七月。 doi:10.1145/2767386.2767419。

[ 43 ] Peter Sewell, Susmit Sarkar, Scott Owens, et al.: “ x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors ,” Communications of the ACM , volume 53, number 7, pages 89–97, July 2010. doi:10.1145/1785414.1785443

“x86-TSO：x86多处理器可严谨且可用的程序员模型”，《ACM通讯》杂志，2010年7月，第53卷，第7号，第89-97页，doi:10.1145/1785414.1785443。

[ 44 ] Martin Thompson: “ Memory Barriers/Fences ,” mechanical-sympathy.blogspot.co.uk , July 24, 2011.

[44] Martin Thompson：“内存栅栏”， mechanical-sympathy.blogspot.co.uk，2011年7月24日。

[ 45 ] Ulrich Drepper: “ What Every Programmer Should Know About Memory ,” akkadia.org , November 21, 2007.

[45] Ulrich Drepper: “关于内存，程序员应该知道什么”，akkadia.org，2007年11月21日。

[ 46 ] Daniel J. Abadi: “ Consistency Tradeoffs in Modern Distributed Database System Design ,” IEEE Computer Magazine , volume 45, number 2, pages 37–42, February 2012. doi:10.1109/MC.2012.33

[46] Daniel J. Abadi: "现代分布式数据库系统设计中的一致性权衡"，IEEE 计算机杂志，第 45 卷，第 2 期，页码为 37–42，2012 年 2 月。 doi:10.1109/MC.2012.33

[ 47 ] Hagit Attiya and Jennifer L. Welch: “ Sequential Consistency Versus Linearizability ,” ACM Transactions on Computer Systems (TOCS), volume 12, number 2, pages 91–122, May 1994. doi:10.1145/176575.176576

【47】Hagit Attiya和Jennifer L. Welch：“顺序一致性与线性化”，发表于ACM计算机系统交易（TOCS），卷12，号2，页码91-122，1994年5月。doi：10.1145/176575.176576。

[ 48 ] Mustaque Ahamad, Gil Neiger, James E. Burns, et al.: “ Causal Memory: Definitions, Implementation, and Programming ,” Distributed Computing , volume 9, number 1, pages 37–49, March 1995. doi:10.1007/BF01784241

[48] Mustaque Ahamad, Gil Neiger, James E. Burns等人：“因果存储器：定义、实现和编程”，《分布式计算》杂志，1995年3月，第9卷第1期，页码37-49。doi:10.1007/BF01784241。

[ 49 ] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen: “ Stronger Semantics for Low-Latency Geo-Replicated Storage ,” at 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2013.

【49】Wyatt Lloyd、Michael J. Freedman、Michael Kaminsky和David G. Andersen：《低延迟地理复制存储的更强语义》，2013年4月于第10届USENIX网络系统设计和实现研讨会（NSDI）上发表。

[ 50 ] Marek Zawirski, Annette Bieniusa, Valter Balegas, et al.: “ SwiftCloud: Fault-Tolerant Geo-Replication Integrated All the Way to the Client Machine ,” INRIA Research Report 8347, August 2013.

【50】Marek Zawirski，Annette Bieniusa，Valter Balegas，等： “SwiftCloud：实现从客户端机器到整个的容错地理复制”，INRIA研究报告8347，2013年8月。

[ 51 ] Peter Bailis, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica: “ Bolt-on Causal Consistency ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[51] Peter Bailis，Ali Ghodsi，Joseph M Hellerstein和Ion Stoica：“附加因果一致性”，在ACM数据管理国际会议（SIGMOD）上，2013年6月。

[ 52 ] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “ Challenges to Adopting Stronger Consistency at Scale ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[52] Philippe Ajoux，Nathan Bronson，Sanjeev Kumar，等： “在规模上采用更强一致性的挑战”，于2015年5月在第15届USENIX操作系统热门主题研讨会（HotOS）上发表。

[ 53 ] Peter Bailis: “ Causality Is Expensive (and What to Do About It) ,” bailis.org , February 5, 2014.

[53] Peter Bailis：“因果关系很昂贵（及如何应对）”，bailis.org，2014年2月5日。

[ 54 ] Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte: “ Concise Server-Wide Causality Management for Eventually Consistent Data Stores ,” at 15th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS), June 2015. doi:10.1007/978-3-319-19129-4_6

[54] Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero和Victor Fonte：“简洁的服务器端一致性管理的事件一致性数据存储”，第15届IFIP分布式应用和可互操作系统国际会议（DAIS），2015年6月。 doi：10.1007/978-3-319-19129-4_6。

[ 55 ] Rob Conery: “ A Better ID Generator for PostgreSQL ,” rob.conery.io , May 29, 2014.

“PostgreSQL更好的ID生成器”，Rob Conery，rob.conery.io，2014年5月29日。

“时间、时钟和分布式系统中的事件排序”，ACM通信，第21卷，第7期，1978年7月，558-565页。doi:10.1145/359545.359563。

[ 57 ] Xavier Défago, André Schiper, and Péter Urbán: “ Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey ,” ACM Computing Surveys , volume 36, number 4, pages 372–421, December 2004. doi:10.1145/1041680.1041682

【57】Xavier Défago、André Schiper和Péter Urbán：“总序广播和组播算法：分类和调查”，ACM计算调查，第36卷，第4号，372-421页，2004年12月。 doi：10.1145/1041680.1041682

[ 58 ] Hagit Attiya and Jennifer Welch: Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, doi:10.1002/0471478210

[58] Hagit Attiya和Jennifer Welch：分布式计算：基础，仿真和高级主题，第2版。 John Wiley＆Sons，2004年。 ISBN：978-0-471-45324-6，doi：10.1002/0471478210

[ 59 ] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, et al.: “ CORFU: A Shared Log Design for Flash Clusters ,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

[59] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran等: “CORFU:闪存集群的共享日志设计”，发表于2012年4月第9届USENIX 网络系统设计和实现研讨会（NSDI）.

[ 60 ] Fred B. Schneider: “ Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial ,” ACM Computing Surveys , volume 22, number 4, pages 299–319, December 1990.

"[60] Fred B. Schneider：“使用状态机方法实现容错服务：教程”，ACM Computing Surveys，卷22，号4，页299-319，1990年12月。"

[ 61 ] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, et al.: “ Calvin: Fast Distributed Transactions for Partitioned Database Systems ,” at ACM International Conference on Management of Data (SIGMOD), May 2012.

[61] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng等人： “Calvin：用于分区数据库系统的快速分布式事务”，发表于ACM数据管理国际会议(SIGMOD)，2012年5月。

[ 62 ] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, et al.: “ Tango: Distributed Data Structures over a Shared Log ,” at 24th ACM Symposium on Operating Systems Principles (SOSP), November 2013. doi:10.1145/2517349.2522732

【62】Mahesh Balakrishnan、Dahlia Malkhi、Ted Wobber等：“Tango：基于共享日志的分布式数据结构”，收录于2013年11月的第24届ACM操作系统原理研讨会（SOSP）。doi:10.1145/2517349.2522732。

[ 63 ] Robbert van Renesse and Fred B. Schneider: “ Chain Replication for Supporting High Throughput and Availability ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[63] Robbert van Renesse和Fred B. Schneider: “链式复制以支持高吞吐量和可用性”，发表于2004年12月的第六届USENIX操作系统设计和实现研讨会（OSDI）。

[ 64 ] Leslie Lamport: “ How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs ,” IEEE Transactions on Computers , volume 28, number 9, pages 690–691, September 1979. doi:10.1109/TC.1979.1675439

[64] 莱斯利·兰波特： “如何制造一个能正确执行多进程程序的多处理器计算机”，IEEE 计算机学报，第 28 卷，第 9 期，第 690-691 页，1979 年 9 月。 doi:10.1109/TC.1979.1675439。

[ 65 ] Enis Söztutar, Devaraj Das, and Carter Shanklin: “ Apache HBase High Availability at the Next Level ,” hortonworks.com , January 22, 2015.

"Apache HBase：下一代高可用性"，来自hortonworks.com的Enis Söztutar、Devaraj Das和Carter Shanklin，2015年1月22日。

[ 66 ] Brian F Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, et al.: “ PNUTS: Yahoo!’s Hosted Data Serving Platform ,” at 34th International Conference on Very Large Data Bases (VLDB), August 2008. doi:10.14778/1454159.1454167

[66] Brian F Cooper，Raghu Ramakrishnan，Utkarsh Srivastava等：“PNUTS：Yahoo!的托管数据服务平台” ，发表于2008年8月的第34届国际大数据会议（VLDB）。doi:10.14778/1454159.1454167。

[ 67 ] Tushar Deepak Chandra and Sam Toueg: “ Unreliable Failure Detectors for Reliable Distributed Systems ,” Journal of the ACM , volume 43, number 2, pages 225–267, March 1996. doi:10.1145/226643.226647

[67] Tushar Deepak Chandra 和 Sam Toueg： “可靠分布式系统的不可靠故障检测器”，ACM杂志，第43卷，第2期，页码225-267，1996年3月。 doi:10.1145/226643.226647

[ 68 ] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson: “ Impossibility of Distributed Consensus with One Faulty Process ,” Journal of the ACM , volume 32, number 2, pages 374–382, April 1985. doi:10.1145/3149.214121

[68] 迈克尔·J·费舍尔，南希·林奇和迈克尔·S·帕特森：《在一个故障进程的情况下无法实现分布式共识》，ACM期刊，第32卷，第2期，1985年4月，374-382页。doi:10.1145/3149.214121。

[ 69 ] Michael Ben-Or: “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,” at 2nd ACM Symposium on Principles of Distributed Computing (PODC), August 1983. doi:10.1145/800221.806707

[69] Michael Ben-Or: “自由选择的另一个优势：完全异步的协议协议”，发表于1983年8月的第二届ACM分布式计算原理研讨会(PODC)，doi:10.1145/800221.806707。

[ 70 ] Jim N. Gray and Leslie Lamport: “ Consensus on Transaction Commit ,” ACM Transactions on Database Systems (TODS), volume 31, number 1, pages 133–160, March 2006. doi:10.1145/1132863.1132867

【70】Jim N. Gray和Leslie Lamport： “事务提交上的共识”，ACM数据库系统交易（TODS），卷31，号1，页133-160，2006年3月。 doi：10.1145 / 1132863.1132867

[ 71 ] Rachid Guerraoui: “ Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus ,” at 9th International Workshop on Distributed Algorithms (WDAG), September 1995. doi:10.1007/BFb0022140

[71] Rachid Guerraoui：“非阻塞原子提交和一致性之间的关系重新审视”，发表于1995年9月的第9届国际分布式算法研讨会（WDAG）。doi：10.1007/BFb0022140。

[ 72 ] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “ All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

【72】 Thanumalayan Sankaranarayana Pillai，Vijay Chidambaram，Ramnatthan Alagappan等人：“并非所有文件系统都是平等的：关于编写崩溃一致性应用程序的难度”，2014年10月在第11届USENIX操作系统设计与实现研讨会（OSDI）上发表。

[ 73 ] Jim Gray: “ The Transaction Concept: Virtues and Limitations ,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

[73] 吉姆·格雷：《交易概念：优点和限制》，发表于1981年9月的第七届国际超大型数据库会议（VLDB）。

[ 74 ] Hector Garcia-Molina and Kenneth Salem: “ Sagas ,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

[74] Hector Garcia-Molina和Kenneth Salem： “Sagas”，于1987年5月在ACM国际数据管理会议(SIGMOD)上发表。 doi:10.1145/38713.38742

[ 75 ] C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “ Transaction Management in the R* Distributed Database Management System ,” ACM Transactions on Database Systems , volume 11, number 4, pages 378–396, December 1986. doi:10.1145/7239.7266

[75] C. Mohan，Bruce G. Lindsay和Ron Obermarck：“R*分布式数据库管理系统中的事务管理”，ACM Transactions on Database Systems，卷11，号4，页378-396，1986年12月。 doi：10.1145 / 7239.7266

[ 76 ] “ Distributed Transaction Processing: The XA Specification ,” X/Open Company Ltd., Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3

“分布式事务处理：XA规范”，X/Open公司有限责任公司，技术标准XO/CAE/91/300，1991年12月。ISBN：978-1-872-63024-3。

[ 77 ] Mike Spille: “ XA Exposed, Part II ,” jroller.com , April 3, 2004.

[77] 迈克·斯皮莱：“XA曝光，第二部分”，jroller.com，2004年4月3日。

[ 78 ] Ivan Silva Neto and Francisco Reverbel: “ Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction ,” at 7th IEEE/ACIS International Conference on Computer and Information Science (ICIS), May 2008. doi:10.1109/ICIS.2008.75

"[78] Ivan Silva Neto和Francisco Reverbel：「从实施WS-Coordination和WS-AtomicTransaction中学到的经验教训」，发表于第七届IEEE / ACIS计算机和信息科学国际会议（ICIS），2008年5月。doi：10.1109 / ICIS.2008.75"

[ 79 ] James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt: “ Formal Specification of a Web Services Protocol ,” at 1st International Workshop on Web Services and Formal Methods (WS-FM), February 2004. doi:10.1016/j.entcs.2004.02.022

[79] James E. Johnson、David E. Langworthy、Leslie Lamport和Friedrich H. Vogt："Web服务协议的形式化规范"，发表于第一届Web服务和形式化方法研讨会(WS-FM)，2004年2月。doi:10.1016/j.entcs.2004.02.022。

[ 80 ] Dale Skeen: “ Nonblocking Commit Protocols ,” at ACM International Conference on Management of Data (SIGMOD), April 1981. doi:10.1145/582318.582339

[80] Dale Skeen: “非阻塞提交协议”，发表于1981年4月ACM国际数据管理会议(SIGMOD)，doi:10.1145/582318.582339。

[ 81 ] Gregor Hohpe: “ Your Coffee Shop Doesn’t Use Two-Phase Commit ,” IEEE Software , volume 22, number 2, pages 64–66, March 2005. doi:10.1109/MS.2005.52

"你的咖啡店不使用两阶段提交协议"，IEEE软件杂志，卷22，号码2，页码64-66，2005年3月。doi:10.1109/MS.2005.52。

[ 82 ] Pat Helland: “ Life Beyond Distributed Transactions: An Apostate’s Opinion ,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

帮我翻译一下：[82] Pat Helland：“超越分布式事务：一个背叛者的意见”，发表于2007年1月的第三届创新数据系统研究双年会。

[ 83 ] Jonathan Oliver: “ My Beef with MSDTC and Two-Phase Commits ,” blog.jonathanoliver.com , April 4, 2011.

[83] Jonathan Oliver： “我对MSDTC和两阶段提交的烦恼，”blog.jonathanoliver.com，2011年4月4日。 [83] 琼纳森·奥利弗： “我对 MSDTC 和两阶段提交的不满”，blog.jonathanoliver.com，2011年4月4日。

[ 84 ] Oren Eini (Ahende Rahien): “ The Fallacy of Distributed Transactions ,” ayende.com , July 17, 2014.

[84] Oren Eini (Ahende Rahien): “分布式事务的谬论”，ayende.com，2014年7月17日。

[ 85 ] Clemens Vasters: “ Transactions in Windows Azure (with Service Bus) – An Email Discussion ,” vasters.com , July 30, 2012.

[85] Clemens Vasters： “在Windows Azure中进行事务处理（使用Service Bus）-电子邮件讨论”，vasters.com，2012年7月30日。

[ 86 ] “ Understanding Transactionality in Azure ,” NServiceBus Documentation, Particular Software, 2015.

[86] “理解Azure中的交易性”，NServiceBus文档，Particular Software，2015年。

[ 87 ] Randy Wigginton, Ryan Lowe, Marcos Albe, and Fernando Ipar: “ Distributed Transactions in MySQL ,” at MySQL Conference and Expo , April 2013.

[87] Randy Wigginton, Ryan Lowe, Marcos Albe和Fernando Ipar：“MySQL中的分布式事务”，2013年4月在MySQL会议和博览会上。

[ 88 ] Mike Spille: “ XA Exposed, Part I ,” jroller.com , April 3, 2004.

[88] 迈克·斯皮尔： “XA揭秘，第一部分”， jroller.com，2004年4月3日。

[ 89 ] Ajmer Dhariwal: “ Orphaned MSDTC Transactions (-2 spids) ,” eraofdata.com , December 12, 2008.

"[89] Ajmer Dhariwal: “无主 MSDTC 交易 (-2 spids),” eraofdata.com，2008 年 12 月 12 日。"

[ 90 ] Paul Randal: “ Real World Story of DBCC PAGE Saving the Day ,” sqlskills.com , June 19, 2013.

[90] Paul Randal：“DBCC PAGE拯救一天的现实故事”，sqlskills.com，2013年6月19日。

[ 91 ] “ in-doubt xact resolution Server Configuration Option ,” SQL Server 2016 documentation, Microsoft, Inc., 2016.

"[91] “in-doubt xact resolution Server Configuration Option，” SQL Server 2016文档，Microsoft，Inc.，2016."，请将其翻译为简体中文，仅返回翻译内容，不包括原文。 "[91]“处于怀疑状态的事务解决服务器配置选项”，SQL Server 2016文档，Microsoft，Inc.，2016年。"

[ 92 ] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “ Consensus in the Presence of Partial Synchrony ,” Journal of the ACM , volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[92] Cynthia Dwork，Nancy Lynch和Larry Stockmeyer：“偏同步性下的达成一致”，ACM期刊，第35卷，第2期，第288–323页，1988年4月。doi：10.1145/42282.42283。

[ 93 ] Miguel Castro and Barbara H. Liskov: “ Practical Byzantine Fault Tolerance and Proactive Recovery ,” ACM Transactions on Computer Systems , volume 20, number 4, pages 396–461, November 2002. doi:10.1145/571637.571640

[93] Miguel Castro和Barbara H. Liskov： “实用拜占庭容错和积极恢复”，ACM计算机系统事务，卷20，第4期，页396-461，2002年11月。 doi：10.1145 / 571637.571640。

[ 94 ] Brian M. Oki and Barbara H. Liskov: “ Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems ,” at 7th ACM Symposium on Principles of Distributed Computing (PODC), August 1988. doi:10.1145/62546.62549

[94] Brian M. Oki和Barbara H. Liskov：“Viewstamped Replication：一种新的主副本方法，支持高可用性分布式系统”，收录于1988年8月第7届ACM分布式计算原理研讨会（PODC）。doi：10.1145/62546.62549。

[ 95 ] Barbara H. Liskov and James Cowling: “ Viewstamped Replication Revisited ,” Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012.

[95] Barbara H. Liskov和James Cowling：“Viewstamped Replication Revisited，” Massachusetts Institute of Technology，Tech Report MIT-CSAIL-TR-2012-021，2012年7月。

[ 96 ] Leslie Lamport: “ The Part-Time Parliament ,” ACM Transactions on Computer Systems , volume 16, number 2, pages 133–169, May 1998. doi:10.1145/279227.279229

[96] Leslie Lamport：“兼职议会”，ACM计算机系统交易，卷16，号2，页133-169，1998年5月。doi：10.1145/279227.279229。

[ 97 ] Leslie Lamport: “ Paxos Made Simple ,” ACM SIGACT News , volume 32, number 4, pages 51–58, December 2001.

"Leslie Lamport在《Paxos Made Simple》中指出，ACM SIGACT News，32卷4期，51-58页，2001年12月。"

[ 98 ] Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone: “ Paxos Made Live – An Engineering Perspective ,” at 26th ACM Symposium on Principles of Distributed Computing (PODC), June 2007.

【98】Tushar Deepak Chandra、Robert Griesemer和Joshua Redstone：“Paxos 在工程角度下的实现——26th ACM分布式计算原理研讨会（PODC）论文，2007年6月”

[ 99 ] Robbert van Renesse: “ Paxos Made Moderately Complex ,” cs.cornell.edu , March 2011.

[99] Robbert van Renesse：“ Paxos Made Moderately Complex”，cs.cornell.edu，2011年3月。 [99]罗伯特·范雷涅塞：《Paxos Made Moderately Complex》。Cornell大学计算机科学系，2011年3月。

[ 100 ] Diego Ongaro: “ Consensus: Bridging Theory and Practice ,” PhD Thesis, Stanford University, August 2014.

[100]迭戈·翁加罗：《共识：理论与实践的桥梁》博士论文，斯坦福大学，2014年8月。

[ 101 ] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft: “ Raft Refloated: Do We Have Consensus? ,” ACM SIGOPS Operating Systems Review , volume 49, number 1, pages 12–21, January 2015. doi:10.1145/2723872.2723876

[101] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy，和Jon Crowcroft：“Raft再浮现: 我们有共识吗？”，ACM SIGOPS操作系统评论，第49卷，第1期，页12-21，2015年1月。 doi:10.1145/2723872.2723876。

[ 102 ] André Medeiros: “ ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice ,” Aalto University School of Science, March 20, 2012.

[102] 安德烈·梅德罗斯: “ZooKeeper 的原子广播协议：理论与实践”，阿尔托大学科学学院，2012年3月20日。

[ 103 ] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider: “ Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab ,” IEEE Transactions on Dependable and Secure Computing , volume 12, number 4, pages 472–484, September 2014. doi:10.1109/TDSC.2014.2355848

【103】Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider：“Vive La Différence: Paxos vs.Viewstamped Replication vs. Zab”，IEEE Transactions on Dependable and Secure Computing，vol. 12，no.4， pp. 472-484，2014年9月。 doi:10.1109/TDSC.2014.2355848

[ 104 ] Will Portnoy: “ Lessons Learned from Implementing Paxos ,” blog.willportnoy.com , June 14, 2012.

【104】威尔·波特劳伊：“从实施Paxos中学到的经验教训”，blog.willportnoy.com，2012年6月14日。

[ 105 ] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “ Flexible Paxos: Quorum Intersection Revisited ,” arXiv:1608.06696 , August 24, 2016.

[105] Heidi Howard、Dahlia Malkhi和Alexander Spiegelman： “灵活的Paxos：再探Quorum 交集”， arXiv:1608.06696，2016年8月24日。

[ 106 ] Heidi Howard and Jon Crowcroft: “ Coracle: Evaluating Consensus at the Internet Edge ,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2829988.2790010

[106] Heidi Howard and Jon Crowcroft: "Coracle: 在互联网边缘评估共识"，ACM数据通信专业兴趣小组(SIGCOMM)年会，2015年8月。doi:10.1145/2829988.2790010。

[ 107 ] Kyle Kingsbury: “ Call Me Maybe: Elasticsearch 1.5.0 ,” aphyr.com , April 27, 2015.

[107] Kyle Kingsbury: "Call Me Maybe: Elasticsearch 1.5.0," aphyr.com, April 27, 2015. [107] Kyle Kingsbury：“ 召唤我吧：Elasticsearch 1.5.0， ”aphyr.com，2015年4月27日。

[ 108 ] Ivan Kelly: “ BookKeeper Tutorial ,” github.com , October 2014.

[108] Ivan Kelly：“BookKeeper教程”，github.com，2014年10月。

[ 109 ] Camille Fournier: “ Consensus Systems for the Skeptical Architect ,” at Craft Conference , Budapest, Hungary, April 2015.

"[109] Camille Fournier: "怀疑论者的共识系统"，于2015年4月在匈牙利布达佩斯的Craft Conference上发表。"

[ 110 ] Kenneth P. Birman: “ A History of the Virtual Synchrony Replication Model ,” in Replication: Theory and Practice , Springer LNCS volume 5959, chapter 6, pages 91–120, 2010. ISBN: 978-3-642-11293-5, doi:10.1007/978-3-642-11294-2_6

[110] Kenneth P. Birman：“虚拟同步复制模型的历史”，载于《复制：理论与实践》，Springer LNCS 5959 卷，第6章，第91-120页，2010年。ISBN: 978-3-642-11293-5，doi:10.1007/978-3-642-11294-2_6。

Part III. Derived Data

In Parts I and II of this book, we assembled from the ground up all the major considerations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application.

在本书的第一部分和第二部分中，我们从最基础的开始组织了分布式数据库中所涉及的所有主要考虑因素，从磁盘上的数据布局一直到存在故障时的分布式一致性限制。但是，这种讨论假设应用程序中只有一个数据库。

In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.

现实情况下，数据系统往往更加复杂。在大型应用中，您通常需要能够以许多不同的方式访问和处理数据，并且没有一个数据库可以同时满足所有这些不同的需求。因此，应用程序通常使用多个不同的数据存储、索引、缓存、分析系统等，并实现将数据从一个存储库移动到另一个存储库的机制。

In this final part of the book, we will examine the issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns, into one coherent application architecture. This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.

在本书的最后部分，我们将研究将多个不同的数据系统集成到一个连贯的应用架构中的问题，这些系统可能具有不同的数据模型并针对不同的访问模式进行了优化。系统构建中的这个方面通常被供应商所忽略，他们声称他们的产品可以满足您所有的需求。实际上，将不同的系统集成起来是一个需要完成的最重要的事情，在非平凡的应用程序中。

Systems of Record and Derived Data

On a high level, systems that store and process data can be grouped into two broad categories:

从高层次来看，存储和处理数据的系统可以分为两个广泛的类别：

Systems of record

A system of record, also known as source of truth , holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically normalized ). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one.

系统记录，也称真相源，保留着您数据的权威版本。当新数据进来，如用户输入时，首先写入此处。每个事实都被准确地表述一次（通常是归一化的）。如果其他系统与系统记录之间存在任何差异，那么系统记录中的值（根据定义）是正确的。

Derived data systems

Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs.

派生系统中的数据是从另一个系统中获取一些现有数据并以某种方式进行转换或处理的结果。如果您丢失了派生数据，则可以从原始来源重新创建它。一个经典的例子是缓存：如果缓存存在数据，可以从缓存中提供数据，但如果缓存中不存在所需数据，则可以回退到基础数据库。非规范化的值、索引和物化视图也属于此类。在推荐系统中，预测性摘要数据通常是从使用日志中派生的。

Technically speaking, derived data is redundant , in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized . You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”

技术上说，派生数据是冗余的，因为它重复了现有信息。然而，在读取查询上获得良好性能通常是必要的。它通常是去正则化的。您可以从单个来源派生多个不同的数据集，使您能够从不同的“视角”查看数据。

Not all systems make a clear distinction between systems of record and derived data in their architecture, but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other.

并非所有系统在其体系结构中明确区分记录系统和衍生数据系统，但这是一个非常有帮助的区分，因为它可以澄清系统中的数据流：它可以明确哪些部分具有哪些输入和输出，以及它们如何相互依赖。

Most databases, storage engines, and query languages are not inherently either a system of record or a derived system. A database is just a tool: how you use it is up to you. The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.

大多数数据库、存储引擎和查询语言并非本质上是记录系统或派生系统。数据库只是一种工具：你如何使用它取决于你。系统记录和派生数据系统之间的区别取决于你在应用程序中如何使用这个工具，而不是取决于这个工具本身。

By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture. This point will be a running theme throughout this part of the book.

通过清楚地表明哪些数据是从哪些其他数据导出的，您可以为本来混淆的系统架构带来明晰。这一点将贯穿本书的整个部分。

Overview of Chapters

We will start in Chapter 10 by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large-scale data systems. In Chapter 11 we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. Chapter 12 concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.

我们将从第10章开始，研究类似MapReduce等批处理数据流系统，并了解它们为构建大规模数据系统提供了很好的工具和原则。在第11章中，我们将将这些想法应用于数据流，从而使我们能够以更低的延迟执行相同类型的操作。第12章总结了本书，探讨了如何利用这些工具在未来构建可靠、可扩展和易维护的应用程序的想法。

Chapter 10. Batch Processing

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.

一个系统如果受到单个人的过多影响，就不可能成功。一旦初步设计完成并相当健壮，真正的考验就开始了，因为拥有许多不同观点的人会进行自己的实验。

Donald Knuth

高德纳

In the first two parts of this book we talked a lot about requests and queries , and the corresponding responses or results . This style of data processing is assumed in many modern data systems: you ask for something, or you send an instruction, and some time later the system (hopefully) gives you an answer. Databases, caches, search indexes, web servers, and many other systems work this way.

在本书的前两个部分，我们广泛讨论了请求和查询，以及相应的回答或结果。许多现代数据系统都使用这种数据处理方式：您请求某些内容或发送指令，一段时间后系统（希望）给您答案。数据库，缓存，搜索索引，Web服务器和许多其他系统都是这样工作的。

In such online systems, whether it’s a web browser requesting a page or a service calling a remote API, we generally assume that the request is triggered by a human user, and that the user is waiting for the response. They shouldn’t have to wait too long, so we pay a lot of attention to the response time of these systems (see “Describing Performance” ).

在这样的在线系统中，无论是Web浏览器请求页面还是服务调用远程API，我们通常假设请求是由人类用户触发的，并且用户正在等待响应。他们不应该等待太久，因此我们非常注重这些系统的响应时间（参见“描述性能”）。

The web, and increasing numbers of HTTP/REST-based APIs, has made the request/response style of interaction so common that it’s easy to take it for granted. But we should remember that it’s not the only way of building systems, and that other approaches have their merits too. Let’s distinguish three different types of systems:

网络和HTTP / REST API数量的不断增加，使请求/响应式的交互方式变得如此普遍，以至于很容易把它视为理所当然的事情。但我们应该记住，这不是构建系统的唯一方式，其他方法也有他们的优点。让我们区分三种不同类型的系统：网络和HTTP / REST API数量的不断增加，使请求/响应式的交互方式变得如此普遍，以至于很容易把它视为理所当然的事情。但我们应该记住，这不是构建系统的唯一方式，其他方法也有他们的优点。让我们区分三种不同类型的系统：

Services (online systems)

A service waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back. Response time is usually the primary measure of performance of a service, and availability is often very important (if the client can’t reach the service, the user will probably get an error message).

一个服务等待客户端的请求或指令，一旦收到，服务会尽快处理并发送响应。响应时间通常是衡量服务性能的主要指标，可用性经常非常重要（如果客户端无法连接服务，则用户可能会收到错误消息）。

Batch processing systems (offline systems)

A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to finish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually throughput (the time it takes to crunch through an input dataset of a certain size). We discuss batch processing in this chapter.

批次处理系统会处理大量的输入数据，并且运行作业以产生输出数据。由于作业通常需要一段时间（从几分钟到数天），因此通常没有用户等待作业完成。相反，批处理作业通常被安排定期运行（例如，每天一次）。批处理作业的主要性能指标通常是吞吐量（完成指定大小的输入数据集所需的时间）。我们在本章中会讨论批处理。

Stream processing systems (near-real-time systems)

Stream processing is somewhere between online and offline/batch processing (so it is sometimes called near-real-time or nearline processing). Like a batch processing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch processing, we discuss it in Chapter 11 .

流处理位于在线和离线/批处理之间（因此有时称为近实时或近线处理）。与批处理系统一样，流处理器消耗输入并产生输出（而不是响应请求）。但是，流作业在事件发生后不久运行，而批作业在固定的输入数据集上运行。这种差异使流处理系统具有比等效批处理系统更低的延迟。由于流处理建立在批处理之上，因此我们在第11章中讨论它。

As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, MapReduce, a batch processing algorithm published in 2004 [ 1 ], was (perhaps over-enthusiastically) called “the algorithm that makes Google so massively scalable” [ 2 ]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.

正如本章所述，批处理是我们构建可靠、可扩展和易于维护应用程序的重要基石。例如，MapReduce 是一种批处理算法，于 2004 年发表[1]，被誉为“使谷歌规模如此巨大的算法”[2]（可能过分热情）。它随后被实现在各种开源数据系统中，包括 Hadoop、CouchDB 和 MongoDB。

MapReduce is a fairly low-level programming model compared to the parallel processing systems that were developed for data warehouses many years previously [ 3 , 4 ], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [ 5 ], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.

MapReduce相比多年前为数据仓库开发的并行处理系统是一种相当低级的编程模型，但它在可运用于大规模处理商用硬件方面迈出了重要的一步。虽然MapReduce的重要性正在逐渐下降，但了解它仍是值得的，因为它清晰地阐述了为什么和如何使用批处理是有用的。

In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hollerith machines used in the 1890 US Census [ 6 ]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And MapReduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [ 7 ]. As usual, history has a tendency of repeating itself.

事实上，批处理是一种非常古老的计算形式。早在可编程数字计算机被发明之前，例如在1890年美国人口普查中使用的Hollerith机器，实现了一种半机械化的批处理形式，以从大型输入中计算聚合统计数据。而MapReduce与1940年代和1950年代广泛用于业务数据处理的电机械IBM卡片分类机具有惊人的相似之处。通常，历史有一种重复的趋势。

In this chapter, we will look at MapReduce and several other batch processing algorithms and frameworks, and explore how they are used in modern data systems. But first, to get started, we will look at data processing using standard Unix tools. Even if you are already familiar with them, a reminder about the Unix philosophy is worthwhile because the ideas and lessons from Unix carry over to large-scale, heterogeneous distributed data systems.

在本章中，我们将研究MapReduce和其他几种批处理算法和框架，并探讨它们在现代数据系统中的应用。但首先，为了开始，我们将研究使用标准Unix工具的数据处理。即使您已经熟悉它们，Unix哲学的提醒也是值得的，因为Unix的思想和教训可以应用于大规模、异构分布式数据系统。

Batch Processing with Unix Tools

Let’s start with a simple example. Say you have a web server that appends a line to a log file every time it serves a request. For example, using the nginx default access log format, one line of the log might look like this:

让我们从一个简单的例子开始。假设您有一个Web服务器，每次提供请求时都向日志文件追加一行。例如，使用nginx默认访问日志格式，日志的一行可能如下所示：

216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115
Safari/537.36"

(That is actually one line; it’s only broken onto multiple lines here for readability.) There’s a lot of information in that line. In order to interpret it, you need to look at the definition of the log format, which is as follows:

“这实际上是一行；为了易读性，它被分成了多行。这行信息非常多。为了解释它，你需要查看日志格式的定义，如下：”

$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"

So, this one line of the log indicates that on February 27, 2015, at 17:55:11 UTC, the server received a request for the file /css/typography.css from the client IP address 216.58.210.78. The user was not authenticated, so $remote_user is set to a hyphen ( - ). The response status was 200 (i.e., the request was successful), and the response was 3,377 bytes in size. The web browser was Chrome 40, and it loaded the file because it was referenced in the page at the URL http://martin.kleppmann.com/ .

因此，日志中的这一行表示在2015年2月27日17:55:11 UTC时，服务器从客户端IP地址216.58.210.78收到了对文件/css/typography.css的请求。用户未经过身份验证，因此$remote_user设置为连字符（-）。响应状态为200（即请求成功），响应大小为3,377字节。网页浏览器为Chrome 40，并且加载该文件是因为它在URL http://martin.kleppmann.com/ 的页面中被引用。

Simple Log Analysis

Various tools can take these log files and produce pretty reports about your website traffic, but for the sake of exercise, let’s build our own, using basic Unix tools. For example, say you want to find the five most popular pages on your website. You can do this in a Unix shell as follows: ⁱ

使用各种工具可以获取这些日志文件并生成有关网站流量的漂亮报告，但是为了练习起见，让我们使用基本的Unix工具构建自己的工具。例如，假设您想找到网站上最受欢迎的五个页面。您可以在Unix shell中按如下方式执行：

cat /var/log/nginx/access.log | 
  awk '{print $7}' | 
  sort             | 
  uniq -c          | 
  sort -r -n       | 
  head -n 5

Read the log file.

阅读日志文件。

Split each line into fields by whitespace, and output only the seventh such field from each line, which happens to be the requested URL. In our example line, this request URL is /css/typography.css .

将每行按空格分成多个字段，并只输出每行中的第七个字段，即所请求的URL。在我们的例子中，这个请求URL是/css/typography.css。

Alphabetically sort the list of requested URLs. If some URL has been requested n times, then after sorting, the file contains the same URL repeated n times in a row.

按字母顺序排序所请求的URL列表。如果某个URL被请求了n次，则在排序后，文件中包含相同的URL连续出现n次。

The uniq command filters out repeated lines in its input by checking whether two adjacent lines are the same. The -c option tells it to also output a counter: for every distinct URL, it reports how many times that URL appeared in the input.

uniq 命令通过检查相邻行是否相同来过滤输入中的重复行。 -c 选项告诉它还要输出计数器：对于每个独特的 URL，它报告该 URL 在输入中出现了多少次。

The second sort sorts by the number ( -n ) at the start of each line, which is the number of times the URL was requested. It then returns the results in reverse ( -r ) order, i.e. with the largest number first.

第二种排序方式是按照每行开头的数字(-n)来排序，这个数字表示URL被请求的次数。然后按照逆序(-r) 返回结果，也就是先返回请求次数最多的URL。

Finally, head outputs just the first five lines ( -n 5 ) of input, and discards the rest.

最终，头部只输出输入的前五行（-n 5），并且丢弃其余部分。

The output of that series of commands looks something like this:

那串命令的输出大概是这个样子的：

4189 /favicon.ico
3631 /2013/05/24/improving-security-of-ssh-private-keys.html
2124 /2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1369 /
 915 /css/typography.css

Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you can easily modify the analysis to suit your needs. For example, if you want to omit CSS files from the report, change the awk argument to '$7 !~ /\.css$/ {print $7}' . If you want to count top client IP addresses instead of top pages, change the awk argument to '{print $1}' . And so on.

尽管如果您不熟悉Unix工具，前面的命令行可能看起来有些晦涩，但它非常强大。它可以在几秒钟内处理数千兆字节的日志文件，并且您可以轻松修改分析以适应您的需要。例如，如果您想从报告中省略CSS文件，将awk参数更改为'$7 !~ /\.css$/ {print $7}'。如果您想统计顶部客户端IP地址而不是顶部页面，则将awk参数更改为'{print $1}'。等等。

We don’t have space in this book to explore Unix tools in detail, but they are very much worth learning about. Surprisingly many data analyses can be done in a few minutes using some combination of awk , sed , grep , sort , uniq , and xargs , and they perform surprisingly well [ 8 ].

我们在这本书中没有足够的空间详细探讨Unix工具，但它们非常值得学习。令人惊讶的是，使用awk、sed、grep、sort、uniq和xargs的一些组合，可以在几分钟内完成许多数据分析，并且它们的表现出奇的好 [8]。

Chain of commands versus custom program

Instead of the chain of Unix commands, you could write a simple program to do the same thing. For example, in Ruby, it might look something like this:

不用Unix命令的链条，你可以编写一个简单的程序来做同样的事情。例如，在Ruby中，它可能看起来像这样：

counts = Hash.new(0) 

File.open('/var/log/nginx/access.log') do |file|
  file.each do |line|
    url = line.split[6] 
    counts[url] += 1 
  end
end

top5 = counts.map{|url, count| [count, url] }.sort.reverse[0...5] 
top5.each{|count, url| puts "#{count} #{url}" }

counts is a hash table that keeps a counter for the number of times we’ve seen each URL. A counter is zero by default.

"counts是一个哈希表，用于记录我们看到每个URL的次数。一个计数器默认为零。"

From each line of the log, we take the URL to be the seventh whitespace-separated field (the array index here is 6 because Ruby’s arrays are zero-indexed).

从日志的每一行中，我们将URL取为第七个用空格分隔的字段（在此，数组索引为6，因为Ruby的数组是从零开始计数的）。

Increment the counter for the URL in the current line of the log.

增加日志当前行中URL的计数器。

Sort the hash table contents by counter value (descending), and take the top five entries.

将哈希表内容按计数器值排序（降序），并取前五条条目。

Print out those top five entries.

打印出那五个排名前五项。

This program is not as concise as the chain of Unix pipes, but it’s fairly readable, and which of the two you prefer is partly a matter of taste. However, besides the superficial syntactic differences between the two, there is a big difference in the execution flow, which becomes apparent if you run this analysis on a large file.

这个程序不如Unix管道链那样简洁，但它相当易读，你喜欢哪一个主要取决于个人口味。然而，除了表面上的语法差异之外，这两者之间存在着执行流程的巨大差异，如果你在大文件上运行这个分析，这一点就会显而易见。

Sorting versus in-memory aggregation

The Ruby script keeps an in-memory hash table of URLs, where each URL is mapped to the number of times it has been seen. The Unix pipeline example does not have such a hash table, but instead relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.

Ruby脚本保持了一个内存hash表，其中每个URL被映射到它被访问的次数。Unix管道示例没有这样的哈希表，而是依靠对URL列表进行排序，其中相同URL的多次出现只是被重复。

Which approach is better? It depends how many different URLs you have. For most small to mid-sized websites, you can probably fit all distinct URLs, and a counter for each URL, in (say) 1 GB of memory. In this example, the working set of the job (the amount of memory to which the job needs random access) depends only on the number of distinct URLs: if there are a million log entries for a single URL, the space required in the hash table is still just one URL plus the size of the counter. If this working set is small enough, an in-memory hash table works fine—even on a laptop.

哪种方法更好？这取决于您有多少不同的URL。对于大多数小型到中型网站，您可以将所有独立的URL以及每个URL的计数器拟合在（比如说）1GB的内存中。在这个例子中，作业的工作集（作业需要随机访问的内存量）仅取决于独立URL的数量：如果有一百万条针对单个URL的日志条目，哈希表中所需的空间仍然只是一个URL加上计数器的大小。如果此工作集足够小，则内存中的哈希表可以正常工作，即使在笔记本电脑上也可以。

On the other hand, if the job’s working set is larger than the available memory, the sorting approach has the advantage that it can make efficient use of disks. It’s the same principle as we discussed in “SSTables and LSM-Trees” : chunks of data can be sorted in memory and written out to disk as segment files, and then multiple sorted segments can be merged into a larger sorted file. Mergesort has sequential access patterns that perform well on disks. (Remember that optimizing for sequential I/O was a recurring theme in Chapter 3 . The same pattern reappears here.)

如果工作集大于可用内存，排序方法具有以下优势：它可以高效地利用磁盘。这与我们在"SSTables and LSM-Trees"中讨论的原理相同：可以在内存中对数据块进行排序并将其写入磁盘作为片段文件，然后将多个排序段合并为一个较大的排序文件。归并排序具有良好的顺序访问模式，可在磁盘上执行良好的性能。(请记住，在第三章中，优化顺序I/O是一个反复出现的主题。这种模式在这里重新出现。)

The sort utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores [ 9 ]. This means that the simple chain of Unix commands we saw earlier easily scales to large datasets, without running out of memory. The bottleneck is likely to be the rate at which the input file can be read from disk.

GNU Coreutils（Linux）中的排序实用程序通过向磁盘溢出并自动并行化多个 CPU 内核中的排序来自动处理大于内存数据集。这意味着我们之前看到的简单的 Unix 命令链可以轻松扩展到大型数据集，而不会用尽内存。 Eng: GNU Coreutils (Linux)中的排序实用程序通过向磁盘溢出并自动并行化多个CPU内核中的排序来自动处理大于内存数据集。这意味着我们之前看到的简单的Unix命令链可以轻松扩展到大型数据集，而不会用尽内存。瓶颈可能是从磁盘读取输入文件的速率。

The Unix Philosophy

It’s no coincidence that we were able to analyze a log file quite easily, using a chain of commands like in the previous example: this was in fact one of the key design ideas of Unix, and it remains astonishingly relevant today. Let’s look at it in some more depth so that we can borrow some ideas from Unix [ 10 ].

我们能够轻松地使用一系列命令，像之前的例子一样分析日志文件，这并非巧合：这实际上是Unix的关键设计思想之一，而且如今仍然非常相关。让我们深入了解一下，这样我们就可以从Unix中借鉴一些思想。

Doug McIlroy, the inventor of Unix pipes, first described them like this in 1964 [ 11 ]: “We should have some ways of connecting programs like [a] garden hose—screw in another segment when it becomes necessary to massage data in another way. This is the way of I/O also.” The plumbing analogy stuck, and the idea of connecting programs with pipes became part of what is now known as the Unix philosophy —a set of design principles that became popular among the developers and users of Unix. The philosophy was described in 1978 as follows [ 12 , 13 ]:

道格·麦克罗伊（Doug McIlroy）是Unix管道的发明者，他在1964年首次这样描述它们：“我们应该有一些连接程序的方法，就像[花园水管]一样--当需要以另一种方式处理数据时，拧紧另一段。这也是I/O的方式。”这个管道的类比很受欢迎，将程序用管道连接起来的想法成为了Unix哲学的一部分——一套设计原则，成为Unix开发者和用户中广受欢迎的哲学。这个哲学在1978年被描述为：

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.

每个程序都要做好一件事情。要完成新的工作，就要重新开始，而不是通过添加新的“功能”来使旧程序变得复杂。

Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

期望每个程序的输出成为另一个尚未知晓的程序的输入。不要用杂乱无用的信息混淆输出。避免使用严格的列或二进制输入格式。不要坚持交互式输入。

Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.

设计并构建软件，甚至操作系统，在几周内尝试。不要犹豫地丢弃笨拙的部分并重新构建。

Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.

使用工具来轻松完成编程任务，而不是使用不熟练的帮助，即使你需要绕道建立工具并且在使用完后可能需要放弃一些。

This approach—automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks—sounds remarkably like the Agile and DevOps movements of today. Surprisingly little has changed in four decades.

这种方法——自动化、快速原型制作、迭代增量、友好实验和将大型项目分解成可管理的块——听起来非常像今天的敏捷和DevOps运动。四十年过去了，令人惊讶的是很少改变。

The sort tool is a great example of a program that does one thing well. It is arguably a better sorting implementation than most programming languages have in their standard libraries (which do not spill to disk and do not use multiple threads, even when that would be beneficial). And yet, sort is barely useful in isolation. It only becomes powerful in combination with the other Unix tools, such as uniq .

排序工具是一个很好的例子，它能够很好地完成一件事情。它可能比大多数编程语言的标准库中的排序实现更好（即使在这种情况下使用多线程更有益），但在孤立的情况下它几乎没用。只有与其他Unix工具（如uniq）相结合，它才变得强大。

A Unix shell like bash lets us easily compose these small programs into surprisingly powerful data processing jobs. Even though many of these programs are written by different groups of people, they can be joined together in flexible ways. What does Unix do to enable this composability?

像bash这样的Unix shell让我们轻松地将这些小程序组合成令人惊讶的强大数据处理作业。尽管许多这些程序由不同的人编写，但它们可以以灵活的方式组合在一起。Unix是如何实现这种可组合性的呢？

A uniform interface

If you expect the output of one program to become the input to another program, that means those programs must use the same data format—in other words, a compatible interface. If you want to be able to connect any program’s output to any program’s input, that means that all programs must use the same input/output interface.

如果您希望一个程序的输出成为另一个程序的输入，那么这些程序必须使用相同的数据格式，换句话说，必须具有兼容的接口。如果您想能够将任何程序的输出连接到任何程序的输入，那么所有程序都必须使用相同的输入/ 输出接口。

In Unix, that interface is a file (or, more precisely, a file descriptor). A file is just an ordered sequence of bytes. Because that is such a simple interface, many different things can be represented using the same interface: an actual file on the filesystem, a communication channel to another process (Unix socket, stdin , stdout ), a device driver (say /dev/audio or /dev/lp0 ), a socket representing a TCP connection, and so on. It’s easy to take this for granted, but it’s actually quite remarkable that these very different things can share a uniform interface, so they can easily be plugged together. ⁱⁱ

在Unix中，该接口是一个文件（或更精确地说，是一个文件描述符）。文件只是一个有序的字节序列。因为这是一个如此简单的接口，许多不同的事物都可以使用相同的接口表示：文件系统上的实际文件，到另一个进程的通信通道（Unix套接字，stdin，stdout），设备驱动程序（例如/dev/audio或/dev/lp0），表示TCP连接的套接字等。很容易认为这是理所当然的，但实际上，这些非常不同的东西可以共享统一的接口，因此可以轻松地将它们组合在一起。

By convention, many (but not all) Unix programs treat this sequence of bytes as ASCII text. Our log analysis example used this fact: awk , sort , uniq , and head all treat their input file as a list of records separated by the \n (newline, ASCII 0x0A ) character. The choice of \n is arbitrary—arguably, the ASCII record separator 0x1E would have been a better choice, since it’s intended for this purpose [ 14 ]—but in any case, the fact that all these programs have standardized on using the same record separator allows them to interoperate.

按照惯例，许多（但不是所有）Unix程序将这个字节序列视为ASCII文本。我们的日志分析示例利用了这一事实：awk、sort、uniq和head都将它们的输入文件视为由\n（换行符，ASCII 0x0A）字符分隔的记录列表。选择\n是任意的 - 可以说，ASCII记录分隔符0x1E可能是更好的选择，因为它是为此目的而设计的[14] - 但无论如何，所有这些程序标准化使用相同的记录分隔符，使它们可以互操作。

The parsing of each record (i.e., a line of input) is more vague. Unix tools commonly split a line into fields by whitespace or tab characters, but CSV (comma-separated), pipe-separated, and other encodings are also used. Even a fairly simple tool like xargs has half a dozen command-line options for specifying how its input should be parsed.

每条记录的解析（即输入的一行）更加模糊。Unix工具通常通过空格或制表符将一行分割为字段，但CSV（逗号分隔）、管道分隔和其他编码也被使用。即使像xargs这样相当简单的工具也有半打命令行选项来指定其输入应该如何解析。

The uniform interface of ASCII text mostly works, but it’s not exactly beautiful: our log analysis example used {print $7} to extract the URL, which is not very readable. In an ideal world this could have perhaps been {print $request_url} or something of that sort. We will return to this idea later.

ASCII文本的统一接口大部分都能够正常运作，但它并不是非常美观：我们的日志分析示例使用{print $7}来提取URL，这并不是非常易读。在理想的情况下，可能会使用{print $request_url}或类似的方式。稍后我们会回到这个想法。

Although it’s not perfect, even decades later, the uniform interface of Unix is still something remarkable. Not many pieces of software interoperate and compose as well as Unix tools do: you can’t easily pipe the contents of your email account and your online shopping history through a custom analysis tool into a spreadsheet and post the results to a social network or a wiki. Today it’s an exception, not the norm, to have programs that work together as smoothly as Unix tools do.

尽管不完美，Unix 的统一界面至今仍然值得称道。没有多少软件像Unix工具一样互操作和组合得如此出色：您无法轻松地通过自定义分析工具将电子邮件账户和在线购物历史的内容输入到电子表格中，并将结果发布到社交网络或维基上。如今，拥有像Unix工具那样顺畅地协同工作的程序已成为例外而非常规。

Even databases with the same data model often don’t make it easy to get data out of one and into the other. This lack of integration leads to Balkanization of data.

即使数据模型相同的数据库，通常也不容易将数据从一个数据库传输到另一个数据库。这种缺乏集成会导致数据的巴尔干化。

Separation of logic and wiring

Another characteristic feature of Unix tools is their use of standard input ( stdin ) and standard output ( stdout ). If you run a program and don’t specify anything else, stdin comes from the keyboard and stdout goes to the screen. However, you can also take input from a file and/or redirect output to a file. Pipes let you attach the stdout of one process to the stdin of another process (with a small in-memory buffer, and without writing the entire intermediate data stream to disk).

Unix工具的另一个特点是它们使用标准输入(stdin)和标准输出(stdout)。如果您运行一个程序而没有指定任何其他内容，stdin就来自键盘，stdout就会显示在屏幕上。但是，您也可以从文件中获取输入，并将输出重定向到文件。使用管道(pipe)可以将一个进程的stdout连接到另一个进程的stdin(通过一个小的内存缓冲区，而无需将整个中间数据流写入磁盘)。

A program can still read and write files directly if it needs to, but the Unix approach works best if a program doesn’t worry about particular file paths and simply uses stdin and stdout . This allows a shell user to wire up the input and output in whatever way they want; the program doesn’t know or care where the input is coming from and where the output is going to. (One could say this is a form of loose coupling , late binding [ 15 ], or inversion of control [ 16 ].) Separating the input/output wiring from the program logic makes it easier to compose small tools into bigger systems.

如果需要的话，程序仍然可以直接读写文件，但是Unix的方法最好是如果程序不用担心特定的文件路径，而只是使用stdin和stdout。这使得shell用户可以以任何他们想要的方式连接输入和输出；程序不知道也不关心输入来自哪里，输出去哪里。（可以说这是一种松散耦合，晚绑定 [15]或控制反转[16]的形式。）将输入/输出布线与程序逻辑分离使得更容易将小工具组合成更大的系统。

You can even write your own programs and combine them with the tools provided by the operating system. Your program just needs to read input from stdin and write output to stdout , and it can participate in data processing pipelines. In the log analysis example, you could write a tool that translates user-agent strings into more sensible browser identifiers, or a tool that translates IP addresses into country codes, and simply plug it into the pipeline. The sort program doesn’t care whether it’s communicating with another part of the operating system or with a program written by you.

你甚至可以编写自己的程序并将它们与操作系统提供的工具结合使用。你的程序只需要从stdin读取输入并将输出写入stdout，就可以参与数据处理管道。在日志分析示例中，你可以编写一个工具，将用户代理字符串转换为更合理的浏览器标识符，或者将IP地址转换为国家代码，并将其简单地插入到管道中。排序程序无论是与操作系统的另一个部分还是与你编写的程序通信，都不会在意。

However, there are limits to what you can do with stdin and stdout . Programs that need multiple inputs or outputs are possible but tricky. You can’t pipe a program’s output into a network connection [ 17 , 18 ]. ⁱⁱⁱ If a program directly opens files for reading and writing, or starts another program as a subprocess, or opens a network connection, then that I/O is wired up by the program itself. It can still be configurable (through command-line options, for example), but the flexibility of wiring up inputs and outputs in a shell is reduced.

然而，使用 stdin 和 stdout 存在一定的限制。需要多个输入或输出的程序也可以编写，但比较棘手。不能将程序的输出导入网络连接[17，18]. 如果程序直接打开文件进行读写，或启动另一个程序作为子进程，或打开网络连接，则程序自身将进行IO布线。它仍然可以通过命令行选项进行配置，但在 shell 中布线输入和输出的灵活性会降低。

Transparency and experimentation

Part of what makes Unix tools so successful is that they make it quite easy to see what is going on:

Unix工具之所以如此成功的一部分原因是它们很容易让人看到正在发生的事情。

The input files to Unix commands are normally treated as immutable. This means you can run the commands as often as you want, trying various command-line options, without damaging the input files.

Unix命令的输入文件通常被视为不可变的。这意味着您可以多次运行命令，尝试不同的命令行选项，而不会破坏输入文件。
You can end the pipeline at any point, pipe the output into less , and look at it to see if it has the expected form. This ability to inspect is great for debugging.

您可以在任何时候结束管道，将输出导入到 less 中并查看它，以查看其是否具有预期的形式。这种检查的能力对于调试非常有用。
You can write the output of one pipeline stage to a file and use that file as input to the next stage. This allows you to restart the later stage without rerunning the entire pipeline.

你可以将一个流水线阶段的输出写入文件，并将该文件作为下一个阶段的输入。这使得你可以在不重新运行整个管道的情况下重新启动后面的阶段。

Thus, even though Unix tools are quite blunt, simple tools compared to a query optimizer of a relational database, they remain amazingly useful, especially for experimentation.

因此，尽管与关系数据库的查询优化器相比，Unix工具非常简单，但它们仍然非常有用，特别是对于实验。

However, the biggest limitation of Unix tools is that they run only on a single machine—and that’s where tools like Hadoop come in.

然而，Unix工具的最大限制在于它们只能运行在单个机器上，这就是Hadoop等工具出现的原因。

MapReduce and Distributed Filesystems

MapReduce is a bit like Unix tools, but distributed across potentially thousands of machines. Like Unix tools, it is a fairly blunt, brute-force, but surprisingly effective tool. A single MapReduce job is comparable to a single Unix process: it takes one or more inputs and produces one or more outputs.

MapReduce就像Unix工具一样，但是分布在可能数千台机器上。像Unix工具一样，它是一个相当短暂，粗暴的，但出奇地有效的工具。一个MapReduce作业相当于一个Unix进程：它需要一个或多个输入，并产生一个或多个输出。

As with most Unix tools, running a MapReduce job normally does not modify the input and does not have any side effects other than producing the output. The output files are written once, in a sequential fashion (not modifying any existing part of a file once it has been written).

与大多数Unix工具一样，运行MapReduce作业通常不修改输入，并且除了生成输出之外没有任何副作用。输出文件按顺序一次性写入（一旦写入过某个文件的任何部分，就不会修改该部分）。

While Unix tools use stdin and stdout as input and output, MapReduce jobs read and write files on a distributed filesystem. In Hadoop’s implementation of MapReduce, that filesystem is called HDFS (Hadoop Distributed File System), an open source reimplementation of the Google File System (GFS) [ 19 ].

而Unix工具使用标准输入和输出作为输入和输出，MapReduce作业则读取和写入分布式文件系统上的文件。在Hadoop的MapReduce实现中，该文件系统称为HDFS（Hadoop分布式文件系统），是Google文件系统（GFS）的开源重新实现[19]。

Various other distributed filesystems besides HDFS exist, such as GlusterFS and the Quantcast File System (QFS) [ 20 ]. Object storage services such as Amazon S3, Azure Blob Storage, and OpenStack Swift [ 21 ] are similar in many ways. ^iv In this chapter we will mostly use HDFS as a running example, but the principles apply to any distributed filesystem.

除HDFS以外，还存在其他分布式文件系统，如GlusterFS和Quantcast文件系统（QFS）[20]。对象存储服务，如Amazon S3，Azure Blob存储和OpenStack Swift [21]，在许多方面类似。在本章中，我们将主要使用HDFS作为运行示例，但这些原则适用于任何分布式文件系统。

HDFS is based on the shared-nothing principle (see the introduction to Part II ), in contrast to the shared-disk approach of Network Attached Storage (NAS) and Storage Area Network (SAN) architectures. Shared-disk storage is implemented by a centralized storage appliance, often using custom hardware and special network infrastructure such as Fibre Channel. On the other hand, the shared-nothing approach requires no special hardware, only computers connected by a conventional datacenter network.

HDFS基于共享无关原则（参见第II部分的介绍），与网络附加存储（NAS）和存储区域网络（SAN）架构的共享磁盘方法相反。共享磁盘存储由集中式存储设备实现，通常使用定制硬件和特殊网络基础设施，如光纤通道。另一方面，共享无关方法不需要特殊硬件，只需要通过传统的数据中心网络连接计算机。

HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine (assuming that every general-purpose machine in a datacenter has some disks attached to it). A central server called the NameNode keeps track of which file blocks are stored on which machine. Thus, HDFS conceptually creates one big filesystem that can use the space on the disks of all machines running the daemon.

HDFS由在每台计算机上运行的守护程序进程组成，它公开了一项网络服务，允许其他节点访问存储在该计算机上的文件（假设数据中心中的每台通用计算机都连接了一些磁盘）。一个名为NameNode的中央服务器跟踪存储在哪台计算机上的文件块。因此，HDFS在概念上创建了一个大型的文件系统，可以使用运行守护程序的所有计算机上的磁盘空间。

In order to tolerate machine and disk failures, file blocks are replicated on multiple machines. Replication may mean simply several copies of the same data on multiple machines, as in Chapter 5 , or an erasure coding scheme such as Reed–Solomon codes, which allows lost data to be recovered with lower storage overhead than full replication [ 20 , 22 ]. The techniques are similar to RAID, which provides redundancy across several disks attached to the same machine; the difference is that in a distributed filesystem, file access and replication are done over a conventional datacenter network without special hardware.

为了容忍机器和磁盘故障，文件块会在多台机器上复制。复制可能是指将同一数据简单地在多台机器上复制几份，就像第5章中那样，也可能是纠删码方案，如Reed-Solomon码，这使得可以用比全复制更低的存储开销来恢复遗失的数据[20，22]。这些技术类似于RAID，它提供对连接到同一台机器上的多个磁盘的冗余；不同之处在于，在分布式文件系统中，文件访问和复制是通过普通的数据中心网络完成的，没有特殊的硬件。

HDFS has scaled well: at the time of writing, the biggest HDFS deployments run on tens of thousands of machines, with combined storage capacity of hundreds of petabytes [ 23 ]. Such large scale has become viable because the cost of data storage and access on HDFS, using commodity hardware and open source software, is much lower than that of the equivalent capacity on a dedicated storage appliance [ 24 ].

HDFS已经成功地扩展了：在撰写本文的时候，最大的HDFS部署运行在数万台机器上，存储能力达到数百PB[23]。如此大规模的应用之所以可行，是因为使用普通硬件和开源软件在HDFS上进行数据存储和访问的成本远远低于专用存储设备[24]。

MapReduce Job Execution

MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS. The easiest way of understanding it is by referring back to the web server log analysis example in “Simple Log Analysis” . The pattern of data processing in MapReduce is very similar to this example:

MapReduce是一个编程框架，您可以使用它编写代码来处理像HDFS这样的分布式文件系统中的大型数据集。最简单的理解方法是参考“简单日志分析”中的Web服务器日志分析示例。MapReduce中的数据处理模式非常类似于此示例。

Read a set of input files, and break it up into records . In the web server log example, each record is one line in the log (that is, \n is the record separator).

读取一组输入文件，并将其分解为记录。在Web服务器日志示例中，每个记录都是日志中的一行（即，\n是记录分隔符）。
Call the mapper function to extract a key and value from each input record. In the preceding example, the mapper function is awk '{print $7}' : it extracts the URL ( $7 ) as the key, and leaves the value empty.

请调用映射函数从每个输入记录中提取键和值。在上面的例子中，mapper函数是awk '{print $7}'：它将URL ($7)提取为键，值保留为空。
Sort all of the key-value pairs by key. In the log example, this is done by the first sort command.

将所有键-值对按键排序。在示例日志中，这是通过第一个排序命令完成的。
Call the reducer function to iterate over the sorted key-value pairs. If there are multiple occurrences of the same key, the sorting has made them adjacent in the list, so it is easy to combine those values without having to keep a lot of state in memory. In the preceding example, the reducer is implemented by the command uniq -c , which counts the number of adjacent records with the same key.

调用 reducer 函数来迭代排序后的键值对列表。如果有多个相同的键，则排序将使它们相邻，因此很容易将这些值组合起来，无需在内存中保存大量状态。在前面的示例中，reducer 由 uniq -c 命令实现，该命令计算具有相同键的相邻记录数。

Those four steps can be performed by one MapReduce job. Steps 2 (map) and 4 (reduce) are where you write your custom data processing code. Step 1 (breaking files into records) is handled by the input format parser. Step 3, the sort step, is implicit in MapReduce—you don’t have to write it, because the output from the mapper is always sorted before it is given to the reducer.

这四个步骤可以通过一个MapReduce作业完成。第二步（映射）和第四步（减少）是您编写自定义数据处理代码的地方。第一步（将文件拆分为记录）由输入格式解析器处理。第三步，排序步骤，在MapReduce中是隐含的 - 您无需编写它，因为从mapper的输出始终在提交给reducer之前进行排序。

To create a MapReduce job, you need to implement two callback functions, the mapper and reducer, which behave as follows (see also “MapReduce Querying” ):

创建一个MapReduce作业，需要实现两个回调函数，即mapper和reducer函数，它们的行为如下（也可参见“MapReduce查询”）：

Mapper

The mapper is called once for every input record, and its job is to extract the key and value from the input record. For each input, it may generate any number of key-value pairs (including none). It does not keep any state from one input record to the next, so each record is handled independently.

映射器针对每个输入记录调用一次，其工作是从输入记录中提取键和值。对于每个输入，它可以生成任意数量的键值对（包括零）。它不会保留从一个输入记录到下一个的任何状态，因此每个记录都是独立处理的。

Reducer

The MapReduce framework takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values. The reducer can produce output records (such as the number of occurrences of the same URL).

MapReduce框架通过收集键值对，将由映射器产生的值根据键的归属进行聚合，并使用该值集合上的迭代器调用缩小器。缩小器可以产生输出记录 (例如相同URL的出现次数)。

In the web server log example, we had a second sort command in step 5, which ranked URLs by number of requests. In MapReduce, if you need a second sorting stage, you can implement it by writing a second MapReduce job and using the output of the first job as input to the second job. Viewed like this, the role of the mapper is to prepare the data by putting it into a form that is suitable for sorting, and the role of the reducer is to process the data that has been sorted.

在Web服务器日志示例中，我们在第五步中使用了第二个排序命令，这将URL按请求次数进行排名。在MapReduce中，如果需要第二个排序阶段，可以通过编写第二个MapReduce作业，将第一个作业的输出作为第二个作业的输入，来实现它。从这个角度看，mapper的作用是通过将数据放入适合排序的形式来准备数据，而reducer的作用是处理已经排序的数据。

Distributed execution of MapReduce

The main difference from pipelines of Unix commands is that MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism. The mapper and reducer only operate on one record at a time; they don’t need to know where their input is coming from or their output is going to, so the framework can handle the complexities of moving data between machines.

MapReduce与Unix命令管道的主要区别在于，MapReduce可以在许多计算机上并行化计算，而无需编写代码来显式处理并行性。映射器和规约器一次只处理一条记录，它们不需要知道其输入来自何处或其输出要去何处，因此框架可以处理在计算机之间移动数据的复杂性。

It is possible to use standard Unix tools as mappers and reducers in a distributed computation [ 25 ], but more commonly they are implemented as functions in a conventional programming language. In Hadoop MapReduce, the mapper and reducer are each a Java class that implements a particular interface. In MongoDB and CouchDB, mappers and reducers are JavaScript functions (see “MapReduce Querying” ).

可以使用标准的Unix工具作为分布式计算中的映射器和减少器[25]，但更常见的是将它们实现为常规编程语言中的函数。在Hadoop MapReduce中，映射器和减少器是实现特定接口的Java类。在MongoDB和CouchDB中，映射器和减少器是JavaScript函数（参见“MapReduce查询”）。

Figure 10-1 shows the dataflow in a Hadoop MapReduce job. Its parallelization is based on partitioning (see Chapter 6 ): the input to a job is typically a directory in HDFS, and each file or file block within the input directory is considered to be a separate partition that can be processed by a separate map task (marked by m 1 , m 2 , and m 3 in Figure 10-1 ).

图 10-1 显示了 Hadoop MapReduce 作业的数据流。它的并行化基于分区（请参见第 6 章）：作业的输入通常是 HDFS 中的一个目录，输入目录中的每个文件或文件块被视为可由单独的映射任务（在图 10-1 中标记为 m 1、m 2 和 m 3）处理的单独分区。

Each input file is typically hundreds of megabytes in size. The MapReduce scheduler (not shown in the diagram) tries to run each mapper on one of the machines that stores a replica of the input file, provided that machine has enough spare RAM and CPU resources to run the map task [ 26 ]. This principle is known as putting the computation near the data [ 27 ]: it saves copying the input file over the network, reducing network load and increasing locality.

每个输入文件通常都有数百兆字节的大小。MapReduce调度程序（在图表中未显示）试图在一台存储输入文件副本的机器上运行每个映射器，前提是该机器具有足够的空闲RAM和CPU资源来运行映射任务[26]。这个原则被称为将计算放在数据附近[27]：它可以避免在网络上传输输入文件，减少网络负荷，提高局部性。

In most cases, the application code that should run in the map task is not yet present on the machine that is assigned the task of running it, so the MapReduce framework first copies the code (e.g., JAR files in the case of a Java program) to the appropriate machines. It then starts the map task and begins reading the input file, passing one record at a time to the mapper callback. The output of the mapper consists of key-value pairs.

在大多数情况下，应该在映射任务中运行的应用程序代码尚未存在于被分配运行任务的机器上，因此MapReduce框架首先将代码（例如，在Java程序的情况下为JAR文件）复制到相应的机器上。然后启动映射任务，并开始读取输入文件，逐个将记录传递给映射器回调函数。映射器的输出由键值对组成。

The reduce side of the computation is also partitioned. While the number of map tasks is determined by the number of input file blocks, the number of reduce tasks is configured by the job author (it can be different from the number of map tasks). To ensure that all key-value pairs with the same key end up at the same reducer, the framework uses a hash of the key to determine which reduce task should receive a particular key-value pair (see “Partitioning by Hash of Key” ).

计算的归约阶段也被分区了。虽然地图任务的数量由输入文件块的数量确定，但归约任务的数量由作业作者配置（它可以与地图任务的数量不同）。为了确保具有相同键的所有键值对都最终在同一个归约器中，框架使用键的哈希值来确定哪个归约任务应接收特定的键值对（请参阅“按关键字哈希分区”）。 Note: This translation may not be accurate and is provided as a rough guide only.

The key-value pairs must be sorted, but the dataset is likely too large to be sorted with a conventional sorting algorithm on a single machine. Instead, the sorting is performed in stages. First, each map task partitions its output by reducer, based on the hash of the key. Each of these partitions is written to a sorted file on the mapper’s local disk, using a technique similar to what we discussed in “SSTables and LSM-Trees” .

键值对必须排序，但数据集可能太大，无法使用单个机器上的传统排序算法进行排序。相反，排序是分阶段进行的。首先，每个映射任务根据键的哈希将其输出分区到缩小器。每个这些分区都写入映射器本地磁盘上的排序文件，使用类似于我们在“ SSTables和LSM-Trees”中讨论的技术。

Whenever a mapper finishes reading its input file and writing its sorted output files, the MapReduce scheduler notifies the reducers that they can start fetching the output files from that mapper. The reducers connect to each of the mappers and download the files of sorted key-value pairs for their partition. The process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as the shuffle [ 26 ] (a confusing term—unlike shuffling a deck of cards, there is no randomness in MapReduce).

每当一个映射器完成读取其输入文件并写入其排序的输出文件时，MapReduce调度程序会通知减少器，它们可以开始从该映射器获取输出文件。减数器连接到每个映射器，并下载其分区的已排序键值对文件。由减小程序分区，排序和复制数据分区从映射到减程序的过程称为shuffle [26]（一个令人困惑的术语-与随机化纸牌不同，MapReduce中没有随机性）。

The reduce task takes the files from the mappers and merges them together, preserving the sort order. Thus, if different mappers produced records with the same key, they will be adjacent in the merged reducer input.

Reducer任务将来自映射器的文件合并在一起，保留了排序顺序。因此，如果不同的映射器生成具有相同键的记录，它们将相邻地出现在合并后的Reducer输入中。

The reducer is called with a key and an iterator that incrementally scans over all records with the same key (which may in some cases not all fit in memory). The reducer can use arbitrary logic to process these records, and can generate any number of output records. These output records are written to a file on the distributed filesystem (usually, one copy on the local disk of the machine running the reducer, with replicas on other machines).

reducer被调用时，会传入一个键和一个迭代器，该迭代器会逐步扫描所有具有相同键的记录（有些情况下可能无法全部放入内存）。reducer可以使用任意逻辑来处理这些记录，并且可以生成任意数量的输出记录。这些输出记录将被写入分布式文件系统上的文件中（通常情况下，一份副本存储在运行reducer的机器的本地磁盘上，其他机器上也会有其副本）。

MapReduce workflows

The range of problems you can solve with a single MapReduce job is limited. Referring back to the log analysis example, a single MapReduce job could determine the number of page views per URL, but not the most popular URLs, since that requires a second round of sorting.

使用单个MapReduce作业可以解决的问题范围有限。回到日志分析的例子，单个MapReduce作业可以确定每个URL的页面浏览次数，但无法确定最受欢迎的URL，因为这需要进行第二轮排序。

Thus, it is very common for MapReduce jobs to be chained together into workflows , such that the output of one job becomes the input to the next job. The Hadoop MapReduce framework does not have any particular support for workflows, so this chaining is done implicitly by directory name: the first job must be configured to write its output to a designated directory in HDFS, and the second job must be configured to read that same directory name as its input. From the MapReduce framework’s point of view, they are two independent jobs.

因此，将MapReduce作业链接成工作流非常常见，以使一个作业的输出成为下一个作业的输入。 Hadoop MapReduce框架没有对工作流程提供任何特殊支持，因此通过目录名称隐式连接这些操作：第一个作业必须配置为将其输出写入HDFS中的指定目录，第二个作业必须配置为读取该目录名称作为其输入。从MapReduce框架的角度来看，它们是两个独立的作业。

Chained MapReduce jobs are therefore less like pipelines of Unix commands (which pass the output of one process as input to another process directly, using only a small in-memory buffer) and more like a sequence of commands where each command’s output is written to a temporary file, and the next command reads from the temporary file. This design has advantages and disadvantages, which we will discuss in “Materialization of Intermediate State” .

串联的MapReduce作业因此不像Unix命令的管道（它直接将一个进程的输出作为另一个进程的输入传递，只使用一个小的内存缓冲区），而更像一个命令序列，每个命令的输出都被写入临时文件，下一个命令从临时文件中读取。这种设计有优点和缺点，我们将在“中间状态的实现”中讨论。

A batch job’s output is only considered valid when the job has completed successfully (MapReduce discards the partial output of a failed job). Therefore, one job in a workflow can only start when the prior jobs—that is, the jobs that produce its input directories—have completed successfully. To handle these dependencies between job executions, various workflow schedulers for Hadoop have been developed, including Oozie, Azkaban, Luigi, Airflow, and Pinball [ 28 ].

只有批处理作业成功完成（MapReduce会丢弃失败作业的部分输出），作业的输出才被视为有效。因此，在工作流程中，一个作业只有在其前置作业（即生成其输入目录的作业）成功完成后才能开始。为了处理作业执行之间的这些依赖关系，开发了各种Hadoop工作流调度程序，包括Oozie、Azkaban、Luigi、Airflow和Pinball[28]。

These schedulers also have management features that are useful when maintaining a large collection of batch jobs. Workflows consisting of 50 to 100 MapReduce jobs are common when building recommendation systems [ 29 ], and in a large organization, many different teams may be running different jobs that read each other’s output. Tool support is important for managing such complex dataflows.

这些调度程序还具有管理特性，可在维护大量批处理作业时非常有用。在构建推荐系统时，包含50到100个MapReduce作业的工作流程很常见[29]，在大型组织中，许多不同的团队可能在运行读取彼此输出的不同作业。工具支持对于管理如此复杂的数据流至关重要。

Various higher-level tools for Hadoop, such as Pig [ 30 ], Hive [ 31 ], Cascading [ 32 ], Crunch [ 33 ], and FlumeJava [ 34 ], also set up workflows of multiple MapReduce stages that are automatically wired together appropriately.

Hadoop的各种高级工具，如Pig（30）、Hive（31）、Cascading（32）、Crunch（33）和FlumeJava（34），还可以设置多个MapReduce阶段的工作流，自动适当地连接在一起。

Reduce-Side Joins and Grouping

We discussed joins in Chapter 2 in the context of data models and query languages, but we have not delved into how joins are actually implemented. It is time that we pick up that thread again.

我们在第2章中讨论了连接，涉及数据模型和查询语言，但我们还没有深入探讨连接是如何实现的。现在是我们重新开始探讨这个话题的时候了。

In many datasets it is common for one record to have an association with another record: a foreign key in a relational model, a document reference in a document model, or an edge in a graph model. A join is necessary whenever you have some code that needs to access records on both sides of that association (both the record that holds the reference and the record being referenced). As discussed in Chapter 2 , denormalization can reduce the need for joins but generally not remove it entirely. ^v

在许多数据集中，一个记录与另一个记录有关联：在关系模型中是外键，在文档模型中是文档引用，或者在图模型中是边缘。每当您有一些需要访问该关联的两边记录（既包含引用的记录，又包含被引用的记录）的代码时，就需要进行连接。正如第2章所讨论的那样，去规范化可以减少连接的需求，但通常不能完全消除它。

In a database, if you execute a query that involves only a small number of records, the database will typically use an index to quickly locate the records of interest (see Chapter 3 ). If the query involves joins, it may require multiple index lookups. However, MapReduce has no concept of indexes—at least not in the usual sense.

在数据库中，如果您执行仅涉及少量记录的查询，数据库通常会使用索引快速定位感兴趣的记录（请参见第3章）。如果查询涉及连接，则可能需要进行多个索引查找。然而，MapReduce没有索引的概念-至少不是通常意义上的。

When a MapReduce job is given a set of files as input, it reads the entire content of all of those files; a database would call this operation a full table scan . If you only want to read a small number of records, a full table scan is outrageously expensive compared to an index lookup. However, in analytic queries (see “Transaction Processing or Analytics?” ) it is common to want to calculate aggregates over a large number of records. In this case, scanning the entire input might be quite a reasonable thing to do, especially if you can parallelize the processing across multiple machines.

当MapReduce作业接收一组文件作为输入时，它会读取所有文件的整个内容；数据库称此操作为完全扫描。如果您只想读取少量记录，与索引查找相比，完全扫描的费用非常昂贵。但是，在分析查询中（请参见“事务处理还是分析？”），通常希望计算大量记录的聚合值。在这种情况下，扫描整个输入可能是相当合理的，特别是如果您可以在多台机器上并行处理。

When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset. For example, we assume that a job is processing the data for all users simultaneously, not merely looking up the data for one particular user (which would be done far more efficiently with an index).

在批量处理的上下文中，当我们谈论连接时，我们指的是解析数据集中某个关联的所有出现。例如，我们假设作业正在同时处理所有用户的数据，而不仅仅是查找某个特定用户的数据（如果使用索引将更高效）。

Example: analysis of user activity events

A typical example of a join in a batch job is illustrated in Figure 10-2 . On the left is a log of events describing the things that logged-in users did on a website (known as activity events or clickstream data ), and on the right is a database of users. You can think of this example as being part of a star schema (see “Stars and Snowflakes: Schemas for Analytics” ): the log of events is the fact table, and the user database is one of the dimensions.

批处理作业中加入的一个典型示例如图10-2所示。左边是描述登陆用户在网站上所做的事情的事件日志（称为活动事件或点击流数据），右边是用户数据库。您可以认为这个例子是星型架构的一部分（参见“星型和雪花形式：用于分析的模式”）：事件日志是事实表，用户数据库是其中一个维度。

An analytics task may need to correlate user activity with user profile information: for example, if the profile contains the user’s age or date of birth, the system could determine which pages are most popular with which age groups. However, the activity events contain only the user ID, not the full user profile information. Embedding that profile information in every single activity event would most likely be too wasteful. Therefore, the activity events need to be joined with the user profile database.

一个分析任务可能需要将用户活动与用户个人资料信息相关联：例如，如果个人资料包含用户的年龄或出生日期，则系统可以确定哪些页面最受哪个年龄组的欢迎。然而，活动事件仅包含用户ID，而不包含完整的用户个人资料信息。将该资料信息嵌入每一个活动事件中很可能会太浪费资源。因此，需要将活动事件与用户个人资料数据库连接。

The simplest implementation of this join would go over the activity events one by one and query the user database (on a remote server) for every user ID it encounters. This is possible, but it would most likely suffer from very poor performance: the processing throughput would be limited by the round-trip time to the database server, the effectiveness of a local cache would depend very much on the distribution of data, and running a large number of queries in parallel could easily overwhelm the database [ 35 ].

这种连接的最简单实现方式是逐个遍历活动事件，并为遇到的每个用户ID查询远程服务器上的用户数据库。虽然这是可能的，但很可能会受到非常差的性能影响：处理吞吐量将受到到数据库服务器的往返时间的限制，本地缓存的有效性将在很大程度上取决于数据的分布，而同时运行大量查询可能很容易压垮数据库。

In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine. Making random-access requests over the network for every record you want to process is too slow. Moreover, querying a remote database would mean that the batch job becomes nondeterministic, because the data in the remote database might change.

为了在批处理过程中获得良好的吞吐量，计算必须尽可能地局限在一台机器上。对于每个要处理的记录进行随机访问网络请求太慢。此外，查询远程数据库将意味着批作业变得非确定性，因为远程数据库中的数据可能会更改。

Thus, a better approach would be to take a copy of the user database (for example, extracted from a database backup using an ETL process—see “Data Warehousing” ) and to put it in the same distributed filesystem as the log of user activity events. You would then have the user database in one set of files in HDFS and the user activity records in another set of files, and could use MapReduce to bring together all of the relevant records in the same place and process them efficiently.

因此，更好的方法是将用户数据库的副本（例如，使用ETL流程从数据库备份中提取 - 见“数据仓库”）放在与用户活动事件日志相同的分布式文件系统中。你可以将用户数据库存在HDFS中的一个文件集合中，将用户活动记录存在另一个文件集合中，并使用MapReduce将所有相关记录汇集在同一位置并有效地处理它们。

Sort-merge joins

Recall that the purpose of the mapper is to extract a key and value from each input record. In the case of Figure 10-2 , this key would be the user ID: one set of mappers would go over the activity events (extracting the user ID as the key and the activity event as the value), while another set of mappers would go over the user database (extracting the user ID as the key and the user’s date of birth as the value). This process is illustrated in Figure 10-3 .

重申一下，映射器的目的是从每个输入记录中提取一个键和一个值。在图10-2中，这个键将是用户ID：一组映射器将处理活动事件（提取用户ID作为键，活动事件作为值），而另一组映射器将处理用户数据库（提取用户ID作为键，用户的出生日期作为值）。这个过程在图10-3中说明。

When the MapReduce framework partitions the mapper output by key and then sorts the key-value pairs, the effect is that all the activity events and the user record with the same user ID become adjacent to each other in the reducer input. The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the activity events in timestamp order—this technique is known as a secondary sort [ 26 ].

当MapReduce框架按键值对分割映射器输出并排序时，具有相同用户ID的所有活动事件和用户记录变得相邻，并进入缩小器输入。 MapReduce作业甚至可以安排记录排序，使得缩小器始终优先查看来自用户数据库的记录，然后按时间戳顺序显示活动事件 - 这种技术称为二次排序[26]。

The reducer can then perform the actual join logic easily: the reducer function is called once for every user ID, and thanks to the secondary sort, the first value is expected to be the date-of-birth record from the user database. The reducer stores the date of birth in a local variable and then iterates over the activity events with the same user ID, outputting pairs of viewed-url and viewer-age-in-years . Subsequent MapReduce jobs could then calculate the distribution of viewer ages for each URL, and cluster by age group.

Reducer能够轻松执行实际的连接逻辑：Reducer函数针对每个用户ID只被调用一次，得益于次要排序，期望第一个值是来自用户数据库的出生日期记录。Reducer在本地变量中存储出生日期，然后迭代相同用户ID的活动事件，输出查看的URL和观看者年龄。随后的MapReduce作业可以计算每个URL的观众年龄分布，并按年龄分组。

Since the reducer processes all of the records for a particular user ID in one go, it only needs to keep one user record in memory at any one time, and it never needs to make any requests over the network. This algorithm is known as a sort-merge join , since mapper output is sorted by key, and the reducers then merge together the sorted lists of records from both sides of the join.

由于Reducer一次处理特定用户ID的所有记录，所以它每次只需要在内存中保持一个用户记录，而且不需要通过网络进行任何请求。这个算法被称为排序合并连接，因为mapper输出按键排序，然后reducer将连接的两侧的排序记录列表合并在一起。

Bringing related data together in the same place

In a sort-merge join, the mappers and the sorting process make sure that all the necessary data to perform the join operation for a particular user ID is brought together in the same place: a single call to the reducer. Having lined up all the required data in advance, the reducer can be a fairly simple, single-threaded piece of code that can churn through records with high throughput and low memory overhead.

在排序合并连接中，映射器和排序过程确保所有执行特定用户ID的连接操作所需的数据汇集到同一个地方：单个对减速器的调用。事先整理好所有所需的数据后，减速器可以是一个相当简单、单线程的代码片段，可以高吞吐量、低内存开销地处理记录。

One way of looking at this architecture is that mappers “send messages” to the reducers. When a mapper emits a key-value pair, the key acts like the destination address to which the value should be delivered. Even though the key is just an arbitrary string (not an actual network address like an IP address and port number), it behaves like an address: all key-value pairs with the same key will be delivered to the same destination (a call to the reducer).

一个看待这种架构的方式是，映射器向减速器“发送消息”。当映射器发出一个键值对时，键就像目的地址，值应该被传送到那里。尽管键只是任意字符串（不是实际的网络地址，如IP地址和端口号），它表现得像一个地址：所有具有相同键的键值对都将被传递到同一个目的地（减速器的调用）。

Using the MapReduce programming model has separated the physical network communication aspects of the computation (getting the data to the right machine) from the application logic (processing the data once you have it). This separation contrasts with the typical use of databases, where a request to fetch data from a database often occurs somewhere deep inside a piece of application code [ 36 ]. Since MapReduce handles all network communication, it also shields the application code from having to worry about partial failures, such as the crash of another node: MapReduce transparently retries failed tasks without affecting the application logic.

使用MapReduce编程模型将计算的物理网络通信方面（将数据传送到正确的计算机）与应用逻辑（一旦获得数据则处理数据）分开。这种分离与典型的数据库使用不同，数据库中从数据库提取数据的请求通常发生在应用程序代码的深处[36]。由于MapReduce处理所有网络通信，因此它还保护应用程序代码免受部分故障的影响，例如其他节点的崩溃：MapReduce在不影响应用程序逻辑的情况下自动重试失败的任务。

GROUP BY

Besides joins, another common use of the “bringing related data to the same place” pattern is grouping records by some key (as in the GROUP BY clause in SQL). All records with the same key form a group, and the next step is often to perform some kind of aggregation within each group—for example:

除了连接之外，“将相关数据放置在同一位置”的模式的另一个常见用途是通过某个键将记录分组（如SQL中的GROUP BY子句）。具有相同键的所有记录形成一个组，下一步通常是对每个组执行某种聚合 - 例如：

Counting the number of records in each group (like in our example of counting page views, which you would express as a COUNT(*) aggregation in SQL)

计算每个组中的记录数（例如在统计页面浏览次数时，您可以将其表示为 SQL 中的 COUNT(*) 聚合函数）。
Adding up the values in one particular field ( SUM(fieldname) ) in SQL

在 SQL 中对特定字段进行值求和（SUM(字段名)）。
Picking the top k records according to some ranking function

根据排名函数选择前k条记录。

The simplest way of implementing such a grouping operation with MapReduce is to set up the mappers so that the key-value pairs they produce use the desired grouping key. The partitioning and sorting process then brings together all the records with the same key in the same reducer. Thus, grouping and joining look quite similar when implemented on top of MapReduce.

使用MapReduce实现此类分组操作的最简单方法是设置映射器，使其生成的键值对使用所需的分组键。然后，分区和排序过程将所有具有相同键的记录汇集到同一个减速器中。因此，当在MapReduce的基础上实现分组和连接时，它们看起来非常相似。

Another common use for grouping is collating all the activity events for a particular user session, in order to find out the sequence of actions that the user took—a process called sessionization [ 37 ]. For example, such analysis could be used to work out whether users who were shown a new version of your website are more likely to make a purchase than those who were shown the old version (A/B testing), or to calculate whether some marketing activity is worthwhile.

分组的另一个常见用途是将特定用户会话的所有活动事件整理在一起，以找出用户所采取的操作序列，这个过程被称为“sessionization”。例如，这种分析可以用于确定那些被展示新版网站的用户比那些被展示旧版的用户更有可能购买（A/B测试），或者计算某些营销活动是否值得。

If you have multiple web servers handling user requests, the activity events for a particular user are most likely scattered across various different servers’ log files. You can implement sessionization by using a session cookie, user ID, or similar identifier as the grouping key and bringing all the activity events for a particular user together in one place, while distributing different users’ events across different partitions.

如果您有多个Web服务器处理用户请求，则特定用户的活动事件很可能散布在不同的服务器日志文件中。您可以通过使用会话Cookie、用户ID或类似的标识符作为分组键，将特定用户的所有活动事件集中在一个地方，同时将不同用户的事件分布在不同的分区中来实现会话化。

Handling skew

The pattern of “bringing all records with the same key to the same place” breaks down if there is a very large amount of data related to a single key. For example, in a social network, most users might be connected to a few hundred people, but a small number of celebrities may have many millions of followers. Such disproportionately active database records are known as linchpin objects [ 38 ] or hot keys .

“将所有具有相同键的记录放在同一个位置”模式在单个键相关的数据非常庞大的情况下将崩溃。例如，在社交网络中，大多数用户可能与数百人相连，但少数名人可能拥有数百万追随者。这些比例过高的数据库记录被称为枢纽对象[38]或热键。

Collecting all activity related to a celebrity (e.g., replies to something they posted) in a single reducer can lead to significant skew (also known as hot spots )—that is, one reducer that must process significantly more records than the others (see “Skewed Workloads and Relieving Hot Spots” ). Since a MapReduce job is only complete when all of its mappers and reducers have completed, any subsequent jobs must wait for the slowest reducer to complete before they can start.

将所有与名人相关的活动（例如对其发布内容的回复）收集到一个单独的减速器中可能会导致显著的偏斜（也称为热点），即一个减速器必须处理比其他减速器更多的记录（请参见“偏斜工作负载和缓解热点”）。由于 MapReduce 作业只有在其所有映射器和减速器完成后才算完成，因此任何后续作业都必须等待最慢的减速器完成后才能开始。

If a join input has hot keys, there are a few algorithms you can use to compensate. For example, the skewed join method in Pig first runs a sampling job to determine which keys are hot [ 39 ]. When performing the actual join, the mappers send any records relating to a hot key to one of several reducers, chosen at random (in contrast to conventional MapReduce, which chooses a reducer deterministically based on a hash of the key). For the other input to the join, records relating to the hot key need to be replicated to all reducers handling that key [ 40 ].

如果联接输入有热键，可以使用一些算法进行补偿。例如，Pig中的Skewed Join方法首先运行采样作业，以确定哪些键是热点[39]。在执行实际连接时，映射器将任何与热点键相关的记录发送到随机选择的多个减速器之一（与传统的MapReduce相反，后者基于键的哈希值确定减速器）。对于联接的另一个输入，与热键相关的记录需要复制到处理该键的所有减速器[40]。

This technique spreads the work of handling the hot key over several reducers, which allows it to be parallelized better, at the cost of having to replicate the other join input to multiple reducers. The sharded join method in Crunch is similar, but requires the hot keys to be specified explicitly rather than using a sampling job. This technique is also very similar to one we discussed in “Skewed Workloads and Relieving Hot Spots” , using randomization to alleviate hot spots in a partitioned database.

这种技术将处理热键的工作分散到多个Reducer中，这使得并行化更好，但代价是必须将其他连接输入复制到多个Reducer中。Crunch中的分片连接方法类似，但需要显式指定热键，而不是使用采样作业。这种技术也类似于我们在“扭曲的工作负载和缓解热点”中讨论的技术，使用随机化来缓解分区数据库中的热点。

Hive’s skewed join optimization takes an alternative approach. It requires hot keys to be specified explicitly in the table metadata, and it stores records related to those keys in separate files from the rest. When performing a join on that table, it uses a map-side join (see the next section) for the hot keys.

Hive 的倾斜连接优化采取了一种替代方法。它要求在表元数据中明确指定热点键，并将与这些键相关的记录存储在与其余记录分开的文件中。在对该表执行连接时，它使用 map-side join（参见下一节）来处理热点键。

When grouping records by a hot key and aggregating them, you can perform the grouping in two stages. The first MapReduce stage sends records to a random reducer, so that each reducer performs the grouping on a subset of records for the hot key and outputs a more compact aggregated value per key. The second MapReduce job then combines the values from all of the first-stage reducers into a single value per key.

当按热键对记录进行分组和聚合时，可以分两个阶段执行分组。第一个MapReduce阶段将记录发送到随机归约器，以便每个归约器对热键的记录子集执行分组，并为每个键输出更紧凑的聚合值。第二个MapReduce作业然后将所有第一阶段归约器的值合并为每个键的单个值。

Map-Side Joins

The join algorithms described in the last section perform the actual join logic in the reducers, and are hence known as reduce-side joins . The mappers take the role of preparing the input data: extracting the key and value from each input record, assigning the key-value pairs to a reducer partition, and sorting by key.

上一节中描述的连接算法在减少器中执行实际连接逻辑，因此被称为减少器端连接。映射器扮演准备输入数据的角色：从每个输入记录中提取键和值，将键值对分配给减少器分区并按键排序。

The reduce-side approach has the advantage that you do not need to make any assumptions about the input data: whatever its properties and structure, the mappers can prepare the data to be ready for joining. However, the downside is that all that sorting, copying to reducers, and merging of reducer inputs can be quite expensive. Depending on the available memory buffers, data may be written to disk several times as it passes through the stages of MapReduce [ 37 ].

"Reduce-side方法的优点是您无需对输入数据做出任何假设：无论其属性和结构如何，映射器都可以准备好加入的数据。然而，缺点是所有排序，复制到缩小器以及合并缩小器输入可能相当昂贵。根据可用的内存缓冲区，数据可能会经过MapReduce的各个阶段多次写入磁盘。"

On the other hand, if you can make certain assumptions about your input data, it is possible to make joins faster by using a so-called map-side join . This approach uses a cut-down MapReduce job in which there are no reducers and no sorting. Instead, each mapper simply reads one input file block from the distributed filesystem and writes one output file to the filesystem—that is all.

另一方面，如果您能对输入数据作出某些假设，就可以使用所谓的映射端连接来更快地进行连接。该方法使用了一个简化版的MapReduce作业，在其中没有减速器和排序。相反，每个映射器仅从分布式文件系统中读取一个输入文件块，并将一个输出文件写入文件系统 - 就是这些了。

Broadcast hash joins

The simplest way of performing a map-side join applies in the case where a large dataset is joined with a small dataset. In particular, the small dataset needs to be small enough that it can be loaded entirely into memory in each of the mappers.

最简单的地图端连接方式适用于将大型数据集与小型数据集连接的情况。特别是，小数据集需要足够小，以便可以在每个映射器中完全加载到内存中。

For example, imagine in the case of Figure 10-2 that the user database is small enough to fit in memory. In this case, when a mapper starts up, it can first read the user database from the distributed filesystem into an in-memory hash table. Once this is done, the mapper can scan over the user activity events and simply look up the user ID for each event in the hash table. ^vi

例如，假设在图10-2的情况下，用户数据库足够小可以放入内存中。这种情况下，当一个Mapper启动时，它可以先将用户数据库从分布式文件系统读取到内存哈希表中。一旦完成这个操作，Mapper可以扫描用户活动事件并在哈希表中查找每个事件的用户ID。

There can still be several map tasks: one for each file block of the large input to the join (in the example of Figure 10-2 , the activity events are the large input). Each of these mappers loads the small input entirely into memory.

还可以有几个地图任务：针对连接的大型输入的每个文件块（在图10-2的示例中，活动事件是大型输入），都可以有一个地图器。每个地图器将完全将小输入加载到内存中。

This simple but effective algorithm is called a broadcast hash join : the word broadcast reflects the fact that each mapper for a partition of the large input reads the entirety of the small input (so the small input is effectively “broadcast” to all partitions of the large input), and the word hash reflects its use of a hash table. This join method is supported by Pig (under the name “replicated join”), Hive (“MapJoin”), Cascading, and Crunch. It is also used in data warehouse query engines such as Impala [ 41 ].

这个简单但有效的算法被称为广播哈希连接：广播一词反映了大输入每个分区的映射器都读取小输入的全部内容（因此小输入实际上被“广播”到大输入的所有分区），哈希一词则反映了它使用哈希表的特点。这种连接方法由Pig（以“重复连接”的名称）、Hive（“MapJoin”）、Cascading和Crunch支持。它还被用于数据仓库查询引擎，例如Impala[41]。

Instead of loading the small join input into an in-memory hash table, an alternative is to store the small join input in a read-only index on the local disk [ 42 ]. The frequently used parts of this index will remain in the operating system’s page cache, so this approach can provide random-access lookups almost as fast as an in-memory hash table, but without actually requiring the dataset to fit in memory.

将小的连接输入加载到内存哈希表中的替代方法是将其存储在本地磁盘上的只读索引中。这个索引的经常使用部分将保留在操作系统的页面缓存中，因此这种方法可以提供几乎与内存哈希表一样快的随机访问查找，但不需要实际上将数据集适合内存。

Partitioned hash joins

If the inputs to the map-side join are partitioned in the same way, then the hash join approach can be applied to each partition independently. In the case of Figure 10-2 , you might arrange for the activity events and the user database to each be partitioned based on the last decimal digit of the user ID (so there are 10 partitions on either side). For example, mapper 3 first loads all users with an ID ending in 3 into a hash table, and then scans over all the activity events for each user whose ID ends in 3.

如果地图侧连接的输入以相同的方式分区，那么哈希连接方法可以独立地应用于每个分区。在图10-2的情况下，您可以根据用户ID的最后一位来对活动事件和用户数据库进行分区（因此每侧有10个分区）。例如，mapper 3首先将所有ID以3结尾的用户加载到哈希表中，然后扫描每个ID以3结尾的用户的所有活动事件。

If the partitioning is done correctly, you can be sure that all the records you might want to join are located in the same numbered partition, and so it is sufficient for each mapper to only read one partition from each of the input datasets. This has the advantage that each mapper can load a smaller amount of data into its hash table.

如果分区正确执行，则可以确保您想要加入的所有记录位于相同编号的分区中，因此对于每个映射器，从每个输入数据集中仅读取一个分区就足够了。这有一个优点，即每个映射器可以将较少的数据加载到其哈希表中。

This approach only works if both of the join’s inputs have the same number of partitions, with records assigned to partitions based on the same key and the same hash function. If the inputs are generated by prior MapReduce jobs that already perform this grouping, then this can be a reasonable assumption to make.

如果连接的输入具有相同数量的分区，并且根据相同的键和相同的哈希函数将记录分配到分区中，则该方法仅在两个加入的输入都可行。如果输入是由先前执行此分组的MapReduce作业生成的，则可以做出合理的假设。

Partitioned hash joins are known as bucketed map joins in Hive [ 37 ].

分区哈希连接在Hive中被称为分桶映射连接。

Map-side merge joins

Another variant of a map-side join applies if the input datasets are not only partitioned in the same way, but also sorted based on the same key. In this case, it does not matter whether the inputs are small enough to fit in memory, because a mapper can perform the same merging operation that would normally be done by a reducer: reading both input files incrementally, in order of ascending key, and matching records with the same key.

另一种地图端连接的变体适用于输入数据集不仅以相同方式分区，而且基于相同键进行排序。在这种情况下，无论输入是否足够小可以放入内存中，因为 Mapper 可以执行常规 Reducer 执行的相同合并操作：按升序顺序逐渐读取两个输入文件并匹配具有相同键的记录。

If a map-side merge join is possible, it probably means that prior MapReduce jobs brought the input datasets into this partitioned and sorted form in the first place. In principle, this join could have been performed in the reduce stage of the prior job. However, it may still be appropriate to perform the merge join in a separate map-only job, for example if the partitioned and sorted datasets are also needed for other purposes besides this particular join.

如果可能存在一种基于地图合并加入的方法，那很可能是先前的MapReduce作业已经将输入数据集首先分区并排序了。原则上，这个连接可能已经在前一个作业的减少阶段中执行。然而，执行分离的仅映射作业中的合并加入可能仍然是适当的，例如，如果除了这个特定的连接之外，还需要这些分区和排序的数据集用于其他目的。

MapReduce workflows with map-side joins

When the output of a MapReduce join is consumed by downstream jobs, the choice of map-side or reduce-side join affects the structure of the output. The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input (since one map task is started for each file block of the join’s large input, regardless of whether a partitioned or broadcast join is used).

当MapReduce连接的结果被下游作业使用时，选择在Map端还是Reduce端进行连接会影响到输出的结构。Reduce端连接的输出会按照连接键进行分区和排序，而Map端连接的输出会按照与大型输入相同的方式进行分区和排序（因为无论使用分区连接还是广播连接，每个文件块的连接都会启动一个Map任务）。

As discussed, map-side joins also make more assumptions about the size, sorting, and partitioning of their input datasets. Knowing about the physical layout of datasets in the distributed filesystem becomes important when optimizing join strategies: it is not sufficient to just know the encoding format and the name of the directory in which the data is stored; you must also know the number of partitions and the keys by which the data is partitioned and sorted.

根据讨论，由于映射端关联操作需要考虑输入数据集的大小、排序和分片情况，因此对于分布式文件系统中数据集的物理布局，我们需要了解得更加清楚，以便优化联接策略。知道数据集的编码格式和存储目录名称是不够的，您还必须知道分区数量和用于分区和排序的键。

In the Hadoop ecosystem, this kind of metadata about the partitioning of datasets is often maintained in HCatalog and the Hive metastore [ 37 ].

在Hadoop生态系统中，有关数据集分区的此类元数据通常存储在HCatalog和Hive元数据存储中。[37]。

The Output of Batch Workflows

We have talked a lot about the various algorithms for implementing workflows of MapReduce jobs, but we neglected an important question: what is the result of all of that processing, once it is done? Why are we running all these jobs in the first place?

我们已经讨论了许多有关实施MapReduce工作流的各种算法，但我们忽略了一个重要问题：一旦完成所有这些处理，结果是什么？我们为什么首先要运行所有这些作业？

In the case of database queries, we distinguished transaction processing (OLTP) purposes from analytic purposes (see “Transaction Processing or Analytics?” ). We saw that OLTP queries generally look up a small number of records by key, using indexes, in order to present them to a user (for example, on a web page). On the other hand, analytic queries often scan over a large number of records, performing groupings and aggregations, and the output often has the form of a report: a graph showing the change in a metric over time, or the top 10 items according to some ranking, or a breakdown of some quantity into subcategories. The consumer of such a report is often an analyst or a manager who needs to make business decisions.

在数据库查询方面，我们将事务处理（OLTP）目的与分析目的（参见“事务处理还是分析？”）区分开来。我们发现，OLTP查询通常通过索引按键查找少量记录，并将其呈现给用户（例如，在网页上）。另一方面，分析查询通常扫描大量记录，执行分组和聚合，并且输出通常采用报告形式：显示指标随时间变化的图形，或根据某个排名的前10个项目，或将某个数量拆分为子类别。此类报告的消费者通常是分析师或经理，他们需要做出业务决策。

Where does batch processing fit in? It is not transaction processing, nor is it analytics. It is closer to analytics, in that a batch process typically scans over large portions of an input dataset. However, a workflow of MapReduce jobs is not the same as a SQL query used for analytic purposes (see “Comparing Hadoop to Distributed Databases” ). The output of a batch process is often not a report, but some other kind of structure.

批处理适用于哪些情况？它既不是事务处理，也不是分析。它与分析更接近，因为批处理通常会扫描整个输入数据集的大部分内容。然而，MapReduce作业的工作流与用于分析目的的SQL查询并不相同（请参见“将Hadoop与分布式数据库进行比较”）。批处理的输出通常不是报告，而是其他一些结构。

Building search indexes

Google’s original use of MapReduce was to build indexes for its search engine, which was implemented as a workflow of 5 to 10 MapReduce jobs [ 1 ]. Although Google later moved away from using MapReduce for this purpose [ 43 ], it helps to understand MapReduce if you look at it through the lens of building a search index. (Even today, Hadoop MapReduce remains a good way of building indexes for Lucene/Solr [ 44 ].)

Google最初使用MapReduce是用来建立其搜索引擎的索引，它是由5至10个MapReduce作业构成的工作流程[1]。尽管Google后来不再将MapReduce用于此目的[43]，但如果你从构建搜索引擎索引的角度来看MapReduce，它仍然有助于理解。（即使到今天，Hadoop MapReduce仍然是构建Lucene / Solr索引的好方法[44]。）

We saw briefly in “Full-text search and fuzzy indexes” how a full-text search index such as Lucene works: it is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs containing that keyword (the postings list). This is a very simplified view of a search index—in reality it requires various additional data, in order to rank search results by relevance, correct misspellings, resolve synonyms, and so on—but the principle holds.

我们简单地在“全文搜索和模糊索引”一章中看到了全文搜索索引（如Lucene）的工作方式：它是一个文件（词典），您可以在其中高效地查找特定的关键词并找到包含该关键词的所有文档ID列表（后续列表）。这是搜索索引的一个非常简化的视图——实际上需要各种额外的数据，以便通过相关性对搜索结果进行排序、纠正拼写错误、解决同义词等，但是原则保持不变。

If you need to perform a full-text search over a fixed set of documents, then a batch process is a very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem. Building such document-partitioned indexes (see “Partitioning and Secondary Indexes” ) parallelizes very well. Since querying a search index by keyword is a read-only operation, these index files are immutable once they have been created.

如果您需要在一组固定的文件上执行全文搜索，那么批处理是构建索引的非常有效的方式：映射器按需要分区文档集，每个减少器为其分区建立索引，并将索引文件写入分布式文件系统。构建这样的文档分区索引（请参阅“分区和二级索引”）非常好并行化。由于通过关键字查询搜索索引是只读操作，因此一旦创建了这些索引文件，它们就是不可变的。

If the indexed set of documents changes, one option is to periodically rerun the entire indexing workflow for the entire set of documents, and replace the previous index files wholesale with the new index files when it is done. This approach can be computationally expensive if only a small number of documents have changed, but it has the advantage that the indexing process is very easy to reason about: documents in, indexes out.

如果文档集合发生变化，一种选择是定期重新运行整个文档集的索引工作流程，并在完成后以新索引文件整体替换以前的索引文件。如果只有少数文件发生更改，这种方法可能计算成本很高，但它具有易于理解的优点：将文档放入，索引输出。

Alternatively, it is possible to build indexes incrementally. As discussed in Chapter 3 , if you want to add, remove, or update documents in an index, Lucene writes out new segment files and asynchronously merges and compacts segment files in the background. We will see more on such incremental processing in Chapter 11 .

另外，也可以逐步建立索引。正如第3章所讨论的，如果您想要添加、删除或更新索引中的文档，Lucene会写入新的段文件，并在后台异步合并和压缩段文件。我们将在第11章中进一步了解这种递增式处理的方法。

Key-value stores as batch process output

Search indexes are just one example of the possible outputs of a batch processing workflow. Another common use for batch processing is to build machine learning systems such as classifiers (e.g., spam filters, anomaly detection, image recognition) and recommendation systems (e.g., people you may know, products you may be interested in, or related searches [ 29 ]).

搜索索引只是批处理工作流的可能输出之一。批处理的另一个常见用途是构建机器学习系统，例如分类器（如垃圾邮件过滤器、异常检测、图像识别）和推荐系统（如可能认识的人、可能感兴趣的产品或相关搜索[29]）。

The output of those batch jobs is often some kind of database: for example, a database that can be queried by user ID to obtain suggested friends for that user, or a database that can be queried by product ID to get a list of related products [ 45 ].

这些批处理作业的输出通常是某种数据库：例如，可以通过用户ID查询以获取该用户的建议朋友的数据库，或者可以通过产品ID查询以获取相关产品列表的数据库。

These databases need to be queried from the web application that handles user requests, which is usually separate from the Hadoop infrastructure. So how does the output from the batch process get back into a database where the web application can query it?

这些数据库需要从处理用户请求的Web应用程序中查询，这通常与Hadoop基础架构分开。那么，批处理过程的输出如何返回到数据库中，使得Web应用程序可以查询它？

The most obvious choice might be to use the client library for your favorite database directly within a mapper or reducer, and to write from the batch job directly to the database server, one record at a time. This will work (assuming your firewall rules allow direct access from your Hadoop environment to your production databases), but it is a bad idea for several reasons:

最明显的选择可能是直接在映射器或减少器中使用您喜欢的数据库的客户端库，并直接从批处理作业一次写入一条记录到数据库服务器。这将起作用（假设您的防火墙规则允许从Hadoop环境直接访问您的生产数据库），但出于几个原因这是一个不好的想法：

As discussed previously in the context of joins, making a network request for every single record is orders of magnitude slower than the normal throughput of a batch task. Even if the client library supports batching, performance is likely to be poor.

如前所述，在联接的上下文中讨论过，为每条记录进行网络请求的速度比批处理任务的正常吞吐量慢数倍。即使客户端库支持批处理，性能也很可能很差。
MapReduce jobs often run many tasks in parallel. If all the mappers or reducers concurrently write to the same output database, with a rate expected of a batch process, that database can easily be overwhelmed, and its performance for queries is likely to suffer. This can in turn cause operational problems in other parts of the system [ 35 ].

MapReduce作业通常并行运行许多任务。如果所有映射器或约简器同时向相同的输出数据库写入，预期是批处理进程的速率，那么该数据库很容易被压垮，其查询性能可能会受影响。这反过来会导致系统其他部分的操作问题 [35]。
Normally, MapReduce provides a clean all-or-nothing guarantee for job output: if a job succeeds, the result is the output of running every task exactly once, even if some tasks failed and had to be retried along the way; if the entire job fails, no output is produced. However, writing to an external system from inside a job produces externally visible side effects that cannot be hidden in this way. Thus, you have to worry about the results from partially completed jobs being visible to other systems, and the complexities of Hadoop task attempts and speculative execution.

通常，MapReduce 为作业输出提供了一个干净的全有或全无的保证：如果一个作业成功了，那么结果就是准确地运行每个任务一次的输出，即使有些任务失败了并且必须在路上重试；如果整个作业失败了，就不会有输出产生。然而，从作业内部写入外部系统会产生外部可见的副作用，这种副作用无法隐藏。因此，你必须担心部分完成的作业结果对其他系统的可见性以及 Hadoop 任务尝试和投机执行的复杂性。

A much better solution is to build a brand-new database inside the batch job and write it as files to the job’s output directory in the distributed filesystem, just like the search indexes in the last section. Those data files are then immutable once written, and can be loaded in bulk into servers that handle read-only queries. Various key-value stores support building database files in MapReduce jobs, including Voldemort [ 46 ], Terrapin [ 47 ], ElephantDB [ 48 ], and HBase bulk loading [ 49 ].

一个更好的解决方案是在批处理作业中构建一个全新的数据库，并将其写为文件存储在分布式文件系统的作业输出目录中，就像上一节中的搜索索引一样。这些数据文件一旦被写入就是不可变的，可以大量加载到处理只读查询的服务器中。各种键值存储支持在MapReduce作业中构建数据库文件，包括Voldemort [46]、Terrapin [47]、ElephantDB [48]和HBase批量加载 [49]。

Building these database files is a good use of MapReduce: using a mapper to extract a key and then sorting by that key is already a lot of the work required to build an index. Since most of these key-value stores are read-only (the files can only be written once by a batch job and are then immutable), the data structures are quite simple. For example, they do not require a WAL (see “Making B-trees reliable” ).

构建这些数据库文件是MapReduce的良好使用方式：使用mapper提取关键字并按照关键字进行排序，已经完成了构建索引所需的大部分工作。由于这些键值存储大部分是只读的（文件只能由批处理作业编写一次，然后是不变的），因此数据结构非常简单。例如，它们不需要WAL（参见“使B树可靠”）。

When loading data into Voldemort, the server continues serving requests to the old data files while the new data files are copied from the distributed filesystem to the server’s local disk. Once the copying is complete, the server atomically switches over to querying the new files. If anything goes wrong in this process, it can easily switch back to the old files again, since they are still there and immutable [ 46 ].

在将数据加载到 Voldemort 时，服务器会在从分布式文件系统将新数据文件复制到服务器本地磁盘的过程中，继续为旧数据文件提供请求服务。一旦复制完成，服务器就会原子地切换到查询新文件。如果在此过程中出现任何问题，可以轻松地切换回旧文件，因为它们仍然存在且不可变[46]。

Philosophy of batch process outputs

The Unix philosophy that we discussed earlier in this chapter ( “The Unix Philosophy” ) encourages experimentation by being very explicit about dataflow: a program reads its input and writes its output. In the process, the input is left unchanged, any previous output is completely replaced with the new output, and there are no other side effects. This means that you can rerun a command as often as you like, tweaking or debugging it, without messing up the state of your system.

Unix哲学鼓励实验，因为它非常明确的数据流：程序读取其输入，然后写入其输出。在这个过程中，输入不会被改变，任何以前的输出都将完全被新输出替换，并且没有其他副作用。这意味着你可以随时重新运行命令，进行调试或优化，而不会破坏系统状态。

The handling of output from MapReduce jobs follows the same philosophy. By treating inputs as immutable and avoiding side effects (such as writing to external databases), batch jobs not only achieve good performance but also become much easier to maintain:

MapReduce作业的输出处理遵循相同的哲学。通过将输入视为不可变，避免副作用（例如写入外部数据库），批处理作业不仅可以实现良好的性能，而且也变得更容易维护。

If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll back to a previous version of the code and rerun the job, and the output will be correct again. Or, even simpler, you can keep the old output in a different directory and simply switch back to it. Databases with read-write transactions do not have this property: if you deploy buggy code that writes bad data to the database, then rolling back the code will do nothing to fix the data in the database. (The idea of being able to recover from buggy code has been called human fault tolerance [ 50 ].)

如果您在代码中引入了一个 bug，并且输出结果是错误或损坏的，您可以简单地回滚到代码的先前版本并重新运行作业，输出结果将再次正确。或者，更简单的方法是将旧的输出保留在不同的目录中，然后简单地切换回它。具有读写事务的数据库没有此属性：如果您部署有问题的代码并将坏数据写入数据库，则回滚代码将无法修复数据库中的数据。（能够从错误的代码中恢复的想法被称为人类容错性[50]。）
As a consequence of this ease of rolling back, feature development can proceed more quickly than in an environment where mistakes could mean irreversible damage. This principle of minimizing irreversibility is beneficial for Agile software development [ 51 ].

由于可以轻松回退的能力，特性开发可以比在可能造成不可逆伤害的环境中更快地进行。最小化不可逆性的原则有利于敏捷软件开发。
If a map or reduce task fails, the MapReduce framework automatically re-schedules it and runs it again on the same input. If the failure is due to a bug in the code, it will keep crashing and eventually cause the job to fail after a few attempts; but if the failure is due to a transient issue, the fault is tolerated. This automatic retry is only safe because inputs are immutable and outputs from failed tasks are discarded by the MapReduce framework.

如果地图或归纳任务失败，MapReduce框架会自动重新安排并在同一输入上运行它。如果故障是由代码中的错误引起的，它会继续崩溃并最终在几次尝试后导致作业失败；但是，如果故障是由短暂问题引起的，则容错。这种自动重试只有在输入不可变且来自失败任务的输出由MapReduce框架丢弃时才是安全的。
The same set of files can be used as input for various different jobs, including monitoring jobs that calculate metrics and evaluate whether a job’s output has the expected characteristics (for example, by comparing it to the output from the previous run and measuring discrepancies).

同一组文件可用作多个不同工作的输入，包括监视工作，计算指标并评估工作的输出是否具有预期特性（例如通过与上次运行的输出进行比较并测量差异）。
Like Unix tools, MapReduce jobs separate logic from wiring (configuring the input and output directories), which provides a separation of concerns and enables potential reuse of code: one team can focus on implementing a job that does one thing well, while other teams can decide where and when to run that job.

像Unix工具一样，MapReduce作业将逻辑与连接（配置输入和输出目录）分离，这提供了关注点分离并使代码的重复使用成为可能：一个团队可以专注于实现一个良好执行的作业，而其他团队可以决定何时何地运行该作业。

In these areas, the design principles that worked well for Unix also seem to be working well for Hadoop—but Unix and Hadoop also differ in some ways. For example, because most Unix tools assume untyped text files, they have to do a lot of input parsing (our log analysis example at the beginning of the chapter used {print $7} to extract the URL). On Hadoop, some of those low-value syntactic conversions are eliminated by using more structured file formats: Avro (see “Avro” ) and Parquet (see “Column-Oriented Storage” ) are often used, as they provide efficient schema-based encoding and allow evolution of their schemas over time (see Chapter 4 ).

在这些领域，对于Unix有效的设计原则似乎也适用于Hadoop，但Unix和Hadoop在某些方面也存在差异。例如，由于大多数Unix工具假定是未经处理的文本文件，它们必须执行大量的输入解析（我们在本章开始时的日志分析示例使用了{print $7}来提取URL）。在Hadoop上，通过使用更结构化的文件格式，可以消除其中一些低价值的句法转换：Avro（请参阅“Avro”）和Parquet（请参阅“列向存储”）通常被用作它们提供了高效的基于模式的编码，并允许随时间演变它们的模式（请参阅第4章）。

Comparing Hadoop to Distributed Databases

As we have seen, Hadoop is somewhat like a distributed version of Unix, where HDFS is the filesystem and MapReduce is a quirky implementation of a Unix process (which happens to always run the sort utility between the map phase and the reduce phase). We saw how you can implement various join and grouping operations on top of these primitives.

正如我们所看到的，Hadoop有点像分布式Unix，其中HDFS是文件系统，而MapReduce是Unix进程的奇怪实现（在map阶段和reduce阶段之间始终运行sort工具）。我们看到了如何在这些基本操作之上实现各种连接和分组操作。

When the MapReduce paper [ 1 ] was published, it was—in some sense—not at all new. All of the processing and parallel join algorithms that we discussed in the last few sections had already been implemented in so-called massively parallel processing (MPP) databases more than a decade previously [ 3 , 40 ]. For example, the Gamma database machine, Teradata, and Tandem NonStop SQL were pioneers in this area [ 52 ].

当MapReduce论文[1]出版时，在某种程度上，它并不是全新的。我们在最近几节中讨论的所有处理和并行连接算法都已经在所谓的大规模并行处理（MPP）数据库中实现了超过十年[3，40]。例如，Gamma数据库机器，Teradata和Tandem NonStop SQL是该领域的先驱[52]。

The biggest difference is that MPP databases focus on parallel execution of analytic SQL queries on a cluster of machines, while the combination of MapReduce and a distributed filesystem [ 19 ] provides something much more like a general-purpose operating system that can run arbitrary programs.

MPP数据库的最大区别在于其专注于在机器群组上并行执行分析性SQL查询，而MapReduce和分布式文件系统的组合则提供了类似于通用操作系统的功能，可以运行任意程序。

Diversity of storage

Databases require you to structure data according to a particular model (e.g., relational or documents), whereas files in a distributed filesystem are just byte sequences, which can be written using any data model and encoding. They might be collections of database records, but they can equally well be text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences, or any other kind of data.

数据库要求您根据特定模型（例如，关系型或文档型）来构建数据，而在分布式文件系统中的文件只是字节序列，可以使用任何数据模型和编码进行编写。它们可能是数据库记录的集合，但同样可以是文本、图像、视频、传感器读数、稀疏矩阵、特征向量、基因组序列或任何其他类型的数据。

To put it bluntly, Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further [ 53 ]. By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data into the database’s proprietary storage format.

简而言之，Hadoop打开了随意将数据倾倒到HDFS的可能性，然后再决定如何进一步处理它。相比之下，MPP数据库通常需要在将数据导入数据库的专有存储格式之前仔细建模数据和查询模式。

From a purist’s point of view, it may seem that this careful modeling and import is desirable, because it means users of the database have better-quality data to work with. However, in practice, it appears that simply making data available quickly—even if it is in a quirky, difficult-to-use, raw format—is often more valuable than trying to decide on the ideal data model up front [ 54 ].

从纯粹主义者的角度来看，这种谨慎的建模和导入似乎是可取的，因为这意味着数据库的用户有更好质量的数据可使用。然而，在实践中，似乎往往比起尝试事先确定理想的数据模型，只需迅速提供数据，即使它是用奇怪、难以使用的原始格式，通常更有价值。

The idea is similar to a data warehouse (see “Data Warehousing” ): simply bringing data from various parts of a large organization together in one place is valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP database slows down that centralized data collection; collecting data in its raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a “data lake” or “enterprise data hub” [ 55 ]).

这个想法很类似于数据仓库：把来自大型组织不同部分的数据集中到一个地方是很有价值的，因为可以实现跨数据集的连接，这些数据集以前是分散的。由于 MPP 数据库要求谨慎的模式设计，因此会减缓中央数据收集的速度。将数据以原始形式收集，并稍后考虑模式设计，可以加快数据收集 (有时称为“数据湖”或“企业数据中心”)。

Indiscriminate data dumping shifts the burden of interpreting the data: instead of forcing the producer of a dataset to bring it into a standardized format, the interpretation of the data becomes the consumer’s problem (the schema-on-read approach [ 56 ]; see “Schema flexibility in the document model” ). This can be an advantage if the producer and consumers are different teams with different priorities. There may not even be one ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle : “raw data is better” [ 57 ].

不加区分的数据倾倒将解释数据的负担转移给了消费者：与其强制数据集的生产者将其置于标准化格式中，解释数据成为了消费者的问题（即读取时模式的方法）。如果生产者和消费者是不同的团队，有着不同的优先级，这可能是一种优势。可能并没有一个理想的数据模型，而是适用于不同目的的不同数据视图。简单地将数据倾倒在其原始形式中允许进行多种这样的转换。这种方法被称为寿司原则：“原始数据更好”。

Thus, Hadoop has often been used for implementing ETL processes (see “Data Warehousing” ): data from transaction processing systems is dumped into the distributed filesystem in some raw form, and then MapReduce jobs are written to clean up that data, transform it into a relational form, and import it into an MPP data warehouse for analytic purposes. Data modeling still happens, but it is in a separate step, decoupled from the data collection. This decoupling is possible because a distributed filesystem supports data encoded in any format.

因此，Hadoop通常用于实现ETL过程（请参见“数据仓库”）：从事务处理系统中获得的数据以某种原始形式转储到分布式文件系统中，然后编写MapReduce作业清理该数据，将其转换为关系形式，并将其导入MPP数据仓库以用于分析目的。数据建模仍然存在，但它是一个独立的步骤，与数据收集解耦。这种解耦是可能的，因为分布式文件系统支持以任何格式编码的数据。

Diversity of processing models

MPP databases are monolithic, tightly integrated pieces of software that take care of storage layout on disk, query planning, scheduling, and execution. Since these components can all be tuned and optimized for the specific needs of the database, the system as a whole can achieve very good performance on the types of queries for which it is designed. Moreover, the SQL query language allows expressive queries and elegant semantics without the need to write code, making it accessible to graphical tools used by business analysts (such as Tableau).

MPP数据库是集成紧密的软件单体，负责处理磁盘存储布局、查询规划、调度和执行等。由于所有这些组件都可以根据数据库的特定需求进行调优和优化，因此整个系统可以在其设计的查询类型上实现非常出色的性能。此外，SQL查询语言允许表达丰富的查询和优雅的语义，无需编写代码，使其可访问业务分析师（例如Tableau）使用的图形工具。

On the other hand, not all kinds of processing can be sensibly expressed as SQL queries. For example, if you are building machine learning and recommendation systems, or full-text search indexes with relevance ranking models, or performing image analysis, you most likely need a more general model of data processing. These kinds of processing are often very specific to a particular application (e.g., feature engineering for machine learning, natural language models for machine translation, risk estimation functions for fraud prediction), so they inevitably require writing code, not just queries.

另一方面，并非所有类型的处理都可以明智地表示为SQL查询。例如，如果你正在建立机器学习和推荐系统、具有相关性排名模型的全文搜索索引或进行图像分析，那么你很可能需要更通用的数据处理模型。这些处理通常非常特定于特定应用程序（例如，用于机器学习的特征工程、用于机器翻译的自然语言模型、用于欺诈预测的风险估计函数），因此它们不可避免地需要编写代码，而不仅仅是查询。

MapReduce gave engineers the ability to easily run their own code over large datasets. If you have HDFS and MapReduce, you can build a SQL query execution engine on top of it, and indeed this is what the Hive project did [ 31 ]. However, you can also write many other forms of batch processes that do not lend themselves to being expressed as a SQL query.

MapReduce将使工程师能够轻松地在大型数据集上运行自己的代码。如果您拥有HDFS和MapReduce，可以在其上构建一个SQL查询执行引擎，这正是Hive项目所做的[31]。不过，您也可以编写许多其他形式的批处理过程，这些过程不适合表达为SQL查询。

Subsequently, people found that MapReduce was too limiting and performed too badly for some types of processing, so various other processing models were developed on top of Hadoop (we will see some of them in “Beyond MapReduce” ). Having two processing models, SQL and MapReduce, was not enough: even more different models were needed! And due to the openness of the Hadoop platform, it was feasible to implement a whole range of approaches, which would not have been possible within the confines of a monolithic MPP database [ 58 ].

随后，人们发现MapReduce的局限性和性能过差，无法处理某些类型的数据处理，因此在Hadoop之上开发出了各种其他的处理模型（我们会在“超越MapReduce”中看到一些）。拥有两个处理模型，SQL和MapReduce，还不够：需要更多不同的模型！由于Hadoop平台的开放性，实现一整个系列的方法是可行的，而这是在一个单一的MPP数据库的限制内不可能实现的[58]。

Crucially, those various processing models can all be run on a single shared-use cluster of machines, all accessing the same files on the distributed filesystem. In the Hadoop approach, there is no need to import the data into several different specialized systems for different kinds of processing: the system is flexible enough to support a diverse set of workloads within the same cluster. Not having to move data around makes it a lot easier to derive value from the data, and a lot easier to experiment with new processing models.

重要的是，这些不同的处理模型都可以在一个共享使用的机群上运行，所有在分布式文件系统上访问相同的文件。在Hadoop的方法中，没有必要将数据导入多个不同的专门处理系统以进行处理：该系统足够灵活以在同一机群中支持各种工作负载。不需要移动数据会使从数据中获取价值变得更加容易，也更容易尝试新的处理模型。

The Hadoop ecosystem includes both random-access OLTP databases such as HBase (see “SSTables and LSM-Trees” ) and MPP-style analytic databases such as Impala [ 41 ]. Neither HBase nor Impala uses MapReduce, but both use HDFS for storage. They are very different approaches to accessing and processing data, but they can nevertheless coexist and be integrated in the same system.

Hadoop生态系统包括随机访问的OLTP数据库，例如HBase（参见“SSTables和LSM树”）和类似MPP的分析数据库，例如Impala [41]。 HBase和Impala均不使用MapReduce，但都使用HDFS进行存储。它们是访问和处理数据的非常不同的方法，但它们仍然可以共存并集成到同一系统中。

Designing for frequent faults

When comparing MapReduce to MPP databases, two more differences in design approach stand out: the handling of faults and the use of memory and disk. Batch processes are less sensitive to faults than online systems, because they do not immediately affect users if they fail and they can always be run again.

当将MapReduce与MPP数据库进行比较时,设计方法中有两个更为显着的不同之处:故障处理和内存和磁盘的使用。批处理过程对于故障比在线系统不太敏感，因为如果它们失败并且可以随时重新运行，则不会立即影响用户。

If a node crashes while a query is executing, most MPP databases abort the entire query, and either let the user resubmit the query or automatically run it again [ 3 ]. As queries normally run for a few seconds or a few minutes at most, this way of handling errors is acceptable, since the cost of retrying is not too great. MPP databases also prefer to keep as much data as possible in memory (e.g., using hash joins) to avoid the cost of reading from disk.

如果一个节点在执行查询时崩溃，大多数MPP数据库会中止整个查询，然后让用户重新提交查询或自动运行再次查询[3]。由于查询通常只运行几秒钟或最多几分钟，处理错误的方式是可接受的，因为重试的成本不是太高。MPP数据库还喜欢尽可能多地保持数据在内存中（例如，使用哈希连接）以避免从磁盘读取的成本。

On the other hand, MapReduce can tolerate the failure of a map or reduce task without it affecting the job as a whole by retrying work at the granularity of an individual task. It is also very eager to write data to disk, partly for fault tolerance, and partly on the assumption that the dataset will be too big to fit in memory anyway.

另一方面，MapReduce 可以容忍 map 或 reduce 任务的失败，而不会影响整个任务，因为它可以在单个任务的粒度上重新尝试工作。此外，它也非常渴望将数据写入磁盘，部分是为了容错，部分是基于数据集太大无法放入内存的假设。

The MapReduce approach is more appropriate for larger jobs: jobs that process so much data and run for such a long time that they are likely to experience at least one task failure along the way. In that case, rerunning the entire job due to a single task failure would be wasteful. Even if recovery at the granularity of an individual task introduces overheads that make fault-free processing slower, it can still be a reasonable trade-off if the rate of task failures is high enough.

MapReduce方法更适合处理更大的任务: 即处理大量数据并且运行时间如此之长以至于它们有可能在执行过程中至少遇到一个任务故障。那种情况下，由于一个任务故障而重新运行整个任务是浪费的。即使以单个任务为粒度的恢复引入了一些开销，使得无故障处理变慢，如果任务故障率足够高，这仍然可以是一个合理的权衡。

But how realistic are these assumptions? In most clusters, machine failures do occur, but they are not very frequent—probably rare enough that most jobs will not experience a machine failure. Is it really worth incurring significant overheads for the sake of fault tolerance?

但这些假设有多现实？在大多数集群中会发生机器故障，但它们并不是非常频繁 - 可能很少有工作会经历机器故障。为了实现容错性，真的值得承担重大的开销吗？

To understand the reasons for MapReduce’s sparing use of memory and task-level recovery, it is helpful to look at the environment for which MapReduce was originally designed. Google has mixed-use datacenters, in which online production services and offline batch jobs run on the same machines. Every task has a resource allocation (CPU cores, RAM, disk space, etc.) that is enforced using containers. Every task also has a priority, and if a higher-priority task needs more resources, lower-priority tasks on the same machine can be terminated (preempted) in order to free up resources. Priority also determines pricing of the computing resources: teams must pay for the resources they use, and higher-priority processes cost more [ 59 ].

为了理解MapReduce在内存和任务级别恢复方面节约使用的原因，有助于看看MapReduce最初设计的环境。谷歌拥有混合使用的数据中心，在同一台机器上运行在线生产服务和离线批处理作业。每个任务都有一个资源分配（CPU核心，内存，磁盘空间等），使用容器来执行。每个任务还有一个优先级，如果需要更多资源的更高优先级任务，可以终止（抢占）同一台机器上的低优先级任务以释放资源。优先级还确定了计算资源的定价：团队必须为他们使用的资源付费，较高优先级的进程成本更高[59]。

This architecture allows non-production (low-priority) computing resources to be overcommitted, because the system knows that it can reclaim the resources if necessary. Overcommitting resources in turn allows better utilization of machines and greater efficiency compared to systems that segregate production and non-production tasks. However, as MapReduce jobs run at low priority, they run the risk of being preempted at any time because a higher-priority process requires their resources. Batch jobs effectively “pick up the scraps under the table,” using any computing resources that remain after the high-priority processes have taken what they need.

这种架构允许非生产（低优先级）计算资源被超额投入，因为系统知道如果需要的话可以收回资源。相比分离生产和非生产任务的系统，超额投入资源可以更好地利用机器并提高效率。然而，因为MapReduce作业以低优先级运行，它们随时面临被抢占的风险，因为高优先级进程需要它们的资源。批处理作业有效地“捡拾桌子下的残羹剩饭”，使用高优先级进程取走所需资源后，剩余的计算资源。

At Google, a MapReduce task that runs for an hour has an approximately 5% risk of being terminated to make space for a higher-priority process. This rate is more than an order of magnitude higher than the rate of failures due to hardware issues, machine reboot, or other reasons [ 59 ]. At this rate of preemptions, if a job has 100 tasks that each run for 10 minutes, there is a risk greater than 50% that at least one task will be terminated before it is finished.

在谷歌，运行一个持续一个小时的MapReduce任务有大约5％的风险被终止，以腾出更高优先级进程的空间。这个速率比由于硬件问题、机器重启或其他原因引起的故障率高出一个数量级。以这种预占率，如果一个作业有100个每个运行10分钟的任务，则至少有一个任务在完成之前被终止的风险大于50％。

And this is why MapReduce is designed to tolerate frequent unexpected task termination: it’s not because the hardware is particularly unreliable, it’s because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster.

这也是为什么MapReduce被设计成能够容忍频繁意外任务终止的原因：不是因为硬件特别不可靠，而是因为可以任意终止进程的自由能够更好地利用计算集群的资源。

Among open source cluster schedulers, preemption is less widely used. YARN’s CapacityScheduler supports preemption for balancing the resource allocation of different queues [ 58 ], but general priority preemption is not supported in YARN, Mesos, or Kubernetes at the time of writing [ 60 ]. In an environment where tasks are not so often terminated, the design decisions of MapReduce make less sense. In the next section, we will look at some alternatives to MapReduce that make different design decisions.

在开源集群调度器中，抢占不太常用。YARN的CapacityScheduler支持抢占来平衡不同队列的资源分配，但在写作时，YARN、Mesos或Kubernetes均不支持常规优先级抢占。在任务不经常终止的环境中，MapReduce的设计决策显得不那么合理。在下一部分中，我们将探讨一些替代MapReduce的方案，这些方案做出了不同的设计决策。

Beyond MapReduce

Although MapReduce became very popular and received a lot of hype in the late 2000s, it is just one among many possible programming models for distributed systems. Depending on the volume of data, the structure of the data, and the type of processing being done with it, other tools may be more appropriate for expressing a computation.

尽管MapReduce在2000年代末非常流行并受到很多宣传，但它只是分布式系统中许多可能的编程模型之一。根据数据的数量、数据的结构和使用它进行的处理类型，可能有其他更适合表达计算的工具。

We nevertheless spent a lot of time in this chapter discussing MapReduce because it is a useful learning tool, as it is a fairly clear and simple abstraction on top of a distributed filesystem. That is, simple in the sense of being able to understand what it is doing, not in the sense of being easy to use. Quite the opposite: implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious—for instance, you would need to implement any join algorithms from scratch [ 37 ].

我们在这章中花费了很多时间讨论MapReduce，因为它是一个有用的学习工具，它是一个相对清晰和简单的抽象，建立在分布式文件系统之上。也就是说，它容易理解它正在做什么，但并不容易使用。相反，使用原始的MapReduce API实现复杂的处理工作实际上是相当困难和繁琐的，例如，你需要从头实现任何联接算法[37]。

In response to the difficulty of using MapReduce directly, various higher-level programming models (Pig, Hive, Cascading, Crunch) were created as abstractions on top of MapReduce. If you understand how MapReduce works, they are fairly easy to learn, and their higher-level constructs make many common batch processing tasks significantly easier to implement.

针对直接使用MapReduce的困难，各种更高层次的编程模型（Pig、Hive、Cascading、Crunch）被创建为对MapReduce的抽象。如果您理解MapReduce的工作原理，学习它们是相当容易的，它们的高级构造使得许多常见的批处理任务更容易实现。

However, there are also problems with the MapReduce execution model itself, which are not fixed by adding another level of abstraction and which manifest themselves as poor performance for some kinds of processing. On the one hand, MapReduce is very robust: you can use it to process almost arbitrarily large quantities of data on an unreliable multi-tenant system with frequent task terminations, and it will still get the job done (albeit slowly). On the other hand, other tools are sometimes orders of magnitude faster for some kinds of processing.

然而，MapReduce执行模型本身也存在问题，即使增加另一级抽象也无法修复，表现为某些处理的性能不佳。一方面，MapReduce非常健壮：您可以在频繁任务终止的不可靠多租户系统上使用它来处理几乎任意数量的数据，并且它仍然能够完成工作（虽然缓慢）。另一方面，其他工具对于某些类型的处理有时会快上几个数量级。

In the rest of this chapter, we will look at some of those alternatives for batch processing. In Chapter 11 we will move to stream processing, which can be regarded as another way of speeding up batch processing.

在本章的其余部分，我们将探讨一些批处理的替代方法。在第11章中，我们将转向流处理，这可以被视为加速批处理的另一种方式。

Materialization of Intermediate State

As discussed previously, every MapReduce job is independent from every other job. The main contact points of a job with the rest of the world are its input and output directories on the distributed filesystem. If you want the output of one job to become the input to a second job, you need to configure the second job’s input directory to be the same as the first job’s output directory, and an external workflow scheduler must start the second job only once the first job has completed.

如前所述，每个MapReduce作业都与其他作业无关。作业与世界的主要联系点是分布式文件系统上的输入和输出目录。如果您希望一个作业的输出成为第二个作业的输入，则需要将第二个作业的输入目录配置为与第一个作业的输出目录相同，并且外部工作流调度程序必须在第一个作业完成后才能启动第二个作业。

This setup is reasonable if the output from the first job is a dataset that you want to publish widely within your organization. In that case, you need to be able to refer to it by name and reuse it as input to several different jobs (including jobs developed by other teams). Publishing data to a well-known location in the distributed filesystem allows loose coupling so that jobs don’t need to know who is producing their input or consuming their output (see “Separation of logic and wiring” ).

如果第一项工作的输出是您想要在组织内广泛发布的数据集，则此设置是合理的。在这种情况下，您需要能够通过名称引用它，并将其作为多个不同作业（包括其他团队开发的作业）的输入进行重用。将数据发布到分布式文件系统中的公认位置允许松散耦合，以便作业不需要知道谁正在生成其输入或使用其输出（参见“逻辑和接线的分离”）。

However, in many cases, you know that the output of one job is only ever used as input to one other job, which is maintained by the same team. In this case, the files on the distributed filesystem are simply intermediate state : a means of passing data from one job to the next. In the complex workflows used to build recommendation systems consisting of 50 or 100 MapReduce jobs [ 29 ], there is a lot of such intermediate state.

然而，在许多情况下，您知道一个作业的输出只会作为输入传递给同一团队维护的另一个作业。在这种情况下，分布式文件系统上的文件只是中间状态：将数据从一个作业传递到下一个的手段。在用于构建由50或100个MapReduce作业组成的推荐系统的复杂工作流程中，有许多这样的中间状态。

The process of writing out this intermediate state to files is called materialization . (We came across the term previously in the context of materialized views, in “Aggregation: Data Cubes and Materialized Views” . It means to eagerly compute the result of some operation and write it out, rather than computing it on demand when requested.)

将这个中间状态写入文件的过程称为物化。我们之前在“聚合：数据立方体和物化视图”中遇到过这个术语。它意味着急切地计算某个操作的结果并写出来，而不是在请求时再计算。

By contrast, the log analysis example at the beginning of the chapter used Unix pipes to connect the output of one command with the input of another. Pipes do not fully materialize the intermediate state, but instead stream the output to the input incrementally, using only a small in-memory buffer.

与此相反，本章开头的日志分析示例使用Unix管道将一个命令的输出与另一个命令的输入连接起来。管道不会完全实现中间状态，而是以增量方式将输出流式传输到输入，仅使用小的内存缓冲区。

MapReduce’s approach of fully materializing intermediate state has downsides compared to Unix pipes:

MapReduce完全物化中间状态的方法与Unix pipes相比有缺点。

A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed, whereas processes connected by a Unix pipe are started at the same time, with output being consumed as soon as it is produced. Skew or varying load on different machines means that a job often has a few straggler tasks that take much longer to complete than the others. Having to wait until all of the preceding job’s tasks have completed slows down the execution of the workflow as a whole.

MapReduce作业只有在生成其输入的所有前置作业中的所有任务都完成时才能开始，而通过Unix管道连接的进程则同时启动，输出在产生时立即被消耗。不同机器上的偏差或负载变化意味着作业经常有一些拖延任务要比其他任务完成时间更长。必须等待所有前置作业的任务完成会减慢整个工作流的执行速度。
Mappers are often redundant: they just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer: if the reducer output was partitioned and sorted in the same way as mapper output, then reducers could be chained together directly, without interleaving with mapper stages.

映射器通常是冗余的：它们只是读取刚刚被减少器写入的相同文件，并为分区和排序的下一阶段进行准备。在许多情况下，映射器代码可以是前一个减少器的一部分：如果减少器输出的方式与映射器输出相同，则可以直接将减少器链在一起，而不需要与映射器阶段交错。
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.

在分布式文件系统中存储中间状态意味着这些文件被复制到多个节点，这对于这种临时数据来说通常是过度的。

Dataflow engines

In order to fix these problems with MapReduce, several new execution engines for distributed batch computations were developed, the most well known of which are Spark [ 61 , 62 ], Tez [ 63 , 64 ], and Flink [ 65 , 66 ]. There are various differences in the way they are designed, but they have one thing in common: they handle an entire workflow as one job, rather than breaking it up into independent subjobs.

为了解决MapReduce存在的问题，出现了几个新的分布式批量计算执行引擎，其中最著名的是Spark[61，62]、Tez[63，64]和Flink[65，66]。它们的设计存在一定差异，但有一个共同点：将整个工作流程作为一个作业来处理，而不是将其分解为独立的子作业。

Since they explicitly model the flow of data through several processing stages, these systems are known as dataflow engines . Like MapReduce, they work by repeatedly calling a user-defined function to process one record at a time on a single thread. They parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function.

由于其明确地模拟了数据流通过多个处理阶段的过程，这些系统被称为数据流引擎。与MapReduce一样，它们通过重复调用用户定义的函数来处理单个线程上的一条记录。它们通过将输入分区来并行化工作，并将一个函数的输出复制到网络上，以成为另一个函数的输入。

Unlike in MapReduce, these functions need not take the strict roles of alternating map and reduce, but instead can be assembled in more flexible ways. We call these functions operators , and the dataflow engine provides several different options for connecting one operator’s output to another’s input:

与MapReduce不同，这些函数不需要采取严格的交替映射和归约角色，而是可以以更灵活的方式组装。我们将这些函数称为运算符，数据流引擎提供了连接一个运算符输出到另一个运算符输入的几种不同选项。

One option is to repartition and sort records by key, like in the shuffle stage of MapReduce (see “Distributed execution of MapReduce” ). This feature enables sort-merge joins and grouping in the same way as in MapReduce.

一种选项是重新分区并按键值排序记录，就像在MapReduce的shuffle阶段中一样（请参见“MapReduce的分布式执行”）。此功能使得排序合并连接和分组与MapReduce中的方式相同。
Another possibility is to take several inputs and to partition them in the same way, but skip the sorting. This saves effort on partitioned hash joins, where the partitioning of records is important but the order is irrelevant because building the hash table randomizes the order anyway.

另一个可能性是采用多个输入并以相同的方式对它们进行分区，但跳过排序。这可以节省在分区哈希连接上的工作，其中记录的分区很重要，但顺序是无关紧要的，因为构建哈希表会随机排列订单。
For broadcast hash joins, the same output from one operator can be sent to all partitions of the join operator.

在广播哈希连接中，可以将来自一个运算符的相同输出发送到连接运算符的所有分区。

This style of processing engine is based on research systems like Dryad [ 67 ] and Nephele [ 68 ], and it offers several advantages compared to the MapReduce model:

这种处理引擎的风格是基于Dryad [67]和Nephele [68]等研究系统，相比于MapReduce模型，它提供了几个优点：

Expensive work such as sorting need only be performed in places where it is actually required, rather than always happening by default between every map and reduce stage.

昂贵的工作（如排序）只需在实际需要的地方执行，而无需在每个映射和减少阶段之间默认发生。
There are no unnecessary map tasks, since the work done by a mapper can often be incorporated into the preceding reduce operator (because a mapper does not change the partitioning of a dataset).

由于Mapper不会更改数据集的分区，因此它完成的工作往往可以合并到先前的Reduce操作符中，因此不存在不必要的映射任务。
Because all joins and data dependencies in a workflow are explicitly declared, the scheduler has an overview of what data is required where, so it can make locality optimizations. For example, it can try to place the task that consumes some data on the same machine as the task that produces it, so that the data can be exchanged through a shared memory buffer rather than having to copy it over the network.

由于工作流程中的所有连接和数据依赖项都是明确声明的，因此调度程序可以了解需要在哪里使用数据，因此它可以进行本地化优化。例如，它可以尝试将消耗一些数据的任务放置在与生成它的任务相同的机器上，以便可以通过共享内存缓冲区交换数据，而无需复制它。
It is usually sufficient for intermediate state between operators to be kept in memory or written to local disk, which requires less I/O than writing it to HDFS (where it must be replicated to several machines and written to disk on each replica). MapReduce already uses this optimization for mapper output, but dataflow engines generalize the idea to all intermediate state.

通常情况下，操作员之间的中间状态只需保留在内存或写入本地磁盘即可，这比将其写入HDFS（必须复制到多台计算机并在每个副本上写入磁盘）需要更少的I/O。MapReduce已经在mapper输出中使用了这种优化，但数据流引擎则将这个想法推广到所有中间状态。
Operators can start executing as soon as their input is ready; there is no need to wait for the entire preceding stage to finish before the next one starts.

运算符可以在其输入准备就绪时立即开始执行；在上一个阶段完成之前，下一个阶段无需等待整个前置阶段的完成。
Existing Java Virtual Machine (JVM) processes can be reused to run new operators, reducing startup overheads compared to MapReduce (which launches a new JVM for each task).

现有的Java虚拟机（JVM）进程可以重用，以运行新的运算符，与MapReduce进程相比，不需要为每个任务启动新的JVM，从而减少启动开销。

You can use dataflow engines to implement the same computations as MapReduce workflows, and they usually execute significantly faster due to the optimizations described here. Since operators are a generalization of map and reduce, the same processing code can run on either execution engine: workflows implemented in Pig, Hive, or Cascading can be switched from MapReduce to Tez or Spark with a simple configuration change, without modifying code [ 64 ].

你可以使用数据流引擎来实现与MapReduce工作流相同的计算，并且它们通常由于此处描述的优化而执行速度显着更快。由于操作符是map和reduce的泛化，因此相同的处理代码可以在任何执行引擎上运行：在Pig、Hive或Cascading中实现的工作流可以通过简单的配置更改从MapReduce转换到Tez或Spark，而不需要修改代码[64]。

Tez is a fairly thin library that relies on the YARN shuffle service for the actual copying of data between nodes [ 58 ], whereas Spark and Flink are big frameworks that include their own network communication layer, scheduler, and user-facing APIs. We will discuss those high-level APIs shortly.

Tez是一个相对轻量级的库，它依赖于YARN shuffle服务来实际复制节点之间的数据[58]，而Spark和Flink是大型框架，包括自己的网络通信层、调度器和面向用户的API。我们稍后将讨论这些高级API。

Fault tolerance

An advantage of fully materializing intermediate state to a distributed filesystem is that it is durable, which makes fault tolerance fairly easy in MapReduce: if a task fails, it can just be restarted on another machine and read the same input again from the filesystem.

将中间状态完全实现到分布式文件系统中的一个优势是它是持久化的，在MapReduce中容错非常容易：如果一个任务失败，它可以在另一台机器上重新启动，并从文件系统中再次读取相同的输入。

Spark, Flink, and Tez avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults: if a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available (a prior intermediary stage if possible, or otherwise the original input data, which is normally on HDFS).

Spark、Flink和Tez不会将中间状态写入HDFS，因此它们采取不同的方法来容错：如果某台机器故障并且该机器上的中间状态丢失，则会从其他仍可用的数据重新计算该状态（尽可能是前一个中间阶段，否则就是通常在HDFS上的原始输入数据）。

To enable this recomputation, the framework must keep track of how a given piece of data was computed—which input partitions it used, and which operators were applied to it. Spark uses the resilient distributed dataset (RDD) abstraction for tracking the ancestry of data [ 61 ], while Flink checkpoints operator state, allowing it to resume running an operator that ran into a fault during its execution [ 66 ].

为了实现这种重新计算，该框架必须跟踪给定数据是如何计算的——它使用了哪些输入分区以及应用了哪些运算符。Spark使用弹性分布式数据集（RDD）抽象来跟踪数据的谱系，而Flink通过检查点操作状态来恢复运行因执行期间遇到错误的操作符。

When recomputing data, it is important to know whether the computation is deterministic : that is, given the same input data, do the operators always produce the same output? This question matters if some of the lost data has already been sent to downstream operators. If the operator is restarted and the recomputed data is not the same as the original lost data, it becomes very hard for downstream operators to resolve the contradictions between the old and new data. The solution in the case of nondeterministic operators is normally to kill the downstream operators as well, and run them again on the new data.

当重新计算数据时，了解计算是否具有确定性非常重要：也就是说，给定相同的输入数据，算子是否总是产生相同的输出？如果一些已丢失的数据已经发送到下游算子，则这个问题非常重要。如果算子被重新启动，并且重新计算的数据与原始丢失的数据不同，则下游算子将很难解决旧数据和新数据之间的矛盾。在非确定性算子的情况下，解决方案通常是将下游算子也杀死，并在新数据上再次运行它们。

In order to avoid such cascading faults, it is better to make operators deterministic. Note however that it is easy for nondeterministic behavior to accidentally creep in: for example, many programming languages do not guarantee any particular order when iterating over elements of a hash table, many probabilistic and statistical algorithms explicitly rely on using random numbers, and any use of the system clock or external data sources is nondeterministic. Such causes of nondeterminism need to be removed in order to reliably recover from faults, for example by generating pseudorandom numbers using a fixed seed.

为了避免此类级联故障，最好使操作员具有确定性。然而请注意，非确定性行为很容易意外混入：例如，许多编程语言在迭代哈希表元素时不保证任何特定顺序，许多概率和统计算法明确依赖使用随机数，任何使用系统时钟或外部数据源都是非确定性的。必须消除这些非确定性的原因，以可靠地从故障中恢复，例如通过使用固定种子生成伪随机数。

Recovering from faults by recomputing data is not always the right answer: if the intermediate data is much smaller than the source data, or if the computation is very CPU-intensive, it is probably cheaper to materialize the intermediate data to files than to recompute it.

通过重新计算数据来恢复故障并不总是正确的答案：如果中间数据比源数据要小得多，或者计算非常耗费 CPU，那么将中间数据实现为文件比重新计算要便宜些。

Discussion of materialization

Returning to the Unix analogy, we saw that MapReduce is like writing the output of each command to a temporary file, whereas dataflow engines look much more like Unix pipes. Flink especially is built around the idea of pipelined execution: that is, incrementally passing the output of an operator to other operators, and not waiting for the input to be complete before starting to process it.

回到Unix的比喻，我们发现MapReduce就像是将每个命令的输出写入临时文件，而数据流引擎更像是Unix的管道。特别是，Flink是建立在管道执行的思想基础上的：即逐步将一个运算符的输出传递给其他运算符，而不必等待输入完成才开始处理。

A sorting operation inevitably needs to consume its entire input before it can produce any output, because it’s possible that the very last input record is the one with the lowest key and thus needs to be the very first output record. Any operator that requires sorting will thus need to accumulate state, at least temporarily. But many other parts of a workflow can be executed in a pipelined manner.

排序操作无论如何都需要在产生任何输出之前消耗其完整的输入，因为最后一个输入的记录可能是具有最低键值的记录，并且需要成为第一个输出记录。因此，任何需要排序的操作都需要累积状态，至少是暂时的。但是，工作流程的许多其他部分可以以流水线方式执行。

When the job completes, its output needs to go somewhere durable so that users can find it and use it—most likely, it is written to the distributed filesystem again. Thus, when using a dataflow engine, materialized datasets on HDFS are still usually the inputs and the final outputs of a job. Like with MapReduce, the inputs are immutable and the output is completely replaced. The improvement over MapReduce is that you save yourself writing all the intermediate state to the filesystem as well.

任务完成后，其输出需要被存储到某个可靠的地方，以便用户可以找到并使用它——很可能，它会再次写入分布式文件系统。因此，在使用数据流引擎时，HDFS上的物化数据集通常仍然是作业的输入和最终输出。与MapReduce一样，输入是不可变的，输出完全被替换。相对于MapReduce的改进在于，您无需将所有中间状态都写入文件系统。

Graphs and Iterative Processing

In “Graph-Like Data Models” we discussed using graphs for modeling data, and using graph query languages to traverse the edges and vertices in a graph. The discussion in Chapter 2 was focused around OLTP-style use: quickly executing queries to find a small number of vertices matching certain criteria.

在“类图形数据模型”中，我们讨论了使用图形来建立数据模型，以及使用图形查询语言遍历图形中的边缘和顶点。第2章的讨论集中在OLTP样式的用途上：快速执行查询以找到满足某些条件的少量顶点。

It is also interesting to look at graphs in a batch processing context, where the goal is to perform some kind of offline processing or analysis on an entire graph. This need often arises in machine learning applications such as recommendation engines, or in ranking systems. For example, one of the most famous graph analysis algorithms is PageRank [ 69 ], which tries to estimate the popularity of a web page based on what other web pages link to it. It is used as part of the formula that determines the order in which web search engines present their results.

在批处理环境下查看图表也很有趣，因为目标是对整个图表执行某种离线处理或分析。这种需求在机器学习应用程序，如推荐引擎或排名系统中经常出现。例如，最著名的图表分析算法之一是PageRank[69]，它试图根据其他网页链接到它的数量来估计网页的流行程度。它作为公式的一部分被用来确定Web搜索引擎呈现其结果的顺序。

Note

Dataflow engines like Spark, Flink, and Tez (see “Materialization of Intermediate State” ) typically arrange the operators in a job as a directed acyclic graph (DAG). This is not the same as graph processing: in dataflow engines, the flow of data from one operator to another is structured as a graph, while the data itself typically consists of relational-style tuples. In graph processing, the data itself has the form of a graph. Another unfortunate naming confusion!

像Spark、 Flink和Tez这样的数据流引擎（参见“中间状态的实体化”）通常将作业中的运算符排列成有向无环图（DAG）。这与图处理不同：在数据流引擎中，从一个运算符到另一个运算符的数据流被组织成图形，而数据本身通常包含关系型元组。在图形处理中，数据本身的形式就是图形。另一个不幸的命名混淆！

Many graph algorithms are expressed by traversing one edge at a time, joining one vertex with an adjacent vertex in order to propagate some information, and repeating until some condition is met—for example, until there are no more edges to follow, or until some metric converges. We saw an example in Figure 2-6 , which made a list of all the locations in North America contained in a database by repeatedly following edges indicating which location is within which other location (this kind of algorithm is called a transitive closure ).

许多图算法通过每次遍历一条边来表达，将一个顶点连接到相邻的顶点以传播一些信息，并重复这个过程直到满足某些条件——例如，直到没有更多的边可跟随，或者直到某些度量收敛。我们在图2-6中看到了一个例子，它通过反复跟随指示哪个位置在哪个其他位置中的边来列出包含在数据库中的所有北美位置的列表（这种算法称为传递闭包）。许多图算法通过每次遍历一条边来表达，将一个顶点连接到相邻的顶点以传播一些信息，并重复这个过程直到满足某些条件——例如，直到没有更多的边可跟随，或者直到某些度量收敛。我们在图2-6中看到了一个例子，它通过反复跟随指示哪个位置在哪个其他位置中的边来列出包含在数据库中的所有北美位置的列表（这种算法称为传递闭包）。

It is possible to store a graph in a distributed filesystem (in files containing lists of vertices and edges), but this idea of “repeating until done” cannot be expressed in plain MapReduce, since it only performs a single pass over the data. This kind of algorithm is thus often implemented in an iterative style:

可以在分布式文件系统中存储图形（包含顶点和边列表的文件），但是“反复执行直到完成”的想法不能用普通MapReduce表达，因为它只对数据执行单个传递。因此，这种算法通常以迭代式实现。

An external scheduler runs a batch process to calculate one step of the algorithm.

一个外部调度程序运行批处理过程来计算算法的一步。
When the batch process completes, the scheduler checks whether it has finished (based on the completion condition—e.g., there are no more edges to follow, or the change compared to the last iteration is below some threshold).

当批处理过程完成后，调度程序会检查它是否已完成（基于完成条件，例如没有更多的边可跟随，或与上一次迭代相比的变化低于某个阈值）。
If it has not yet finished, the scheduler goes back to step 1 and runs another round of the batch process.

如果还没有完成，调度程序将返回第一步并运行另一轮批处理过程。

This approach works, but implementing it with MapReduce is often very inefficient, because MapReduce does not account for the iterative nature of the algorithm: it will always read the entire input dataset and produce a completely new output dataset, even if only a small part of the graph has changed compared to the last iteration.

这种方法可以奏效，但通过MapReduce实现通常非常低效，因为MapReduce并不考虑迭代算法的本质：即使只有图的一小部分与上一次迭代相比发生了变化，它也会读取整个输入数据集并生成完全新的输出数据集。

The Pregel processing model

As an optimization for batch processing graphs, the bulk synchronous parallel (BSP) model of computation [ 70 ] has become popular. Among others, it is implemented by Apache Giraph [ 37 ], Spark’s GraphX API, and Flink’s Gelly API [ 71 ]. It is also known as the Pregel model, as Google’s Pregel paper popularized this approach for processing graphs [ 72 ].

作为优化批处理图形的一种模型，批量同步并行（BSP）计算模型 [70] 变得流行。其中，它由 Apache Giraph [37]、Spark的GraphX API 和 Flink的Gelly API [71] 实现。它也被称为Pregel模型，因为谷歌的Pregel论文推广了这种处理图形的方法 [72]。

Recall that in MapReduce, mappers conceptually “send a message” to a particular call of the reducer because the framework collects together all the mapper outputs with the same key. A similar idea is behind Pregel: one vertex can “send a message” to another vertex, and typically those messages are sent along the edges in a graph.

在MapReduce中，映射器在概念上“发送消息”到特定的减速器调用，因为框架将所有具有相同键的映射器输出集合在一起。Pregel背后的一个类似的思想：一个顶点可以“发送消息”到另一个顶点，通常这些消息沿着图中的边发送。

In each iteration, a function is called for each vertex, passing it all the messages that were sent to it—much like a call to the reducer. The difference from MapReduce is that in the Pregel model, a vertex remembers its state in memory from one iteration to the next, so the function only needs to process new incoming messages. If no messages are being sent in some part of the graph, no work needs to be done.

在每次迭代中，对于每个顶点都会调用一个函数，并将所有发送到该顶点的消息传递给它，就像调用Reducer一样。与MapReduce的不同之处在于，在Pregel模型中，一个顶点会记住它的状态，从一个迭代到下一个迭代，因此函数只需要处理新来的传入消息。如果在图的某些部分没有发送消息，则无需进行任何工作。

It’s a bit similar to the actor model (see “Distributed actor frameworks” ), if you think of each vertex as an actor, except that vertex state and messages between vertices are fault-tolerant and durable, and communication proceeds in fixed rounds: at every iteration, the framework delivers all messages sent in the previous iteration. Actors normally have no such timing guarantee.

这有点类似于演员模型（参见“分布式演员框架”），如果您将每个顶点视为演员，除了顶点状态和顶点之间的消息是容错和可持续的，而且通信是在固定回合中进行的：在每次迭代中，框架传递了在上一次迭代中发送的所有消息。演员通常没有这样的时间保证。

Fault tolerance

The fact that vertices can only communicate by message passing (not by querying each other directly) helps improve the performance of Pregel jobs, since messages can be batched and there is less waiting for communication. The only waiting is between iterations: since the Pregel model guarantees that all messages sent in one iteration are delivered in the next iteration, the prior iteration must completely finish, and all of its messages must be copied over the network, before the next one can start.

节点只能通过消息传递来通信（而不能直接查询彼此），这有助于提高Pregel作业的性能，因为消息可以批处理，通信等待时间也较短。唯一需要等待的是在迭代之间：由于Pregel模型保证在一次迭代中发送的所有消息都会在下一次迭代中到达，因此前一次迭代必须完全结束，并且所有消息必须复制到网络中，然后才能开始下一次迭代。

Even though the underlying network may drop, duplicate, or arbitrarily delay messages (see “Unreliable Networks” ), Pregel implementations guarantee that messages are processed exactly once at their destination vertex in the following iteration. Like MapReduce, the framework transparently recovers from faults in order to simplify the programming model for algorithms on top of Pregel.

尽管底层网络可能会丢失、复制或任意延迟消息（参见“不可靠网络”），但 Pregel 的实现保证消息在下一次迭代时在其目标顶点上仅被处理一次。与 MapReduce 类似，该框架能够透明地从故障中恢复，以简化基于 Pregel 的算法的编程模型。

This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration—i.e., writing their full state to durable storage. If a node fails and its in-memory state is lost, the simplest solution is to roll back the entire graph computation to the last checkpoint and restart the computation. If the algorithm is deterministic and messages are logged, it is also possible to selectively recover only the partition that was lost (like we previously discussed for dataflow engines) [ 72 ].

这种容错性是通过在迭代结束时定期检查点所有顶点的状态，即将它们的完整状态写入可持久化存储中来实现的。如果一个节点失败并且其内存中的状态丢失，最简单的解决方案是将整个图计算回滚到最后一个检查点并重新启动计算。如果算法是确定性的且消息被记录，则也可以选择性地恢复只丢失的分区（就像我们之前讨论过的数据流引擎一样）[72]。

Parallel execution

A vertex does not need to know on which physical machine it is executing; when it sends messages to other vertices, it simply sends them to a vertex ID. It is up to the framework to partition the graph—i.e., to decide which vertex runs on which machine, and how to route messages over the network so that they end up in the right place.

一个顶点不需要知道它在哪台物理机器上执行；当它发送消息到其他顶点时，它只是将它们发送给顶点ID。框架负责对图进行分区——即决定哪个顶点在哪个机器上运行，并通过网络路由消息，使它们到达正确的位置。

Because the programming model deals with just one vertex at a time (sometimes called “thinking like a vertex”), the framework may partition the graph in arbitrary ways. Ideally it would be partitioned such that vertices are colocated on the same machine if they need to communicate a lot. However, finding such an optimized partitioning is hard—in practice, the graph is often simply partitioned by an arbitrarily assigned vertex ID, making no attempt to group related vertices together.

由于编程模型仅处理一个顶点（有时称为“像顶点一样思考”），因此框架可以以任意方式分区图形。理想情况下，应该将顶点分配在同一台机器上，以便进行大量通信。然而，寻找这样一个优化的分区很难-在实践中，图通常只是按任意分配的顶点ID进行分区，不尝试将相关的顶点分组在一起。

As a result, graph algorithms often have a lot of cross-machine communication overhead, and the intermediate state (messages sent between nodes) is often bigger than the original graph. The overhead of sending messages over the network can significantly slow down distributed graph algorithms.

因此，图形算法通常有很多机器间通信开销，中间状态（节点之间发送的消息）通常比原始图形更大。在网络上发送消息的开销会严重减慢分布式图形算法的速度。

For this reason, if your graph can fit in memory on a single computer, it’s quite likely that a single-machine (maybe even single-threaded) algorithm will outperform a distributed batch process [ 73 , 74 ]. Even if the graph is bigger than memory, it can fit on the disks of a single computer, single-machine processing using a framework such as GraphChi is a viable option [ 75 ]. If the graph is too big to fit on a single machine, a distributed approach such as Pregel is unavoidable; efficiently parallelizing graph algorithms is an area of ongoing research [ 76 ].

因此，如果你的图能够在一台计算机的内存中容纳，那么单机（甚至单线程）算法很可能会胜过分布式批处理[73，74]。即使图形比内存大，它也可以适合单台计算机的磁盘，使用GraphChi这样的框架进行单机处理是一个可行的选择[75]。如果图太大而无法适合单台机器，则需要采用分布式方法，如Pregel；高效并行化图形算法是正在进行的研究领域[76]。

High-Level APIs and Languages

Over the years since MapReduce first became popular, the execution engines for distributed batch processing have matured. By now, the infrastructure has become robust enough to store and process many petabytes of data on clusters of over 10,000 machines. As the problem of physically operating batch processes at such scale has been considered more or less solved, attention has turned to other areas: improving the programming model, improving the efficiency of processing, and broadening the set of problems that these technologies can solve.

自从MapReduce首次流行以来，分布式批处理的执行引擎得到了发展。到现在，基础设施已经变得足够强大，在超过10,000台机器的集群上存储和处理许多PB级别的数据。由于在这样的规模下物理操作批处理问题已基本解决，注意力已经转向其他领域：改进编程模型、提高处理效率和扩大这些技术可解决的问题范围。

As discussed previously, higher-level languages and APIs such as Hive, Pig, Cascading, and Crunch became popular because programming MapReduce jobs by hand is quite laborious. As Tez emerged, these high-level languages had the additional benefit of being able to move to the new dataflow execution engine without the need to rewrite job code. Spark and Flink also include their own high-level dataflow APIs, often taking inspiration from FlumeJava [ 34 ].

如前所述，较高级别的语言和API（如Hive、Pig、Cascading和Crunch）变得流行是因为手动编写MapReduce作业非常费力。随着Tez的出现，这些高级语言还具有无需重新编写作业代码即可移动到新的数据流执行引擎的附加好处。Spark和Flink还包括自己的高级数据流API，通常受到FlumeJava[34]的启发。

These dataflow APIs generally use relational-style building blocks to express a computation: joining datasets on the value of some field; grouping tuples by key; filtering by some condition; and aggregating tuples by counting, summing, or other functions. Internally, these operations are implemented using the various join and grouping algorithms that we discussed earlier in this chapter.

这些数据流API通常使用关系式构建块来表达计算：在某个字段的值上连接数据集；按键分组元组；按条件过滤；通过计数、求和或其他函数聚合元组。在内部，这些操作使用我们在本章早些时候讨论过的各种连接和分组算法实现。

Besides the obvious advantage of requiring less code, these high-level interfaces also allow interactive use, in which you write analysis code incrementally in a shell and run it frequently to observe what it is doing. This style of development is very helpful when exploring a dataset and experimenting with approaches for processing it. It is also reminiscent of the Unix philosophy, which we discussed in “The Unix Philosophy” .

除了需要更少的代码之外，这些高级接口还允许交互式使用，在其中您可以在shell中逐步编写分析代码，并经常运行它以观察它的执行情况。这种开发风格在探索数据集和尝试处理方法时非常有帮助。这也让人想起了Unix哲学，我们在“Unix哲学”一文中已经讨论过。

Moreover, these high-level interfaces not only make the humans using the system more productive, but they also improve the job execution efficiency at a machine level.

此外，这些高级接口不仅使使用系统的人更加高效，而且也提高了机器级别的工作执行效率。

The move toward declarative query languages

An advantage of specifying joins as relational operators, compared to spelling out the code that performs the join, is that the framework can analyze the properties of the join inputs and automatically decide which of the aforementioned join algorithms would be most suitable for the task at hand. Hive, Spark, and Flink have cost-based query optimizers that can do this, and even change the order of joins so that the amount of intermediate state is minimized [ 66 , 77 , 78 , 79 ].

将连接指定为关系运算符的优点，与拼写执行连接的代码相比，是框架可以分析连接输入的属性，并自动决定哪种连接算法最适合当前任务。Hive、Spark和Flink都拥有成本基于查询优化器，甚至可以改变连接顺序，以使中间状态的数量最小化。[66, 77, 78, 79]。

The choice of join algorithm can make a big difference to the performance of a batch job, and it is nice not to have to understand and remember all the various join algorithms we discussed in this chapter. This is possible if joins are specified in a declarative way: the application simply states which joins are required, and the query optimizer decides how they can best be executed. We previously came across this idea in “Query Languages for Data” .

连接算法的选择可以对批处理作业的性能产生很大影响，并且不需要了解和记住我们在本章中讨论的所有各种连接算法。如果连接以声明方式指定，则可能：应用程序仅说明需要哪些连接，查询优化器将决定如何最好地执行它们。我们之前在“数据查询语言”中遇到了这个想法。

However, in other ways, MapReduce and its dataflow successors are very different from the fully declarative query model of SQL. MapReduce was built around the idea of function callbacks: for each record or group of records, a user-defined function (the mapper or reducer) is called, and that function is free to call arbitrary code in order to decide what to output. This approach has the advantage that you can draw upon a large ecosystem of existing libraries to do things like parsing, natural language analysis, image analysis, and running numerical or statistical algorithms.

然而，在其他方面，MapReduce及其数据流后继者与SQL完全声明式查询模型非常不同。MapReduce是围绕函数回调的想法构建的：对于每个记录或记录组，都会调用用户定义的功能（映射器或减少器），并且该功能可以自由调用任意代码以决定要输出什么。这种方法的优点是，您可以利用大量现有库的生态系统来完成诸如解析、自然语言分析、图像分析以及运行数字或统计算法之类的任务。

The freedom to easily run arbitrary code is what has long distinguished batch processing systems of MapReduce heritage from MPP databases (see “Comparing Hadoop to Distributed Databases” ); although databases have facilities for writing user-defined functions, they are often cumbersome to use and not well integrated with the package managers and dependency management systems that are widely used in most programming languages (such as Maven for Java, npm for JavaScript, and Rubygems for Ruby).

批处理系统和MPP数据库一直以来最大的区别在于能够轻松运行任意代码。尽管数据库也有撰写用户定义函数的功能，但往往使用麻烦，并且不太与大多数编程语言中广泛使用的包管理器和依赖管理系统（例如Java的Maven、JavaScript的npm和Ruby的Rubygems）相结合。

However, dataflow engines have found that there are also advantages to incorporating more declarative features in areas besides joins. For example, if a callback function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record. If such simple filtering and mapping operations are expressed in a declarative way, the query optimizer can take advantage of column-oriented storage layouts (see “Column-Oriented Storage” ) and read only the required columns from disk. Hive, Spark DataFrames, and Impala also use vectorized execution (see “Memory bandwidth and vectorized processing” ): iterating over data in a tight inner loop that is friendly to CPU caches, and avoiding function calls. Spark generates JVM bytecode [ 79 ] and Impala uses LLVM to generate native code for these inner loops [ 41 ].

然而，数据流引擎发现，除连接外，在其他领域引入更多声明性功能也有优势。例如，如果回调函数只包含简单的过滤条件，或仅从记录中选择某些字段，则在每个记录上调用函数存在显着的CPU开销。如果这些简单的过滤和映射操作以声明性的方式表示，则查询优化器可以利用面向列的存储布局（请参见“面向列的存储”），并从磁盘中仅读取所需的列。Hive、Spark DataFrames和Impala也使用矢量化执行（请参见“内存带宽和矢量化处理”）：在一个紧密的内部循环中迭代数据，这对CPU缓存友好，并避免函数调用。Spark生成JVM字节码[79]，而Impala使用LLVM为这些内部循环生成本地代码[41]。

By incorporating declarative aspects in their high-level APIs, and having query optimizers that can take advantage of them during execution, batch processing frameworks begin to look more like MPP databases (and can achieve comparable performance). At the same time, by having the extensibility of being able to run arbitrary code and read data in arbitrary formats, they retain their flexibility advantage.

通过在高级API中添加声明性方面，并拥有可利用它们的查询优化器，批处理框架开始看起来更像MPP数据库（并可实现可比较的性能）。同时，通过具有运行任意代码和以任意格式读取数据的可扩展性，它们保留了灵活性的优势。

Specialization for different domains

While the extensibility of being able to run arbitrary code is useful, there are also many common cases where standard processing patterns keep reoccurring, and so it is worth having reusable implementations of the common building blocks. Traditionally, MPP databases have served the needs of business intelligence analysts and business reporting, but that is just one among many domains in which batch processing is used.

尽管能够运行任意代码的可扩展性很有用，但也有许多常见情况下采用标准处理模式重复出现，因此值得拥有可重复使用的常见构建块的实现。传统上，MPP数据库已经满足了商业智能分析师和商业报告的需求，但这仅仅是批处理被用于许多领域中之一。

Another domain of increasing importance is statistical and numerical algorithms, which are needed for machine learning applications such as classification and recommendation systems. Reusable implementations are emerging: for example, Mahout implements various algorithms for machine learning on top of MapReduce, Spark, and Flink, while MADlib implements similar functionality inside a relational MPP database (Apache HAWQ) [ 54 ].

另一个越来越重要的领域是统计和数字算法，这些算法是机器学习应用（如分类和推荐系统）所需的。可重复使用的实现正在出现：例如，Mahout在MapReduce、Spark和Flink之上实现了各种机器学习算法，而MADlib在关系型MPP数据库（Apache HAWQ）内实现了类似的功能。

Also useful are spatial algorithms such as k-nearest neighbors [ 80 ], which searches for items that are close to a given item in some multi-dimensional space—a kind of similarity search. Approximate search is also important for genome analysis algorithms, which need to find strings that are similar but not identical [ 81 ].

同样有用的是空间算法，例如k-近邻[80]，它会在多维空间中搜索与给定项相近的项 - 一种相似性搜索。近似搜索对于基因组分析算法也非常重要，需要找到相似但不完全相同的字符串[81]。

Batch processing engines are being used for distributed execution of algorithms from an increasingly wide range of domains. As batch processing systems gain built-in functionality and high-level declarative operators, and as MPP databases become more programmable and flexible, the two are beginning to look more alike: in the end, they are all just systems for storing and processing data.

批量处理引擎被用于从越来越广泛的领域分布式执行算法。随着批量处理系统获得内置功能和高级声明操作符，以及MPP数据库变得更加可编程和灵活，两者开始看起来更相似：最终，它们都只是存储和处理数据的系统。

Summary

In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk , grep , and sort , and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.”

在这一章中，我们探讨了批处理的主题。我们从查看Unix工具如awk、grep和sort开始，看到了这些工具的设计理念如何被延续到MapReduce和更近期的数据流引擎中。其中一些设计原则是输入是不可变的，输出旨在成为另一个（尚未知晓）程序的输入，而复杂问题则通过组合“做一件事情很好”的小工具来解决。

In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesystem. We saw that dataflow engines add their own pipe-like data transport mechanisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS.

在Unix系统中，允许一个程序与另一个程序组合的统一接口是文件和管道；在MapReduce中，该接口是分布式文件系统。我们看到，数据流引擎添加了自己的类似管道的数据传输机制，以避免将中间状态实现到分布式文件系统中，但作业的初始输入和最终输出通常仍然是HDFS。

The two main problems that distributed batch processing frameworks need to solve are:

分布式批处理框架需要解决的两个主要问题是：

Partitioning

In MapReduce, mappers are partitioned according to input file blocks. The output of mappers is repartitioned, sorted, and merged into a configurable number of reducer partitions. The purpose of this process is to bring all the related data—e.g., all the records with the same key—together in the same place.

在MapReduce中，mappers按照输入文件块进行分区。 Mapper的输出被重新分区，排序并合并为可配置数量的reducer分区。此过程的目的是将所有相关数据（例如具有相同键的所有记录）聚集在同一位置。

Post-MapReduce dataflow engines try to avoid sorting unless it is required, but they otherwise take a broadly similar approach to partitioning.

后MapReduce数据流引擎尝试避免排序，除非需要，但它们在分区方面采取了类似的方法。

Fault tolerance

MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case. Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed.

MapReduce 经常会写入磁盘，这使得在单个任务失败时很容易从中恢复，而无需重新启动整个作业，但在没有故障的情况下会降低执行速度。数据流引擎执行的中间状态材料化较少，并将更多的状态保留在内存中，这意味着如果节点失败，它们需要重新计算更多的数据。确定性操作符减少了需要重新计算的数据量。

We discussed several join algorithms for MapReduce, most of which are also internally used in MPP databases and dataflow engines. They also provide a good illustration of how partitioned algorithms work:

我们讨论了几种 MapReduce 的连接算法，其中大部分也在 MPP 数据库和数据流引擎中内部使用。它们也很好地说明了分区算法的工作原理。

Sort-merge joins

Each of the inputs being joined goes through a mapper that extracts the join key. By partitioning, sorting, and merging, all the records with the same key end up going to the same call of the reducer. This function can then output the joined records.

每个连接的输入都经过映射器处理，提取连接键。通过分区，排序和合并，所有相同键的记录最终被发送到相同的reducer中。这个函数可以输出连接的记录。

Broadcast hash joins

One of the two join inputs is small, so it is not partitioned and it can be entirely loaded into a hash table. Thus, you can start a mapper for each partition of the large join input, load the hash table for the small input into each mapper, and then scan over the large input one record at a time, querying the hash table for each record.

两个连接输入中的一个较小，因此它不会被分区，可以完全加载到哈希表中。因此，您可以为大型连接输入的每个分区启动一个映射器，将小输入的哈希表加载到每个映射器中，然后逐个扫描大型输入记录，查询每个记录的哈希表。

Partitioned hash joins

If the two join inputs are partitioned in the same way (using the same key, same hash function, and same number of partitions), then the hash table approach can be used independently for each partition.

如果两个连接输入以相同的方式进行分区（使用相同的键，相同的哈希函数和相同数量的分区），则哈希表方法可以独立地用于每个分区。

Distributed batch processing engines have a deliberately restricted programming model: callback functions (such as mappers and reducers) are assumed to be stateless and to have no externally visible side effects besides their designated output. This restriction allows the framework to hide some of the hard distributed systems problems behind its abstraction: in the face of crashes and network issues, tasks can be retried safely, and the output from any failed tasks is discarded. If several tasks for a partition succeed, only one of them actually makes its output visible.

分布式批处理引擎采用有意限制的编程模型: 回调函数 (例如 mappers 和 reducers) 被假定为无状态的，并且除了它们指定的输出之外，没有任何外部可见的副作用。这种限制使框架能够将一些艰难的分布式系统问题隐藏在其抽象后面: 在遇到崩溃和网络问题时，任务可以安全地重试，并且任何失败任务的输出都会被丢弃。如果一个分区的多个任务成功，只有其中一个实际上会使其输出可见。

Thanks to the framework, your code in a batch processing job does not need to worry about implementing fault-tolerance mechanisms: the framework can guarantee that the final output of a job is the same as if no faults had occurred, even though in reality various tasks perhaps had to be retried. These reliable semantics are much stronger than what you usually have in online services that handle user requests and that write to databases as a side effect of processing a request.

由于框架，批处理作业中的代码无需担心实现容错机制：框架可以保证作业的最终输出与没有发生故障时相同，即使实际上可能需要重试各种任务。这些可靠的语义比通常处理用户请求并以处理请求的副作用写入数据库的在线服务强得多。

The distinguishing feature of a batch processing job is that it reads some input data and produces some output data, without modifying the input—in other words, the output is derived from the input. Crucially, the input data is bounded : it has a known, fixed size (for example, it consists of a set of log files at some point in time, or a snapshot of a database’s contents). Because it is bounded, a job knows when it has finished reading the entire input, and so a job eventually completes when it is done.

批处理作业的特点是读取一些输入数据并生成一些输出数据，而不修改输入数据，换句话说，输出是由输入得出的。关键的是，输入数据是有界限的：具有已知的、固定的大小（例如，在某个时刻由一组日志文件组成，或者是数据库内容的快照）。由于它是有界的，作业知道何时完成读取整个输入，因此作业完成时才最终完成。

In the next chapter, we will turn to stream processing, in which the input is unbounded —that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming in. We shall see that stream and batch processing are similar in some respects, but the assumption of unbounded streams also changes a lot about how we build systems.

在下一章中，我们将转向流处理，其中输入是无界的 - 也就是说，您仍有一份工作，但其输入是永无止境的数据流。在这种情况下，工作永远不会完成，因为随时可能还有更多的工作进来。我们将看到，在某些方面，流和批处理是相似的，但是假设为无界流也改变了我们构建系统的方式。

Footnotes

ⁱ Some people love to point out that cat is unnecessary here, as the input file could be given directly as an argument to awk . However, the linear pipeline is more apparent when written like this.

有些人喜欢指出，猫在这里是不必要的，因为输入文件可以直接作为参数提供给awk。然而，这样写线性管道更明显。

ⁱⁱ Another example of a uniform interface is URLs and HTTP, the foundations of the web. A URL identifies a particular thing (resource) on a website, and you can link to any URL from any other website. A user with a web browser can thus seamlessly jump between websites by following links, even though the servers may be operated by entirely unrelated organizations. This principle seems obvious today, but it was a key insight in making the web the success that it is today. Prior systems were not so uniform: for example, in the era of bulletin board systems (BBSs), each system had its own phone number and baud rate configuration. A reference from one BBS to another would have to be in the form of a phone number and modem settings; the user would have to hang up, dial the other BBS, and then manually find the information they were looking for. It wasn’t possible to link directly to some piece of content inside another BBS.

另一个统一接口的例子是URL和HTTP，是Web的基础。 URL识别网站上的特定事物（资源），您可以从任何其他网站链接到任何URL。通过遵循链接，带有Web浏览器的用户可以轻松地在网站之间跳转，即使Web服务器可以由完全不相关的组织操作。这个原则今天似乎很明显，但它是使Web取得今天的成功的关键洞察力。以前的系统不是这么统一：例如，在公告板系统（BBS）的时代，每个系统都有自己的电话号码和波特率配置。从一个BBS到另一个BBS的参考必须以电话号码和调制解调器设置的形式出现；用户必须挂断电话，拨打另一个BBS，然后手动查找他们正在寻找的信息。无法直接链接到另一个BBS中的某些内容。

ⁱⁱⁱ Except by using a separate tool, such as netcat or curl . Unix started out trying to represent everything as files, but the BSD sockets API deviated from that convention [ 17 ]. The research operating systems Plan 9 and Inferno are more consistent in their use of files: they represent a TCP connection as a file in /net/tcp [ 18 ].

除非使用其他工具，如netcat或curl，否则无法实现。Unix最初试图将所有内容表示为文件，但BSD套接字API偏离了这个约定[17]。研究操作系统Plan 9和Inferno在使用文件上更加一致：它们将TCP连接表示为 /net/tcp中的文件[18]。

^iv One difference is that with HDFS, computing tasks can be scheduled to run on the machine that stores a copy of a particular file, whereas object stores usually keep storage and computation separate. Reading from a local disk has a performance advantage if network bandwidth is a bottleneck. Note however that if erasure coding is used, the locality advantage is lost, because the data from several machines must be combined in order to reconstitute the original file [ 20 ].

一个区别是，使用HDFS可以将计算任务安排在存储特定文件副本的机器上运行，而对象存储通常将存储和计算分开。如果网络带宽成为瓶颈，则从本地磁盘读取具有性能优势。但是请注意，如果使用纠删码，则会失去局部化优势，因为必须将来自多台计算机的数据组合在一起才能重新组合原始文件。

^v The joins we talk about in this book are generally equi-joins , the most common type of join, in which a record is associated with other records that have an identical value in a particular field (such as an ID). Some databases support more general types of joins, for example using a less-than operator instead of an equality operator, but we do not have space to cover them here.

本书中讨论的连接通常是等值连接，这是最常见的连接类型，其中一条记录与在特定字段（如ID）中具有相同值的其他记录相关联。某些数据库支持更一般的连接类型，例如使用小于运算符而不是等于运算符，但我们没有空间在此处涵盖它们。

^vi This example assumes that there is exactly one entry for each key in the hash table, which is probably true with a user database (a user ID uniquely identifies a user). In general, the hash table may need to contain several entries with the same key, and the join operator will output all matches for a key.

此示例假定哈希表中每个键都有唯一一个条目，这在用户数据库中可能是正确的（用户ID唯一标识用户）。一般来说，哈希表可能需要包含多个具有相同键的条目，连接操作符将输出键的所有匹配项。

References

[ 1 ] Jeffrey Dean and Sanjay Ghemawat: “ MapReduce: Simplified Data Processing on Large Clusters ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[1] Jeffrey Dean 和 Sanjay Ghemawat：“MapReduce：大型集群上的简化数据处理”，于2004年12月的第六届USENIX操作系统设计和实现研讨会（OSDI）上。

[ 2 ] Joel Spolsky: “ The Perils of JavaSchools ,” joelonsoftware.com , December 25, 2005.

[2] Joel Spolsky: “Java 学校的危险”，来源于 joelonsoftware.com，2005 年 12 月 25 日。

[ 3 ] Shivnath Babu and Herodotos Herodotou: “ Massively Parallel Databases and MapReduce Systems ,” Foundations and Trends in Databases , volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036

"[3] Shivnath Babu and Herodotos Herodotou: “Massively Parallel Databases and MapReduce Systems,” Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036" [3] Shivnath Babu和Herodotos Herodotou：“高度并行的数据库和MapReduce系统”，《数据库基础与趋势》第5卷，第1期，第1-104页，2013年11月。doi:10.1561/1900000036。

[ 4 ] David J. DeWitt and Michael Stonebraker: “ MapReduce: A Major Step Backwards ,” originally published at databasecolumn.vertica.com , January 17, 2008.

[4] David J. DeWitt 和 Michael Stonebraker： “MapReduce：一个重大的倒退”，最初发表于 databasecolumn.vertica.com，2008 年 1 月 17 日。 [4] David J. DeWitt 和 Michael Stonebraker： “MapReduce：一个重大的倒退”，最初发表于 databasecolumn.vertica.com，2008 年 1 月 17 日。

[ 5 ] Henry Robinson: “ The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google ,” the-paper-trail.org , June 25, 2014.

[5] 亨利·罗宾逊：“大象是特洛伊木马：论谷歌Map-Reduce之死”，the-paper-trail.org，2014年6月25日。

[ 6 ] “ The Hollerith Machine ,” United States Census Bureau, census.gov .

[6] “洪勒里特机器”，美国人口普查局，census.gov。

[ 7 ] “ IBM 82, 83, and 84 Sorters Reference Manual ,” Edition A24-1034-1, International Business Machines Corporation, July 1962.

[7] “IBM 82、83和84分类机参考手册”，A24-1034-1版，国际商业机器公司，1962年7月。

[ 8 ] Adam Drake: “ Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster ,” aadrake.com , January 25, 2014.

“命令行工具可比你的Hadoop集群快235倍”， Adam Drake，aadrake.com，2014年1月25日。

[ 9 ] “ GNU Coreutils 8.23 Documentation ,” Free Software Foundation, Inc., 2014.

"[9] GNU Coreutils 8.23文档，自由软件基金会，2014年."

[ 10 ] Martin Kleppmann: “ Kafka, Samza, and the Unix Philosophy of Distributed Data ,” martin.kleppmann.com , August 5, 2015.

[10] Martin Kleppmann：“Kafka、Samza和分布式数据的Unix哲学”, martin.kleppmann.com, 2015年8月5日。

[ 11 ] Doug McIlroy: Internal Bell Labs memo , October 1964. Cited in: Dennis M. Richie: “ Advice from Doug McIlroy ,” cm.bell-labs.com .

[11] 道格·麦克罗伊（Doug McIlroy）： 1964年10月，AT&T贝尔实验室内部备忘录。引用自：丹尼斯·里奇（Dennis M. Richie）： “道格·麦克罗伊的建议”， cm.bell-labs.com。

[ 12 ] M. D. McIlroy, E. N. Pinson, and B. A. Tague: “ UNIX Time-Sharing System: Foreword ,” The Bell System Technical Journal , volume 57, number 6, pages 1899–1904, July 1978.

【12】M.D. McIlroy，E.N. Pinson 和 B.A. Tague： “UNIX 分时系统：前言，” 贝尔系统技术杂志，第57卷，第6 期，页码1899–1904， 1978 年7月。

[ 13 ] Eric S. Raymond: The Art of UNIX Programming . Addison-Wesley, 2003. ISBN: 978-0-13-142901-7

[13] Eric S. Raymond: UNIX编程艺术。Addison-Wesley出版社，2003年。ISBN: 978-0-13-142901-7。

[ 14 ] Ronald Duncan: “ Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text ,” ronaldduncan.wordpress.com , October 31, 2009.

[14]罗纳德·邓肯： “文本文件格式-ASCII分隔文本-不是CSV或Tab分隔文本，” ronaldduncan.wordpress.com，2009年10月31日。

[ 15 ] Alan Kay: “ Is ‘Software Engineering’ an Oxymoron? ,” tinlizzie.org .

艾伦·凯[15]: “软件工程是一个矛盾词吗？”(来源于tinlizzie.org)

[ 16 ] Martin Fowler: “ InversionOfControl ,” martinfowler.com , June 26, 2005.

[16] Martin Fowler： “控制反转，” martinfowler.com，2005年6月26日。

[ 17 ] Daniel J. Bernstein: “ Two File Descriptors for Sockets ,” cr.yp.to .

丹尼尔·J·伯恩斯坦：「套接字使用两个文件描述符」，cr.yp.to。

[ 18 ] Rob Pike and Dennis M. Ritchie: “ The Styx Architecture for Distributed Systems ,” Bell Labs Technical Journal , volume 4, number 2, pages 146–152, April 1999.

"[18] Rob Pike和Dennis M. Ritchie: “分布式系统的Styx架构”，贝尔实验室技术杂志，卷4，号码2，页码146-152，1999年4月。"

[ 19 ] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “ The Google File System ,” at 19th ACM Symposium on Operating Systems Principles (SOSP), October 2003. doi:10.1145/945445.945450

[19] Sanjay Ghemawat, Howard Gobioff, 和Shun-Tak Leung：“谷歌文件系统”，发表于2003年10月的第19届ACM操作系统原理研讨会（SOSP）。doi:10.1145/945445.945450。

[ 20 ] Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “ The Quantcast File System ,” Proceedings of the VLDB Endowment , volume 6, number 11, pages 1092–1101, August 2013. doi:10.14778/2536222.2536234

[20] Michael Ovsiannikov, Silvius Rus, Damian Reeves等人： “Quantcast文件系统”，《VLDB终极论文集》第6卷，第11号，页面1092-1101，2013年8月。 doi:10.14778/2536222.2536234。

[ 21 ] “ OpenStack Swift 2.6.1 Developer Documentation ,” OpenStack Foundation, docs.openstack.org , March 2016.

[21] "OpenStack Swift 2.6.1开发者文档”，OpenStack基金会，docs.openstack.org，2016年3月。

[ 22 ] Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “ Introduction to HDFS Erasure Coding in Apache Hadoop ,” blog.cloudera.com , September 23, 2015.

"[22] 张哲，安德鲁·王，郑凯等人：“Apache Hadoop中HDFS纠删码入门”，blog.cloudera.com，2015年9月23日发布。"

[ 23 ] Peter Cnudde: “ Hadoop Turns 10 ,” yahoohadoop.tumblr.com , February 5, 2016.

"Peter Cnudde：“Hadoop迈入第10个年头”，yahoohadoop.tumblr.com，2016年2月5日。"

[ 24 ] Eric Baldeschwieler: “ Thinking About the HDFS vs. Other Storage Technologies ,” hortonworks.com , July 25, 2012.

"[24] Eric Baldeschwieler: “关于HDFS与其他存储技术的思考”，hortonworks.com，2012年7月25日。"

[ 25 ] Brendan Gregg: “ Manta: Unix Meets Map Reduce ,” dtrace.org , June 25, 2013.

“Manta: Unix Meets Map Reduce”，布伦丹·格雷格， dtrace.org，2013年6月25日。

[ 26 ] Tom White: Hadoop: The Definitive Guide , 4th edition. O’Reilly Media, 2015. ISBN: 978-1-491-90163-2

[26] Tom White: Hadoop: 完全指南, 第四版. O'Reilly 媒体, 2015. ISBN: 978-1-491-90163-2

[ 27 ] Jim N. Gray: “ Distributed Computing Economics ,” Microsoft Research Tech Report MSR-TR-2003-24, March 2003.

[27]吉姆·格雷：《分布式计算经济学》，微软研究技术报告MSR-TR-2003-24，2003年3月。

[ 28 ] Márton Trencséni: “ Luigi vs Airflow vs Pinball ,” bytepawn.com , February 6, 2016.

"[28] Márton Trencséni: “Luigi vs Airflow vs Pinball,” bytepawn.com, February 6, 2016." "[28] Márton Trencséni：“Luigi vs Airflow vs Pinball，”bytepawn.com，2016年2月6日。"

[ 29 ] Roshan Sumbaly, Jay Kreps, and Sam Shah: “ The ‘Big Data’ Ecosystem at LinkedIn ,” at ACM International Conference on Management of Data (SIGMOD), July 2013. doi:10.1145/2463676.2463707

【29】Roshan Sumbaly、Jay Kreps和Sam Shah：“领英的‘大数据’生态系统”，ACM数据管理国际会议(SIGMOD)，2013年7月。doi:10.1145/2463676.2463707。

[ 30 ] Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “ Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience ,” at 35th International Conference on Very Large Data Bases (VLDB), August 2009.

[30] Alan F. Gates, Olga Natkovich, Shubham Chopra等： “在Map-Reduce之上构建一个高级数据流系统：Pig体验”，刊于2009年8月的第35届国际超大型数据库会议（VLDB）。

[ 31 ] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “ Hive – A Petabyte Scale Data Warehouse Using Hadoop ,” at 26th IEEE International Conference on Data Engineering (ICDE), March 2010. doi:10.1109/ICDE.2010.5447738

[31] Ashish Thusoo、Joydeep Sen Sarma、Namit Jain 等人： “Hive – 使用 Hadoop 的 PB 级数据仓库”，收录于 26 届 IEEE 国际数据工程会议 (ICDE)，2010年3月。 doi:10.1109/ICDE.2010.5447738。

[ 32 ] “ Cascading 3.0 User Guide ,” Concurrent, Inc., docs.cascading.org , January 2016.

【32】“Cascading 3.0 用户指南”，Concurrent, Inc.，docs.cascading.org，2016年1月。 “Cascading 3.0 用户指南”，Concurrent, Inc.，docs.cascading.org，2016年1月。

[ 33 ] “ Apache Crunch User Guide ,” Apache Software Foundation, crunch.apache.org .

[33] “Apache Crunch 用户指南”，Apache软件基金会，crunch.apache.org。

[ 34 ] Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “ FlumeJava: Easy, Efficient Data-Parallel Pipelines ,” at 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2010. doi:10.1145/1806596.1806638

[34] Craig Chambers（克雷格·钱伯斯），Ashish Raniwala（阿希斯·拉尼瓦拉），Frances Perry（弗朗西斯·佩里）等人：“FlumeJava：易于使用，高效的数据并行管道”，出版于2010年6月的第31届ACM SIGPLAN编程语言设计与实现会议（PLDI）。doi：10.1145/1806596.1806638。

[ 35 ] Jay Kreps: “ Why Local State is a Fundamental Primitive in Stream Processing ,” oreilly.com , July 31, 2014.

Jay Kreps：「为何本地状态是流处理中的基本原语」，oreilly.com，2014 年 7 月 31 日。

[ 36 ] Martin Kleppmann: “ Rethinking Caching in Web Apps ,” martin.kleppmann.com , October 1, 2012.

[36] Martin Kleppmann：“重新思考Web应用程序中的缓存”，martin.kleppmann.com，2012年10月1日。

[ 37 ] Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: Hadoop Application Architectures . O’Reilly Media, 2015. ISBN: 978-1-491-90004-8

马克·格罗弗（Mark Grover）、泰德·马拉斯卡（Ted Malaska）、乔纳森·塞德曼（Jonathan Seidman）和格温·夏皮拉（Gwen Shapira）：《Hadoop 应用架构》，O’Reilly Media，2015年，ISBN: 978-1-491-90004-8

[ 38 ] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “ Challenges to Adopting Stronger Consistency at Scale ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[38] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar等： "在规模上采用更强的一致性所面临的挑战"，于2015年5月在第15届USENIX操作系统热门主题研讨会（HotOS）上发表。

[ 39 ] Sriranjan Manjunath: “ Skewed Join ,” wiki.apache.org , 2009.

"Skewed Join," wiki.apache.org，2009. "倾斜连接"，wiki.apache.org，2009年。

[ 40 ] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “ Practical Skew Handling in Parallel Joins ,” at 18th International Conference on Very Large Data Bases (VLDB), August 1992.

[40] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider和S. Seshadri：「並行聯接中實用的偏斜處理」，於1992年8月第18屆非常大資料庫國際會議（VLDB）上發表。

[ 41 ] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “ Impala: A Modern, Open-Source SQL Engine for Hadoop ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[41] Marcel Kornacker，Alexander Behm，Victor Bittorf等人：「Impala：Hadoop 的现代开源 SQL 引擎」，发表于第七届创新数据系统研究双年会（CIDR），2015年1月。

[ 42 ] Matthieu Monsch: “ Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data ,” engineering.linkedin.com , October 26, 2015.

[42] Matthieu Monsch：“开源 PalDB，一个轻量级的存储辅助工具”，工程.linkedin.com，2015年10月26日。

[ 43 ] Daniel Peng and Frank Dabek: “ Large-Scale Incremental Processing Using Distributed Transactions and Notifications ,” at 9th USENIX conference on Operating Systems Design and Implementation (OSDI), October 2010.

[43] 丹尼尔·彭和弗兰克·达贝克：「使用分布式事务和通知进行大规模增量处理」，发表于2010年10月第9届USENIX操作系统设计和实现会议（OSDI）。

[ 44 ] “ “Cloudera Search User Guide,” Cloudera, Inc., September 2015.

[44] “Cloudera搜索用户指南”，Cloudera，Inc.，2015年9月。

[ 45 ] Lili Wu, Sam Shah, Sean Choi, et al.: “ The Browsemaps: Collaborative Filtering at LinkedIn ,” at 6th Workshop on Recommender Systems and the Social Web (RSWeb), October 2014.

“Browsemaps: LinkedIn上的协作过滤” - Lili Wu，Sam Shah，Sean Choi等在2014年10月举办的第6届推荐系统和社交网络研讨会（RSWeb）上的文章。

[ 46 ] Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “ Serving Large-Scale Batch Computed Data with Project Voldemort ,” at 10th USENIX Conference on File and Storage Technologies (FAST), February 2012.

【46】Roshan Sumbaly, Jay Kreps, Lei Gao等人：《使用Project Voldemort为大规模批处理数据提供服务》，发表于2012年2月的第十届USENIX文件与存储技术会议（FAST）。

[ 47 ] Varun Sharma: “ Open-Sourcing Terrapin: A Serving System for Batch Generated Data ,” engineering.pinterest.com , September 14, 2015.

[47] Varun Sharma： “公开源代码泰瑞龟：批量生成数据服务系统”，engineering.pinterest.com，2015年9月14日。

[ 48 ] Nathan Marz: “ ElephantDB ,” slideshare.net , May 30, 2011.

[48] Nathan Marz: “ElephantDB,” 幻灯片分享网，2011年5月30日。

[ 49 ] Jean-Daniel (JD) Cryans: “ How-to: Use HBase Bulk Loading, and Why ,” blog.cloudera.com , September 27, 2013.

"Jean-Daniel (JD) Cryans：如何使用HBase批量加载及其原因"，blog.cloudera.com，2013年9月27日。"

[ 50 ] Nathan Marz: “ How to Beat the CAP Theorem ,” nathanmarz.com , October 13, 2011.

[50] Nathan Marz：“如何打败CAP定理”，nathanmarz.com，2011年10月13日。

[ 51 ] Molly Bartlett Dishman and Martin Fowler: “ Agile Architecture ,” at O’Reilly Software Architecture Conference , March 2015.

[51] 莫利·巴特利特·迪什曼和马丁·福勒： “敏捷架构”，于2015年3月在O'Reilly软件架构会议上。

[ 52 ] David J. DeWitt and Jim N. Gray: “ Parallel Database Systems: The Future of High Performance Database Systems ,” Communications of the ACM , volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894

[52] David J. DeWitt 和 Jim N. Gray: “并行数据库系统: 高性能数据库系统的未来,”ACM 通讯, 卷 35, 第 6 期, 页码 85–98,1992年 6 月. doi:10.1145/129888.129894.

[ 53 ] Jay Kreps: “ But the multi-tenancy thing is actually really really hard ,” tweetstorm, twitter.com , October 31, 2014.

[53] Jay Kreps：“但是多租户的事实际上非常非常难。”（tweetstorm，twitter.com，2014年10月31日。）

[ 54 ] Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “ MAD Skills: New Analysis Practices for Big Data ,” Proceedings of the VLDB Endowment , volume 2, number 2, pages 1481–1492, August 2009. doi:10.14778/1687553.1687576

[54] Jeffrey Cohen, Brian Dolan, Mark Dunlap等：《MAD技能：大数据新分析实践》，VLDB杂志，第2卷，第2号，1481-1492页，2009年8月。doi：10.14778/1687553.1687576。

[ 55 ] Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “ Data Wrangling: The Challenging Journey from the Wild to the Lake ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[55] Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “数据整理：从荒野到湖泊的挑战之旅”，于2015年1月在第七届创新数据系统研究（CIDR）双年会上发表。

[ 56 ] Paige Roberts: “ To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question ,” adaptivesystemsinc.com , July 2, 2015.

"对于Hadoop数据湖来说，是选择“读时模式”还是“写时模式”？——佩奇·罗伯茨(56)，适应性系统公司，2015年7月2日。"

[ 57 ] Bobby Johnson and Joseph Adler: “ The Sushi Principle: Raw Data Is Better ,” at Strata+Hadoop World , February 2015.

[57] 罗伯特·约翰逊和约瑟夫·阿德勒： “寿司原则：原始数据更好” ，于2015年2月在Strata+Hadoop World发表。

[ 58 ] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “ Apache Hadoop YARN: Yet Another Resource Negotiator ,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523633

"Apache Hadoop YARN：另一个资源协商者"，作者：Vinod Kumar Vavilapalli，Arun C. Murthy，Chris Douglas等，发表于2013年10月第4届ACM云计算研讨会（SoCC）。doi:10.1145/2523616.2523633。"

[ 59 ] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “ Large-Scale Cluster Management at Google with Borg ,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741964

[59] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu等: “在Google中的大规模集群管理使用Borg” ，出自于第10届欧洲计算机系统会议(EuroSys)，2015年4月。 doi:10.1145/2741948.2741964

[ 60 ] Malte Schwarzkopf: “ The Evolution of Cluster Scheduler Architectures ,” firmament.io , March 9, 2016.

「[60] Malte Schwarzkopf：群集调度器架构的演变，firmament.io，2016年3月9日。」

[ 61 ] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing ,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

[61] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das 等人： “弹性分布式数据集：一种容错的内存集群计算抽象”，于2012年4月在第9届USENIX网络系统设计和实现研讨会（NSDI）上发表。

[ 62 ] Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: Learning Spark . O’Reilly Media, 2015. ISBN: 978-1-449-35904-1

[62] Holden Karau，Andy Konwinski，Patrick Wendell和Matei Zaharia：《学习Spark》。O'Reilly Media，2015. ISBN：978-1-449-35904-1。

[ 63 ] Bikas Saha and Hitesh Shah: “ Apache Tez: Accelerating Hadoop Query Processing ,” at Hadoop Summit , June 2014.

[63] Bikas Saha和Hitesh Shah： “Apache Tez：加速Hadoop查询处理”，于2014年6月的Hadoop峰会上。

[ 64 ] Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “ Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications ,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742790

【64】Bikas Saha、Hitesh Shah、Siddharth Seth等：“Apache Tez: 一个用于建模和构建数据处理应用的统一框架”，载于ACM数据管理国际会议(SIGMOD)，2015年6月。doi:10.1145/2723372.2742790。

[ 65 ] Kostas Tzoumas: “ Apache Flink: API, Runtime, and Project Roadmap ,” slideshare.net , January 14, 2015.

[65] Kostas Tzoumas： “Apache Flink：API、Runtime 和项目路线图”，slideshare.net，2015 年 1 月 14 日。

[ 66 ] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “ The Stratosphere Platform for Big Data Analytics ,” The VLDB Journal , volume 23, number 6, pages 939–964, May 2014. doi:10.1007/s00778-014-0357-y

[66] 亚历山大·亚历山德罗夫、里科·柏格曼、斯蒂芬·伊文等人：《大数据分析的平流层平台》，《VLDB Journal》，第23卷，第6期，2014年5月，第939-964页。doi：10.1007/s00778-014-0357-y。

[ 67 ] Michael Isard, Mihai Budiu, Yuan Yu, et al.: “ Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks ,” at European Conference on Computer Systems (EuroSys), March 2007. doi:10.1145/1272996.1273005

[67] Michael Isard, Mihai Budiu, Yuan Yu等人：“Dryad：来自顺序组成模块的分布式数据并行程序”，发表于欧洲计算机系统会议（EuroSys），2007年3月。doi：10.1145 / 1272996.1273005。

[ 68 ] Daniel Warneke and Odej Kao: “ Nephele: Efficient Parallel Data Processing in the Cloud ,” at 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS), November 2009. doi:10.1145/1646468.1646476

[68] Daniel Warneke和Odej Kao：“Nephele: 云中高效的并行数据处理”，发表于第二届网格和超级计算的许多任务计算研讨会议(MTAGS)，2009年11月。doi:10.1145/1646468.1646476。

[ 69 ] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “ The PageRank Citation Ranking: Bringing Order to the Web ,” Stanford InfoLab Technical Report 422, 1999.

[69] 劳伦斯·佩奇（Lawrence Page），谢尔盖·布林（Sergey Brin），拉吉夫·莫特瓦尼（Rajeev Motwani）和特里·温格勒德（Terry Winograd）：“PageRank引用排名：为网络带来秩序”，斯坦福信息实验室技术报告422，1999年。

[ 70 ] Leslie G. Valiant: “ A Bridging Model for Parallel Computation ,” Communications of the ACM , volume 33, number 8, pages 103–111, August 1990. doi:10.1145/79173.79181

[70] 莱斯利·瓦利安特： "并行计算的桥接模型"， ACM通讯，卷33，号8，页103-111，1990年8月。 doi：10.1145/79173.79181。

[ 71 ] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “ Spinning Fast Iterative Data Flows ,” Proceedings of the VLDB Endowment , volume 5, number 11, pages 1268-1279, July 2012. doi:10.14778/2350229.2350245

[71] Stephan Ewen，Kostas Tzoumas，Moritz Kaufmann和Volker Markl：“Spinning Fast Iterative Data Flows”，VLDB Endowment会议论文集，第5卷，第11期，2012年7月。doi:10.14778/2350229.2350245。

[ 72 ] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “ Pregel: A System for Large-Scale Graph Processing ,” at ACM International Conference on Management of Data (SIGMOD), June 2010. doi:10.1145/1807167.1807184

【72】Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik 等人：「Pregel：一個用於大規模圖形處理的系統」，於 2010 年 6 月的 ACM 國際數據管理會議 (SIGMOD) 上發表。 doi:10.1145/1807167.1807184

[ 73 ] Frank McSherry, Michael Isard, and Derek G. Murray: “ Scalability! But at What COST? ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[73] Frank McSherry, Michael Isard, 和 Derek G. Murray: “可扩展性！但代价是什么？”，于 2015 年 5 月在第 15 届 USENIX 操作系统热点问题研讨会上发表。

[ 74 ] Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “ Musketeer: All for One, One for All in Data Processing Systems ,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741968

【74】Ionel Gog, Malte Schwarzkopf, Natacha Crooks等：“Musketeer: 数据处理系统中的All for One, One for All”，发表于2015年4月的第10届欧洲计算机系统会议（EuroSys）。doi:10.1145/2741948.2741968

[ 75 ] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “ GraphChi: Large-Scale Graph Computation on Just a PC ,” at 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2012.

“GraphChi:只需一台PC的大规模图形计算”的作者为Aapo Kyrola，Guy Blelloch和Carlos Guestrin。文章发表于2012年10月的第10届USENIX操作系统设计和实现研讨会（OSDI）。

[ 76 ] Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “ Parallel Graph Analytics ,” Communications of the ACM , volume 59, number 5, pages 78–87, May 2016. doi:10.1145/2901919

[76] Andrew Lenharth，Donald Nguyen和Keshav Pingali：“并行图分析”，ACM通讯，第59卷，第5期，页78-87，2016年5月。doi：10.1145 / 2901919

[ 77 ] Fabian Hüske: “ Peeking into Apache Flink’s Engine Room ,” flink.apache.org , March 13, 2015.

[77] Fabian Hüske：“深入了解Apache Flink引擎”，flink.apache.org，2015年3月13日。

[ 78 ] Mostafa Mokhtar: “ Hive 0.14 Cost Based Optimizer (CBO) Technical Overview ,” hortonworks.com , March 2, 2015.

“Mostafa Mokhtar： “Hive 0.14 成本基于优化器 (CBO) 技术概述”，hortonworks.com，2015 年 3 月 2 日。”

[ 79 ] Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “ Spark SQL: Relational Data Processing in Spark ,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742797

"[79] Michael Armbrust, Reynold S Xin, Cheng Lian等：’Spark SQL：Spark中的关系数据处理’，发表于ACM数据管理国际会议（SIGMOD），2015年6月。doi：10.1145 / 2723372.2742797"

[ 80 ] Daniel Blazevski: “ Planting Quadtrees for Apache Flink ,” insightdataengineering.com , March 25, 2016.

80. Daniel Blazevski：“为Apache Flink种植四叉树”，insightdataengineering.com，2016年3月25日。

[ 81 ] Tom White: “ Genome Analysis Toolkit: Now Using Apache Spark for Data Processing ,” blog.cloudera.com , April 6, 2016.

"[81] Tom White: “Genome Analysis Toolkit: Now Using Apache Spark for Data Processing,” blog.cloudera.com, April 6, 2016." 81. 汤姆·怀特：“基因组分析工具包：现在使用Apache Spark进行数据处理”，blog.cloudera.com，2016年4月6日。

Chapter 11. Stream Processing

A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.

复杂的系统往往从一个可行的简单系统逐步演化而来,而反过来并不成立：完全由头设计的复杂系统往往无法正常运行，也无法被强行运行。

John Gall, Systemantics (1975)

约翰·高尔，《系统论》（1975年）

In Chapter 10 we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of derived data ; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recommendation systems, analytics, and more.

在第10章中，我们讨论了批处理——一种读取一组文件作为输入并产生新的输出文件的技术。输出是一种导出数据的形式；也就是说，如果必要，可以通过再次运行批处理来重新创建该数据集。我们看到了这个简单而强大的想法如何被用于创建搜索索引、推荐系统、分析等等。

However, one big assumption remained throughout Chapter 10 : namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option.

然而，第10章始終存在一個重要的假設：即輸入是有界的，即已知且有限的大小，因此批處理程序知道何時完成讀取其輸入。例如，MapReduce集中的排序操作必須在開始生成輸出之前讀取其整個輸入：可能發生的情況是，最後一個輸入記錄是具有最低鍵的記錄，因此需要成為第一個輸出記錄，因此提早開始輸出不是選項。

In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way [ 1 ]. Thus, batch processors must artificially divide the data into chunks of fixed duration: for example, processing a day’s worth of data at the end of every day, or processing an hour’s worth of data at the end of every hour.

实际上，很多数据是无界限的，因为它随着时间逐渐到达：你的用户昨天和今天生产出数据，他们明天还会继续生产更多的数据。除非你倒闭，否则这个过程永远不会结束，因此数据集从没有任何有意义的“完整”[1]。因此，批处理器必须人为地将数据分成固定时长的块，例如，每天结束时处理一天的数据，或每小时结束时处理一小时的数据。

The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users. To reduce the delay, we can run the processing more frequently—say, processing a second’s worth of data at the end of every second—or even continuously, abandoning the fixed time slices entirely and simply processing every event as it happens. That is the idea behind stream processing .

每日批处理的问题在于输入的变化只能在一天后反映在输出中，这对于许多不耐烦的用户来说太慢了。为了缩短延迟，我们可以更频繁地运行处理——比如在每秒末处理一秒钟的数据——甚至连续处理，完全放弃固定的时间片，简单地处理每个事件发生的过程。这就是流处理的思想。

In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [ 2 ], filesystem APIs (such as Java’s FileInputStream ), TCP connections, delivering audio and video over the internet, and so on.

一般情况下，“数据流”指的是随着时间逐步可用的数据。这个概念在很多地方都出现过：在Unix中的标准输入和标准输出、编程语言中的延迟列表，文件系统API（比如Java的FileInputStream）中，TCP连接、通过互联网传输音频和视频等等。

In this chapter we will look at event streams as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter . We will first discuss how streams are represented, stored, and transmitted over a network. In “Databases and Streams” we will investigate the relationship between streams and databases. And finally, in “Processing Streams” we will explore approaches and tools for processing those streams continually , and ways that they can be used to build applications.

在本章中，我们将研究事件流作为数据管理机制：无限的、逐步处理的批处理数据的对应物。我们首先讨论流如何被表示、存储和通过网络传输。在“数据库和流”中，我们将研究流和数据库之间的关系。最后，在“处理流”中，我们将探索持续处理这些流的方法和工具，以及它们可以用于构建应用程序的方式。

Transmitting Event Streams

In the batch processing world, the inputs and outputs of a job are files (perhaps on a distributed filesystem). What does the streaming equivalent look like?

在批处理世界中，作业的输入和输出是文件（可能位于分布式文件系统上）。流处理的等效物是什么样子？

When the input is a file (a sequence of bytes), the first processing step is usually to parse it into a sequence of records. In a stream processing context, a record is more commonly known as an event , but it is essentially the same thing: a small, self-contained, immutable object containing the details of something that happened at some point in time. An event usually contains a timestamp indicating when it happened according to a time-of-day clock (see “Monotonic Versus Time-of-Day Clocks” ).

当输入是一个文件（一系列字节）时，通常的第一步处理是将其解析成一系列记录。在流处理上下文中，记录更常被称为事件，但本质上是相同的东西：一个小的、独立的、不可变的对象，包含在某个时间点发生的详细信息。通常情况下，事件包含一个时间戳，表示发生了何时，根据按日计时时钟（参见“单调性与按日计时时钟”）。

For example, the thing that happened might be an action that a user took, such as viewing a page or making a purchase. It might also originate from a machine, such as a periodic measurement from a temperature sensor, or a CPU utilization metric. In the example of “Batch Processing with Unix Tools” , each line of the web server log is an event.

例如，发生的事情可能是用户采取的行动，比如查看页面或进行购买。它也可能来自机器，比如从温度传感器的定期测量或CPU利用率指标。在“使用Unix工具进行批处理”的例子中，Web服务器日志的每一行都是一个事件。

An event may be encoded as a text string, or JSON, or perhaps in some binary form, as discussed in Chapter 4 . This encoding allows you to store an event, for example by appending it to a file, inserting it into a relational table, or writing it to a document database. It also allows you to send the event over the network to another node in order to process it.

一个事件可以编码为文本字符串，JSON，或者可能以某种二进制形式，如第四章讨论的方式。这种编码允许你存储一个事件，例如将它附加到一个文件中，插入到关系表中，或将其写入文档数据库中。它还允许你将事件发送到网络中的另一个节点以便处理它。

In batch processing, a file is written once and then potentially read by multiple jobs. Analogously, in streaming terminology, an event is generated once by a producer (also known as a publisher or sender ), and then potentially processed by multiple consumers ( subscribers or recipients ) [ 3 ]. In a filesystem, a filename identifies a set of related records; in a streaming system, related events are usually grouped together into a topic or stream .

在批量处理中，一个文件只写一次，然后可能被多个作业读取。类似地，在实时处理术语中，事件首先由生产者（也称为发布者或发送者）生成，然后可能由多个消费者（订阅者或接收者）处理。在文件系统中，文件名标识一组相关记录；在实时处理系统中，相关事件通常被分组到主题或流中。

In principle, a file or database is sufficient to connect producers and consumers: a producer writes every event that it generates to the datastore, and each consumer periodically polls the datastore to check for events that have appeared since it last ran. This is essentially what a batch process does when it processes a day’s worth of data at the end of every day.

原则上，一个文件或数据库足够连接生产者和消费者：生产者将其生成的每个事件写入数据存储库，每个消费者定期轮询数据存储库以检查自上次运行以来出现的事件。这基本上就是批处理在每天结束时处理一天数据的方式。

However, when moving toward continual processing with low delays, polling becomes expensive if the datastore is not designed for this kind of usage. The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become. Instead, it is better for consumers to be notified when new events appear.

然而，当向着低延迟的持续处理方向移动时，如果数据存储区没有设计为此类使用方式，轮询变得非常昂贵。轮询的次数越多，返回新事件的请求所占比例就越低，因此开销就越高。相反，最好在出现新事件时通知消费者。

Databases have traditionally not supported this kind of notification mechanism very well: relational databases commonly have triggers , which can react to a change (e.g., a row being inserted into a table), but they are very limited in what they can do and have been somewhat of an afterthought in database design [ 4 , 5 ]. Instead, specialized tools have been developed for the purpose of delivering event notifications.

传统上，数据库并不很好地支持这种通知机制：关系型数据库通常具有触发器，可以对更改做出反应（例如，将一行插入到表中），但它们在功能上非常有限，在数据库设计中有些事后想法。为了实现事件通知，专门的工具已经被开发出来。

Messaging Systems

A common approach for notifying consumers about new events is to use a messaging system : a producer sends a message containing the event, which is then pushed to consumers. We touched on these systems previously in “Message-Passing Dataflow” , but we will now go into more detail.

通知消费者新事件的常见方法是使用消息系统：生产者发送包含事件的消息，然后将其推送到消费者。我们之前在“消息传递数据流”中涉及过这些系统，但现在我们将进一步详细介绍。

A direct communication channel like a Unix pipe or TCP connection between producer and consumer would be a simple way of implementing a messaging system. However, most messaging systems expand on this basic model. In particular, Unix pipes and TCP connect exactly one sender with one recipient, whereas a messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive messages in a topic.

直接通信渠道，比如Unix管道或TCP连接，可作为实现消息传递系统的简单方式。然而，大多数消息传递系统都建立在这一基础模型之上。特别是，Unix管道和TCP只能将一个发送方与一个接收方连接起来，而消息传递系统允许多个生产节点向同一主题发送消息，并允许多个使用节点在主题中接收消息。

Within this publish/subscribe model, different systems take a wide range of approaches, and there is no one right answer for all purposes. To differentiate the systems, it is particularly helpful to ask the following two questions:

在这个发布/订阅模型中，不同的系统采取了各种各样的方法，并没有一个通用的答案适用于所有情况。为了区分这些系统，有两个问题特别有帮助：

What happens if the producers send messages faster than the consumers can process them? Broadly speaking, there are three options: the system can drop messages, buffer messages in a queue, or apply backpressure (also known as flow control ; i.e., blocking the producer from sending more messages). For example, Unix pipes and TCP use backpressure: they have a small fixed-size buffer, and if it fills up, the sender is blocked until the recipient takes data out of the buffer (see “Network congestion and queueing” ).

如果生产者发送消息的速度比消费者处理它们的速度更快，会发生什么？大致上，有三个选择：系统可以丢弃消息，在队列中缓冲消息，或者应用反压（也称为流量控制；即，阻止生产者继续发送更多消息）。例如，Unix管道和TCP使用反压：它们有一个小的固定大小的缓冲区，如果它被填满，发送者将被阻止，直到接收方将数据从缓冲区中取出为止（请参见“网络拥塞和排队”）。

If messages are buffered in a queue, it is important to understand what happens as that queue grows. Does the system crash if the queue no longer fits in memory, or does it write messages to disk? If so, how does the disk access affect the performance of the messaging system [ 6 ]?

如果消息在队列中缓冲，重要的是要了解队列增长时会发生什么。如果队列无法再适配内存，系统是否会崩溃，或者它是否会将消息写入磁盘？如果是这样，磁盘访问如何影响消息系统的性能？[6]
What happens if nodes crash or temporarily go offline—are any messages lost? As with databases, durability may require some combination of writing to disk and/or replication (see the sidebar “Replication and Durability” ), which has a cost. If you can afford to sometimes lose messages, you can probably get higher throughput and lower latency on the same hardware.

如果节点崩溃或暂时离线，是否会丢失任何消息？与数据库类似，耐久性可能需要将数据写入磁盘和/或复制（请参见侧边栏“复制和耐久性”），这具有成本。如果您需要时偶尔丢失消息，您可能可以在相同硬件上获得更高的吞吐量和更低的延迟。

Whether message loss is acceptable depends very much on the application. For example, with sensor readings and metrics that are transmitted periodically, an occasional missing data point is perhaps not important, since an updated value will be sent a short time later anyway. However, beware that if a large number of messages are dropped, it may not be immediately apparent that the metrics are incorrect [ 7 ]. If you are counting events, it is more important that they are delivered reliably, since every lost message means incorrect counters.

信息丢失是否可接受非常取决于应用程序。例如，对于定期传输的传感器读数和指标，偶尔丢失的数据点可能不重要，因为稍后会发送更新的值。然而，请注意，如果丢失了大量的消息，则可能不会立即发现指标不正确[7]。如果您正在计算事件，则更重要的是它们可靠地传递，因为每个丢失的消息都意味着计数器不正确。

A nice property of the batch processing systems we explored in Chapter 10 is that they provide a strong reliability guarantee: failed tasks are automatically retried, and partial output from failed tasks is automatically discarded. This means the output is the same as if no failures had occurred, which helps simplify the programming model. Later in this chapter we will examine how we can provide similar guarantees in a streaming context.

我们在第10章探讨的批处理系统的一个很好的特点是它们提供了强大的可靠性保证：失败的任务会自动重试，并且失败任务的部分输出会自动丢弃。这意味着输出与未发生故障时的输出相同，有助于简化编程模型。本章稍后我们将探讨如何在流式上下文中提供类似的保证。

Direct messaging from producers to consumers

A number of messaging systems use direct network communication between producers and consumers without going via intermediary nodes:

许多消息系统使用生产者和消费者之间的直接网络通信，而不经过中介节点。

UDP multicast is widely used in the financial industry for streams such as stock market feeds, where low latency is important [ 8 ]. Although UDP itself is unreliable, application-level protocols can recover lost packets (the producer must remember packets it has sent so that it can retransmit them on demand).

UDP组播在金融行业广泛应用于像股票市场数据等的流媒体，低延迟很重要。虽然UDP本身不可靠，但应用层协议可以恢复丢失的数据包（生产者必须记得它已经发送的数据包，以便在需要时进行重传）。
Brokerless messaging libraries such as ZeroMQ [ 9 ] and nanomsg take a similar approach, implementing publish/subscribe messaging over TCP or IP multicast.

没有经纪人的消息传递库，如ZeroMQ和nanomsg采用类似的方法，通过TCP或IP组播实现发布/订阅消息传递。
StatsD [ 10 ] and Brubeck [ 7 ] use unreliable UDP messaging for collecting metrics from all machines on the network and monitoring them. (In the StatsD protocol, counter metrics are only correct if all messages are received; using UDP makes the metrics at best approximate [ 11 ]. See also “TCP Versus UDP” .)

StatsD和Brubeck使用不可靠的UDP消息在网络上从所有机器收集指标并对其进行监控。（在StatsD协议中，只有当收到所有消息时，计数器指标才正确；使用UDP最多只能获得近似的指标[11]。另请参见“TCP与UDP”）
If the consumer exposes a service on the network, producers can make a direct HTTP or RPC request (see “Dataflow Through Services: REST and RPC” ) to push messages to the consumer. This is the idea behind webhooks [ 12 ], a pattern in which a callback URL of one service is registered with another service, and it makes a request to that URL whenever an event occurs.

如果消费者在网络上提供了服务，那么生产者可以直接通过HTTP或RPC请求（请参阅“服务数据流：REST和RPC”）向消费者推送消息。这就是Webhook[12]背后的思想，这是一种模式，其中一个服务的回调URL被注册到另一个服务中，并在发生事件时向该URL发出请求。

Although these direct messaging systems work well in the situations for which they are designed, they generally require the application code to be aware of the possibility of message loss. The faults they can tolerate are quite limited: even if the protocols detect and retransmit packets that are lost in the network, they generally assume that producers and consumers are constantly online.

尽管这些直接信息传递系统在它们被设计的情况下工作得很好，但它们通常需要应用程序代码意识到消息丢失的可能性。它们可以容忍的错误相当有限：即使协议检测到并重新传输在网络中丢失的数据包，它们通常假定生产者和消费者始终在线。

If a consumer is offline, it may miss messages that were sent while it is unreachable. Some protocols allow the producer to retry failed message deliveries, but this approach may break down if the producer crashes, losing the buffer of messages that it was supposed to retry.

如果消费者离线，则可能错过在其不可达时发送的消息。一些协议允许生产者重试失败的消息传递，但如果生产者崩溃，失去了应该重试的消息缓冲区，这种方法可能会崩溃。

Message brokers

A widely used alternative is to send messages via a message broker (also known as a message queue ), which is essentially a kind of database that is optimized for handling message streams [ 13 ]. It runs as a server, with producers and consumers connecting to it as clients. Producers write messages to the broker, and consumers receive them by reading them from the broker.

一种广泛使用的替代方式是通过消息代理（也称为消息队列）发送消息，它实际上是一种针对处理消息流进行了优化的数据库[13]。它作为服务器运行，生产者和消费者作为客户端连接到它上面。生产者将消息写入代理，消费者通过从代理中读取消息来接收它们。

By centralizing the data in the broker, these systems can more easily tolerate clients that come and go (connect, disconnect, and crash), and the question of durability is moved to the broker instead. Some message brokers only keep messages in memory, while others (depending on configuration) write them to disk so that they are not lost in case of a broker crash. Faced with slow consumers, they generally allow unbounded queueing (as opposed to dropping messages or backpressure), although this choice may also depend on the configuration.

通过将数据集中存储在代理中，这些系统可以更容易地容忍客户端的连接和断开（连接，断开和崩溃），并且持久性的问题被移至代理。一些消息代理仅将消息保存在内存中，而其他代理（取决于配置）会将它们写入磁盘，以防代理崩溃时消息丢失。面对缓慢的消费者，他们通常允许无限排队（与丢弃消息或反压相对），尽管这种选择也可能取决于配置。

A consequence of queueing is also that consumers are generally asynchronous : when a producer sends a message, it normally only waits for the broker to confirm that it has buffered the message and does not wait for the message to be processed by consumers. The delivery to consumers will happen at some undetermined future point in time—often within a fraction of a second, but sometimes significantly later if there is a queue backlog.

排队的一个后果是消费者通常是异步的：当生产者发送一条消息时，它通常只等待代理确认已缓存该消息，而不等待消息被消费者处理。向消费者的交付将在某个不确定的未来时间发生 - 通常在几分之一秒内，但如果存在队列积压，则有时会更晚。

Message brokers compared to databases

Some message brokers can even participate in two-phase commit protocols using XA or JTA (see “Distributed Transactions in Practice” ). This feature makes them quite similar in nature to databases, although there are still important practical differences between message brokers and databases:

一些消息代理甚至可以使用XA或JTA参与两阶段提交协议（参见“实践中的分布式事务”）。这个功能使它们与数据库非常相似，尽管消息代理和数据库之间仍然存在重要的实际差异：

Databases usually keep data until it is explicitly deleted, whereas most message brokers automatically delete a message when it has been successfully delivered to its consumers. Such message brokers are not suitable for long-term data storage.

数据库通常会保留数据直到显式删除，而大多数消息代理在将消息成功交付给其使用者后会自动删除该消息。这样的消息代理不适合长期数据存储。
Since they quickly delete messages, most message brokers assume that their working set is fairly small—i.e., the queues are short. If the broker needs to buffer a lot of messages because the consumers are slow (perhaps spilling messages to disk if they no longer fit in memory), each individual message takes longer to process, and the overall throughput may degrade [ 6 ].

由于它们会快速删除消息，大多数消息代理假设它们的工作集相对较小，即队列较短。如果代理需要缓冲大量消息，因为消费者速度较慢（可能将消息溢出到磁盘，如果它们不再适合内存），每个单独的消息处理时间变长，整体吞吐量可能会下降。
Databases often support secondary indexes and various ways of searching for data, while message brokers often support some way of subscribing to a subset of topics matching some pattern. The mechanisms are different, but both are essentially ways for a client to select the portion of the data that it wants to know about.

数据库通常支持辅助索引和各种搜索数据的方法，而消息代理通常支持订阅与某些模式匹配的主题子集的一些方式。这些机制不同，但两者本质上都是客户端选择要了解的数据部分的方法。
When querying a database, the result is typically based on a point-in-time snapshot of the data; if another client subsequently writes something to the database that changes the query result, the first client does not find out that its prior result is now outdated (unless it repeats the query, or polls for changes). By contrast, message brokers do not support arbitrary queries, but they do notify clients when data changes (i.e., when new messages become available).

查询数据库时，结果通常基于数据的某个时间点的快照；如果另一个客户端稍后向数据库写入了一些更改查询结果的内容，则第一个客户端无法发现其先前的结果已过时（除非它重复查询或轮询更改）。相比之下，消息代理不支持任意查询，但它们在数据更改时通知客户端（即在新消息变得可用时通知）。

This is the traditional view of message brokers, which is encapsulated in standards like JMS [ 14 ] and AMQP [ 15 ] and implemented in software like RabbitMQ, ActiveMQ, HornetQ, Qpid, TIBCO Enterprise Message Service, IBM MQ, Azure Service Bus, and Google Cloud Pub/Sub [ 16 ].

这是消息代理的传统视角，它被封装在像JMS [14]和AMQP [15]这样的标准中，并在RabbitMQ，ActiveMQ，HornetQ，Qpid，TIBCO企业消息服务，IBM MQ，Azure服务中得到了实现Bus和Google Cloud Pub / Sub [16]。

Multiple consumers

When multiple consumers read messages in the same topic, two main patterns of messaging are used, as illustrated in Figure 11-1 :

当多个消费者在相同的主题中读取消息时，使用了两种主要的消息模式，如图11-1所示：

Load balancing

Each message is delivered to one of the consumers, so the consumers can share the work of processing the messages in the topic. The broker may assign messages to consumers arbitrarily. This pattern is useful when the messages are expensive to process, and so you want to be able to add consumers to parallelize the processing. (In AMQP, you can implement load balancing by having multiple clients consuming from the same queue, and in JMS it is called a shared subscription .)

每条消息都会被发送到其中一个消费者，这样消费者可以共同处理主题中的消息。代理可以任意地将消息分配给消费者。当消息处理成本很高时，这种模式非常有用，因此您希望能够添加消费者来并行处理。（在AMQP中，您可以通过让多个客户端从同一队列消费来实现负载平衡，在JMS中则称为共享订阅。）

Fan-out

Each message is delivered to all of the consumers. Fan-out allows several independent consumers to each “tune in” to the same broadcast of messages, without affecting each other—the streaming equivalent of having several different batch jobs that read the same input file. (This feature is provided by topic subscriptions in JMS, and exchange bindings in AMQP.)

每条消息都会发送给所有的消费者。扇出允许多个独立的消费者“调谐”到相同的消息广播，而不会相互影响-这就像有几个不同的批处理作业读取相同的输入文件的流媒体一样。(这个特性由JMS中的主题订阅和AMQP中的交换绑定提供。)

The two patterns can be combined: for example, two separate groups of consumers may each subscribe to a topic, such that each group collectively receives all messages, but within each group only one of the nodes receives each message.

这两个模式可以结合使用：例如，两个独立的消费者群体可以各自订阅一个主题，以便每个群体共同接收所有消息，但在每个群体中只有一个节点接收每条消息。

Acknowledgments and redelivery

Consumers may crash at any time, so it could happen that a broker delivers a message to a consumer but the consumer never processes it, or only partially processes it before crashing. In order to ensure that the message is not lost, message brokers use acknowledgments : a client must explicitly tell the broker when it has finished processing a message so that the broker can remove it from the queue.

消费者可能随时崩溃，因此经纪人可能会发送消息给消费者，但消费者可能在崩溃之前未能处理完全，或仅部分处理完毕。为确保消息不丢失，消息经纪人使用确认机制：客户端必须明确告诉经纪人已完成对消息的处理，以便经纪人可以将其从队列中移除。

If the connection to a client is closed or times out without the broker receiving an acknowledgment, it assumes that the message was not processed, and therefore it delivers the message again to another consumer. (Note that it could happen that the message actually was fully processed, but the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol, as discussed in “Distributed Transactions in Practice” .)

如果与客户端的连接关闭或超时，而经纪人未收到确认，则假定该消息未被处理，因此将该消息再次传递给另一个消费者。（请注意，实际上该消息可能已完全处理，但在网络中丢失了确认。处理此情况需要原子提交协议，如“实践中的分布式事务”所讨论的。）

When combined with load balancing, this redelivery behavior has an interesting effect on the ordering of messages. In Figure 11-2 , the consumers generally process messages in the order they were sent by producers. However, consumer 2 crashes while processing message m3 , at the same time as consumer 1 is processing message m4 . The unacknowledged message m3 is subsequently redelivered to consumer 1, with the result that consumer 1 processes messages in the order m4 , m3 , m5 . Thus, m3 and m4 are not delivered in the same order as they were sent by producer 1.

当与负载均衡结合使用时，此重新传递行为对消息排序产生有趣的影响。在图11-2中，消费者通常按照制造商发送的顺序处理消息。但是，当消费者2在处理消息m3时崩溃，同时消费者1正在处理消息m4时，未确认的消息m3随后会被重新发送给消费者1，结果是消费者1按照m4，m3，m5的顺序处理消息。因此，m3和m4不按制造商1发送的相同顺序传递。

Even if the message broker otherwise tries to preserve the order of messages (as required by both the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to messages being reordered. To avoid this issue, you can use a separate queue per consumer (i.e., not use the load balancing feature). Message reordering is not a problem if messages are completely independent of each other, but it can be important if there are causal dependencies between messages, as we shall see later in the chapter.

即使消息代理器尽可能保持消息顺序（符合JMS和AMQP标准的要求），负载均衡与重传结合必然会导致消息被重新排序。为了避免这个问题，可以为每个消费者使用单独的队列（不使用负载均衡功能）。如果消息完全独立，那么消息重新排序不是问题，但是如果消息之间存在因果依赖关系，则可能很重要，这将在本章后面讨论。

Partitioned Logs

Sending a packet over a network or making a request to a network service is normally a transient operation that leaves no permanent trace. Although it is possible to record it permanently (using packet capture and logging), we normally don’t think of it that way. Even message brokers that durably write messages to disk quickly delete them again after they have been delivered to consumers, because they are built around a transient messaging mindset.

通过网络发送数据包或向网络服务发送请求通常是一个短暂的操作，不会留下永久的痕迹。尽管可以通过数据包捕获和记录来永久记录它，但我们通常不会这样考虑。即使消息代理将消息可靠地写入磁盘，它们也会在传递给消费者后快速将其删除，因为它们是围绕着短暂消息传递模式构建的。

Databases and filesystems take the opposite approach: everything that is written to a database or file is normally expected to be permanently recorded, at least until someone explicitly chooses to delete it again.

数据库和文件系统采取相反的方式：通常期望写入到数据库或文件的所有内容都能够永久记录，至少在有人明确选择再次删除之前是这样的。

This difference in mindset has a big impact on how derived data is created. A key feature of batch processes, as discussed in Chapter 10 , is that you can run them repeatedly, experimenting with the processing steps, without risk of damaging the input (since the input is read-only). This is not the case with AMQP/JMS-style messaging: receiving a message is destructive if the acknowledgment causes it to be deleted from the broker, so you cannot run the same consumer again and expect to get the same result.

思维方式的差异会对衍生数据的创建产生很大的影响。批量处理的一个关键特征是你可以反复运行它们，尝试处理步骤，而不会破坏输入数据（因为输入数据是只读的）。而AMQP/JMS风格的消息传递不是这样的：如果确认消息后导致它从代理中被删除，那么接收到消息就会被破坏，所以你不能再次运行同样的消费者并期望得到相同的结果。

If you add a new consumer to a messaging system, it typically only starts receiving messages sent after the time it was registered; any prior messages are already gone and cannot be recovered. Contrast this with files and databases, where you can add a new client at any time, and it can read data written arbitrarily far in the past (as long as it has not been explicitly overwritten or deleted by the application).

如果向消息系统添加新的消费者，通常只会开始接收在它注册之后发送的消息；之前的消息已经消失，无法恢复。相比之下，文件和数据库可以在任何时间添加新的客户端，并且可以读取写入任意远的数据（只要它没有被应用程序明确覆盖或删除）。

Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging? This is the idea behind log-based message brokers .

为什么我们不能拥有一种混合方法，将数据库的持久存储方法与消息传递的低延迟通知功能相结合呢？这就是基于日志的消息代理的核心思想。

Using logs for message storage

A log is simply an append-only sequence of records on disk. We previously discussed logs in the context of log-structured storage engines and write-ahead logs in Chapter 3 , and in the context of replication in Chapter 5 .

一个日志是仅仅在磁盘上追加记录的序列。我们之前在第三章中讨论了日志结构化存储引擎和预写日志，在第五章中讨论了复制的情况下的日志。

The same structure can be used to implement a message broker: a producer sends a message by appending it to the end of the log, and a consumer receives messages by reading the log sequentially. If a consumer reaches the end of the log, it waits for a notification that a new message has been appended. The Unix tool tail -f , which watches a file for data being appended, essentially works like this.

同样的结构可以用来实现一个消息代理：生产者通过将消息附加到日志末尾来发送消息，而消费者通过按顺序读取日志来接收消息。如果消费者到达日志的末尾，它会等待通知新消息已被附加。Unix工具tail -f，它观察文件中是否有数据被附加，基本上就是这样工作的。

In order to scale to higher throughput than a single disk can offer, the log can be partitioned (in the sense of Chapter 6 ). Different partitions can then be hosted on different machines, making each partition a separate log that can be read and written independently from other partitions. A topic can then be defined as a group of partitions that all carry messages of the same type. This approach is illustrated in Figure 11-3 .

为了扩大吞吐量，超过单个磁盘的性能，可以对日志进行分区（如第6章所述）。不同的分区可以托管在不同的机器上，使每个分区成为一个独立的日志，可以独立于其他分区进行读写。然后可以将一个主题定义为一组携带相同类型信息的分区。这种方法在图11-3中说明了。

Within each partition, the broker assigns a monotonically increasing sequence number, or offset , to every message (in Figure 11-3 , the numbers in boxes are message offsets). Such a sequence number makes sense because a partition is append-only, so the messages within a partition are totally ordered. There is no ordering guarantee across different partitions.

在每个分区中，代理人为每个消息分配一个单调递增的序列号或偏移量（在图11-3中，方框中的数字是消息偏移量）。这样的序列号是有意义的，因为分区是仅追加的，所以分区内的消息是完全有序的。跨不同分区没有排序保证。

Apache Kafka [ 17 , 18 ], Amazon Kinesis Streams [ 19 ], and Twitter’s DistributedLog [ 20 , 21 ] are log-based message brokers that work like this. Google Cloud Pub/Sub is architecturally similar but exposes a JMS-style API rather than a log abstraction [ 16 ]. Even though these message brokers write all messages to disk, they are able to achieve throughput of millions of messages per second by partitioning across multiple machines, and fault tolerance by replicating messages [ 22 , 23 ].

Apache Kafka、Amazon Kinesis Streams 和 Twitter 的 DistributedLog 是基于日志的消息代理，工作方式如下。Google Cloud Pub/Sub 与其架构类似，但提供类似 JMS 的 API 而非日志抽象。尽管这些消息代理将所有消息都写入磁盘，但通过跨多台计算机分区和复制消息，它们能够实现每秒数百万条消息的吞吐量和容错性。

Logs compared to traditional messaging

The log-based approach trivially supports fan-out messaging, because several consumers can independently read the log without affecting each other—reading a message does not delete it from the log. To achieve load balancing across a group of consumers, instead of assigning individual messages to consumer clients, the broker can assign entire partitions to nodes in the consumer group.

日志基于的方法很容易支持扇出式消息传递，因为多个消费者可以独立地读取日志，而不会相互影响——读取一条消息并不会从日志中删除它。为了在多个消费者节点之间实现负载均衡，代理可以将整个分区分配给该消费者组中的节点，而不是将单个消息分配给消费者客户端。

Each client then consumes all the messages in the partitions it has been assigned. Typically, when a consumer has been assigned a log partition, it reads the messages in the partition sequentially, in a straightforward single-threaded manner. This coarse-grained load balancing approach has some downsides:

每个客户端都会消费分配给它的分区中的所有消息。通常，当消费者分配到一个日志分区后，会按顺序逐个读取分区中的消息，采用简单的单线程方式。这种粗粒度的负载均衡方法存在一些缺点：

The number of nodes sharing the work of consuming a topic can be at most the number of log partitions in that topic, because messages within the same partition are delivered to the same node. ⁱ

消费一个主题的节点数量最多只能是该主题的日志分区数，因为同一分区内的消息会被传递到相同的节点。
If a single message is slow to process, it holds up the processing of subsequent messages in that partition (a form of head-of-line blocking; see “Describing Performance” ).

如果一个单独的消息处理较慢，它会阻塞该分区后续消息的处理（一种先到先处理的形式；参见“描述性能”）。

Thus, in situations where messages may be expensive to process and you want to parallelize processing on a message-by-message basis, and where message ordering is not so important, the JMS/AMQP style of message broker is preferable. On the other hand, in situations with high message throughput, where each message is fast to process and where message ordering is important, the log-based approach works very well.

因此，在处理信息可能很昂贵且您想要按消息进行并行处理的情况下，而且消息排序并不那么重要时，首选JMS/AMQP样式的消息代理。另一方面，在高消息吞吐量的情况下，每个消息都快速处理且消息排序很重要时，基于日志的方法非常有效。

Consumer offsets

Consuming a partition sequentially makes it easy to tell which messages have been processed: all messages with an offset less than a consumer’s current offset have already been processed, and all messages with a greater offset have not yet been seen. Thus, the broker does not need to track acknowledgments for every single message—it only needs to periodically record the consumer offsets. The reduced bookkeeping overhead and the opportunities for batching and pipelining in this approach help increase the throughput of log-based systems.

按顺序消耗分区可以轻松确定已处理过的消息：具有比消费者当前偏移量小的偏移量的所有消息已经处理完毕，具有更大偏移量的所有消息尚未被查看。因此，代理不需要跟踪每个单个消息的确认 - 它只需要定期记录消费者偏移量。这种方法中减少的簿记开销以及管道化和批处理的机会有助于提高基于日志的系统的吞吐量。

This offset is in fact very similar to the log sequence number that is commonly found in single-leader database replication, and which we discussed in “Setting Up New Followers” . In database replication, the log sequence number allows a follower to reconnect to a leader after it has become disconnected, and resume replication without skipping any writes. Exactly the same principle is used here: the message broker behaves like a leader database, and the consumer like a follower.

这个偏移实际上非常类似于单主数据库复制中常见的日志序列号，我们在“设置新的跟随者”中讨论过。在数据库复制中，日志序列号允许跟随者在与主服务器断开连接后重新连接到主服务器，并在不跳过任何写入的情况下恢复复制。这里使用完全相同的原则：消息代理行为就像一个主数据库，而消费者则像一个跟随者。

If a consumer node fails, another node in the consumer group is assigned the failed consumer’s partitions, and it starts consuming messages at the last recorded offset. If the consumer had processed subsequent messages but not yet recorded their offset, those messages will be processed a second time upon restart. We will discuss ways of dealing with this issue later in the chapter.

如果一个消费者节点失败，消费者组中的另一个节点将被分配到失败的消费者分区，并从最后记录的偏移量开始消费消息。如果消费者已经处理了后续消息但尚未记录它们的偏移量，则在重新启动后这些消息将被再次处理。我们将在本章后面讨论处理这个问题的方法。

Disk space usage

If you only ever append to the log, you will eventually run out of disk space. To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved to archive storage. (We’ll discuss a more sophisticated way of freeing disk space later.)

如果你只是向日志追加数据，最终你会耗尽磁盘空间。为了回收磁盘空间，日志实际上被分成了段，而旧的段会定期删除或移动到归档存储中。（我们稍后将讨论一种更复杂的释放磁盘空间的方法。）

This means that if a slow consumer cannot keep up with the rate of messages, and it falls so far behind that its consumer offset points to a deleted segment, it will miss some of the messages. Effectively, the log implements a bounded-size buffer that discards old messages when it gets full, also known as a circular buffer or ring buffer . However, since that buffer is on disk, it can be quite large.

这意味着如果一个慢消费者无法跟上消息的速度，它会落后很远，导致其消费者偏移指向已删除的段，这会导致它错过一些消息。实际上，该记录实现了一个有界大小的缓冲区，当它变满时会丢弃旧的消息，也被称为环形缓冲区或环形缓存。但是，由于该缓冲区在磁盘上，它可以非常大。

Let’s do a back-of-the-envelope calculation. At the time of writing, a typical large hard drive has a capacity of 6 TB and a sequential write throughput of 150 MB/s. If you are writing messages at the fastest possible rate, it takes about 11 hours to fill the drive. Thus, the disk can buffer 11 hours’ worth of messages, after which it will start overwriting old messages. This ratio remains the same, even if you use many hard drives and machines. In practice, deployments rarely use the full write bandwidth of the disk, so the log can typically keep a buffer of several days’ or even weeks’ worth of messages.

让我们进行草稿纸计算。在撰写本文时，一台典型的大型硬盘容量为6 TB，顺序写入吞吐量为150 MB/s。如果您以最快速度写入消息，填充驱动器需要约11小时。因此，磁盘可以缓存11小时的消息，之后它将开始覆盖旧的消息。即使使用多个硬盘和机器，这个比率仍然保持不变。实际上，在部署中很少使用磁盘的完整写入带宽，因此日志通常可以保持数天或甚至数周的消息缓冲。

Regardless of how long you retain messages, the throughput of a log remains more or less constant, since every message is written to disk anyway [ 18 ]. This behavior is in contrast to messaging systems that keep messages in memory by default and only write them to disk if the queue grows too large: such systems are fast when queues are short and become much slower when they start writing to disk, so the throughput depends on the amount of history retained.

日志的吞吐量始终保持相对恒定，无论消息保留多长时间，因为无论如何每个消息都会被写入磁盘。这种行为与默认将消息保留在内存中，仅在队列过大时才将其写入磁盘的消息系统形成对比。这些系统在队列较短时速度很快，当它们开始向磁盘写入时会变得慢得多，因此吞吐量取决于保留的历史数量。

When consumers cannot keep up with producers

At the beginning of “Messaging Systems” we discussed three choices of what to do if a consumer cannot keep up with the rate at which producers are sending messages: dropping messages, buffering, or applying backpressure. In this taxonomy, the log-based approach is a form of buffering with a large but fixed-size buffer (limited by the available disk space).

在“消息系统”的开头，我们讨论了三种选择，如果消费者无法跟上生产者发送消息的速度：丢弃消息、缓冲或施加反压力。在这种分类中，基于日志的方法是一种具有大但固定大小缓冲区（由可用磁盘空间限制）的缓冲。

If a consumer falls so far behind that the messages it requires are older than what is retained on disk, it will not be able to read those messages—so the broker effectively drops old messages that go back further than the size of the buffer can accommodate. You can monitor how far a consumer is behind the head of the log, and raise an alert if it falls behind significantly. As the buffer is large, there is enough time for a human operator to fix the slow consumer and allow it to catch up before it starts missing messages.

如果消费者落后太远，需要的消息比磁盘上保留的消息还要旧，那么它将无法读取这些消息。因此，经纪人有效地删除更早的消息，这些消息回溯的时间超过缓冲区大小。您可以监视消费者落后于日志头的程度，并在其显著落后时发出警报。由于缓冲区很大，因此人工操作员有足够的时间修复缓慢的消费者并使其赶上，以免错过消息。

Even if a consumer does fall too far behind and starts missing messages, only that consumer is affected; it does not disrupt the service for other consumers. This fact is a big operational advantage: you can experimentally consume a production log for development, testing, or debugging purposes, without having to worry much about disrupting production services. When a consumer is shut down or crashes, it stops consuming resources—the only thing that remains is its consumer offset.

即使消费者落后太多开始错过信息，也只影响到该消费者本身，不会影响其他消费者的服务。这个事实是一个极大的运营优势：你可以实验性地消费生产日志，用于开发、测试或调试目的，而不必担心会干扰生产服务。当消费者关闭或崩溃时，它停止消耗资源，唯一留下的是消费者偏移量。

This behavior also contrasts with traditional message brokers, where you need to be careful to delete any queues whose consumers have been shut down—otherwise they continue unnecessarily accumulating messages and taking away memory from consumers that are still active.

此行为与传统的消息代理相比也存在明显的对比，传统的消息代理需要谨慎删除已经关闭的消费者所使用的队列，否则这些队列会持续积累消息，占用消费者还需使用的内存空间。

Replaying old messages

We noted previously that with AMQP- and JMS-style message brokers, processing and acknowledging messages is a destructive operation, since it causes the messages to be deleted on the broker. On the other hand, in a log-based message broker, consuming messages is more like reading from a file: it is a read-only operation that does not change the log.

之前我们注意到，使用AMQP和JMS风格的消息代理时，处理和确认消息是一个破坏性的操作，因为它会导致消息在代理中被删除。另一方面，在基于日志的消息代理中，消费消息更像是从文件中读取：它是一个只读操作，不会改变日志。

The only side effect of processing, besides any output of the consumer, is that the consumer offset moves forward. But the offset is under the consumer’s control, so it can easily be manipulated if necessary: for example, you can start a copy of a consumer with yesterday’s offsets and write the output to a different location, in order to reprocess the last day’s worth of messages. You can repeat this any number of times, varying the processing code.

数据处理的唯一副作用，除了任何消费者的输出之外，就是消费者偏移量向前移动。但偏移量在消费者的控制之下，因此如果需要，很容易进行操作：例如，您可以使用昨天的偏移量启动消费者的副本，并将输出写入不同的位置，以重新处理最后一天的消息。您可以重复这个过程任意次数，可以变换处理代码。

This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process. It allows more experimentation and easier recovery from errors and bugs, making it a good tool for integrating dataflows within an organization [ 24 ].

这个方面使基于日志的消息更像上一章的批处理，通过可重复的转换过程清楚地将派生数据与输入数据分离。它允许更多的实验和更容易地从错误和漏洞中恢复，使其成为组织内集成数据流的好工具[24]。这方面使基于日志的消息更像前面章节中的批处理模式，通过可重复转换过程将衍生数据与输入数据明确区别开来。这使得数据集成更加灵活，容易恢复出错和解决漏洞，也是组织内数据流集成的重要工具。[24]

Databases and Streams

We have drawn some comparisons between message brokers and databases. Even though they have traditionally been considered separate categories of tools, we saw that log-based message brokers have been successful in taking ideas from databases and applying them to messaging. We can also go in reverse: take ideas from messaging and streams, and apply them to databases.

我们已经对消息代理和数据库进行了一些比较。尽管它们传统上被认为是不同类别的工具，但我们发现基于日志的消息代理已成功地将数据库的思想应用于消息传递。我们也可以反向操作：从消息和数据流中获取想法，并将其应用于数据库。

We said previously that an event is a record of something that happened at some point in time. The thing that happened may be a user action (e.g., typing a search query), or a sensor reading, but it may also be a write to a database . The fact that something was written to a database is an event that can be captured, stored, and processed. This observation suggests that the connection between databases and streams runs deeper than just the physical storage of logs on disk—it is quite fundamental.

我们之前提到过，事件是发生在某个时间点的事情的记录。发生的事情可能是用户操作（例如，输入搜索查询），或传感器读数，但它也可能是对数据库的写入。将某些内容写入数据库是一个可以捕捉、存储和处理的事件。这个观察结果表明，数据库与流之间的联系比仅仅在磁盘上存储日志的物理存储更加深刻，它是相当基本的。

In fact, a replication log (see “Implementation of Replication Logs” ) is a stream of database write events, produced by the leader as it processes transactions. The followers apply that stream of writes to their own copy of the database and thus end up with an accurate copy of the same data. The events in the replication log describe the data changes that occurred.

实际上，复制日志（参见“复制日志的实现”）是领导者处理事务时生成的数据库写事件流。跟随者将该写流应用于其自己的数据库副本，因此最终得到了同样数据的准确副本。复制日志中的事件描述了发生的数据更改。

We also came across the state machine replication principle in “Total Order Broadcast” , which states: if every event represents a write to the database, and every replica processes the same events in the same order, then the replicas will all end up in the same final state. (Processing an event is assumed to be a deterministic operation.) It’s just another case of event streams!

我们在“全序广播”中也遇到了状态机复制原理，该原理指出：如果每个事件都代表对数据库的写操作，并且每个副本以相同顺序处理相同的事件，则所有副本最终都将处于相同的状态。（假设处理事件是确定性操作。）这只是事件流的另一个案例！

In this section we will first look at a problem that arises in heterogeneous data systems, and then explore how we can solve it by bringing ideas from event streams to databases.

在本节中，我们首先会探讨异构数据系统中出现的问题，然后通过引入事件流的思想来解决它。

Keeping Systems in Sync

As we have seen throughout this book, there is no single system that can satisfy all data storage, querying, and processing needs. In practice, most nontrivial applications need to combine several different technologies in order to satisfy their requirements: for example, using an OLTP database to serve user requests, a cache to speed up common requests, a full-text index to handle search queries, and a data warehouse for analytics. Each of these has its own copy of the data, stored in its own representation that is optimized for its own purposes.

在本书中，我们已经看到，没有单一的系统能够满足所有数据存储、查询和处理的需求。实际上，大多数非平凡应用需要组合几种不同的技术来满足它们的需求，例如使用 OLTP 数据库来服务用户请求，使用缓存来加速常见请求，使用全文索引来处理搜索查询，以及使用数据仓库进行分析。每个技术都有自己的数据副本，存储在自己的表示形式中，其优化了自己的目的。

As the same or related data appears in several different places, they need to be kept in sync with one another: if an item is updated in the database, it also needs to be updated in the cache, search indexes, and data warehouse. With data warehouses this synchronization is usually performed by ETL processes (see “Data Warehousing” ), often by taking a full copy of a database, transforming it, and bulk-loading it into the data warehouse—in other words, a batch process. Similarly, we saw in “The Output of Batch Workflows” how search indexes, recommendation systems, and other derived data systems might be created using batch processes.

由于同一或相关数据出现在多个不同的地方，它们需要保持同步：如果在数据库中更新一个项目，则还需要在缓存、搜索索引和数据仓库中进行更新。对于数据仓库，通常通过ETL过程（请参见“数据仓库”）执行此同步，这通常是通过对数据库进行全量复制、转换并批量加载到数据仓库中即批处理的方式。同样，我们在“批处理工作流的输出”中看到，可以使用批处理过程创建搜索索引、推荐系统和其他派生数据系统。

If periodic full database dumps are too slow, an alternative that is sometimes used is dual writes , in which the application code explicitly writes to each of the systems when data changes: for example, first writing to the database, then updating the search index, then invalidating the cache entries (or even performing those writes concurrently).

如果定期进行完整的数据库转储速度太慢，有时会使用另一种替代方法——双写，应用程序代码会在数据发生更改时明确地向每个系统写入：例如，先写入数据库，然后更新搜索索引，然后使缓存条目无效（甚至同时执行这些写操作）。

However, dual writes have some serious problems, one of which is a race condition illustrated in Figure 11-4 . In this example, two clients concurrently want to update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both clients first write the new value to the database, then write it to the search index. Due to unlucky timing, the requests are interleaved: the database first sees the write from client 1 setting the value to A, then the write from client 2 setting the value to B, so the final value in the database is B. The search index first sees the write from client 2, then client 1, so the final value in the search index is A. The two systems are now permanently inconsistent with each other, even though no error occurred.

然而，双重写入存在一些严重问题，其中之一是在图11-4中说明的竞态条件。在这个例子中，两个客户端同时想要更新项目X：客户端1想要将值设置为A，而客户端2想要将其设置为B。两个客户端先将新值写入数据库，然后将其写入搜索索引。由于不幸的时机，请求发生了交错：数据库首先看到了来自客户端1的写入，将值设置为A，然后看到客户端2设置值为B的写入，因此数据库中的最终值为B。搜索索引首先看到客户端2的写入，然后是客户端1，因此搜索索引中的最终值为A。即使没有发生错误，这两个系统现在永久不一致。

Unless you have some additional concurrency detection mechanism, such as the version vectors we discussed in “Detecting Concurrent Writes” , you will not even notice that concurrent writes occurred—one value will simply silently overwrite another value.

除非您拥有其他并发检测机制，比如我们在“检测并发写入”中讨论过的版本向量，否则您甚至不会注意到发生了并发写入——一个值将会默默地覆盖另一个值。

Another problem with dual writes is that one of the writes may fail while the other succeeds. This is a fault-tolerance problem rather than a concurrency problem, but it also has the effect of the two systems becoming inconsistent with each other. Ensuring that they either both succeed or both fail is a case of the atomic commit problem, which is expensive to solve (see “Atomic Commit and Two-Phase Commit (2PC)” ).

双写的另一个问题是，其中一个写入可能失败，而另一个则成功。这是一种容错问题，而不是并发问题，但它也会导致两个系统之间不一致。确保它们要么都成功要么都失败是原子提交问题的一个实例，其解决方法代价昂贵（参见“原子提交和两阶段提交（2PC）”）。

If you only have one replicated database with a single leader, then that leader determines the order of writes, so the state machine replication approach works among replicas of the database. However, in Figure 11-4 there isn’t a single leader: the database may have a leader and the search index may have a leader, but neither follows the other, and so conflicts can occur (see “Multi-Leader Replication” ).

如果您只有一个带有单个领导者的复制数据库，那么该领导者确定写入顺序，因此状态机复制方法在数据库的副本之间起作用。但是，在图11-4中没有单个领导者：数据库可能有一个领导者，搜索索引可能有一个领导者，但是两者都不遵循另一个，因此可能发生冲突（参见“多领导者复制”）。

The situation would be better if there really was only one leader—for example, the database—and if we could make the search index a follower of the database. But is this possible in practice?

如果确实只有一个领导者（例如数据库），并且我们能够使搜索索引成为数据库的追随者，那么情况会更好。但这在实践中可能吗？

Change Data Capture

The problem with most databases’ replication logs is that they have long been considered to be an internal implementation detail of the database, not a public API. Clients are supposed to query the database through its data model and query language, not parse the replication logs and try to extract data from them.

大多数数据库的复制日志问题在于，它们长期以来被认为是数据库的内部实现细节，而不是公共 API。客户端应该通过其数据模型和查询语言查询数据库，而不是解析复制日志并尝试从中提取数据。

For decades, many databases simply did not have a documented way of getting the log of changes written to them. For this reason it was difficult to take all the changes made in a database and replicate them to a different storage technology such as a search index, cache, or data warehouse.

几十年来，许多数据库根本没有记录获取其更改日志的方法。因此，将数据库中进行的所有更改复制到不同的存储技术（例如搜索索引、缓存或数据仓库）中就变得困难。

More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems. CDC is especially interesting if changes are made available as a stream, immediately as they are written.

最近，越来越多的人对变更数据捕获（CDC）产生兴趣，这是观察写入数据库的所有数据更改并以可复制到其他系统的形式提取它们的过程。如果更改以流的形式立即提供，CDC尤其有趣。

For example, you can capture the changes in a database and continually apply the same changes to a search index. If the log of changes is applied in the same order, you can expect the data in the search index to match the data in the database. The search index and any other derived data systems are just consumers of the change stream, as illustrated in Figure 11-5 .

例如，您可以捕获数据库中的更改，并持续将相同的更改应用于搜索索引。如果按相同顺序应用更改日志，则可以期望搜索索引中的数据与数据库中的数据匹配。搜索索引和任何其他派生数据系统仅是更改流的使用者，如图 11-5所示。

Implementing change data capture

We can call the log consumers derived data systems , as discussed in the introduction to Part III : the data stored in the search index and the data warehouse is just another view onto the data in the system of record. Change data capture is a mechanism for ensuring that all changes made to the system of record are also reflected in the derived data systems so that the derived systems have an accurate copy of the data.

我们可以将日志消耗者派生的数据系统称为第三部分引言中讨论的系统，搜索索引和数据仓库存储的数据仅是系统记录中数据的另一种视图。更改数据捕获是一种机制，可确保所有对系统记录所做的更改也反映在派生数据系统中，以便派生系统具有准确的数据副本。

Essentially, change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers. A log-based message broker is well suited for transporting the change events from the source database, since it preserves the ordering of messages (avoiding the reordering issue of Figure 11-2 ).

本质上，更改数据捕获将一个数据库设为领导者（从中捕获更改），并将其他数据库转变为追随者。基于日志的消息代理非常适合传输来自源数据库的更改事件，因为它保留了消息的顺序（避免了图11-2中的重新排序问题）。

Database triggers can be used to implement change data capture (see “Trigger-based replication” ) by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. However, they tend to be fragile and have significant performance overheads. Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.

数据库触发器可以用于实现更改数据捕获（参见“基于触发器的复制”），通过注册观察所有数据表更改并将对应条目添加到更改日志表的触发器。然而，它们往往很容易出错并且具有显着的性能开销。解析复制日志可以是更强大的方法，但它也带来了挑战，如处理模式更改。

LinkedIn’s Databus [ 25 ], Facebook’s Wormhole [ 26 ], and Yahoo!’s Sherpa [ 27 ] use this idea at large scale. Bottled Water implements CDC for PostgreSQL using an API that decodes the write-ahead log [ 28 ], Maxwell and Debezium do something similar for MySQL by parsing the binlog [ 29 , 30 , 31 ], Mongoriver reads the MongoDB oplog [ 32 , 33 ], and GoldenGate provides similar facilities for Oracle [ 34 , 35 ].

LinkedIn的Databus[25]、Facebook的Wormhole[26]和Yahoo！的Sherpa[27]大规模地使用了这个想法。Bottled Water使用API解码预写日志[28]为PostgreSQL实现CDC，Maxwell和Debezium通过解析binlog为MySQL做了类似的事情[29,30,31]，Mongoriver读取MongoDB oplog[32,33]，GoldenGate为Oracle提供类似的设施[34,35]。

Like message brokers, change data capture is usually asynchronous: the system of record database does not wait for the change to be applied to consumers before committing it. This design has the operational advantage that adding a slow consumer does not affect the system of record too much, but it has the downside that all the issues of replication lag apply (see “Problems with Replication Lag” ).

像消息代理一样，更改数据捕获通常是异步的：记录数据库的系统在提交更改之前不会等待消费者应用更改。该设计具有运营优势，即添加一个缓慢的消费者不会对系统的记录产生太大影响，但它的副作用是所有复制延迟的问题都适用（请参见“复制延迟的问题”）。

Initial snapshot

If you have the log of all changes that were ever made to a database, you can reconstruct the entire state of the database by replaying the log. However, in many cases, keeping all changes forever would require too much disk space, and replaying it would take too long, so the log needs to be truncated.

如果您拥有数据库中所有更改的日志记录，可以通过重播日志来重建数据库的整个状态。然而，在许多情况下，永久保留所有更改将需要太多的磁盘空间，并且重放将花费太长时间，因此需要截断日志。

Building a new full-text index, for example, requires a full copy of the entire database—it is not sufficient to only apply a log of recent changes, since it would be missing items that were not recently updated. Thus, if you don’t have the entire log history, you need to start with a consistent snapshot, as previously discussed in “Setting Up New Followers” .

建立新的全文索引，例如，需要完整复制整个数据库-仅应用最近更改的日志是不够的，因为它会缺少未最近更新的项目。因此，如果您没有完整的日志历史记录，您需要从一致的快照开始，如“设置新的关注者”中所讨论的那样。

The snapshot of the database must correspond to a known position or offset in the change log, so that you know at which point to start applying changes after the snapshot has been processed. Some CDC tools integrate this snapshot facility, while others leave it as a manual operation.

数据库的快照必须对应于更改日志中的已知位置或偏移量，这样在快照处理后，您就知道从哪个点开始应用更改。一些CDC工具会集成这种快照功能，而其他工具则将其留作手动操作。

Log compaction

If you can only keep a limited amount of log history, you need to go through the snapshot process every time you want to add a new derived data system. However, log compaction provides a good alternative.

如果你只能保留有限数量的日志历史记录，每次想添加新的派生数据系统时就需要进行快照过程。然而，日志压缩提供了一个很好的替代方案。

We discussed log compaction previously in “Hash Indexes” , in the context of log-structured storage engines (see Figure 3-2 for an example). The principle is simple: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key. This compaction and merging process runs in the background.

我们之前在“哈希索引”一章中讨论了日志压缩，其中的上下文是基于日志结构的存储引擎（参见图3-2作为示例）。原理很简单：存储引擎周期性地查找具有相同键的日志记录，丢弃任何副本，并仅保留每个键的最新更新。这个压缩和合并的过程在后台运行。

In a log-structured storage engine, an update with a special null value (a tombstone ) indicates that a key was deleted, and causes it to be removed during log compaction. But as long as a key is not overwritten or deleted, it stays in the log forever. The disk space required for such a compacted log depends only on the current contents of the database, not the number of writes that have ever occurred in the database. If the same key is frequently overwritten, previous values will eventually be garbage-collected, and only the latest value will be retained.

在基于日志结构的存储引擎中，通过使用特定的空值（墓碑），更新表示密钥已被删除，并在日志压缩期间将其删除。但只要密钥未被覆盖或删除，它就会永远保留在日志中。这样压缩日志所需的磁盘空间仅取决于数据库的当前内容，不取决于数据库中以前执行的写入次数。如果经常覆盖同一密钥，则先前的值最终将被垃圾回收，仅保留最新的值。

The same idea works in the context of log-based message brokers and change data capture. If the CDC system is set up such that every change has a primary key, and every update for a key replaces the previous value for that key, then it’s sufficient to keep just the most recent write for a particular key.

在基于日志的消息代理和变更数据捕获的上下文中，同样的想法适用。如果CDC系统设置为每个更改都有一个主键，并且每个关键字的更新替换该关键字的先前值，则仅保留特定关键字的最新写入即足够。

Now, whenever you want to rebuild a derived data system such as a search index, you can start a new consumer from offset 0 of the log-compacted topic, and sequentially scan over all messages in the log. The log is guaranteed to contain the most recent value for every key in the database (and maybe some older values)—in other words, you can use it to obtain a full copy of the database contents without having to take another snapshot of the CDC source database.

现在，每当您想重新构建派生数据系统，例如搜索索引，您可以从日志压缩主题的偏移0开始启动新的消费者，并按顺序扫描日志中的所有消息。日志保证包含数据库中每个键的最新值（可能还有一些旧值），换句话说，您可以使用它获取数据库内容的完整副本，而无需对CDC源数据库进行另一个快照。

This log compaction feature is supported by Apache Kafka. As we shall see later in this chapter, it allows the message broker to be used for durable storage, not just for transient messaging.

这个日志压缩功能由Apache Kafka支持。正如我们在本章后面所将看到的，这使得消息代理可以被用作持久化存储，而不仅仅是短暂的消息传递。

API support for change streams

Increasingly, databases are beginning to support change streams as a first-class interface, rather than the typical retrofitted and reverse-engineered CDC efforts. For example, RethinkDB allows queries to subscribe to notifications when the results of a query change [ 36 ], Firebase [ 37 ] and CouchDB [ 38 ] provide data synchronization based on a change feed that is also made available to applications, and Meteor uses the MongoDB oplog to subscribe to data changes and update the user interface [ 39 ].

越来越多的数据库开始支持变更流作为一流的接口，而不是典型的后续和逆向工程的CDC努力。例如，RethinkDB允许查询订阅通知查询结果更改（36），Firebase（37）和CouchDB（38）基于变更流提供数据同步，也可供应用程序使用，Meteor使用MongoDB oplog来订阅数据更改并更新用户界面（39）。

VoltDB allows transactions to continuously export data from a database in the form of a stream [ 40 ]. The database represents an output stream in the relational data model as a table into which transactions can insert tuples, but which cannot be queried. The stream then consists of the log of tuples that committed transactions have written to this special table, in the order they were committed. External consumers can asynchronously consume this log and use it to update derived data systems.

VoltDB允许事务连续以流的形式从数据库中导出数据。该数据库将输出流在关系数据模型中表示为一个表，事务可以插入元组，但不能查询。然后，该流由已提交的事务写入到此特殊表中的元组日志组成，按照它们提交的顺序。外部消费者可以异步消费此日志并使用它来更新衍生数据系统。

Kafka Connect [ 41 ] is an effort to integrate change data capture tools for a wide range of database systems with Kafka. Once the stream of change events is in Kafka, it can be used to update derived data systems such as search indexes, and also feed into stream processing systems as discussed later in this chapter.

Kafka Connect是将变更数据捕捉工具整合到各种数据库系统和Kafka的一种努力。一旦变更事件的流在Kafka中，它就可以用来更新派生数据系统，例如搜索索引，同时也可以馈送到流处理系统，正如本章后面所讨论的那样。

Event Sourcing

There are some parallels between the ideas we’ve discussed here and event sourcing , a technique that was developed in the domain-driven design (DDD) community [ 42 , 43 , 44 ]. We will discuss event sourcing briefly, because it incorporates some useful and relevant ideas for streaming systems.

这里讨论的思想与事件溯源存在一些相似之处，事件溯源是在面向领域的设计（DDD）社区中开发的一种技术 [42, 43, 44]。我们会简短地讨论事件溯源，因为它融合了一些有用和相关的流式系统思想。

Similarly to change data capture, event sourcing involves storing all changes to the application state as a log of change events. The biggest difference is that event sourcing applies the idea at a different level of abstraction:

与更改数据捕获类似，事件源存储应用程序状态的所有更改作为更改事件日志。最大的区别在于事件源将这个想法应用于不同的抽象级别：

In change data capture, the application uses the database in a mutable way, updating and deleting records at will. The log of changes is extracted from the database at a low level (e.g., by parsing the replication log), which ensures that the order of writes extracted from the database matches the order in which they were actually written, avoiding the race condition in Figure 11-4 . The application writing to the database does not need to be aware that CDC is occurring.

在变化数据捕获中，应用程序会以可变方式使用数据库，随意更新和删除记录。变更日志将从数据库中以低层级方式提取出来（例如，通过解析复制日志），这可以确保从数据库中提取的写入顺序与实际写入顺序相匹配，避免了图11-4中的竞争条件。写入数据库的应用程序不需要知道CDC正在发生。
In event sourcing, the application logic is explicitly built on the basis of immutable events that are written to an event log. In this case, the event store is append-only, and updates or deletes are discouraged or prohibited. Events are designed to reflect things that happened at the application level, rather than low-level state changes.

在事件溯源中，应用程序逻辑是明确基于写入事件日志的不可变事件构建的。在这种情况下，事件存储是追加模式的，更新或删除是不鼓励或禁止的。事件旨在反映应用程序级别发生的事情，而非低级状态更改。

Event sourcing is a powerful technique for data modeling: from an application point of view it is more meaningful to record the user’s actions as immutable events, rather than recording the effect of those actions on a mutable database. Event sourcing makes it easier to evolve applications over time, helps with debugging by making it easier to understand after the fact why something happened, and guards against application bugs (see “Advantages of immutable events” ).

事件溯源是一种强大的数据建模技术：从应用程序的角度来看，记录用户的行为作为不可变事件是更有意义的，而不是记录这些行为对可变数据库的影响。事件溯源使得随着时间的推移更容易演进应用程序，有助于调试，因为它更容易理解事后发生了什么，同时可以防止应用程序错误发生（详见“不可变事件的优点”）。

For example, storing the event “student cancelled their course enrollment” clearly expresses the intent of a single action in a neutral fashion, whereas the side effects “one entry was deleted from the enrollments table, and one cancellation reason was added to the student feedback table” embed a lot of assumptions about the way the data is later going to be used. If a new application feature is introduced—for example, “the place is offered to the next person on the waiting list”—the event sourcing approach allows that new side effect to easily be chained off the existing event.

例如，存储“学生取消了他们的课程注册”事件可以清晰地表达单个行动的意图，以中立的方式表达，而“从注册表中删除一个条目并在学生反馈表中添加一个取消原因”的副作用则嵌入了很多关于数据后续使用方式的假设。如果引入了新的应用程序功能 - 例如，“将位置提供给等候名单上的下一个人”，事件采集方法可以使新的副作用轻松地针对现有事件进行链接。

Event sourcing is similar to the chronicle data model [ 45 ], and there are also similarities between an event log and the fact table that you find in a star schema (see “Stars and Snowflakes: Schemas for Analytics” ).

事件溯源类似于编年史数据模型[45]，事件日志也与星型模式中的事实表存在相似之处（参见“星型和雪花型：用于分析的模式”）。

Specialized databases such as Event Store [ 46 ] have been developed to support applications using event sourcing, but in general the approach is independent of any particular tool. A conventional database or a log-based message broker can also be used to build applications in this style.

专门的数据库，如事件存储库，已经被开发出来以支持使用事件溯源的应用程序，但一般来说，这种方法不依赖于任何特定的工具。传统的数据库或基于日志的消息代理也可以用来构建这种风格的应用程序。

Deriving current state from the event log

An event log by itself is not very useful, because users generally expect to see the current state of a system, not the history of modifications. For example, on a shopping website, users expect to be able to see the current contents of their cart, not an append-only list of all the changes they have ever made to their cart.

事件日志本身并不是很有用，因为用户通常希望看到系统的当前状态，而不是修改历史。例如，在购物网站上，用户希望能够看到其购物车的当前内容，而不是一个仅包含其购物车所有更改的追加列表。

Thus, applications that use event sourcing need to take the log of events (representing the data written to the system) and transform it into application state that is suitable for showing to a user (the way in which data is read from the system [ 47 ]). This transformation can use arbitrary logic, but it should be deterministic so that you can run it again and derive the same application state from the event log.

因此，使用事件溯源的应用程序需要记录事件日志（代表写入系统的数据），并将其转化为适合向用户展示的应用程序状态（系统中读取数据的方式[47]）。这种转化可以使用任何逻辑，但应该是确定性的，这样您可以再次运行它，并从事件日志中派生相同的应用程序状态。

Like with change data capture, replaying the event log allows you to reconstruct the current state of the system. However, log compaction needs to be handled differently:

像更改数据捕获一样，重放事件日志允许您重建系统的当前状态。然而，需要以不同的方式处理日志压缩：

A CDC event for the update of a record typically contains the entire new version of the record, so the current value for a primary key is entirely determined by the most recent event for that primary key, and log compaction can discard previous events for the same key.

CDC事件用于更新记录时，通常包含完整的新记录版本，因此主键的当前值完全取决于该主键的最新事件，并且日志压缩可以丢弃相同键的先前事件。
On the other hand, with event sourcing, events are modeled at a higher level: an event typically expresses the intent of a user action, not the mechanics of the state update that occurred as a result of the action. In this case, later events typically do not override prior events, and so you need the full history of events to reconstruct the final state. Log compaction is not possible in the same way.

另一方面，事件溯源模型更高层次地建模事件：事件通常表达用户操作的意图，而不是由于该操作导致的状态更新的机制。在这种情况下，后续事件通常不会覆盖先前事件，因此您需要完整的事件历史记录来重建最终状态。日志压缩无法以相同的方式进行。

Applications that use event sourcing typically have some mechanism for storing snapshots of the current state that is derived from the log of events, so they don’t need to repeatedly reprocess the full log. However, this is only a performance optimization to speed up reads and recovery from crashes; the intention is that the system is able to store all raw events forever and reprocess the full event log whenever required. We discuss this assumption in “Limitations of immutability” .

使用事件源应用程序通常具有某种机制来存储当前状态的快照，该快照来自事件日志，以便它们不需要重复处理完整的日志。但是，这只是一种性能优化，可加快读取和从崩溃中恢复的速度；意图是系统能够永久存储所有原始事件并在需要时重新处理完整的事件日志。我们在“不可变性的局限性”中讨论了这种假设。

Commands and events

The event sourcing philosophy is careful to distinguish between events and commands [ 48 ]. When a request from a user first arrives, it is initially a command: at this point it may still fail, for example because some integrity condition is violated. The application must first validate that it can execute the command. If the validation is successful and the command is accepted, it becomes an event, which is durable and immutable.

事件溯源哲学特别注重区分事件和命令[48]。当用户的请求首次到达时，它最初是一个命令：此时它仍可能失败，例如因为某些完整性条件被破坏。应用程序必须先验证它是否能执行此命令。如果验证成功且命令被接受，则成为一个持久且不可变的事件。

For example, if a user tries to register a particular username, or reserve a seat on an airplane or in a theater, then the application needs to check that the username or seat is not already taken. (We previously discussed this example in “Fault-Tolerant Consensus” .) When that check has succeeded, the application can generate an event to indicate that a particular username was registered by a particular user ID, or that a particular seat has been reserved for a particular customer.

例如，如果用户尝试注册特定的用户名，或者预留飞机或剧院的座位，那么应用程序需要检查该用户名或座位是否已经被占用。（我们之前在“容错共识”中讨论过这个例子。）当检查成功时，应用程序可以生成一个事件，指示特定的用户名已经被特定的用户ID注册，或者特定的座位已经被特定的客户预订。

At the point when the event is generated, it becomes a fact . Even if the customer later decides to change or cancel the reservation, the fact remains true that they formerly held a reservation for a particular seat, and the change or cancellation is a separate event that is added later.

当事件被触发时，它便成为了一个事实。即使顾客后来决定改变或取消预订，这个事实仍然成立，即他们曾经持有预订特定座位的凭证，并且更改或取消是一个后来被添加的单独事件。

A consumer of the event stream is not allowed to reject an event: by the time the consumer sees the event, it is already an immutable part of the log, and it may have already been seen by other consumers. Thus, any validation of a command needs to happen synchronously, before it becomes an event—for example, by using a serializable transaction that atomically validates the command and publishes the event.

事件流的消费者不允许拒绝事件：当消费者看到事件时，它已经是日志不可变的一部分，而且可能已经被其他消费者看到了。因此，任何命令的验证都需要同步发生，即在成为事件之前，通过使用序列化事务原子验证命令并发布事件。

Alternatively, the user request to reserve a seat could be split into two events: first a tentative reservation, and then a separate confirmation event once the reservation has been validated (as discussed in “Implementing linearizable storage using total order broadcast” ). This split allows the validation to take place in an asynchronous process.

Alternatively可以翻译成“或者”，用户请求预留座位可以分成两个事件：首先是暂时预留，然后是在预留得到确认后进行的确认事件（如“使用总序列广播实现线性化存储”中所讨论的）。这种分割允许验证过程在异步处理中进行。

State, Streams, and Immutability

We saw in Chapter 10 that batch processing benefits from the immutability of its input files, so you can run experimental processing jobs on existing input files without fear of damaging them. This principle of immutability is also what makes event sourcing and change data capture so powerful.

在第10章中，我们看到批处理从其输入文件的不变性中获益，因此您可以在现有输入文件上运行实验处理作业，而不用担心损坏它们。这种不变性原则也使得事件溯源和变更数据捕获非常强大。

We normally think of databases as storing the current state of the application—this representation is optimized for reads, and it is usually the most convenient for serving queries. The nature of state is that it changes, so databases support updating and deleting data as well as inserting it. How does this fit with immutability?

我们通常认为数据库存储应用程序的当前状态——对于读取而言，该表示方式被优化，并且通常最方便用于提供查询。状态的特性在于它会发生变化，因此数据库支持更新、删除数据以及插入数据。这与不变性如何相适应？

Whenever you have state that changes, that state is the result of the events that mutated it over time. For example, your list of currently available seats is the result of the reservations you have processed, the current account balance is the result of the credits and debits on the account, and the response time graph for your web server is an aggregation of the individual response times of all web requests that have occurred.

无论何时，当你拥有一个不断变化的状态时，这个状态都是随着时间发生了变化的事件的结果。例如，你当前可用座位的列表取决于你处理过的预订，当前账户余额是账户上的借贷的结果，你网站服务器的响应时间图是所有发生过的网络请求的响应时间的聚合。

No matter how the state changes, there was always a sequence of events that caused those changes. Even as things are done and undone, the fact remains true that those events occurred. The key idea is that mutable state and an append-only log of immutable events do not contradict each other: they are two sides of the same coin. The log of all changes, the changelog , represents the evolution of state over time.

无论状态如何改变，总会有一系列事件导致这些变化。即使事物发生了变化和消失，事实仍然存在：这些事件发生了。关键思想是可变状态和只追加不可变事件的日志并不矛盾：它们是同一硬币的两面。所有变化的日志，即更改日志，代表了状态随时间的演变。

If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time, as shown in Figure 11-6 [ 49 , 50 , 51 ]. The analogy has limitations (for example, the second derivative of state does not seem to be meaningful), but it’s a useful starting point for thinking about data.

如果你在数学方面有天赋，你可能会说应用程序状态是随时间积分事件流得到的，而更改流是通过时间对状态求导得到的，如图11-6所示。这个类比有局限性（例如，状态的二阶导数似乎没有意义），但它是思考数据的有用起点。

If you store the changelog durably, that simply has the effect of making the state reproducible. If you consider the log of events to be your system of record, and any mutable state as being derived from it, it becomes easier to reason about the flow of data through a system. As Pat Helland puts it [ 52 ]:

如果您持久性地存储更改日志，那只会使状态可再生。如果您将事件日志视为记录系统，将任何可变状态视为从中派生，那么可以更轻松地推理数据在系统中的流动。正如Pat Helland所说：'“这简化了对数据流的推理。”

Transaction logs record all the changes made to the database. High-speed appends are the only way to change the log. From this perspective, the contents of the database hold a caching of the latest record values in the logs. The truth is the log. The database is a cache of a subset of the log. That cached subset happens to be the latest value of each record and index value from the log.

事务日志记录对数据库所做的所有更改。高速追加是更改日志的唯一方法。从这个角度来看，数据库的内容保留了日志中最新记录值的缓存。事实就是日志。数据库是日志的一个子集缓存。该缓存子集恰好是从日志中每个记录和索引值中提取的最新值。

Log compaction, as discussed in “Log compaction” , is one way of bridging the distinction between log and database state: it retains only the latest version of each record, and discards overwritten versions.

日志压缩是一种方法，用来消除日志和数据库状态之间的区别：它仅保留每个记录的最新版本，并且丢弃覆盖的版本。

Advantages of immutable events

Immutability in databases is an old idea. For example, accountants have been using immutability for centuries in financial bookkeeping. When a transaction occurs, it is recorded in an append-only ledger , which is essentially a log of events describing money, goods, or services that have changed hands. The accounts, such as profit and loss or the balance sheet, are derived from the transactions in the ledger by adding them up [ 53 ].

数据库中的不可变性是一个古老的概念。例如，会计师在财务簿记中已经使用不可变性几个世纪了。当发生交易时，它被记录在一个附加日志中，该日志本质上是描述已经转移的资金、商品或服务事件的日志。损益表或资产负债表等帐户是通过将其加起来从日志中派生出来的交易[53]。

If a mistake is made, accountants don’t erase or change the incorrect transaction in the ledger—instead, they add another transaction that compensates for the mistake, for example refunding an incorrect charge. The incorrect transaction still remains in the ledger forever, because it might be important for auditing reasons. If incorrect figures, derived from the incorrect ledger, have already been published, then the figures for the next accounting period include a correction. This process is entirely normal in accounting [ 54 ].

如果出现错误，会计师不会在账簿中擦除或更改不正确的交易——相反，他们会添加另一笔交易来补偿错误，例如退还错误的费用。不正确的交易仍然会永久保存在账簿中，因为这可能对审核原因很重要。如果已经发布了从错误账簿派生的不正确数字，则下一个会计期间的数字将包括校正。这个过程在会计中是非常正常的 [54]。

Although such auditability is particularly important in financial systems, it is also beneficial for many other systems that are not subject to such strict regulation. As discussed in “Philosophy of batch process outputs” , if you accidentally deploy buggy code that writes bad data to a database, recovery is much harder if the code is able to destructively overwrite data. With an append-only log of immutable events, it is much easier to diagnose what happened and recover from the problem.

尽管审计能力在金融系统中尤为重要，但对许多其他系统也有益，这些系统并不受到如此严格的监管。如“批处理输出哲学”中所讨论的，如果您不小心部署了有错误的代码，将糟糕的数据写入数据库，如果该代码能够具有破坏性地覆盖数据，则恢复将更加困难。使用不可变时间轴的追加日志记录事件，诊断问题和恢复变得更加容易。

Immutable events also capture more information than just the current state. For example, on a shopping website, a customer may add an item to their cart and then remove it again. Although the second event cancels out the first event from the point of view of order fulfillment, it may be useful to know for analytics purposes that the customer was considering a particular item but then decided against it. Perhaps they will choose to buy it in the future, or perhaps they found a substitute. This information is recorded in an event log, but would be lost in a database that deletes items when they are removed from the cart [ 42 ].

不可变事件还可以捕捉比当前状态更多的信息。例如，在购物网站上，客户可能会将商品添加到购物车中，然后再将其移除。尽管第二个事件从订单履行的角度来看会取消第一个事件，但为了分析目的，知道客户正在考虑某个商品，但最终决定不购买，可能很有用。也许他们将来会选择购买它，或者可能找到替代品。这些信息记录在事件日志中，但在从购物车中删除商品的数据库中将丢失。

Deriving several views from the same event log

Moreover, by separating mutable state from the immutable event log, you can derive several different read-oriented representations from the same log of events. This works just like having multiple consumers of a stream ( Figure 11-5 ): for example, the analytic database Druid ingests directly from Kafka using this approach [ 55 ], Pistachio is a distributed key-value store that uses Kafka as a commit log [ 56 ], and Kafka Connect sinks can export data from Kafka to various different databases and indexes [ 41 ]. It would make sense for many other storage and indexing systems, such as search servers, to similarly take their input from a distributed log (see “Keeping Systems in Sync” ).

此外，通过将可变状态与不可变事件日志分开，您可以从相同的事件日志中获得多种不同的读取导向表示。这就像拥有多个流的消费者一样（图11-5）：例如，分析数据库Druid使用此方法直接从Kafka摄取[55]，Pistachio是一种分布式键值存储，它使用Kafka作为提交日志[56]，而Kafka Connect Sinks可以将数据从Kafka导出到各种不同的数据库和索引[41]。许多其他存储和索引系统，例如搜索服务器，同样可以从分布式日志中获取它们的输入（请参阅“保持系统同步”）。

Having an explicit translation step from an event log to a database makes it easier to evolve your application over time: if you want to introduce a new feature that presents your existing data in some new way, you can use the event log to build a separate read-optimized view for the new feature, and run it alongside the existing systems without having to modify them. Running old and new systems side by side is often easier than performing a complicated schema migration in an existing system. Once the old system is no longer needed, you can simply shut it down and reclaim its resources [ 47 , 57 ].

将事件日志明确地翻译到数据库中，有助于随着时间的推移，演变应用程序：如果您想引入一种新功能，以某种新方式呈现现有数据，您可以使用事件日志为新功能构建单独的读取优化视图，在不修改它们的情况下与现有系统并行运行。在现有系统中执行复杂的模式迁移比在旧系统旁边运行旧和新系统通常更容易。一旦不再需要旧系统，您可以简单地关闭它并回收其资源[47、57]。

Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns (see Chapter 3 ). For this reason, you gain a lot of flexibility by separating the form in which data is written from the form it is read, and by allowing several different read views. This idea is sometimes known as command query responsibility segregation (CQRS) [ 42 , 58 , 59 ].

如果您不必担心数据如何查询和访问，那么存储数据通常相当简单；模式设计、索引和存储引擎的许多复杂性都是为了支持特定的查询和访问模式（请参见第3章）。由于这个原因，通过将数据写入和读取的方式分离，以及允许多个不同的读取视图，可以获得很多的灵活性。这个想法有时被称为命令查询责任分离（CQRS）[42，58，59]。

The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see “Many-to-One and Many-to-Many Relationships” ) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log.

传统的数据库和模式设计方法是基于一种谬论，即数据必须以查询的形式写入。如果您可以将数据从写优化的事件日志翻译成读优化的应用程序状态，则有关规范化和去规范化的辩论（请参见“多对一和多对多关系”）将变得相对无关紧要：在读优化视图中去规范化数据是完全合理的，因为翻译过程为您提供了一种保持一致性的机制。

In “Describing Load” we discussed Twitter’s home timelines, a cache of recently written tweets by the people a particular user is following (like a mailbox). This is another example of read-optimized state: home timelines are highly denormalized, since your tweets are duplicated in all of the timelines of the people following you. However, the fan-out service keeps this duplicated state in sync with new tweets and new following relationships, which keeps the duplication manageable.

在“描述载荷”一节中，我们讨论了Twitter的主页时间线，这是一个由特定用户正在关注的人最近撰写的推文的缓存（类似于邮箱）。这是另一个读取优化状态的例子：主页时间线高度非规范化，因为你的推文在所有关注你的人的时间线中都会重复。然而，分扇服务使这个重复的状态与新的推文和新的关注关系同步，从而使重复变得可管理。

Concurrency control

The biggest downside of event sourcing and change data capture is that the consumers of the event log are usually asynchronous, so there is a possibility that a user may make a write to the log, then read from a log-derived view and find that their write has not yet been reflected in the read view. We discussed this problem and potential solutions previously in “Reading Your Own Writes” .

事件溯源和变更数据捕获的最大缺点是事件日志的消费者通常是异步的，因此存在一个可能性，即用户可能会将其写入日志，然后从日志派生的视图中读取，并发现其写入尚未在读取视图中反映出来。我们在“读取自己的写入”中曾经讨论过这个问题和潜在的解决方案。

One solution would be to perform the updates of the read view synchronously with appending the event to the log. This requires a transaction to combine the writes into an atomic unit, so either you need to keep the event log and the read view in the same storage system, or you need a distributed transaction across the different systems. Alternatively, you could use the approach discussed in “Implementing linearizable storage using total order broadcast” .

一个解决方案是在将事件追加到日志时同步执行读视图的更新。这需要一个事务将写操作组合成一个原子单元，因此你需要将事件日志和读视图保存在同一个存储系统中，或者你需要在不同系统之间使用分布式事务。另外，你可以使用“使用总序广播实现可线性化存储”的方法。

On the other hand, deriving the current state from an event log also simplifies some aspects of concurrency control. Much of the need for multi-object transactions (see “Single-Object and Multi-Object Operations” ) stems from a single user action requiring data to be changed in several different places. With event sourcing, you can design an event such that it is a self-contained description of a user action. The user action then requires only a single write in one place—namely appending the events to the log—which is easy to make atomic.

另一方面，从事件日志中获取当前状态也简化了并发控制的某些方面。大部分需要多对象事务（参见“单对象和多对象操作”）的原因在于单个用户操作需要在多个不同位置更改数据。使用事件源，您可以设计一个事件，使其成为用户操作的独立描述。然后，用户操作仅需要在一个地方进行单个写入 - 即将事件附加到日志中 - 这很容易使其原子化。

If the event log and the application state are partitioned in the same way (for example, processing an event for a customer in partition 3 only requires updating partition 3 of the application state), then a straightforward single-threaded log consumer needs no concurrency control for writes—by construction, it only processes a single event at a time (see also “Actual Serial Execution” ). The log removes the nondeterminism of concurrency by defining a serial order of events in a partition [ 24 ]. If an event touches multiple state partitions, a bit more work is required, which we will discuss in Chapter 12 .

如果事件日志和应用状态以相同的方式进行分区（例如，仅需更新应用状态的分区3以处理客户的事件），则简单的单线程日志消费者不需要并发控制来进行写操作——由于构造原因，它一次只处理一个事件（另请参阅“实际串行执行”）。日志通过定义分区中事件的序列顺序来消除并发的非确定性[24]。如果事件涉及多个状态分区，则需要更多工作，在第12章中我们将讨论这些。

Limitations of immutability

Many systems that don’t use an event-sourced model nevertheless rely on immutability: various databases internally use immutable data structures or multi-version data to support point-in-time snapshots (see “Indexes and snapshot isolation” ). Version control systems such as Git, Mercurial, and Fossil also rely on immutable data to preserve version history of files.

许多不使用事件溯源模型的系统仍然依赖于不变性：各种数据库内部使用不可变数据结构或多版本数据来支持时间点快照（请参见“索引和快照隔离”）。版本控制系统如Git、Mercurial和Fossil也依赖于不变数据来保留文件版本历史记录。

To what extent is it feasible to keep an immutable history of all changes forever? The answer depends on the amount of churn in the dataset. Some workloads mostly add data and rarely update or delete; they are easy to make immutable. Other workloads have a high rate of updates and deletes on a comparatively small dataset; in these cases, the immutable history may grow prohibitively large, fragmentation may become an issue, and the performance of compaction and garbage collection becomes crucial for operational robustness [ 60 , 61 ].

保留所有更改的不可变历史记录是否可行取决于数据集中的变化量。某些工作负载主要添加数据，很少更新或删除；它们易于变为不可变。其他工作负载在相对较小的数据集上具有高更新和删除率；在这些情况下，不可变历史记录可能会增长到无法承受的大小，碎片化可能会成为一个问题，压缩和垃圾收集的性能变得至关重要 [60、61]。

Besides the performance reasons, there may also be circumstances in which you need data to be deleted for administrative reasons, in spite of all immutability. For example, privacy regulations may require deleting a user’s personal information after they close their account, data protection legislation may require erroneous information to be removed, or an accidental leak of sensitive information may need to be contained.

除了性能原因，由于管理原因，您可能需要删除数据，尽管存在不可变性。例如，隐私规定可能要求在用户关闭账户后删除其个人信息，数据保护法规可能要求删除错误信息，或者敏感信息的意外泄露可能需要被控制。

In these circumstances, it’s not sufficient to just append another event to the log to indicate that the prior data should be considered deleted—you actually want to rewrite history and pretend that the data was never written in the first place. For example, Datomic calls this feature excision [ 62 ], and the Fossil version control system has a similar concept called shunning [ 63 ].

在这种情况下，仅仅将另一个事件附加到日志中以指示应将先前的数据视为已删除是不足够的 - 你实际上想要重写历史并假装第一次没有写入数据。例如，Datomic称此功能为切除[62]，而Fossil版本控制系统具有类似的概念称为shunning[63]。

Truly deleting data is surprisingly hard [ 64 ], since copies can live in many places: for example, storage engines, filesystems, and SSDs often write to a new location rather than overwriting in place [ 52 ], and backups are often deliberately immutable to prevent accidental deletion or corruption. Deletion is more a matter of “making it harder to retrieve the data” than actually “making it impossible to retrieve the data.” Nevertheless, you sometimes have to try, as we shall see in “Legislation and self-regulation” .

真正删除数据的难度出人意料 [64]，因为副本可以存在于许多地方: 例如，存储引擎、文件系统和SSD通常在新位置写入而不是原地覆盖 [52]，备份通常故意不可变，以防止意外删除或损坏。删除更多地是“让检索数据变得更难”而不是真正“使检索数据变得不可能”。虽然如此，有时您必须尝试，正如我们将在“立法和自我调节”中看到的那样。

Processing Streams

So far in this chapter we have talked about where streams come from (user activity events, sensors, and writes to databases), and we have talked about how streams are transported (through direct messaging, via message brokers, and in event logs).

到目前为止，在这一章中，我们已经谈论了流从哪来（用户活动事件，传感器和对数据库的写入），以及我们谈论了流如何传输（通过直接消息传递，通过消息代理和事件日志）。

What remains is to discuss what you can do with the stream once you have it—namely, you can process it. Broadly, there are three options:

剩下的就是讨论一旦你拥有了这个数据流你可以做什么——也就是你可以对其进行处理。大致上，有三个选项：

You can take the data in the events and write it to a database, cache, search index, or similar storage system, from where it can then be queried by other clients. As shown in Figure 11-5 , this is a good way of keeping a database in sync with changes happening in other parts of the system—especially if the stream consumer is the only client writing to the database. Writing to a storage system is the streaming equivalent of what we discussed in “The Output of Batch Workflows” .

您可以将事件中的数据写入数据库、缓存、搜索索引或类似存储系统中，其他客户端可以从中查询。如图11-5所示，这是一种很好的方法，可以保持数据库与系统中其他部分发生的更改同步，特别是如果流式消费者是写入数据库的唯一客户端。写入存储系统就相当于我们在“批量工作流的输出”中讨论的流式等效物。
You can push the events to users in some way, for example by sending email alerts or push notifications, or by streaming the events to a real-time dashboard where they are visualized. In this case, a human is the ultimate consumer of the stream.

你可以通过发送电子邮件或推送通知，或将事件流式传输到实时仪表板并进行可视化的方式向用户推送事件。在这种情况下，人类是流的终极消费者。
You can process one or more input streams to produce one or more output streams. Streams may go through a pipeline consisting of several such processing stages before they eventually end up at an output (option 1 or 2).

您可以处理一个或多个输入流以生成一个或多个输出流。流可能通过由几个这样的处理阶段组成的管道经过，在最终到达选项 1 或 2 的输出之前。

In the rest of this chapter, we will discuss option 3: processing streams to produce other, derived streams. A piece of code that processes streams like this is known as an operator or a job . It is closely related to the Unix processes and MapReduce jobs we discussed in Chapter 10 , and the pattern of dataflow is similar: a stream processor consumes input streams in a read-only fashion and writes its output to a different location in an append-only fashion.

在本章的其余部分中，我们将讨论选项3：处理流以生成其他派生流。像这样处理流的代码被称为操作符或作业。它与我们在第10章中讨论的Unix进程和MapReduce作业密切相关，数据流模式相似：流处理器以只读的方式消耗输入流，并以追加的方式将其输出写入不同的位置。

The patterns for partitioning and parallelization in stream processors are also very similar to those in MapReduce and the dataflow engines we saw in Chapter 10 , so we won’t repeat those topics here. Basic mapping operations such as transforming and filtering records also work the same.

流处理器中的分区和并行化模式与MapReduce和我们在第10章中看到的数据流引擎非常相似，因此我们不会在这里重复这些主题。基本的映射操作，如转换和过滤记录，也是相同的。

The one crucial difference to batch jobs is that a stream never ends. This difference has many implications: as discussed at the start of this chapter, sorting does not make sense with an unbounded dataset, and so sort-merge joins (see “Reduce-Side Joins and Grouping” ) cannot be used. Fault-tolerance mechanisms must also change: with a batch job that has been running for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job that has been running for several years, restarting from the beginning after a crash may not be a viable option.

一个批处理和流处理的关键区别是流处理从不结束。这个差异有许多影响：就像本章开头所讨论的，对于一个无限的数据集来说排序是没有意义的，因此无法使用排序合并连接（参见“Reduce-Side Joins and Grouping”）。容错机制也必须改变：对于运行几分钟的批处理作业，失败的任务可以从头重新开始，但对于运行数年的流处理作业，在崩溃后从头重新开始可能不是可行的选择。

Uses of Stream Processing

Stream processing has long been used for monitoring purposes, where an organization wants to be alerted if certain things happen. For example:

流处理长期以来一直被用于监控目的，其中组织希望在发生某些事情时得到警报。例如：

Fraud detection systems need to determine if the usage patterns of a credit card have unexpectedly changed, and block the card if it is likely to have been stolen.

欺诈检测系统需要确定信用卡使用模式是否意外更改，并在可能被盗时封锁该卡。
Trading systems need to examine price changes in a financial market and execute trades according to specified rules.

交易系统需要审查金融市场的价格变化，并按照规定的规则执行交易。
Manufacturing systems need to monitor the status of machines in a factory, and quickly identify the problem if there is a malfunction.

制造系统需要监视工厂内机器的状态，并且在机器出现故障时能够迅速识别问题。
Military and intelligence systems need to track the activities of a potential aggressor, and raise the alarm if there are signs of an attack.

军事和情报系统需要跟踪潜在敌人的活动，并在发现攻击迹象时发出警报。

These kinds of applications require quite sophisticated pattern matching and correlations. However, other uses of stream processing have also emerged over time. In this section we will briefly compare and contrast some of these applications.

这种应用需要相当复杂的模式匹配和相关性。然而，随着时间的推移，流处理的其他用途也不断出现。在本节中，我们将简要比较和对比这些应用。

Complex event processing

Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns [ 65 , 66 ]. Similarly to the way that a regular expression allows you to search for certain patterns of characters in a string, CEP allows you to specify rules to search for certain patterns of events in a stream.

复杂事件处理（CEP）是开发于1990年代的一种方法，用于分析事件流，特别适用于需要搜索特定事件模式的应用程序[65，66]。类似于正则表达式允许您在字符串中搜索特定模式的字符的方式，CEP允许您指定规则搜索事件流中的某些事件模式。

CEP systems often use a high-level declarative query language like SQL, or a graphical user interface, to describe the patterns of events that should be detected. These queries are submitted to a processing engine that consumes the input streams and internally maintains a state machine that performs the required matching. When a match is found, the engine emits a complex event (hence the name) with the details of the event pattern that was detected [ 67 ].

CEP系统通常使用高级声明性查询语言（如SQL）或图形用户界面来描述应检测到的事件模式。这些查询将提交给处理引擎，该引擎消耗输入流并内部维护状态机以执行所需的匹配。当找到匹配时，引擎发出一个带有被检测到的事件模式细节的复杂事件（因此得名）。

In these systems, the relationship between queries and data is reversed compared to normal databases. Usually, a database stores data persistently and treats queries as transient: when a query comes in, the database searches for data matching the query, and then forgets about the query when it has finished. CEP engines reverse these roles: queries are stored long-term, and events from the input streams continuously flow past them in search of a query that matches an event pattern [ 68 ].

在这些系统中，与普通数据库相比，查询和数据之间的关系被颠倒了。通常，数据库会持久地存储数据，并将查询视为短暂的：当有查询时，数据库将搜索与查询匹配的数据，然后在完成后就忘记了查询。CEP引擎颠倒了这些角色：查询被长期存储，而来自输入流的事件不断流过它们，以寻找与事件模式匹配的查询[68]。

Implementations of CEP include Esper [ 69 ], IBM InfoSphere Streams [ 70 ], Apama, TIBCO StreamBase, and SQLstream. Distributed stream processors like Samza are also gaining SQL support for declarative queries on streams [ 71 ].

CEP 的实现包括 Esper [69]，IBM InfoSphere Streams [70]，Apama，TIBCO StreamBase 和 SQLstream。分布式流处理器像 Samza 也在逐渐获得 SQL 对流查询的支持 [71]。

Stream analytics

Another area in which stream processing is used is for analytics on streams. The boundary between CEP and stream analytics is blurry, but as a general rule, analytics tends to be less interested in finding specific event sequences and is more oriented toward aggregations and statistical metrics over a large number of events—for example:

流处理使用的另一个领域是流分析。 CEP 和流分析之间的界限模糊不清，但一般规则是分析倾向于不关心特定事件序列，而更倾向于对大量事件进行聚合和统计度量。例如：

Measuring the rate of some type of event (how often it occurs per time interval)

测量某种类型事件的频率（每个时间间隔发生的频率）
Calculating the rolling average of a value over some time period

计算某一时期内某个值的滚动平均值。
Comparing current statistics to previous time intervals (e.g., to detect trends or to alert on metrics that are unusually high or low compared to the same time last week)

将当前统计数据与以前的时间间隔进行比较（例如，以便检测趋势或警报指标是否与上周同期明显偏高或偏低）。

Such statistics are usually computed over fixed time intervals—for example, you might want to know the average number of queries per second to a service over the last 5 minutes, and their 99th percentile response time during that period. Averaging over a few minutes smoothes out irrelevant fluctuations from one second to the next, while still giving you a timely picture of any changes in traffic pattern. The time interval over which you aggregate is known as a window , and we will look into windowing in more detail in “Reasoning About Time” .

这类统计通常是在固定的时间间隔内计算的——例如，您可能想知道过去5分钟内服务的平均每秒查询次数，以及在这段时间内它们的99th百分位响应时间。在几分钟内平均化可以平滑不相关的波动，同时仍能及时地给您展示任何流量模式的变化。您聚合的时间间隔称为窗口，在“时间推理”中详细探讨窗口。

Stream analytics systems sometimes use probabilistic algorithms, such as Bloom filters (which we encountered in “Performance optimizations” ) for set membership, HyperLogLog [ 72 ] for cardinality estimation, and various percentile estimation algorithms (see “Percentiles in Practice” ). Probabilistic algorithms produce approximate results, but have the advantage of requiring significantly less memory in the stream processor than exact algorithms. This use of approximation algorithms sometimes leads people to believe that stream processing systems are always lossy and inexact, but that is wrong: there is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization [ 73 ].

流分析系统有时使用概率算法，比如布隆过滤器（我们在“性能优化”中遇到）用于集合成员、HyperLogLog用于基数估计以及各种百分位估计算法（详见“百分位数实践”）。概率算法会产生近似结果，但其优点是需要远比精确算法更少的内存资源。这种运用近似算法有时会让人们认为流处理系统总是有损和不精确的，但这是错误的：流处理本质上并非近似，而概率算法只是一种优化。

Many open source distributed stream processing frameworks are designed with analytics in mind: for example, Apache Storm, Spark Streaming, Flink, Concord, Samza, and Kafka Streams [ 74 ]. Hosted services include Google Cloud Dataflow and Azure Stream Analytics.

许多开源分布式流处理框架都是以分析为重点设计的，例如Apache Storm、Spark Streaming、Flink、Concord、Samza和Kafka Streams [74]。托管服务包括Google Cloud Dataflow和Azure Stream Analytics。

Maintaining materialized views

We saw in “Databases and Streams” that a stream of changes to a database can be used to keep derived data systems, such as caches, search indexes, and data warehouses, up to date with a source database. We can regard these examples as specific cases of maintaining materialized views (see “Aggregation: Data Cubes and Materialized Views” ): deriving an alternative view onto some dataset so that you can query it efficiently, and updating that view whenever the underlying data changes [ 50 ].

在“数据库和流”中，我们看到可以使用数据库的更改流来保持派生数据系统（例如缓存、搜索索引和数据仓库）与源数据库保持最新。我们可以将这些示例视为维护物化视图的特定案例（请参见“聚合：数据立方体和物化视图”）：导出一些数据集的替代视图，以便您可以高效地查询它，并在基础数据更改时更新该视图[50]。

Similarly, in event sourcing, application state is maintained by applying a log of events; here the application state is also a kind of materialized view. Unlike stream analytics scenarios, it is usually not sufficient to consider only events within some time window: building the materialized view potentially requires all events over an arbitrary time period, apart from any obsolete events that may be discarded by log compaction (see “Log compaction” ). In effect, you need a window that stretches all the way back to the beginning of time.

在事件溯源中，通过应用一系列事件日志来维护应用状态；这里应用状态也是某种物化视图。与流分析场景不同的是，仅考虑某个时间窗口内的事件通常是不足够的：构建物化视图可能需要在任意时间段内获取所有事件，除了可能被日志压缩（参见“日志压缩”）丢弃的过时事件。实际上，您需要一个一直延伸到时间开始的窗口。

In principle, any stream processor could be used for materialized view maintenance, although the need to maintain events forever runs counter to the assumptions of some analytics-oriented frameworks that mostly operate on windows of a limited duration. Samza and Kafka Streams support this kind of usage, building upon Kafka’s support for log compaction [ 75 ].

原则上，任何流处理器都可以用于实时视图维护，尽管需要永久维护事件与一些面向分析的框架的假设相矛盾，这些框架主要基于有限持续时间的窗口操作。Samza和Kafka Streams支持这种使用方式，基于Kafka的日志压缩支持 [75]。

Search on streams

Besides CEP, which allows searching for patterns consisting of multiple events, there is also sometimes a need to search for individual events based on complex criteria, such as full-text search queries.

除了CEP，它允许搜索由多个事件组成的模式，有时还需要根据复杂条件搜索个别事件，例如全文搜索查询。

For example, media monitoring services subscribe to feeds of news articles and broadcasts from media outlets, and search for any news mentioning companies, products, or topics of interest. This is done by formulating a search query in advance, and then continually matching the stream of news items against this query. Similar features exist on some websites: for example, users of real estate websites can ask to be notified when a new property matching their search criteria appears on the market. The percolator feature of Elasticsearch [ 76 ] is one option for implementing this kind of stream search.

例如，媒体监控服务订阅来自媒体机构的新闻文章和广播，搜索任何涉及感兴趣的公司、产品或主题的新闻。这是通过预先制定搜索查询，然后不断将新闻流与此查询进行匹配来完成的。一些网站上存在类似的功能：例如，房地产网站的用户可以要求在市场上出现符合其搜索标准的新房产时收到通知。 Elasticsearch 的 percolator 功能是实现这种流搜索的选项之一。

Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries, like in CEP. In the simplest case, you can test every document against every query, although this can get slow if you have a large number of queries. To optimize the process, it is possible to index the queries as well as the documents, and thus narrow down the set of queries that may match [ 77 ].

传统搜索引擎首先索引文档，然后在索引上运行查询。相比之下，搜索流程将处理颠倒过来：将查询存储，然后文档通过查询运行，就像在CEP中一样。在最简单的情况下，可以对每个文档进行每个查询的测试，但如果有大量查询，则可能速度较慢。为优化这一过程，可以索引查询以及文档，从而缩小可能匹配的查询集合[77]。

Message passing and RPC

In “Message-Passing Dataflow” we discussed message-passing systems as an alternative to RPC—i.e., as a mechanism for services to communicate, as used for example in the actor model. Although these systems are also based on messages and events, we normally don’t think of them as stream processors:

在“消息传递数据流”中，我们讨论了消息传递系统作为RPC的一种替代方案——即作为服务之间通信的机制，例如在actor模型中使用。虽然这些系统也基于消息和事件，但我们通常不将它们视为流处理器：

Actor frameworks are primarily a mechanism for managing concurrency and distributed execution of communicating modules, whereas stream processing is primarily a data management technique.

演员框架主要是用于管理并发和分布式执行通信模块的机制，而流处理主要是一种数据管理技术。
Communication between actors is often ephemeral and one-to-one, whereas event logs are durable and multi-subscriber.

演员之间的沟通往往是短暂的一对一交流，而事件日志是持久的、可多次订阅的。
Actors can communicate in arbitrary ways (including cyclic request/response patterns), but stream processors are usually set up in acyclic pipelines where every stream is the output of one particular job, and derived from a well-defined set of input streams.

演员可以用任意方式进行沟通(包括循环的请求/响应模式),但流处理器通常设定为非循环的管道,其中每个流都是一个特定工作的输出,并来源于一个明确定义的输入流集合.

That said, there is some crossover area between RPC-like systems and stream processing. For example, Apache Storm has a feature called distributed RPC , which allows user queries to be farmed out to a set of nodes that also process event streams; these queries are then interleaved with events from the input streams, and results can be aggregated and sent back to the user [ 78 ]. (See also “Multi-partition data processing” .)

RPC式系统和流处理之间存在一些交叉区域。例如，Apache Storm具有名为“分布式RPC”的功能，允许将用户查询分配到一组同时处理事件流的节点；然后，这些查询将与输入流中的事件交错，并可聚合结果并发送回用户。请参见“多分区数据处理”。

It is also possible to process streams using actor frameworks. However, many such frameworks do not guarantee message delivery in the case of crashes, so the processing is not fault-tolerant unless you implement additional retry logic.

使用 actor 框架也可以处理数据流。然而，许多这样的框架在发生崩溃时无法保证消息传递，所以除非您实现了额外的重试逻辑，否则处理不具备容错能力。

Reasoning About Time

Stream processors often need to deal with time, especially when used for analytics purposes, which frequently use time windows such as “the average over the last five minutes.” It might seem that the meaning of “the last five minutes” should be unambiguous and clear, but unfortunately the notion is surprisingly tricky.

流处理器常常需要处理时间，特别是在用于分析目的时，这通常会使用时间窗口，例如“过去五分钟内的平均值”。这个概念似乎应该是明确和清晰的，但不幸的是，它的意义却令人惊讶地棘手。

In a batch process, the processing tasks rapidly crunch through a large collection of historical events. If some kind of breakdown by time needs to happen, the batch process needs to look at the timestamp embedded in each event. There is no point in looking at the system clock of the machine running the batch process, because the time at which the process is run has nothing to do with the time at which the events actually occurred.

在批处理过程中，处理任务快速地处理大量的历史事件。如果需要按时间进行分解，批处理需要查看每个事件中嵌入的时间戳。查看批处理运行的机器的系统时钟没有意义，因为处理运行的时间与事件实际发生的时间无关。

A batch process may read a year’s worth of historical events within a few minutes; in most cases, the timeline of interest is the year of history, not the few minutes of processing. Moreover, using the timestamps in the events allows the processing to be deterministic: running the same process again on the same input yields the same result (see “Fault tolerance” ).

批处理可以在几分钟内读取一年的历史事件；在大多数情况下，感兴趣的时间线是历史的一年，而不是几分钟的处理。此外，使用事件中的时间戳允许处理是确定性的：在相同的输入上再次运行相同的过程会产生相同的结果（参见“错误容忍”）。

On the other hand, many stream processing frameworks use the local system clock on the processing machine (the processing time ) to determine windowing [ 79 ]. This approach has the advantage of being simple, and it is reasonable if the delay between event creation and event processing is negligibly short. However, it breaks down if there is any significant processing lag—i.e., if the processing may happen noticeably later than the time at which the event actually occurred.

另一方面，许多流处理框架使用处理机上的本地系统时钟（处理时间）来确定窗口化[79]。这种方法具有简单的优点，如果事件创建和事件处理之间的延迟可以忽略不计的话是合理的。然而，如果存在任何显着的处理延迟，即如果处理可能明显晚于事件实际发生的时间，它将失效。

Event time versus processing time

There are many reasons why processing may be delayed: queueing, network faults (see “Unreliable Networks” ), a performance issue leading to contention in the message broker or processor, a restart of the stream consumer, or reprocessing of past events (see “Replaying old messages” ) while recovering from a fault or after fixing a bug in the code.

处理可能被延迟的原因有很多：排队，网络故障（参见“不可靠网络”），性能问题导致消息代理或处理器中的争用，流消费者的重启，或者在修复代码中的错误后从故障中恢复或重新处理过去的事件（参见“重新播放旧消息”）。

Moreover, message delays can also lead to unpredictable ordering of messages. For example, say a user first makes one web request (which is handled by web server A), and then a second request (which is handled by server B). A and B emit events describing the requests they handled, but B’s event reaches the message broker before A’s event does. Now stream processors will first see the B event and then the A event, even though they actually occurred in the opposite order.

此外，消息延迟还会导致消息的顺序变得不可预测。例如，一个用户首先请求一个网页（由服务器A处理），然后请求第二个网页（服务器B处理）。A和B都会发布关于它们处理的请求的事件，但是B的事件先到达消息代理，然后才是A的事件。现在，流处理器会先看到B的事件，然后再看到A的事件，尽管它们实际上是相反的顺序发生的。

If it helps to have an analogy, consider the Star Wars movies: Episode IV was released in 1977, Episode V in 1980, and Episode VI in 1983, followed by Episodes I, II, and III in 1999, 2002, and 2005, respectively, and Episode VII in 2015 [ 80 ]. ⁱⁱ If you watched the movies in the order they came out, the order in which you processed the movies is inconsistent with the order of their narrative. (The episode number is like the event timestamp, and the date when you watched the movie is the processing time.) As humans, we are able to cope with such discontinuities, but stream processing algorithms need to be specifically written to accommodate such timing and ordering issues.

如果需要一个类比的话，可以考虑《星球大战》电影系列：第四集在1977年上映，第五集在1980年，第六集在1983年上映，接着是第一集、第二集和第三集，在1999年、2002年和2005年上映，最后是第七集在2015年上映[80]。如果您按照它们上映的顺序观看电影，那么您处理电影的顺序与它们叙述的顺序是不一致的。(剧集编号就像事件时间戳，您观看电影的日期就是处理时间。)作为人类，我们能够应对这种不连贯性，但流处理算法需要专门编写以适应这种时间和顺序问题。

Confusing event time and processing time leads to bad data. For example, say you have a stream processor that measures the rate of requests (counting the number of requests per second). If you redeploy the stream processor, it may be shut down for a minute and process the backlog of events when it comes back up. If you measure the rate based on the processing time, it will look as if there was a sudden anomalous spike of requests while processing the backlog, when in fact the real rate of requests was steady ( Figure 11-7 ).

混淆事件时间和处理时间会导致错误的数据。例如，假设你有一个流处理器，用于测量请求的速率（每秒请求的数量）。如果你重新部署流处理器，那么它可能会被关闭一分钟，并在重新启动后处理事件的积压。如果你根据处理时间来测量速率，则会看起来似乎请求的突然异常峰值，但实际上请求的实际速率是稳定的（图11-7）。

Knowing when you’re ready

A tricky problem when defining windows in terms of event time is that you can never be sure when you have received all of the events for a particular window, or whether there are some events still to come.

定义事件时间窗口时的一个棘手问题在于无法确定何时已接收到某个窗口的所有事件，或者是否还有一些事件需要到来。

For example, say you’re grouping events into one-minute windows so that you can count the number of requests per minute. You have counted some number of events with timestamps that fall in the 37th minute of the hour, and time has moved on; now most of the incoming events fall within the 38th and 39th minutes of the hour. When do you declare that you have finished the window for the 37th minute, and output its counter value?

例如，假设您正在将事件分组到每分钟窗口中，以便可以计算每分钟的请求数量。您已经计算出一些时间戳位于小时的第37分钟内的事件数量，但时间已经过去了。现在，大部分传入事件都在小时的第38和39分钟内。您应该在何时声明已完成第37分钟的窗口，并输出其计数器值？

You can time out and declare a window ready after you have not seen any new events for a while, but it could still happen that some events were buffered on another machine somewhere, delayed due to a network interruption. You need to be able to handle such straggler events that arrive after the window has already been declared complete. Broadly, you have two options [ 1 ]:

在一段时间内没有观察到任何新事件，可以超时且声明窗口已准备好，但仍可能存在某些事件在其他机器上被缓存或由于网络中断而延迟。您需要能够处理在窗口已经被声明为完成后到达的这种拖延事件。一般而言，您有两个选择[1]:

Ignore the straggler events, as they are probably a small percentage of events in normal circumstances. You can track the number of dropped events as a metric, and alert if you start dropping a significant amount of data.

忽略那些迷路的事件，因为它们在正常情况下可能只占事件的一小部分。你可以把丢失的事件数作为指标进行跟踪，并在你开始丢失大量数据时发出警报。
Publish a correction , an updated value for the window with stragglers included. You may also need to retract the previous output.

发布一份更正的通知，包括剩余者在内的更新数值。您可能需要撤回以前的输出。

In some cases it is possible to use a special message to indicate, “From now on there will be no more messages with a timestamp earlier than t ,” which can be used by consumers to trigger windows [ 81 ]. However, if several producers on different machines are generating events, each with their own minimum timestamp thresholds, the consumers need to keep track of each producer individually. Adding and removing producers is trickier in this case.

有些情况下，可以使用特殊信息来表示“从现在开始，将不再有比t更早的时间戳的信息”，消费者可以使用这些信息来触发窗口[81]。然而，如果有多个位于不同机器上生成事件的生产者，每个生产者都有自己的最小时间戳阈值，消费者需要单独跟踪每个生产者。在这种情况下，添加和删除生产者更加棘手。

Whose clock are you using, anyway?

Assigning timestamps to events is even more difficult when events can be buffered at several points in the system. For example, consider a mobile app that reports events for usage metrics to a server. The app may be used while the device is offline, in which case it will buffer events locally on the device and send them to a server when an internet connection is next available (which may be hours or even days later). To any consumers of this stream, the events will appear as extremely delayed stragglers.

在事件可以被缓冲在系统的几个点之后分配时间戳变得更加困难。例如，考虑一个移动应用程序，它向服务器报告用于使用度量的事件。在设备离线的情况下，应用程序可能被使用，并且在设备上缓冲事件，并在下次可用的互联网连接（可能是几小时甚至几天后）将它们发送到服务器。对于这个流的任何消费者来说，事件将显得非常迟滞。

In this context, the timestamp on the events should really be the time at which the user interaction occurred, according to the mobile device’s local clock. However, the clock on a user-controlled device often cannot be trusted, as it may be accidentally or deliberately set to the wrong time (see “Clock Synchronization and Accuracy” ). The time at which the event was received by the server (according to the server’s clock) is more likely to be accurate, since the server is under your control, but less meaningful in terms of describing the user interaction.

在这种情况下，事件的时间戳应该真正是用户交互发生时的时间，根据移动设备的本地时钟。然而，用户控制的设备上的时钟往往不能被信任，因为它可能会被意外或故意设置为错误的时间（请参见“时钟同步和精度”）。根据服务器时钟接收事件的时间更有可能是准确的，因为服务器是在你的控制之下，但从描述用户交互的角度来看，意义较小。

To adjust for incorrect device clocks, one approach is to log three timestamps [ 82 ]:

调整不正确的设备时钟的一种方法是记录三个时间戳 [82]:

The time at which the event occurred, according to the device clock

事件发生的时间，根据设备时钟。
The time at which the event was sent to the server, according to the device clock

根据设备时钟发送事件的时间。
The time at which the event was received by the server, according to the server clock

服务器接收该事件的时间，根据服务器时钟显示。

By subtracting the second timestamp from the third, you can estimate the offset between the device clock and the server clock (assuming the network delay is negligible compared to the required timestamp accuracy). You can then apply that offset to the event timestamp, and thus estimate the true time at which the event actually occurred (assuming the device clock offset did not change between the time the event occurred and the time it was sent to the server).

通过从第三个时间戳中减去第二个时间戳，可以估计设备时钟和服务器时钟之间的偏移量（假设网络延迟与所需时间戳精度相比可以忽略不计）。然后，可以将该偏移量应用于事件时间戳，从而估计事件实际发生的真正时间（假设设备时钟偏移量在该事件发生时和发送到服务器时没有发生变化）。

This problem is not unique to stream processing—batch processing suffers from exactly the same issues of reasoning about time. It is just more noticeable in a streaming context, where we are more aware of the passage of time.

这个问题不仅仅是流处理所独有的——批量处理在关于时间的推断方面也有完全相同的问题。只是在流处理环境中更加显著，因为我们更加注意时间的流逝。

Types of windows

Once you know how the timestamp of an event should be determined, the next step is to decide how windows over time periods should be defined. The window can then be used for aggregations, for example to count events, or to calculate the average of values within the window. Several types of windows are in common use [ 79 , 83 ]:

一旦确定了事件的时间戳应如何确定，下一步就是决定如何定义时间段内的窗口。接下来可以使用窗口进行聚合，例如计算事件次数或计算窗口内数值的平均值。常用的几种窗口类型包括[79，83]：

Tumbling window

A tumbling window has a fixed length, and every event belongs to exactly one window. For example, if you have a 1-minute tumbling window, all the events with timestamps between 10:03:00 and 10:03:59 are grouped into one window, events between 10:04:00 and 10:04:59 into the next window, and so on. You could implement a 1-minute tumbling window by taking each event timestamp and rounding it down to the nearest minute to determine the window that it belongs to.

滚动窗口具有固定长度，每个事件都属于恰好一个窗口。例如，如果您有一个1分钟的滚动窗口，则所有时间戳介于10:03:00和10:03:59之间的事件都分组为一个窗口，介于10:04:00和10:04:59之间的事件则分为下一个窗口，以此类推。您可以通过将每个事件时间戳向下舍入到最接近的一分钟来实现1分钟的滚动窗口。

Hopping window

A hopping window also has a fixed length, but allows windows to overlap in order to provide some smoothing. For example, a 5-minute window with a hop size of 1 minute would contain the events between 10:03:00 and 10:07:59, then the next window would cover events between 10:04:00 and 10:08:59, and so on. You can implement this hopping window by first calculating 1-minute tumbling windows, and then aggregating over several adjacent windows.

跳跃窗口也有固定的长度，但允许窗口重叠以提供一些平滑。例如，带有1分钟跳跃大小的5分钟窗口将包含10:03:00至10:07:59之间的事件，然后下一个窗口将覆盖10:04:00至10:08:59之间的事件，以此类推。您可以通过先计算1分钟滚动窗口，然后聚合多个相邻窗口来实现此跳跃窗口。

Sliding window

A sliding window contains all the events that occur within some interval of each other. For example, a 5-minute sliding window would cover events at 10:03:39 and 10:08:12, because they are less than 5 minutes apart (note that tumbling and hopping 5-minute windows would not have put these two events in the same window, as they use fixed boundaries). A sliding window can be implemented by keeping a buffer of events sorted by time and removing old events when they expire from the window.

滑动窗口包含在某段时间内发生的所有事件。例如，5分钟的滑动窗口将包括在10:03:39和10:08:12发生的事件，因为它们相隔不到5分钟（请注意，滚动和跳跃的5分钟窗口不会将这两个事件放在同一个窗口中，因为它们使用固定边界）。滑动窗口可以通过保留按时间排序的事件缓冲区来实现，并在它们从窗口中过期时删除旧事件。

Session window

Unlike the other window types, a session window has no fixed duration. Instead, it is defined by grouping together all events for the same user that occur closely together in time, and the window ends when the user has been inactive for some time (for example, if there have been no events for 30 minutes). Sessionization is a common requirement for website analytics (see “GROUP BY” ).

与其他窗口类型不同，会话窗口没有固定的持续时间。相反，它是通过将同一用户在时间上紧密联系的所有事件组合在一起来定义的，当用户一段时间内处于非活动状态时窗口就会结束（例如，如果30分钟没有出现任何事件）。会话化是网站分析的常见要求（请参见“GROUP BY”）。

Stream Joins

In Chapter 10 we discussed how batch jobs can join datasets by key, and how such joins form an important part of data pipelines. Since stream processing generalizes data pipelines to incremental processing of unbounded datasets, there is exactly the same need for joins on streams.

在第10章中，我们讨论了批处理作业如何通过键值连接数据集，以及这些连接如何构成数据管道的重要部分。由于流处理将数据管道广义化为无限增量处理数据集，因此流上的连接具有完全相同的需求。

However, the fact that new events can appear anytime on a stream makes joins on streams more challenging than in batch jobs. To understand the situation better, let’s distinguish three different types of joins: stream-stream joins, stream-table joins, and table-table joins [ 84 ]. In the following sections we’ll illustrate each by example.

然而，流上随时可以出现新事件的事实，使得流上的连接比批处理作业更具挑战性。为了更好地理解情况，让我们区分三种不同的连接类型：流-流连接、流-表连接和表-表连接[84]。在接下来的章节中，我们将以示例说明每种类型。

Stream-stream join (window join)

Say you have a search feature on your website, and you want to detect recent trends in searched-for URLs. Every time someone types a search query, you log an event containing the query and the results returned. Every time someone clicks one of the search results, you log another event recording the click. In order to calculate the click-through rate for each URL in the search results, you need to bring together the events for the search action and the click action, which are connected by having the same session ID. Similar analyses are needed in advertising systems [ 85 ].

假设您的网站有一个搜索功能，并且您想要检测最近搜索的URL趋势。每当有人键入搜索查询时，您记录一个包含查询和返回结果的事件。每当有人点击搜索结果之一时，您记录另一个事件，记录点击情况。为了计算搜索结果中每个URL的点击率，您需要将搜索操作和点击操作的事件结合起来，它们通过具有相同会话ID来连接。类似的分析也需要在广告系统中进行。

The click may never come if the user abandons their search, and even if it comes, the time between the search and the click may be highly variable: in many cases it might be a few seconds, but it could be as long as days or weeks (if a user runs a search, forgets about that browser tab, and then returns to the tab and clicks a result sometime later). Due to variable network delays, the click event may even arrive before the search event. You can choose a suitable window for the join—for example, you may choose to join a click with a search if they occur at most one hour apart.

如果用户放弃搜索，那么单击可能永远不会出现，即使它出现了，在搜索和单击之间的时间也可能高度变化：在许多情况下可能只有几秒钟，但它也可能长达数天或数周（如果用户运行搜索，忘记了浏览器选项卡，然后稍后返回选项卡并单击结果）。由于网络延迟的不确定性，单击事件甚至可能在搜索事件之前到达。您可以选择合适的窗口进行连接-例如，如果它们在最多一小时内发生，则可以选择将单击与搜索连接起来。

Note that embedding the details of the search in the click event is not equivalent to joining the events: doing so would only tell you about the cases where the user clicked a search result, not about the searches where the user did not click any of the results. In order to measure search quality, you need accurate click-through rates, for which you need both the search events and the click events.

请注意，在单击事件中嵌入搜索详细信息并不等同于加入事件：这样做只会告诉您用户单击搜索结果的情况，而不是用户没有单击任何结果的搜索情况。为了测量搜索质量，您需要精确的点击率，而这需要搜索事件和单击事件两者都需要。

To implement this type of join, a stream processor needs to maintain state : for example, all the events that occurred in the last hour, indexed by session ID. Whenever a search event or click event occurs, it is added to the appropriate index, and the stream processor also checks the other index to see if another event for the same session ID has already arrived. If there is a matching event, you emit an event saying which search result was clicked. If the search event expires without you seeing a matching click event, you emit an event saying which search results were not clicked.

为实现此类连接，流处理器需要维护状态：例如，所有最近一小时发生的事件，按会话 ID 索引。每当搜索事件或点击事件发生时，它将被添加到相应的索引中，流处理器也将检查其他索引，以查看是否已经到达了与相同会话 ID 的另一个事件。如果存在匹配的事件，则发出一条事件，说明哪个搜索结果已被点击。如果搜索事件在您看不到匹配的点击事件的情况下过期，则发出一条事件，说明哪些搜索结果未被点击。

Stream-table join (stream enrichment)

In “Example: analysis of user activity events” ( Figure 10-2 ) we saw an example of a batch job joining two datasets: a set of user activity events and a database of user profiles. It is natural to think of the user activity events as a stream, and to perform the same join on a continuous basis in a stream processor: the input is a stream of activity events containing a user ID, and the output is a stream of activity events in which the user ID has been augmented with profile information about the user. This process is sometimes known as enriching the activity events with information from the database.

在“示例:用户活动事件的分析”（图10-2）中，我们看到了一个批处理作业示例，在此作业中，两个数据集被合并：一组用户活动事件和一个用户配置文件数据库。自然而然的想法是将用户活动事件视为数据流，并在数据流处理器中持续执行相同的合并：输入是包含用户ID的活动事件流，输出是已经被用户配置文件信息扩充的活动事件流。这个过程有时被称为将活动事件与数据库中的信息进行丰富。

To perform this join, the stream process needs to look at one activity event at a time, look up the event’s user ID in the database, and add the profile information to the activity event. The database lookup could be implemented by querying a remote database; however, as discussed in “Example: analysis of user activity events” , such remote queries are likely to be slow and risk overloading the database [ 75 ].

为了执行这个连接操作，流处理需要逐一查看每个活动事件，从数据库中查找该事件的用户ID并将个人资料信息添加到活动事件中。数据库的查找可以通过查询远程数据库来实现；然而，正如在“示例：用户活动事件分析”中所讨论的，这样的远程查询很可能很慢，有过载数据库的风险[75]。

Another approach is to load a copy of the database into the stream processor so that it can be queried locally without a network round-trip. This technique is very similar to the hash joins we discussed in “Map-Side Joins” : the local copy of the database might be an in-memory hash table if it is small enough, or an index on the local disk.

另一种方法是将数据库副本加载到流处理器中，以便可以在本地查询，而不需要网络往返。这种技术非常类似于我们在“Map-Side Joins”中讨论的哈希连接：如果本地数据库的大小足够小，那么它的本地副本可能是在内存中的哈希表，否则可能是本地磁盘上的索引。

The difference to batch jobs is that a batch job uses a point-in-time snapshot of the database as input, whereas a stream processor is long-running, and the contents of the database are likely to change over time, so the stream processor’s local copy of the database needs to be kept up to date. This issue can be solved by change data capture: the stream processor can subscribe to a changelog of the user profile database as well as the stream of activity events. When a profile is created or modified, the stream processor updates its local copy. Thus, we obtain a join between two streams: the activity events and the profile updates.

批处理任务的区别在于，批处理任务使用数据库的某个时间点的快照作为输入，而流处理器是长时间运行的，数据库的内容随着时间会发生变化，因此流处理器的本地副本需要保持最新。这个问题可以通过变更数据捕获来解决：流处理器可以订阅用户配置文件数据库的变更日志以及活动事件流。当配置文件被创建或修改时，流处理器会更新其本地副本。因此，我们得到了两个流的连接：活动事件和配置文件更新。

A stream-table join is actually very similar to a stream-stream join; the biggest difference is that for the table changelog stream, the join uses a window that reaches back to the “beginning of time” (a conceptually infinite window), with newer versions of records overwriting older ones. For the stream input, the join might not maintain a window at all.

流表联接实际上与流流联接非常相似；最大的区别在于对于表changelog流，联接使用窗口回溯到“时间的开始”（概念上的无限窗口），以新版本的记录覆盖旧的记录。对于流输入，联接可能根本不维护窗口。

Table-table join (materialized view maintenance)

Consider the Twitter timeline example that we discussed in “Describing Load” . We said that when a user wants to view their home timeline, it is too expensive to iterate over all the people the user is following, find their recent tweets, and merge them.

考虑我们在“描述负载”中讨论过的Twitter时间线示例。我们说当用户想要查看他们的主页时间线时，遍历用户关注的所有人，找到他们最近的推文并合并它们是太昂贵了。

Instead, we want a timeline cache: a kind of per-user “inbox” to which tweets are written as they are sent, so that reading the timeline is a single lookup. Materializing and maintaining this cache requires the following event processing:

相反，我们想要一个时间轴缓存：一种每个用户“收件箱”，当推文被发送时写入该收件箱，这样阅读时间轴就是单个查找。实现和维护此缓存需要以下事件处理：

When user u sends a new tweet, it is added to the timeline of every user who is following u .

当用户 u 发布新推文时，它将被添加到所有正在关注用户 u 的用户的时间线中。
When a user deletes a tweet, it is removed from all users’ timelines.

当用户删除一条推文时，它会从所有用户的时间线上删除。
When user u ₁ starts following user u ₂ , recent tweets by u ₂ are added to u ₁ ’s timeline.

当用户 u1 开始关注用户 u2 时，u2 的最近推文将被添加到 u1 的时间线中。
When user u ₁ unfollows user u ₂ , tweets by u ₂ are removed from u ₁ ’s timeline.

当用户u1取消关注用户u2时，u2的推文会从u1的时间线中移除。

To implement this cache maintenance in a stream processor, you need streams of events for tweets (sending and deleting) and for follow relationships (following and unfollowing). The stream process needs to maintain a database containing the set of followers for each user so that it knows which timelines need to be updated when a new tweet arrives [ 86 ].

为了在流处理器中实现此缓存维护，您需要具有推文事件（发送和删除）和关注关系（关注和取消关注）的事件流。流处理过程需要维护一个包含每个用户的关注者集合的数据库，以便知道在新推文到达时需要更新哪些时间线。[86]。

Another way of looking at this stream process is that it maintains a materialized view for a query that joins two tables (tweets and follows), something like the following:

另一种看待这个流程的方式是它为一个连接两个表（tweets和follows）的查询维护了一个物化视图，类似于以下内容：

SELECT follows.follower_id AS timeline_id,
  array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id

The join of the streams corresponds directly to the join of the tables in that query. The timelines are effectively a cache of the result of this query, updated every time the underlying tables change. ⁱⁱⁱ

流的联接直接对应于查询中表的联接。时间轴实际上是这个查询结果的缓存，在基础表变化时每次更新。

Time-dependence of joins

The three types of joins described here (stream-stream, stream-table, and table-table) have a lot in common: they all require the stream processor to maintain some state (search and click events, user profiles, or follower list) based on one join input, and query that state on messages from the other join input.

这里描述的三种连接类型（流-流、流-表和表-表）有许多共同点：它们都需要流处理器基于一个连接输入维护一些状态（搜索和点击事件、用户资料或关注者列表），并查询来自其他连接输入的信息的状态。

The order of the events that maintain the state is important (it matters whether you first follow and then unfollow, or the other way round). In a partitioned log, the ordering of events within a single partition is preserved, but there is typically no ordering guarantee across different streams or partitions.

维护状态的事件顺序很重要（首先关注再取消或者相反，顺序有影响）。在分区日志中，单个分区内的事件顺序被保留，但通常没有跨不同流或分区的顺序保证。

This raises a question: if events on different streams happen around a similar time, in which order are they processed? In the stream-table join example, if a user updates their profile, which activity events are joined with the old profile (processed before the profile update), and which are joined with the new profile (processed after the profile update)? Put another way: if state changes over time, and you join with some state, what point in time do you use for the join [ 45 ]?

这引发了一个问题：如果不同数据流上的事件在相近的时间内发生，它们的处理顺序是什么？在流表联接的例子中，如果用户更新他们的个人资料，哪些活动事件将与旧资料一起联接（在资料更新之前处理），哪些事件将与新资料一起联接（在资料更新之后处理）？换句话说：如果状态随时间变化，并且您与某些状态联接，那么对于联接使用哪个时间点呢？[45]

Such time dependence can occur in many places. For example, if you sell things, you need to apply the right tax rate to invoices, which depends on the country or state, the type of product, and the date of sale (since tax rates change from time to time). When joining sales to a table of tax rates, you probably want to join with the tax rate at the time of the sale, which may be different from the current tax rate if you are reprocessing historical data.

这种时间上的依赖性可能发生在很多地方。例如，如果你销售产品，你需要将适当的税率应用于发票上，这取决于国家或州、产品类型以及销售日期(因为税率会随着时间变化而变化)。当将销售与税率表结合时，你可能想要使用销售时的税率进行结合，这可能与当前的税率不同，特别是当你重新处理历史数据时。

If the ordering of events across streams is undetermined, the join becomes nondeterministic [ 87 ], which means you cannot rerun the same job on the same input and necessarily get the same result: the events on the input streams may be interleaved in a different way when you run the job again.

如果跨流事件的排序是不确定的，则连接将变得非确定性，这意味着您不能在相同的输入上重新运行相同的作业并且必须得到相同的结果：当您再次运行作业时，输入流中的事件可能以不同的方式交织。

In data warehouses, this issue is known as a slowly changing dimension (SCD), and it is often addressed by using a unique identifier for a particular version of the joined record: for example, every time the tax rate changes, it is given a new identifier, and the invoice includes the identifier for the tax rate at the time of sale [ 88 , 89 ]. This change makes the join deterministic, but has the consequence that log compaction is not possible, since all versions of the records in the table need to be retained.

在数据仓库中，这个问题被称为逐渐变化的维度（SCD），通常通过为特定版本的连接记录使用唯一标识符来解决。例如，每当税率发生变化，就会赋予一个新的标识符，并且发票包括销售时税率的标识符[88, 89]。这种改变使得连接是确定性的，但是有一个后果，即日志压缩是不可能的，因为表中所有版本的记录都需要被保留。

Fault Tolerance

In the final section of this chapter, let’s consider how stream processors can tolerate faults. We saw in Chapter 10 that batch processing frameworks can tolerate faults fairly easily: if a task in a MapReduce job fails, it can simply be started again on another machine, and the output of the failed task is discarded. This transparent retry is possible because input files are immutable, each task writes its output to a separate file on HDFS, and output is only made visible when a task completes successfully.

在本章的最后一节中，让我们考虑流处理器如何容忍故障。我们在第10章中看到，批处理框架可以相对容易地容错：如果MapReduce作业中的任务失败，它可以简单地在另一台机器上重新启动，失败任务的输出将被丢弃。这种透明的重试是可能的，因为输入文件是不可变的，每个任务将其输出写入HDFS上的单独文件中，只有在任务成功完成时才会使输出可见。

In particular, the batch approach to fault tolerance ensures that the output of the batch job is the same as if nothing had gone wrong, even if in fact some tasks did fail. It appears as though every input record was processed exactly once—no records are skipped, and none are processed twice. Although restarting tasks means that records may in fact be processed multiple times, the visible effect in the output is as if they had only been processed once. This principle is known as exactly-once semantics , although effectively-once would be a more descriptive term [ 90 ].

特别是，批处理容错方法确保与未发生故障相同的批处理任务输出结果，即使某些任务实际上出现故障。似乎每个输入记录都被处理了一次，没有跳过记录，也没有重复处理记录。尽管重新启动任务可能意味着实际上记录可能被处理多次，但在输出中的可见效果就像它们只被处理了一次一样。这个原则被称为精确一次语义，然而，有效一次可能是一个更描述性的术语。[90]。

The same issue of fault tolerance arises in stream processing, but it is less straightforward to handle: waiting until a task is finished before making its output visible is not an option, because a stream is infinite and so you can never finish processing it.

在流处理中也会出现相同的容错问题，但处理起来要复杂一些：等待任务完成后再显示其输出不是一个选项，因为流是无限的，所以你永远无法完成对其的处理。

Microbatching and checkpointing

One solution is to break the stream into small blocks, and treat each block like a miniature batch process. This approach is called microbatching , and it is used in Spark Streaming [ 91 ]. The batch size is typically around one second, which is the result of a performance compromise: smaller batches incur greater scheduling and coordination overhead, while larger batches mean a longer delay before results of the stream processor become visible.

一种解决方案是将流分成小块，将每个块作为小批量处理。这种方法称为微批处理，它在Spark Streaming [91]中使用。批量大小通常约为一秒钟，这是性能折衷的结果：较小批次会产生更大的调度和协调开销，而较大批次意味着流处理器结果变得可见之前需要更长的延迟。

Microbatching also implicitly provides a tumbling window equal to the batch size (windowed by processing time, not event timestamps); any jobs that require larger windows need to explicitly carry over state from one microbatch to the next.

微批处理还隐含提供一个等于批量大小的滚动窗口（按处理时间而非事件时间戳进行窗口化）；任何需要较大窗口的作业都需要显式地从一个微批处理传递状态到下一个。

A variant approach, used in Apache Flink, is to periodically generate rolling checkpoints of state and write them to durable storage [ 92 , 93 ]. If a stream operator crashes, it can restart from its most recent checkpoint and discard any output generated between the last checkpoint and the crash. The checkpoints are triggered by barriers in the message stream, similar to the boundaries between microbatches, but without forcing a particular window size.

另一种方法是在Apache Flink中使用周期性滚动检查点来存储状态，并将它们写入持久性存储 [92，93]。如果流操作员崩溃，则可以从其最近的检查点重新启动，并丢弃在最后一个检查点和崩溃之间生成的任何输出。检查点是由消息流中的屏障触发的，类似于微批次之间的边界，但不强制任何窗口大小。

Within the confines of the stream processing framework, the microbatching and checkpointing approaches provide the same exactly-once semantics as batch processing. However, as soon as output leaves the stream processor (for example, by writing to a database, sending messages to an external message broker, or sending emails), the framework is no longer able to discard the output of a failed batch. In this case, restarting a failed task causes the external side effect to happen twice, and microbatching or checkpointing alone is not sufficient to prevent this problem.

在流处理框架的范围内，微批处理和检查点方案提供了与批处理完全相同的仅一次语义。但是，一旦输出离开流处理器（例如，写入数据库、发送消息到外部消息代理或发送电子邮件），框架就无法丢弃由于失败的批处理而产生的输出。在这种情况下，重新启动失败的任务会导致外部副作用发生两次，仅仅使用微批处理或检查点方案是不足以解决这个问题的。

Atomic commit revisited

In order to give the appearance of exactly-once processing in the presence of faults, we need to ensure that all outputs and side effects of processing an event take effect if and only if the processing is successful. Those effects include any messages sent to downstream operators or external messaging systems (including email or push notifications), any database writes, any changes to operator state, and any acknowledgment of input messages (including moving the consumer offset forward in a log-based message broker).

为了在故障存在的情况下呈现出精确一次处理的外观，我们需要确保所有事件的处理所产生的输出和副作用只有在处理成功时才会生效。这些影响包括发送到下游操作器或外部消息系统（包括电子邮件或推送通知）的任何消息、任何数据库写入、任何操作器状态更改以及对输入消息的确认（包括将消费者偏移量向前移动在基于日志的消息代理中）。

Those things either all need to happen atomically, or none of them must happen, but they should not go out of sync with each other. If this approach sounds familiar, it is because we discussed it in “Exactly-once message processing” in the context of distributed transactions and two-phase commit.

这些事情要么必须同时发生，要么就不能发生，但它们不应该与彼此失去同步。如果这种方法听起来很熟悉，那是因为我们曾在分布式事务和两阶段提交的“恰好一次消息处理”中讨论过它。

In Chapter 9 we discussed the problems in the traditional implementations of distributed transactions, such as XA. However, in more restricted environments it is possible to implement such an atomic commit facility efficiently. This approach is used in Google Cloud Dataflow [ 81 , 92 ] and VoltDB [ 94 ], and there are plans to add similar features to Apache Kafka [ 95 , 96 ]. Unlike XA, these implementations do not attempt to provide transactions across heterogeneous technologies, but instead keep them internal by managing both state changes and messaging within the stream processing framework. The overhead of the transaction protocol can be amortized by processing several input messages within a single transaction.

第9章中，我们讨论了传统分布式事务（如XA）实现中存在的问题。然而，在更受限制的环境中，可以有效地实现这样的原子提交功能。Google Cloud Dataflow[81，92]和VoltDB [94]使用了这种方法，并计划在Apache Kafka [95,96] 中添加类似功能。与XA不同，这些实现不尝试通过异构技术提供事务，而是通过在流处理框架内管理状态更改和消息来保持它们的内部性。事务协议的开销可以通过在单个事务中处理多个输入消息而摊销。

Idempotence

Our goal is to discard the partial output of any failed tasks so that they can be safely retried without taking effect twice. Distributed transactions are one way of achieving that goal, but another way is to rely on idempotence [ 97 ].

我们的目标是丢弃任何失败任务的部分输出，以便可以安全地重试而不会产生两倍影响。分布式事务是实现这一目标的一种方式，但另一种方式是依靠幂等性。

An idempotent operation is one that you can perform multiple times, and it has the same effect as if you performed it only once. For example, setting a key in a key-value store to some fixed value is idempotent (writing the value again simply overwrites the value with an identical value), whereas incrementing a counter is not idempotent (performing the increment again means the value is incremented twice).

幂等操作是指你可以多次执行的操作，而它的效果和只执行一次一样。例如，在键值存储中将键设置为某个固定值是幂等的（再次写入相同的值只会将该值覆盖），而增加计数器不是幂等的（再次进行增加意味着值增加了两次）。

Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of extra metadata. For example, when consuming messages from Kafka, every message has a persistent, monotonically increasing offset. When writing a value to an external database, you can include the offset of the message that triggered the last write with the value. Thus, you can tell whether an update has already been applied, and avoid performing the same update again.

即使一个操作本质上不是幂等的，加入一些额外的元数据也可以让它变为幂等的。例如，在从Kafka接收消息时，每个消息都有一个持久化的、单调递增的偏移量。当将一个值写入外部数据库时，可以将触发最后一次写入的消息的偏移量与该值一起包含在内。这样，就可以判断一个更新是否已经被应用，避免再次执行相同的更新。

The state handling in Storm’s Trident is based on a similar idea [ 78 ]. Relying on idempotence implies several assumptions: restarting a failed task must replay the same messages in the same order (a log-based message broker does this), the processing must be deterministic, and no other node may concurrently update the same value [ 98 , 99 ].

Storm的Trident中的状态处理基于类似的思路。依赖幂等性意味着有几个假设：如果重新启动失败的任务，则必须以相同的顺序再次播放相同的消息（基于日志的消息代理程序可以实现此操作），处理必须是确定性的，并且不能同时更新同一值的任何其他节点。

When failing over from one processing node to another, fencing may be required (see “The leader and the lock” ) to prevent interference from a node that is thought to be dead but is actually alive. Despite all those caveats, idempotent operations can be an effective way of achieving exactly-once semantics with only a small overhead.

当从一个处理节点故障转移至另一个处理节点时，可能需要进行隔离（请参见“领导者和锁定”），以防止来自被认为已死亡但实际上仍然活动的节点的干扰。尽管有所有这些警告，但幂等操作可以有效地实现精确一次性语义而只带来小的开销。

Rebuilding state after a failure

Any stream process that requires state—for example, any windowed aggregations (such as counters, averages, and histograms) and any tables and indexes used for joins—must ensure that this state can be recovered after a failure.

任何需要状态的流处理过程——例如任何窗口聚合（例如计数器、平均值和直方图）以及用于连接的任何表和索引——必须确保在故障后可以恢复此状态。

One option is to keep the state in a remote datastore and replicate it, although having to query a remote database for each individual message can be slow, as discussed in “Stream-table join (stream enrichment)” . An alternative is to keep state local to the stream processor, and replicate it periodically. Then, when the stream processor is recovering from a failure, the new task can read the replicated state and resume processing without data loss.

一种选择是将状态保持在远程数据仓库中并复制它，尽管每个消息都需要查询远程数据库可能会很慢，如“流-表连接（流增强）”中所讨论的那样。另一种选择是将状态保持在流处理器本地并定期进行复制。然后，当流处理器恢复失败时，新任务可以读取复制的状态并在没有数据丢失的情况下恢复处理。

For example, Flink periodically captures snapshots of operator state and writes them to durable storage such as HDFS [ 92 , 93 ]; Samza and Kafka Streams replicate state changes by sending them to a dedicated Kafka topic with log compaction, similar to change data capture [ 84 , 100 ]. VoltDB replicates state by redundantly processing each input message on several nodes (see “Actual Serial Execution” ).

例如，Flink定期捕获操作状态的快照并将其写入持久存储，例如HDFS [92，93]；Samza和Kafka Streams通过将状态更改发送到专用Kafka主题并进行日志压缩来复制状态，类似于更改数据捕获 [84，100]. VoltDB通过在多个节点上冗余处理每个输入消息来复制状态（参见“实际串行执行”）。

In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from the input streams. For example, if the state consists of aggregations over a fairly short window, it may be fast enough to simply replay the input events corresponding to that window. If the state is a local replica of a database, maintained by change data capture, the database can also be rebuilt from the log-compacted change stream (see “Log compaction” ).

在某些情况下，甚至不必复制状态，因为它可以从输入流重建。例如，如果状态是在相当短的窗口期内聚合的，那么只需简单地重播与该窗口相对应的输入事件即可快速完成。如果状态是由变更数据捕获维护的数据库的本地副本，则数据库也可以从经过日志压缩的更改流中重建 (见“日志压缩”)。

However, all of these trade-offs depend on the performance characteristics of the underlying infrastructure: in some systems, network delay may be lower than disk access latency, and network bandwidth may be comparable to disk bandwidth. There is no universally ideal trade-off for all situations, and the merits of local versus remote state may also shift as storage and networking technologies evolve.

然而，所有这些权衡取决于底层基础设施的性能特征：在某些系统中，网络延迟可能低于磁盘访问延迟，并且网络带宽可能与磁盘带宽相当。对于所有情况，没有普遍理想的权衡，当存储和网络技术发展时，本地与远程状态的优点也可能会发生变化。

Summary

In this chapter we have discussed event streams, what purposes they serve, and how to process them. In some ways, stream processing is very much like the batch processing we discussed in Chapter 10 , but done continuously on unbounded (never-ending) streams rather than on a fixed-size input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem.

在本章中，我们讨论了事件流，它们的作用以及如何处理它们。从某些方面来看，流式处理非常类似于第10章中讨论的批处理，但是是在无限制（永不结束的）流上连续执行而不是在固定大小的输入上执行。从这个角度来看，消息代理和事件日志是文件系统的流式处理等效物。

We spent some time comparing two types of message brokers:

我们花了一些时间比较了两种消息中间件：

AMQP/JMS-style message broker

The broker assigns individual messages to consumers, and consumers acknowledge individual messages when they have been successfully processed. Messages are deleted from the broker once they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see also “Message-Passing Dataflow” ), for example in a task queue, where the exact order of message processing is not important and where there is no need to go back and read old messages again after they have been processed.

经纪人将单个消息分配给消费者，并且消费者在成功处理单个消息后进行确认。一旦消息被确认，它将从经纪人中删除。这种方法适用于异步RPC的形式（也请参见“消息传递数据流”），例如在任务队列中，其中消息处理的确切顺序不重要，并且在处理完后没有必要回头读取旧消息。

Log-based message broker

The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order. Parallelism is achieved through partitioning, and consumers track their progress by checkpointing the offset of the last message they have processed. The broker retains messages on disk, so it is possible to jump back and reread old messages if necessary.

代理将一个分区中的所有消息分配给同一个消费节点，并始终按照相同的顺序传递消息。通过分区实现并行处理，消费者通过检查点记住它们已经处理的最后一条消息的偏移量来追踪进度。代理在磁盘上保留消息，因此如果需要，可以跳回并重新读取旧的消息。

The log-based approach has similarities to the replication logs found in databases (see Chapter 5 ) and log-structured storage engines (see Chapter 3 ). We saw that this approach is especially appropriate for stream processing systems that consume input streams and generate derived state or derived output streams.

基于日志的方法与数据库中的复制日志（参见第5章）和基于日志的存储引擎（参见第3章）相似。我们发现，这种方法特别适用于流处理系统，这些系统会消耗输入流并生成派生状态或派生输出流。

In terms of where streams come from, we discussed several possibilities: user activity events, sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally represented as streams. We saw that it can also be useful to think of the writes to a database as a stream: we can capture the changelog—i.e., the history of all changes made to a database—either implicitly through change data capture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the contents of a database.

关于数据流的来源，我们讨论了几种可能性：用户活动事件、传感器提供的定期读数和数据源（如金融市场数据）自然地表示为数据流。我们发现，将对数据库的写操作视为数据流也是有用的：我们可以通过更改数据捕获变更日志，即数据库所有更改的历史记录，无论是通过更改数据捕获还是通过事件源显式捕获。日志压缩使数据流能够保留数据库内容的完整副本。

Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming the log of changes and applying them to the derived system. You can even build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present.

将数据库表示为流，为系统集成提供了强大的机会。通过消耗更改日志并将其应用于派生系统，您可以将派生数据系统（例如搜索索引、缓存和分析系统）保持最新状态。您甚至可以从头开始并在整个时间上消耗更改日志，以构建对现有数据的新视图。

The facilities for maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks. We discussed several purposes of stream processing, including searching for event patterns (complex event processing), computing windowed aggregations (stream analytics), and keeping derived data systems up to date (materialized views).

状态维护流和重播消息的设施也是实现流连接和各种流处理框架中的容错技术的基础。我们讨论了流处理的几个目的，包括搜索事件模式（复杂事件处理）、计算窗口聚合（流分析）以及保持派生数据系统最新状态（实体视图）。

We then discussed the difficulties of reasoning about time in a stream processor, including the distinction between processing time and event timestamps, and the problem of dealing with straggler events that arrive after you thought your window was complete.

我们随后讨论了在流式处理器中推理时间的困难，包括处理时间和事件时间戳之间的区别，以及解决晚到的事件，这些事件在您认为窗口已经完成后到达的问题。

We distinguished three types of joins that may appear in stream processes:

我们区分了在数据流处理中可能出现的三种连接类型：

Stream-stream joins

Both input streams consist of activity events, and the join operator searches for related events that occur within some window of time. For example, it may match two actions taken by the same user within 30 minutes of each other. The two join inputs may in fact be the same stream (a self-join ) if you want to find related events within that one stream.

两个输入流都包含活动事件，连接操作符搜索在某个时间窗口内发生的相关事件。例如，它可以匹配同一用户在30分钟内执行的两个操作。如果您想在一个流中找到相关的事件，则两个连接输入实际上可以是同一个流（自连接）。

Stream-table joins

One input stream consists of activity events, while the other is a database changelog. The changelog keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event.

一个输入流由行为事件组成，而另一个是数据库变更日志。变更日志保持数据库的本地副本是最新的。对于每个行为事件，联接操作符会查询数据库并输出丰富的行为事件。

Table-table joins

Both input streams are database changelogs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables.

两个输入流都是数据库更改日志。在这种情况下，每个一侧的更改都与另一侧的最新状态连接。结果是两个表之间连接的物化视图的更改流。

Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a stream processor. As with batch processing, we need to discard the partial output of any failed tasks. However, since a stream process is long-running and produces output continuously, we can’t simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on microbatching, checkpointing, transactions, or idempotent writes.

最终，我们讨论了在流式处理器中实现容错和精确一次语义的技术。与批处理一样，我们需要丢弃任何失败任务的部分输出。然而，由于流式处理进程长时间运行且持续产生输出，我们无法简单地丢弃所有输出。取而代之的是，可以使用基于微批处理、检查点、事务或幂等写入的更精细的恢复机制。

Footnotes

ⁱ It’s possible to create a load balancing scheme in which two consumers share the work of processing a partition by having both read the full set of messages, but one of them only considers messages with even-numbered offsets while the other deals with the odd-numbered offsets. Alternatively, you could spread message processing over a thread pool, but that approach complicates consumer offset management. In general, single-threaded processing of a partition is preferable, and parallelism can be increased by using more partitions.

可以创建一个负载均衡方案，其中两个消费者共享处理分区的工作，同时读取完整的消息集，但其中一个仅考虑偶数偏移量的消息，而另一个则处理奇数偏移量的消息。或者，您可以将消息处理分散在线程池中，但这种方法会复杂化消费者偏移量管理。一般而言，在分区上进行单线程处理更为优选，通过使用更多分区来增加并行性。

ⁱⁱ Thank you to Kostas Kloudas from the Flink community for coming up with this analogy.

感谢 Flink 社区的 Kostas Kloudas 提供这个比喻。

ⁱⁱⁱ If you regard a stream as the derivative of a table, as in Figure 11-6 , and regard a join as a product of two tables u·v , something interesting happens: the stream of changes to the materialized join follows the product rule ( u·v )′ = u ′ v + uv ′. In words: any change of tweets is joined with the current followers, and any change of followers is joined with the current tweets [ 49 , 50 ].

如果您将流视为表格的导数（如图11-6所示）并将连接视为两个表格u·v的乘积，则会发生有趣的事情：对于物化连接的更改流遵循乘法规则(u·v)′ = u′v + uv′。换句话说：任何推文的更改都与当前的关注者配对，任何关注者的更改都与当前的推文配对。[49,50]。

References

[ 1 ] Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “ The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing ,” Proceedings of the VLDB Endowment , volume 8, number 12, pages 1792–1803, August 2015. doi:10.14778/2824032.2824076

[1] Tyler Akidau, Robert Bradshaw, Craig Chambers等人：“数据流模型：在大规模、无限制、乱序数据处理中平衡正确性、延迟和成本的实用方法”，《VLDB Endowment》杂志，第8卷，第12期，2015年8月，1792-1803页，doi:10.14778/2824032.2824076。

[ 2 ] Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and Interpretation of Computer Programs , 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at mitpress.mit.edu

[2] Harold Abelson，Gerald Jay Sussman和Julie Sussman：《计算机程序的构造与解释》，第二版。MIT出版社，1996年。ISBN：978-0-262-51087-5，可在线访问mitpress.mit.edu。

[ 3 ] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “ The Many Faces of Publish/Subscribe ,” ACM Computing Surveys , volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078

[3] Patrick Th. Eugster, Pascal A. Felber，Rachid Guerraoui和Anne-Marie Kermarrec：““发布/订阅”的许多面孔。ACM计算机调查，第35卷，第2期，114-131页，2003年6月。doi：10.1145 / 857076.857078”

[ 4 ] Joseph M. Hellerstein and Michael Stonebraker: Readings in Database Systems , 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at redbook.cs.berkeley.edu

[4] Joseph M. Hellerstein 和 Michael Stonebraker：《数据库系统阅读集》第4版。 MIT出版社，2005年。ISBN：978-0-262-69314-1，可在线访问redbook.cs.berkeley.edu。

[ 5 ] Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “ Monitoring Streams – A New Class of Data Management Applications ,” at 28th International Conference on Very Large Data Bases (VLDB), August 2002.

Don Carney、Uğur Çetintemel、Mitch Cherniack等人在2002年8月的第28届国际大型数据库会议（VLDB）上发表了重要论文：“流式数据监控——数据管理应用的新型类别”。

[ 6 ] Matthew Sackman: “ Pushing Back ,” lshift.net , May 5, 2016.

[6] Matthew Sackman：“Pushing Back，”lshift.net，2016年5月5日。 [6] 马修·萨克曼：“推回”，lshift.net，2016年5月5日。

[ 7 ] Vicent Martí: “ Brubeck, a statsd-Compatible Metrics Aggregator ,” githubengineering.com , June 15, 2015.

[7] Vicent Martí：“Brubeck，一种与Statsd兼容的度量聚合器”，githubengineering.com，2015年6月15日。

[ 8 ] Seth Lowenberger: “ MoldUDP64 Protocol Specification V 1.00 ,” nasdaqtrader.com , July 2009.

[8] Seth Lowenberger： “MoldUDP64 协议规范 V 1.00，” nasdaqtrader.com，2009年7月。

[ 9 ] Pieter Hintjens: ZeroMQ – The Guide . O’Reilly Media, 2013. ISBN: 978-1-449-33404-8

[9] Pieter Hintjens： ZeroMQ指南。O’Reilly Media，2013年。 ISBN：978-1-449-33404-8。

[ 10 ] Ian Malpass: “ Measure Anything, Measure Everything ,” codeascraft.com , February 15, 2011.

"[10] Ian Malpass: “Measure Anything, Measure Everything,” codeascraft.com, February 15, 2011." "[10] Ian Malpass： “度量一切，度量万物”，codeascraft.com，2011年2月15日。"

[ 11 ] Dieter Plaetinck: “ 25 Graphite, Grafana and statsd Gotchas ,” blog.raintank.io , March 3, 2016.

[11] Dieter Plaetinck：“25个Graphite、Grafana和statsd的坑点”，blog.raintank.io，2016年3月3日。

[ 12 ] Jeff Lindsay: “ Web Hooks to Revolutionize the Web ,” progrium.com , May 3, 2007.

[12] Jeff Lindsay：“Web Hook 改变网页的方式”，progrium.com，2007年5月3日。

[ 13 ] Jim N. Gray: “ Queues Are Databases ,” Microsoft Research Technical Report MSR-TR-95-56, December 1995.

"[13] Jim N. Gray: “队列就是数据库”，微软研究院技术报告MSR-TR-95-56，1995年12月。"

[ 14 ] Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “ JSR-343 Java Message Service (JMS) 2.0 Specification ,” jms-spec.java.net , March 2013.

[14] Mark Hapner, Rich Burridge, Rahul Sharma等人： “JSR-343 Java消息服务（JMS）2.0规范”，jms-spec.java.net，2013年3月。

[ 15 ] Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “ AMQP: Advanced Message Queuing Protocol Specification ,” Version 0-9-1, November 2008.

[15] Sanjay Aiyagari, Matthew Arrott, Mark Atwell，等人： “AMQP：高级消息队列协议规范”，版本0-9-1，2008年11月。

[ 16 ] “ Google Cloud Pub/Sub: A Google-Scale Messaging Service ,” cloud.google.com , 2016.

“Google Cloud Pub/Sub：一个谷歌规模的消息服务”，cloud.google.com，2016年。

[ 17 ] “ Apache Kafka 0.9 Documentation ,” kafka.apache.org , November 2015.

【17】“Apache Kafka 0.9 文档”，kafka.apache.org，2015年11月。

[ 18 ] Jay Kreps, Neha Narkhede, and Jun Rao: “ Kafka: A Distributed Messaging System for Log Processing ,” at 6th International Workshop on Networking Meets Databases (NetDB), June 2011.

[18] Jay Kreps，Neha Narkhede和Jun Rao：2011年6月在第六届国际网络与数据库研讨会（NetDB）上发表了“Kafka：用于日志处理的分布式消息传递系统”。

[ 19 ] “ Amazon Kinesis Streams Developer Guide ,” docs.aws.amazon.com , April 2016.

“Amazon Kinesis Streams 开发者指南”，docs.aws.amazon.com，2016年4月。

[ 20 ] Leigh Stewart and Sijie Guo: “ Building DistributedLog: Twitter’s High-Performance Replicated Log Service ,” blog.twitter.com , September 16, 2015.

【20】Leigh Stewart和Sijie Guo：“构建DistributedLog：Twitter的高性能复制日志服务”，博客.twitter.com，2015年9月16日。

[ 21 ] “ DistributedLog Documentation ,” Twitter, Inc., distributedlog.io , May 2016.

[21] “DistributedLog文档”，Twitter公司，distributedlog.io，2016年5月。

[ 22 ] Jay Kreps: “ Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines) ,” engineering.linkedin.com , April 27, 2014.

[22] Jay Kreps：「基准测试Apache Kafka：在三台便宜的机器上每秒写入2百万条记录」，引擎工程师.linkedin.com，2014年4月27日。

[ 23 ] Kartik Paramasivam: “ How We’re Improving and Advancing Kafka at LinkedIn ,” engineering.linkedin.com , September 2, 2015.

[23] Kartik Paramasivam：“LinkedIn如何改进和推进Kafka”，engineering.linkedin.com，2015年9月2日。

[ 24 ] Jay Kreps: “ The Log: What Every Software Engineer Should Know About Real-Time Data’s Unifying Abstraction ,” engineering.linkedin.com , December 16, 2013.

[24] Jay Kreps：“日志：每个软件工程师应该了解实时数据统一抽象的知识”，engineering.linkedin.com，2013年12月16日。

[ 25 ] Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “ All Aboard the Databus! ,” at 3rd ACM Symposium on Cloud Computing (SoCC), October 2012.

"全部登上数据总线！"，来自Shirshanka Das，Chavdar Botev，Kapil Surlaker等人在2012年10月的第三届ACM云计算研讨会（SoCC）上的报告。

[ 26 ] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “ Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services ,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

[26] Yogeshwer Sharma，Philippe Ajoux，Petchean Ang等： “Wormhole：可靠的发布订阅技术以支持地理复制的互联网服务”，于2015年5月第12届 USENIX Symposium on Networked Systems Design and Implementation (NSDI) 上发表。

[ 27 ] P. P. S. Narayan: “ Sherpa Update ,” developer.yahoo.com , June 8, .

[27] P. P. S. Narayan: "Sherpa 更新", developer.yahoo.com，6月8日。

[ 28 ] Martin Kleppmann: “ Bottled Water: Real-Time Integration of PostgreSQL and Kafka ,” martin.kleppmann.com , April 23, 2015.

[28] Martin Kleppmann：“瓶装水：PostgreSQL和Kafka的实时集成”，martin.kleppmann.com，2015年4月23日。

[ 29 ] Ben Osheroff: “ Introducing Maxwell, a mysql-to-kafka Binlog Processor ,” developer.zendesk.com , August 20, 2015.

[29] Ben Osheroff：“介绍Maxwell，一个将mysql转换为kafka binlog处理器”，developer.zendesk.com，2015年8月20日。

[ 30 ] Randall Hauch: “ Debezium 0.2.1 Released ,” debezium.io , June 10, 2016.

兰德尔·霍克：「Debezium 0.2.1 发布」，debezium.io，2016年6月10日。

[ 31 ] Prem Santosh Udaya Shankar: “ Streaming MySQL Tables in Real-Time to Kafka ,” engineeringblog.yelp.com , August 1, 2016.

“将MySQL表实时流式传输到Kafka”，Yelp工程博客，2016年8月1日。

[ 32 ] “ Mongoriver ,” Stripe, Inc., github.com , September 2014.

【32】“Mongoriver”，Stripe，Inc.，github.com，2014年9月。【32】“蒙戈河”，Stripe公司，github.com，2014年9月。

[ 33 ] Dan Harvey: “ Change Data Capture with Mongo + Kafka ,” at Hadoop Users Group UK , August 2015.

[33] Dan Harvey: “Mongo + Kafka 实现数据捕获变化”，在英国 Hadoop 用户组于2015年8月演讲。

[ 34 ] “ Oracle GoldenGate 12c: Real-Time Access to Real-Time Information ,” Oracle White Paper, March 2015.

“Oracle GoldenGate 12c：实时访问实时信息”，Oracle白皮书，2015年3月。

[ 35 ] “ Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works ,” Oracle Corporation, youtube.com , November 2012.

【35】《Oracle GoldenGate基础知识：Oracle GoldenGate如何工作》甲骨文公司，youtube.com, 2012年11月。

[ 36 ] Slava Akhmechet: “ Advancing the Realtime Web ,” rethinkdb.com , January 27, 2015.

[36] Slava Akhmechet：「推动实时网络」，rethinkdb.com，2015年1月27日。

[ 37 ] “ Firebase Realtime Database Documentation ,” Google, Inc., firebase.google.com , May 2016.

【37】“Firebase实时数据库文档”，Google公司，firebase.google.com，2016年5月。

[ 38 ] “ Apache CouchDB 1.6 Documentation ,” docs.couchdb.org , 2014.

[38] “Apache CouchDB 1.6 文档”，docs.couchdb.org，2014年。

[ 39 ] Matt DeBergalis: “ Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff ,” info.meteor.com , December 17, 2013.

[39] Matt DeBergalis: “Meteor 0.7.0:使用 MongoDB Oplog 而非轮询差异来提高可扩展性的数据库查询”，info.meteor.com，2013年12月17日。

[ 40 ] “ Chapter 15. Importing and Exporting Live Data ,” VoltDB 6.4 User Manual, docs.voltdb.com , June 2016.

[40] "第15章：导入和导出实时数据"，VoltDB 6.4 用户手册，docs.voltdb.com，2016年6月。

[ 41 ] Neha Narkhede: “ Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines ,” confluent.io , February 18, 2016.

Neha Narkhede：“宣布Kafka Connect：构建大规模低延迟数据管道”，confluent.io，2016年2月18日。

[ 42 ] Greg Young: “ CQRS and Event Sourcing ,” at Code on the Beach , August 2014.

[42] Greg Young: "CQRS和事件溯源"，在海滩代码上，2014年8月。

[ 43 ] Martin Fowler: “ Event Sourcing ,” martinfowler.com , December 12, 2005.

【43】马丁·福勒（Martin Fowler）：“事件溯源”，martinfowler.com，2005年12月12日。

[ 44 ] Vaughn Vernon: Implementing Domain-Driven Design . Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7

[44] Vaughn Vernon：实现领域驱动设计。Addison-Wesley 专业出版社，2013年。ISBN：978-0-321-83457-7。

[ 45 ] H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “ View Maintenance Issues for the Chronicle Data Model ,” at 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), May 1995. doi:10.1145/212433.220201

【45】H. V. Jagadish，Inderpal Singh Mumick，和Abraham Silberschatz：“编年史数据模型的视图维护问题”，发表于1995年5月的第14届ACM SIGACT-SIGMOD-SIGART数据库系统基本原理研讨会（PODS）上。doi：10.1145/212433.220201。

[ 46 ] “ Event Store 3.5.0 Documentation ,” Event Store LLP, docs.geteventstore.com , February 2016.

[46] “Event Store 3.5.0文档”，Event Store LLP，docs.geteventstore.com，2016年2月。

[ 47 ] Martin Kleppmann: Making Sense of Stream Processing . Report, O’Reilly Media, May 2016.

[47] Martin Kleppmann：流处理的意义。报告，O'Reilly Media，2016年5月。

[ 48 ] Sander Mak: “ Event-Sourced Architectures with Akka ,” at JavaOne , September 2014.

[48] Sander Mak: 在JavaOne 2014年9月针对Akka的基于事件溯源的架构的演讲。

[ 49 ] Julian Hyde: personal communication , June 2016.

[49] 朱利安 · 海德：个人交流， 2016年6月。

[ 50 ] Ashish Gupta and Inderpal Singh Mumick: Materialized Views: Techniques, Implementations, and Applications . MIT Press, 1999. ISBN: 978-0-262-57122-7

[50] Ashish Gupta和Inderpal Singh Mumick: 实现材料视图的技术和应用。麻省理工学院出版社，1999年。ISBN: 978-0-262-57122-7。

[ 51 ] Timothy Griffin and Leonid Libkin: “ Incremental Maintenance of Views with Duplicates ,” at ACM International Conference on Management of Data (SIGMOD), May 1995. doi:10.1145/223784.223849

[51] Timothy Griffin 和 Leonid Libkin：「重複值視圖的增量維護」，於 1995 年 5 月的 ACM 國際數據管理會議 (SIGMOD) 上，DOI：10.1145/223784.223849。

[ 52 ] Pat Helland: “ Immutability Changes Everything ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[52] Pat Helland在2015年1月的第七届创新数据系统研究双年会（CIDR）上发表的“不可变性改变一切”演讲。

[ 53 ] Martin Kleppmann: “ Accounting for Computer Scientists ,” martin.kleppmann.com , March 7, 2011.

[53] Martin Kleppmann：“计算机科学家的会计学”，martin.kleppmann.com，2011年3月7日。

[ 54 ] Pat Helland: “ Accountants Don’t Use Erasers ,” blogs.msdn.com , June 14, 2007.

[54] Pat Helland: “会计师不会用橡皮擦”，blogs.msdn.com，2007年6月14日。

[ 55 ] Fangjin Yang: “ Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets ,” metamarkets.com , June 3, 2015.

“使用Druid、Samza和Kafka进行自用：Metamarkets的Metametrics”，metamarkets.com，2015年6月3日。

[ 56 ] Gavin Li, Jianqiu Lv, and Hang Qi: “ Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute ,” yahoohadoop.tumblr.com , April 13, 2015.

"开心果: 将数据和计算同时放置以获得最快的云计算"，来自yahoohadoop.tumblr.com，2015年4月13日。"

[ 57 ] Kartik Paramasivam: “ Stream Processing Hard Problems – Part 1: Killing Lambda ,” engineering.linkedin.com , June 27, 2016.

[57]Kartik Paramasivam：“流处理的难题-第一部分：杀死Lambda”，engineering.linkedin.com，2016年6月27日。 [57]Kartik Paramasivam：“流处理中的难题 - 第1部分：杀死Lambda”，engineering.linkedin.com，2016年6月27日。

[ 58 ] Martin Fowler: “ CQRS ,” martinfowler.com , July 14, 2011.

[58] Martin Fowler：“CQRS”，martinfowler.com，2011年7月14日。 [58] 马丁·福勒：“CQRS”，martinfowler.com，2011年7月14日。

[ 59 ] Greg Young: “ CQRS Documents ,” cqrs.files.wordpress.com , November 2010.

[59] Greg Young: “CQRS文档”， cqrs.files.wordpress.com，2010年11月。

[ 60 ] Baron Schwartz: “ Immutability, MVCC, and Garbage Collection ,” xaprb.com , December 28, 2013.

[60] Baron Schwartz: “不变性、MVCC 和垃圾回收”，xaprb.com，2013年12月28日。

[ 61 ] Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: “Re: Turning the Database Inside-out with Apache Samza ,” Hacker News discussion, news.ycombinator.com , March 4, 2015.

[61] Daniel Eloff, Slava Akhmechet, Jay Kreps,等人："重新定义数据库：Apache Samza开发", Hacker News讨论，news.ycombinator.com，2015年3月4日。

[ 62 ] “ Datomic Development Resources: Excision ,” Cognitect, Inc., docs.datomic.com .

`【62】“Datomic开发资源：切除”，Cognitect, Inc.，docs.datomic.com。`

[ 63 ] “ Fossil Documentation: Deleting Content from Fossil ,” fossil-scm.org , 2016.

[63] “Fossil文档：从Fossil中删除内容”， fossil-scm.org，2016年。

[ 64 ] Jay Kreps: “ The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard, ” twitter.com , March 30, 2015.

“分布式系统的讽刺在于，数据丢失非常容易，但删除数据却令人意外地困难。”——杰伊·克莱普斯，2015年3月30日，Twitter。

[ 65 ] David C. Luckham: “ What’s the Difference Between ESP and CEP? ,” complexevents.com , August 1, 2006.

[65] David C. Luckham： “ESP和CEP的区别是什么？”，complexevents.com，2006年8月1日。

[ 66 ] Srinath Perera: “ How Is Stream Processing and Complex Event Processing (CEP) Different? ,” quora.com , December 3, 2015.

66. Srinath Perera：“流处理和复杂事件处理（CEP）有何不同？”， quora.com，2015年12月3日。

[ 67 ] Arvind Arasu, Shivnath Babu, and Jennifer Widom: “ The CQL Continuous Query Language: Semantic Foundations and Query Execution ,” The VLDB Journal , volume 15, number 2, pages 121–142, June 2006. doi:10.1007/s00778-004-0147-z

【67】Arvind Arasu、Shivnath Babu 和 Jennifer Widom：“CQL 连续查询语言：语义基础和查询执行”，The VLDB Journal，卷 15，第 2 期，页码 121–142，2006 年 6 月。doi:10.1007/s00778-004-0147-z。

[ 68 ] Julian Hyde: “ Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch ,” ACM Queue , volume 7, number 11, December 2009. doi:10.1145/1661785.1667562

[68] Julian Hyde：“数据在传输中：流式 SQL 技术如何帮助解决 Web 2.0 数据危机”，ACM Queue，第 7 卷，第 11 期，2009 年 12 月。doi:10.1145/1661785.1667562。

[ 69 ] “ Esper Reference, Version 5.4.0 ,” EsperTech, Inc., espertech.com , April 2016.

“Esper 参考文档, 版本5.4.0”，EsperTech, Inc., espertech.com，2016年4月。

[ 70 ] Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “ Of Streams and Storms ,” IBM technical report, developer.ibm.com , April 2014.

[70]Zubair Nabi，Eric Bouillet，Andrew Bainbridge和Chris Thomas：“关于数据流和风暴”，IBM技术报告，developer.ibm.com，2014年4月。

[ 71 ] Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “ SamzaSQL: Scalable Fast Data Management with Streaming SQL ,” at IEEE International Workshop on High-Performance Big Data Computing (HPBDC), May 2016. doi:10.1109/IPDPSW.2016.141

"[71] Milinda Pathirage, Julian Hyde, Yi Pan和Beth Plale： ‘SamzaSQL：使用流式SQL进行可扩展快速数据管理’，发表于IEEE国际高性能大数据计算研讨会（HPBDC），2016年5月。 doi: 10.1109 / IPDPSW.2016.141"

[ 72 ] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “ HyperLo⁠gLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm ,” at Conference on Analysis of Algorithms (AofA), June 2007.

[72] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “HyperLogLog：一种近似最优基数估计算法的分析”，发表于算法分析 (AofA) 学术会议，2007 年 6 月。

[ 73 ] Jay Kreps: “ Questioning the Lambda Architecture ,” oreilly.com , July 2, 2014.

[73] 杰伊·克雷普斯：《质疑Lambda架构》，oreilly.com，2014年7月2日。

[ 74 ] Ian Hellström: “ An Overview of Apache Streaming Technologies ,” databaseline.wordpress.com , March 12, 2016.

"Apache流技术综述"，Ian Hellström，databaseline。wordpress.com，2016年3月12日。

[ 75 ] Jay Kreps: “ Why Local State Is a Fundamental Primitive in Stream Processing ,” oreilly.com , July 31, 2014.

[75] Jay Kreps：“为什么本地状态是流处理中的基本原语”，oreilly.com，2014年7月31日。

[ 76 ] Shay Banon: “ Percolator ,” elastic.co , February 8, 2011.

[76] Shay Banon：“Percolator”，elastic.co，2011年2月8日。

[ 77 ] Alan Woodward and Martin Kleppmann: “ Real-Time Full-Text Search with Luwak and Samza ,” martin.kleppmann.com , April 13, 2015.

【77】Alan Woodward和Martin Kleppmann： “Luwak和Samza实现的实时全文搜索”， martin.kleppmann.com，2015年4月13日。

[ 78 ] “ Apache Storm 1.0.1 Documentation ,” storm.apache.org , May 2016.

“Apache Storm 1.0.1 文档,” storm.apache.org, 2016 年 5 月。

[ 79 ] Tyler Akidau: “ The World Beyond Batch: Streaming 102 ,” oreilly.com , January 20, 2016.

[79] Tyler Akidau：“超越批处理：流处理 102”。oreilly.com，2016 年 1 月 20 日。

[ 80 ] Stephan Ewen: “ Streaming Analytics with Apache Flink ,” at Kafka Summit , April 2016.

80. Stephan Ewen：在2016年四月的Kafka峰会上发表“使用Apache Flink进行流式分析”演讲。

[ 81 ] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “ MillWheel: Fault-Tolerant Stream Processing at Internet Scale ,” at 39th International Conference on Very Large Data Bases (VLDB), August 2013.

[81] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, 等： “MillWheel：面向互联网规模的容错流处理”，发表于第39届非常大型数据集成会议（VLDB），2013年8月。

[ 82 ] Alex Dean: “ Improving Snowplow’s Understanding of Time ,” snowplowanalytics.com , September 15, 2015.

【82】Alex Dean：“优化Snowplow对时间的了解”，snowplowanalytics.com，2015年9月15日。

[ 83 ] “ Windowing (Azure Stream Analytics) ,” Microsoft Azure Reference, msdn.microsoft.com , April 2016.

"[83] "窗口化（Azure Stream Analytics）"，Microsoft Azure参考，msdn.microsoft.com，2016年4月。"

[ 84 ] “ State Management ,” Apache Samza 0.10 Documentation, samza.apache.org , December 2015.

[84] “状态管理，” Apache Samza 0.10 文档，samza.apache.org，2015 年 12 月。

[ 85 ] Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “ Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams ,” at ACM International Conference on Management of Data (SIGMOD), June 2013. doi:10.1145/2463676.2465272

“Photon: 容错和可扩展的连续数据流连接”，85页，作者为Rajagopal Ananthanarayanan、Venkatesh Basker、Sumit Das等，发表于2013年6月的ACM国际数据管理会议(SIGMOD)，DOI:10.1145/2463676.2465272。

[ 86 ] Martin Kleppmann: “ Samza Newsfeed Demo ,” github.com , September 2014.

[86] Martin Kleppmann：“Samza Newsfeed演示”，github.com，2014年9月。

[ 87 ] Ben Kirwin: “ Doing the Impossible: Exactly-Once Messaging Patterns in Kafka ,” ben.kirw.in , November 28, 2014.

【87】本·柯尔文： “在Kafka中实现不可能完成的任务：确切一次性消息模式”，ben.kirw.in，2014年11月28日。

[ 88 ] Pat Helland: “ Data on the Outside Versus Data on the Inside ,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

[88] Pat Helland：《内外数据的区别》，收录于2005年第二届创新数据系统研究会议（CIDR）。

[ 89 ] Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling , 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1

[89] Ralph Kimball和Margy Ross：数据仓库工具包：维度建模的权威指南，第三版。约翰威利和儿子，2013年。ISBN：978-1-118-53080-1。

[ 90 ] Viktor Klang: “ I’m coining the phrase ‘effectively-once’ for message processing with at-least-once + idempotent operations ,” twitter.com , October 20, 2016.

"维克多·克朗：我正在创造一个新的术语——“有效一次”，用于至少一次加幂等操作的消息处理。"，推特，2016年10月20日。

[ 91 ] Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “ Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters ,” at 4th USENIX Conference in Hot Topics in Cloud Computing (HotCloud), June 2012.

[91] Matei Zaharia，Tathagata Das，Haoyuan Li等： “离散化流：大规模集群上流处理的高效和容错模型”，于2012年6月的第4届云计算热点问题（HotCloud）USENIX会议上。

[ 92 ] Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “ High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink ,” data-artisans.com , August 5, 2015.

“高吞吐量、低延迟和精确一次性流处理与Apache Flink”，作者：Kostas Tzoumas、Stephan Ewen和Robert Metzger，发表于data-artisans.com网站，发布于2015年8月5日。

[ 93 ] Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “ Lightweight Asynchronous Snapshots for Distributed Dataflows ,” arXiv:1506.08603 [cs.DC], June 29, 2015.

[93] Paris Carbone, Gyula Fóra, Stephan Ewen和其他人： “轻量级异步快照用于分布式数据流”，arXiv：1506.08603 [cs.DC]，2015年6月29日。

[ 94 ] Ryan Betts and John Hugg: Fast Data: Smart and at Scale . Report, O’Reilly Media, October 2015.

[94] Ryan Betts和John Hugg：即时数据：智能且规模化。O'Reilly Media，2015年10月报告。

[ 95 ] Flavio Junqueira: “ Making Sense of Exactly-Once Semantics ,” at Strata+Hadoop World London , June 2016.

[95] Flavio Junqueira：“解读仅一次语义”，斯特拉塔+哈杜普世界伦敦，2016年6月。

[ 96 ] Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “ KIP-98 – Exactly Once Delivery and Transactional Messaging ,” cwiki.apache.org , November 2016.

[96] Jason Gustafson，Flavio Junqueira，Apurva Mehta，Sriram Subramanian和Guozhang Wang：“KIP-98-精确一次传递和事务消息”，cwiki.apache.org，2016年11月。

[ 97 ] Pat Helland: “ Idempotence Is Not a Medical Condition ,” Communications of the ACM , volume 55, number 5, page 56, May 2012. doi:10.1145/2160718.2160734

"97. 帕特·赫兰德：『幂等性不是医学条件』，ACM通讯，2012年5月第55卷第5期，第56页。doi: 10.1145/2160718.2160734。"

[ 98 ] Jay Kreps: “ Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind ,” email to samza-dev mailing list, September 9, 2014.

"[98] Jay Kreps: “Re: 在恢复/倒带时实现确定性行为的尝试”，发送给Samza-dev邮件列表，2014年9月9日。"

[ 99 ] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “ A Survey of Rollback-Recovery Protocols in Message-Passing Systems ,” ACM Computing Surveys , volume 34, number 3, pages 375–408, September 2002. doi:10.1145/568522.568525

“消息传递系统中的回滚恢复协议综述”，作者为E. N. (Mootaz) Elnozahy、Lorenzo Alvisi、Yi-Min Wang和David B. Johnson。发表于2002年9月的ACM Computing Surveys上，第34卷，第3期，页码为375-408。doi:10.1145/568522.568525。

[ 100 ] Adam Warski: “ Kafka Streams – How Does It Fit the Stream Processing Landscape? ,” softwaremill.com , June 1, 2016.

[100] Adam Warski: “Kafka Streams - 它在流处理领域中的地位和应用?”, softwaremill.com, 2016年6月1日。

Chapter 12. The Future of Data Systems

If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.

如果一件事情被指定为另一件事的目标，则它的最终目标不能在保护其存在中。因此，船长不会将委托给他的船的保护作为最后的目标，因为一艘船被指定为其它的目标，即航行。

(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)

如果一个船长的最高目标是保护他的船只，那么他会永远把它停靠在港口。

St. Thomas Aquinas, Summa Theologica (1265–1274)

圣托马斯·阿奎那，《神学大全》（1265-1274年）

So far, this book has been mostly about describing things as they are at present. In this final chapter, we will shift our perspective toward the future and discuss how things should be : I will propose some ideas and approaches that, I believe, may fundamentally improve the ways we design and build applications.

迄今为止，这本书大多描述了当前情况。在这个最后的章节中，我们将转向未来，并讨论事物应该如何：我将提出一些想法和方法，我相信这些想法和方法可以基本上改善我们设计和构建应用程序的方式。

Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused.

关于未来的看法和猜测当然是主观的，因此本章节中我将使用第一人称写下我个人的观点。您可以不同意我的观点，形成自己的看法，但我希望本章节中的一些想法可以作为有建设性的讨论的起点，并为经常令人困惑的概念带来一些清晰度。

The goal of this book was outlined in Chapter 1 : to explore how to create applications and systems that are reliable , scalable , and maintainable . These themes have run through all of the chapters: for example, we discussed many fault-tolerance algorithms that help improve reliability, partitioning to improve scalability, and mechanisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today—robust, correct, evolvable, and ultimately beneficial to humanity.

本书的目标在第1章中已经概述：探索如何创建可靠、可扩展和易于维护的应用程序和系统。这些主题贯穿了所有章节：例如，我们讨论了许多故障容错算法，以提高可靠性，分区以提高可扩展性，以及提高可维护性的进化和抽象机制。在本章中，我们将把所有这些想法汇集起来，并在其基础上构建未来愿景。我们的目标是发现如何设计比今天更好的应用程序-坚固、正确、可发展，并最终有益于人类。

Data Integration

A recurring theme in this book has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs. For example, when discussing storage engines in Chapter 3 , we saw log-structured storage, B-trees, and column-oriented storage. When discussing replication in Chapter 5 , we saw single-leader, multi-leader, and leaderless approaches.

这本书的一个重复出现的主题是，对于任何给定的问题，都有几种解决方案，它们都有不同的优点、缺点和权衡。例如，在第3章讨论存储引擎时，我们看到了日志结构化存储、B树和列式存储。在第5章讨论复制时，我们看到了单个领导者、多个领导者和无领导者方法。

If you have a problem such as “I want to store some data and look it up again later,” there is no one right solution, but many different approaches that are each appropriate in different circumstances. A software implementation typically has to pick one particular approach. It’s hard enough to get one code path robust and performing well—trying to do everything in one piece of software almost guarantees that the implementation will be poor.

如果你有一个问题，比如“我想存储一些数据并在以后查找它们”，那么就没有一个正确的解决方案，而是有许多不同的方法，每种方法在不同的情况下都是合适的。软件实现通常必须选择一种特定的方法。让一个代码路径变得强大并且性能良好已经足够困难了 - 尝试在一个软件中做所有事情几乎保证实现会很差。

Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software, even a so-called “general-purpose” database, is designed for a particular usage pattern.

因此，在选择软件工具时，最合适的选择也取决于情况。每个软件，即使是所谓的“通用”数据库，都是为特定的使用模式而设计的。

Faced with this profusion of alternatives, the first challenge is then to figure out the mapping between the software products and the circumstances in which they are a good fit. Vendors are understandably reluctant to tell you about the kinds of workloads for which their software is poorly suited, but hopefully the previous chapters have equipped you with some questions to ask in order to read between the lines and better understand the trade-offs.

面对如此繁多的选择，第一个挑战就是要找出软件产品与其适用的情况之间的映射关系。厂商很自然地不愿意告诉您他们的软件不适用于哪种工作负载，但是希望前面的章节已经为您提供了一些问题，以便更好地阅读信息并了解权衡。

However, even if you perfectly understand the mapping between tools and circumstances for their use, there is another challenge: in complex applications, data is often used in several different ways. There is unlikely to be one piece of software that is suitable for all the different circumstances in which the data is used, so you inevitably end up having to cobble together several different pieces of software in order to provide your application’s functionality.

然而，即使你完全理解工具和使用环境之间的映射关系，还有另一个挑战：在复杂应用中，数据通常以几种不同的方式使用。不太可能有一款软件适用于数据使用的所有不同情况，因此你最终不得不将几个不同的软件组合在一起，以提供应用程序的功能。

Combining Specialized Tools by Deriving Data

For example, it is common to need to integrate an OLTP database with a full-text search index in order to handle queries for arbitrary keywords. Although some databases (such as PostgreSQL) include a full-text indexing feature, which can be sufficient for simple applications [ 1 ], more sophisticated search facilities require specialist information retrieval tools. Conversely, search indexes are generally not very suitable as a durable system of record, and so many applications need to combine two different tools in order to satisfy all of the requirements.

例如，通常需要将OLTP数据库与全文搜索索引集成，以处理任意关键字的查询。虽然一些数据库（如PostgreSQL）包括全文索引功能，可以满足简单应用程序的需求，但更复杂的搜索功能需要专业的信息检索工具。相反，搜索索引通常不适合作为持久的记录系统，因此许多应用程序需要结合两个不同的工具以满足所有要求。

We touched on the issue of integrating data systems in “Keeping Systems in Sync” . As the number of different representations of the data increases, the integration problem becomes harder. Besides the database and the search index, perhaps you need to keep copies of the data in analytics systems (data warehouses, or batch and stream processing systems); maintain caches or denormalized versions of objects that were derived from the original data; pass the data through machine learning, classification, ranking, or recommendation systems; or send notifications based on changes to the data.

我们提及了“保持系统同步”中的数据系统集成问题。随着数据的不同表示增加，集成问题变得更加困难。除了数据库和搜索索引，可能需要在分析系统（数据仓库或批处理和流处理系统）中保留数据副本；维护缓存或由原始数据派生的对象的非正规化版本；通过机器学习、分类、排名或推荐系统对数据进行处理；或基于数据变化发送通知。

Surprisingly often I see software engineers make statements like, “In my experience, 99% of people only need X” or “…don’t need X” (for various values of X). I think that such statements say more about the experience of the speaker than about the actual usefulness of a technology. The range of different things you might want to do with data is dizzyingly wide. What one person considers to be an obscure and pointless feature may well be a central requirement for someone else. The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.

令人惊讶的是，我经常看到软件工程师会说：“根据我的经验，99%的人只需要X”或者“不需要X”（X的值是不同的）。我认为这样的说法更多地反映了说话者的经验，而非技术的实际可用性。数据可能有无限多种可能的用途。对于一个人来说，某个看似晦涩无意义的特性，可能对于另一个人来说却是非常重要的需求。只有从整个组织的数据流程的角度来看，你才可能意识到数据集成的需求。

Reasoning about dataflows

When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs: where is data written first, and which representations are derived from which sources? How do you get data into all the right places, in the right formats?

当需要在多个存储系统中维护相同数据的副本以满足不同的访问模式时，您需要非常清楚输入和输出：首先在哪里写入数据，哪些表示来自哪些源？如何以正确的格式将数据放入所有正确的位置？

For example, you might arrange for data to first be written to a system of record database, capturing the changes made to that database (see “Change Data Capture” ) and then applying the changes to the search index in the same order. If change data capture (CDC) is the only way of updating the index, you can be confident that the index is entirely derived from the system of record, and therefore consistent with it (barring bugs in the software). Writing to the database is the only way of supplying new input into this system.

例如，您可以安排数据首先写入记录系统数据库，捕获对该数据库所做的更改（请参见“更改数据捕获”），然后按相同顺序将更改应用于搜索索引。如果更改数据捕获（CDC）是更新索引的唯一方法，则可以确信索引完全源自记录系统，并且因此与其一致（除软件中的错误）。写入数据库是向该系统提供新输入的唯一方法。

Allowing the application to directly write to both the search index and the database introduces the problem shown in Figure 11-4 , in which two clients concurrently send conflicting writes, and the two storage systems process them in a different order. In this case, neither the database nor the search index is “in charge” of determining the order of writes, and so they may make contradictory decisions and become permanently inconsistent with each other.

允许应用程序直接向搜索索引和数据库写入会引入问题，如图11-4所示，其中两个客户端同时发送冲突写入请求，并且两个存储系统以不同的顺序处理它们。在这种情况下，数据库和搜索索引都无法确定写入的顺序，因此它们可能会做出相互矛盾的决定，并且永久地与彼此不一致。

If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order. This is an application of the state machine replication approach that we saw in “Total Order Broadcast” . Whether you use change data capture or an event sourcing log is less important than simply the principle of deciding on a total order.

如果可能的话，您可以将所有用户输入通过单个系统漏斗，该系统决定所有写入的顺序，这样通过以相同的顺序处理写入，就更容易推导出数据的其他表示形式。这是我们在“总序列广播”中看到的状态机复制方法的应用。无论您使用更改数据捕获还是事件源日志，决定总顺序的原则比具体方法更为重要。

Updating a derived data system based on an event log can often be made deterministic and idempotent (see “Idempotence” ), making it quite easy to recover from faults.

更新基于事件日志的派生数据系统通常可以使其具有确定性和幂等性（请参见“幂等性”），从而使其非常容易从故障中恢复。

Derived data versus distributed transactions

The classic approach for keeping different data systems consistent with each other involves distributed transactions, as discussed in “Atomic Commit and Two-Phase Commit (2PC)” . How does the approach of using derived data systems fare in comparison to distributed transactions?

使用派生数据系统的方法与分布式事务相比如何表现？经典的保持不同数据系统之间一致性的方法涉及分布式事务，如“原子提交和二阶段提交（2PC）”中所讨论的。

At an abstract level, they achieve a similar goal by different means. Distributed transactions decide on an ordering of writes by using locks for mutual exclusion (see “Two-Phase Locking (2PL)” ), while CDC and event sourcing use a log for ordering. Distributed transactions use atomic commit to ensure that changes take effect exactly once, while log-based systems are often based on deterministic retry and idempotence.

在抽象层面上，它们不同的方式实现了相似的目标。分布式事务通过使用锁进行互斥来确定写入的顺序（参见“两阶段锁定（2PL）”），而CDC和事件源则使用日志进行排序。分布式事务使用原子提交来确保更改只执行一次，而基于日志的系统通常基于确定性重试和幂等性。

The biggest difference is that transaction systems usually provide linearizability (see “Linearizability” ), which implies useful guarantees such as reading your own writes (see “Reading Your Own Writes” ). On the other hand, derived data systems are often updated asynchronously, and so they do not by default offer the same timing guarantees.

最大的区别在于事务系统通常提供线性化（参见“线性化”），这意味着有用的保证，例如读取您自己的写入（参见“读取您自己的写入”）。另一方面，派生数据系统经常是异步更新的，因此它们默认情况下不提供相同的定时保证。

Within limited environments that are willing to pay the cost of distributed transactions, they have been used successfully. However, I think that XA has poor fault tolerance and performance characteristics (see “Distributed Transactions in Practice” ), which severely limit its usefulness. I believe that it might be possible to create a better protocol for distributed transactions, but getting such a protocol widely adopted and integrated with existing tools would be challenging, and unlikely to happen soon.

在愿意支付分布式事务成本的有限环境中，它们已经成功地被使用。然而，我认为XA协议具有较差的容错性和性能特征(见“实践中的分布式事务”), 这严重限制了它的实用性。我相信可能有可能创建一个更好的分布式事务协议，但是让这样的协议被广泛接受并与现有工具集成将是具有挑战性的，并且不太可能很快发生。

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems. However, guarantees such as reading your own writes are useful, and I don’t think that it is productive to tell everyone “eventual consistency is inevitable—suck it up and learn to deal with it” (at least not without good guidance on how to deal with it).

缺乏广泛支持的良好分布式事务协议，我认为基于日志的派生数据是集成不同数据系统的最有前途的方法。然而，像读取自己的写入这样的保证很有用，我认为告诉每个人“最终一致性是不可避免的-拼命忍受它并学会处理它”并不是有成效的（至少没有好的指导如何处理它）。

In “Aiming for Correctness” we will discuss some approaches for implementing stronger guarantees on top of asynchronously derived systems, and work toward a middle ground between distributed transactions and asynchronous log-based systems.

在“追求正确性”的讨论中，我们将讨论一些方法，以实现在异步派生系统的基础上提供更强的保证，并朝着分布式事务和异步日志基础系统之间的中间地带努力。

The limits of total ordering

With systems that are small enough, constructing a totally ordered event log is entirely feasible (as demonstrated by the popularity of databases with single-leader replication, which construct precisely such a log). However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge:

随着系统变得越来越大、工作负载变得更加复杂，构建完全有序的事件日志的限制就开始显现（正如单主复制数据库的流行所证明的那样）。但是，如果系统足够小，构建完全有序的事件日志是完全可行的。

In most cases, constructing a totally ordered log requires all events to pass through a single leader node that decides on the ordering. If the throughput of events is greater than a single machine can handle, you need to partition it across multiple machines (see “Partitioned Logs” ). The order of events in two different partitions is then ambiguous.

在大多数情况下，构建一个完全有序的日志需要所有事件经过一个决定顺序的领导节点。如果事件吞吐量大于一个机器的处理能力，您需要将其分区到多个机器上（请参阅“分区日志”）。然后，两个不同分区中的事件顺序就是不确定的。
If the servers are spread across multiple geographically distributed datacenters, for example in order to tolerate an entire datacenter going offline, you typically have a separate leader in each datacenter, because network delays make synchronous cross-datacenter coordination inefficient (see “Multi-Leader Replication” ). This implies an undefined ordering of events that originate in two different datacenters.

如果服务器分布在多个地理分布的数据中心中，例如为了容忍整个数据中心离线，通常每个数据中心都有一个单独的领导者，因为网络延迟使得跨数据中心同步协调低效（请参见“多领导者复制”）。这意味着来自两个不同数据中心的事件会有未定义的顺序。
When applications are deployed as microservices (see “Dataflow Through Services: REST and RPC” ), a common design choice is to deploy each service and its durable state as an independent unit, with no durable state shared between services. When two events originate in different services, there is no defined order for those events.

当应用程序部署为微服务（请参见“通过服务的数据流：REST和RPC”）时，常见的设计选择是将每个服务及其持久状态独立部署为一个独立单元，没有服务之间共享的持久状态。当两个事件来自不同的服务时，没有定义这些事件的顺序。
Some applications maintain client-side state that is updated immediately on user input (without waiting for confirmation from a server), and even continue to work offline (see “Clients with offline operation” ). With such applications, clients and servers are very likely to see events in different orders.

一些应用程序会在用户输入时立即更新客户端状态(无需等待服务器确认)，甚至可以离线操作(参见“拥有离线操作的客户端”)。对于这样的应用程序，客户端和服务器很有可能看到不同顺序的事件。

In formal terms, deciding on a total order of events is known as total order broadcast , which is equivalent to consensus (see “Consensus algorithms and total order broadcast” ). Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events. It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node and that work well in a geographically distributed setting.

在正式术语中，决定事件的完全顺序被称为完全顺序广播，它等同于共识（见“共识算法和完全顺序广播”）。大多数共识算法都是为单个节点的吞吐量足够处理整个事件流的情况而设计的，这些算法不提供多个节点共享排序事件的机制。设计能够扩展到单个节点吞吐量之外并在地理分布设置中良好运行的共识算法仍然是一个开放的研究问题。

Ordering events to capture causality

In cases where there is no causal link between events, the lack of a total order is not a big problem, since concurrent events can be ordered arbitrarily. Some other cases are easy to handle: for example, when there are multiple updates of the same object, they can be totally ordered by routing all updates for a particular object ID to the same log partition. However, causal dependencies sometimes arise in more subtle ways (see also “Ordering and Causality” ).

如果事件之间没有因果联系，缺乏完全排序并不是一个大问题，因为并发事件可以任意排序。有些其他情况很容易处理：例如，当有多个相同对象的更新时，它们可以通过将特定对象ID的所有更新路由到同一日志分区来进行完全排序。但是，因果依赖关系有时以更微妙的方式出现（也请参见“排序和因果性”）。

For example, consider a social networking service, and two users who were in a relationship but have just broken up. One of the users removes the other as a friend, and then sends a message to their remaining friends complaining about their ex-partner. The user’s intention is that their ex-partner should not see the rude message, since the message was sent after the friend status was revoked.

例如，考虑一个社交网络服务和两个曾经恋爱关系但已经分手的用户。其中一个用户将另一个用户从好友列表中删除，然后向其余的朋友发送一条抱怨前任的信息。该用户的意图是让其前任看不到这条粗鲁的信息，因为这条信息是在好友状态被撤销后发送的。

However, in a system that stores friendship status in one place and messages in another place, that ordering dependency between the unfriend event and the message-send event may be lost. If the causal dependency is not captured, a service that sends notifications about new messages may process the message-send event before the unfriend event, and thus incorrectly send a notification to the ex-partner.

然而，在一个将友谊状态存储在一个地方，而将消息存储在另一个地方的系统中，未友事件和消息发送事件之间的排序依赖关系可能会丢失。如果未捕捉到因果依赖关系，一个发送有关新消息的通知的服务可能会在未友事件之前处理消息发送事件，从而错误地向前任发送通知。

In this example, the notifications are effectively a join between the messages and the friend list, making it related to the timing issues of joins that we discussed previously (see “Time-dependence of joins” ). Unfortunately, there does not seem to be a simple answer to this problem [ 2 , 3 ]. Starting points include:

在这个例子中，通知实际上是消息和好友列表之间的连接，这使它与我们之前讨论过的连接的时序问题相关[参见“连接的时间依赖性”]。不幸的是，这个问题似乎没有简单的答案[2,3]。起点包括:

Logical timestamps can provide total ordering without coordination (see “Sequence Number Ordering” ), so they may help in cases where total order broadcast is not feasible. However, they still require recipients to handle events that are delivered out of order, and they require additional metadata to be passed around.

逻辑时间戳可以在不需要协调的情况下提供完全的排序（参见“序列号排序”），因此它们可能有助于在无法实现完全排序广播的情况下。但是，它们仍然需要接收方处理按顺序交付的事件，并需要传递附加的元数据。
If you can log an event to record the state of the system that the user saw before making a decision, and give that event a unique identifier, then any later events can reference that event identifier in order to record the causal dependency [ 4 ]. We will return to this idea in “Reads are events too” .

如果您可以记录一个事件以记录用户在做出决策之前看到的系统状态，并为该事件分配一个唯一标识符，那么任何以后的事件都可以引用该事件标识符来记录因果依赖关系[4]。我们将在“读取也是事件”中回到这个想法。
Conflict resolution algorithms (see “Automatic Conflict Resolution” ) help with processing events that are delivered in an unexpected order. They are useful for maintaining state, but they do not help if actions have external side effects (such as sending a notification to a user).

冲突解决算法（见“自动冲突解决”）有助于处理以出乎意料的顺序传递的事件。它们对于维护状态很有用，但如果操作具有外部副作用（例如向用户发送通知），则它们无法帮助。

Perhaps, over time, patterns for application development will emerge that allow causal dependencies to be captured efficiently, and derived state to be maintained correctly, without forcing all events to go through the bottleneck of total order broadcast.

也许随着时间的推移，应用程序开发的模式将不断出现，可有效捕获因果依赖关系，并正确维护派生状态，而无需强制所有事件经过总顺序广播瓶颈。

Batch and Stream Processing

I would say that the goal of data integration is to make sure that data ends up in the right form in all the right places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors are the tools for achieving this goal.

数据集成的目标是确保数据以正确的形式出现在所有正确的位置。这需要消耗输入，转换、连接、过滤、汇总、训练模型、评估，最终写入适当的输出。批处理和流处理器是实现这个目标的工具。

The outputs of batch and stream processes are derived datasets such as search indexes, materialized views, recommendations to show to users, aggregate metrics, and so on (see “The Output of Batch Workflows” and “Uses of Stream Processing” ).

批处理和流处理的输出是派生数据集，例如搜索索引、实体视图、推荐展示给用户、聚合指标等（见“批处理工作流程的输出”和“流处理的用途”）。

As we saw in Chapter 10 and Chapter 11 , batch and stream processing have a lot of principles in common, and the main fundamental difference is that stream processors operate on unbounded datasets whereas batch process inputs are of a known, finite size. There are also many detailed differences in the ways the processing engines are implemented, but these distinctions are beginning to blur.

正如我们在第十章和第十一章所看到的那样，批量和流处理有很多共同的原则，主要的基本区别是流处理器处理无界数据集，而批处理输入是已知的有限大小。在处理引擎实现的方式上也有许多细节上的差异，但这些区别正在开始模糊化。

Spark performs stream processing on top of a batch processing engine by breaking the stream into microbatches , whereas Apache Flink performs batch processing on top of a stream processing engine [ 5 ]. In principle, one type of processing can be emulated on top of the other, although the performance characteristics vary: for example, microbatching may perform poorly on hopping or sliding windows [ 6 ].

Spark通过将流数据分解成微批次，利用批处理引擎进行流处理；而Apache Flink则是在流处理引擎上执行批处理[5]。理论上，可以在一个处理引擎上模拟另一种类型的处理，但性能特征不同：例如，微批处理在跳动或滑动窗口上的表现可能较差[6]。

Maintaining derived state

Batch processing has a quite strong functional flavor (even if the code is not written in a functional programming language): it encourages deterministic, pure functions whose output depends only on the input and which have no side effects other than the explicit outputs, treating inputs as immutable and outputs as append-only. Stream processing is similar, but it extends operators to allow managed, fault-tolerant state (see “Rebuilding state after a failure” ).

批处理具有相当强的功能风味（即使代码不是用函数式编程语言编写的）：它鼓励确定性、纯函数，其输出仅取决于输入，除了显式输出之外没有副作用，将输入视为不可变的，将输出视为追加型。流处理类似，但它扩展了运算符以允许受控的、容错的状态（参见“失效后重建状态”）。

The principle of deterministic functions with well-defined inputs and outputs is not only good for fault tolerance (see “Idempotence” ), but also simplifies reasoning about the dataflows in an organization [ 7 ]. No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one system through functional application code and applying the effects to derived systems.

确定性函数的原则不仅有利于容错性（见“幂等性”），而且简化了对组织中数据流的推理。无论派生数据是搜索索引、统计模型还是缓存，都有助于按照数据管道的方式考虑，从一个系统中推导出另一个系统中的状态变化，并通过应用功能代码将影响应用于派生系统。

In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see “Limitations of distributed transactions” ).

基本上，派生数据系统可以同步维护，就像关系型数据库在写入索引表时同步更新辅助索引一样。但是，异步是基于事件日志的系统的鲁棒性所在：它允许系统中某个部分的故障在本地得到控制，而分布式事务如果有任何一个参与方发生故障，则会中止，因此它们往往通过将故障扩散到系统的其余部分来放大故障（参见“分布式事务的局限性”）。

We saw in “Partitioning and Secondary Indexes” that secondary indexes often cross partition boundaries. A partitioned system with secondary indexes either needs to send writes to multiple partitions (if the index is term-partitioned) or send reads to all partitions (if the index is document-partitioned). Such cross-partition communication is also most reliable and scalable if the index is maintained asynchronously [ 8 ] (see also “Multi-partition data processing” ).

在“分区和二级索引”中，我们看到二级索引经常跨越分区边界。具有二级索引的分区系统需要将写操作发送到多个分区（如果索引是术语分区），或者将读操作发送到所有分区（如果索引是文档分区）。如果异步维护索引，则此类跨分区通信也是最可靠和可扩展的[8]（另请参见“多分区数据处理”）。

Reprocessing data for application evolution

When maintaining derived data, batch and stream processing are both useful. Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.

在维护派生数据时，批处理和流处理都很有用。流处理允许反映输入中的变化并快速生成派生视图，而批处理则可以重新处理大量已累积的历史数据，以便将新视图导出到现有数据集。

In particular, reprocessing existing data provides a good mechanism for maintaining a system, evolving it to support new features and changed requirements (see Chapter 4 ). Without reprocessing, schema evolution is limited to simple changes like adding a new optional field to a record, or adding a new type of record. This is the case both in a schema-on-write and in a schema-on-read context (see “Schema flexibility in the document model” ). On the other hand, with reprocessing it is possible to restructure a dataset into a completely different model in order to better serve new requirements.

特别是，重新处理现有数据提供了维护系统的良好机制，使其能够支持新特性和更改后的需求（见第四章）。若不重新处理，模式演进仅限于简单的更改，如向记录中添加一个新的可选字段或添加一种新类型的记录。这在模式写入和模式读取的上下文中都是如此（见“文档模型中的模式灵活性”）。另一方面，通过重新处理，可以将数据集重构为完全不同的模型，以更好地满足新需求。

Schema Migrations on Railways

Large-scale “schema migrations” occur in noncomputer systems as well. For example, in the early days of railway building in 19th-century England there were various competing standards for the gauge (the distance between the two rails). Trains built for one gauge couldn’t run on tracks of another gauge, which restricted the possible interconnections in the train network [ 9 ].

大规模的“架构迁移”也发生在非计算机系统中。例如，在19世纪英国铁路建设的早期阶段，存在不同的竞争性标准来确定铁轨间距（即轨距）。建造用于一种轨距的火车不能在另一种轨距的轨道上行驶，这限制了铁路网中的可能连接[9]。

After a single standard gauge was finally decided upon in 1846, tracks with other gauges had to be converted—but how do you do this without shutting down the train line for months or years? The solution is to first convert the track to dual gauge or mixed gauge by adding a third rail. This conversion can be done gradually, and when it is done, trains of both gauges can run on the line, using two of the three rails. Eventually, once all trains have been converted to the standard gauge, the rail providing the nonstandard gauge can be removed.

在1846年，单一标准轨距终于被确定后，其他轨距的铁路需要转换，但是如何在不关闭铁路线数个月乃至数年的情况下完成转换呢？解决方法是通过添加第三根钢轨来先将轨道转换为双轨或混合轨，这种转换可以逐渐完成，完成后两种轨距的列车都可以在该线路上行驶，使用三根钢轨中的两根。最终，等到所有列车都已经切换到标准轨距后，提供非标准轨距的轨道就可以被拆除了。

“Reprocessing” the existing tracks in this way, and allowing the old and new versions to exist side by side, makes it possible to change the gauge gradually over the course of years. Nevertheless, it is an expensive undertaking, which is why nonstandard gauges still exist today. For example, the BART system in the San Francisco Bay Area uses a different gauge from the majority of the US.

将现有的轨道进行“再加工”，并允许新旧版本并行存在，可以使得逐步修改轨道规格。尽管如此，这是一项昂贵的工作，这也是为什么非标准轨距仍然存在的原因之一。例如，旧金山湾区的BART系统使用不同于大部分美国的轨距。

Derived views allow gradual evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data. You can then start shifting a small number of users to the new view in order to test its performance and find any bugs, while most users continue to be routed to the old view. Gradually, you can increase the proportion of users accessing the new view, and eventually you can drop the old view [ 10 ].

派生视图允许逐步演变。如果您想重组数据集，您不需要进行突然的迁移。相反，您可以将旧模式和新模式作为两个独立的派生视图，同时指向相同的基础数据。然后，您可以开始将少量用户转移到新视图以测试其性能并查找任何错误，而大多数用户继续被路由到旧视图。逐渐地，您可以增加访问新视图的用户比例，最终可以放弃旧视图[10]。

The beauty of such a gradual migration is that every stage of the process is easily reversible if something goes wrong: you always have a working system to go back to. By reducing the risk of irreversible damage, you can be more confident about going ahead, and thus move faster to improve your system [ 11 ].

这种逐步迁移的好处在于，如果出了问题，每个阶段的过程都可以很容易地逆转：你始终有一个可以回到正常工作状态的系统。通过降低不可逆损坏的风险，你可以更有信心地继续前进，从而更快地改进你的系统。

The lambda architecture

If batch processing is used to reprocess historical data, and stream processing is used to process recent updates, then how do you combine the two? The lambda architecture [ 12 ] is a proposal in this area that has gained a lot of attention.

如果批处理用于重新处理历史数据，而流处理用于处理最近的更新，那么如何将两者结合起来呢？Lambda架构是在这个领域提出的一个提案，已经引起了很多关注。

The core idea of the lambda architecture is that incoming data should be recorded by appending immutable events to an always-growing dataset, similarly to event sourcing (see “Event Sourcing” ). From these events, read-optimized views are derived. The lambda architecture proposes running two different systems in parallel: a batch processing system such as Hadoop MapReduce, and a separate stream-processing system such as Storm.

Lambda体系结构的核心思想是，传入的数据应该通过将不可变事件追加到始终增长的数据集中进行记录，类似于事件溯源（见“事件溯源”）。从这些事件中派生出了面向读取优化的视图。Lambda体系结构建议并行运行两个不同的系统：批处理系统（例如Hadoop MapReduce）和单独的流处理系统（例如Storm）。

In the lambda approach, the stream processor consumes the events and quickly produces an approximate update to the view; the batch processor later consumes the same set of events and produces a corrected version of the derived view. The reasoning behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant (see “Fault Tolerance” ). Moreover, the stream process can use fast approximate algorithms while the batch process uses slower exact algorithms.

在 Lambda 架构中，流处理器消耗事件并快速生成视图的近似更新；批处理器随后消耗相同的事件集，并生成导出视图的校正版本。设计背后的理由是批处理更简单，因此更少出错，而流处理器被认为不太可靠且更难以实现容错（参见“容错”）。此外，流处理可以使用快速的近似算法，而批处理使用更慢的精确算法。

The lambda architecture was an influential idea that shaped the design of data systems for the better, particularly by popularizing the principle of deriving views onto streams of immutable events and reprocessing events when needed. However, I also think that it has a number of practical problems:

Lambda架构是一种具有影响力的想法，为设计数据系统提供了更好的方式，特别是通过普及依据不变事件流派生视图并在需要时重新处理事件的原则。然而，我认为它也有一些实际问题：

Having to maintain the same logic to run both in a batch and in a stream processing framework is significant additional effort. Although libraries such as Summingbird [ 13 ] provide an abstraction for computations that can be run in either a batch or a streaming context, the operational complexity of debugging, tuning, and maintaining two different systems remains [ 14 ].

在批处理和流处理框架中保持相同的逻辑需要额外的努力。尽管像Summingbird[13]这样的库提供了在批处理或流处理上下文中运行的计算的抽象，但调试，调整和维护两个不同的系统的操作复杂性仍然存在[14]。
Since the stream pipeline and the batch pipeline produce separate outputs, they need to be merged in order to respond to user requests. This merge is fairly easy if the computation is a simple aggregation over a tumbling window, but it becomes significantly harder if the view is derived using more complex operations such as joins and sessionization, or if the output is not a time series.

由于流水线和批处理管道产生不同的输出，它们需要合并才能响应用户请求。如果计算是在滚动窗口上进行简单的聚合，这个合并很容易，但如果视图是使用更复杂的操作如联接和sessionization衍生的，或者如果输出不是时间序列，这个合并将变得更加困难。
Although it is great to have the ability to reprocess the entire historical dataset, doing so frequently is expensive on large datasets. Thus, the batch pipeline often needs to be set up to process incremental batches (e.g., an hour’s worth of data at the end of every hour) rather than reprocessing everything. This raises the problems discussed in “Reasoning About Time” , such as handling stragglers and handling windows that cross boundaries between batches. Incrementalizing a batch computation adds complexity, making it more akin to the streaming layer, which runs counter to the goal of keeping the batch layer as simple as possible.

尽管重新处理整个历史数据集的能力很棒，但在大型数据集上频繁进行这样的操作代价高昂。因此，批处理管道通常需要设置为处理增量批次（例如，在每小时结束时处理一小时的数据），而不是重新处理所有数据。这引发了“时间推理”中讨论的问题，例如处理散乱数据和跨越批次之间边界的窗口。将批处理计算的增量化会增加复杂性，使其更类似于流层，这与保持批处理层尽可能简单的目标相违背。

Unifying batch and stream processing

More recent work has enabled the benefits of the lambda architecture to be enjoyed without its downsides, by allowing both batch computations (reprocessing historical data) and stream computations (processing events as they arrive) to be implemented in the same system [ 15 ].

最近的工作使得无需忍受Lambda架构缺点就能享受其优点，方法是在同一个系统中实现批量计算（重新处理历史数据）和流计算（处理事件到达）[15]。

Unifying batch and stream processing in one system requires the following features, which are becoming increasingly widely available:

统一批处理和流处理于一个系统中需要以下功能，这些功能变得越来越普遍可用：

The ability to replay historical events through the same processing engine that handles the stream of recent events. For example, log-based message brokers have the ability to replay messages (see “Replaying old messages” ), and some stream processors can read input from a distributed filesystem like HDFS.

能够通过处理最近事件的同一处理引擎来重播历史事件。例如，基于日志的消息代理可以回放消息（请参见“重播旧消息”），并且一些流处理器可以从分布式文件系统（如HDFS）读取输入。
Exactly-once semantics for stream processors—that is, ensuring that the output is the same as if no faults had occurred, even if faults did in fact occur (see “Fault Tolerance” ). Like with batch processing, this requires discarding the partial output of any failed tasks.

流处理器的确切一次语义——确保输出与未发生故障时相同，即使确实发生了故障（请参见“容错性”）。与批处理类似，这需要放弃任何失败任务的部分输出。简化后: 流处理器确保输出不受到故障影响，需舍弃任何失败任务的部分输出。
Tools for windowing by event time, not by processing time, since processing time is meaningless when reprocessing historical events (see “Reasoning About Time” ). For example, Apache Beam provides an API for expressing such computations, which can then be run using Apache Flink or Google Cloud Dataflow.

按事件时间进行窗口化处理的工具，而不是根据处理时间，因为在重新处理历史事件时处理时间是没有意义的（参见“关于时间的推理”）。例如，Apache Beam提供了一个API来表达这样的计算，然后可以使用Apache Flink或Google Cloud Dataflow来运行。

Unbundling Databases

At a most abstract level, databases, Hadoop, and operating systems all perform the same functions: they store some data, and they allow you to process and query that data [ 16 ]. A database stores data in records of some data model (rows in tables, documents, vertices in a graph, etc.) while an operating system’s filesystem stores data in files—but at their core, both are “information management” systems [ 17 ]. As we saw in Chapter 10 , the Hadoop ecosystem is somewhat like a distributed version of Unix.

在最抽象的层面上，数据库、Hadoop和操作系统都执行相同的功能：它们存储一些数据，并允许您处理和查询这些数据。数据库以某些数据模型的记录形式存储数据（表中的行、文档、图中的顶点等），而操作系统的文件系统以文件形式存储数据 - 但从其核心来看，两者都是“信息管理”系统。正如我们在第10章中所看到的，Hadoop生态系统有点像Unix的分布式版本。

Of course, there are many practical differences. For example, many filesystems do not cope very well with a directory containing 10 million small files, whereas a database containing 10 million small records is completely normal and unremarkable. Nevertheless, the similarities and differences between operating systems and databases are worth exploring.

当然，操作系统和数据库之间存在许多实用的不同之处。例如，许多文件系统无法很好地处理包含1千万个小文件的目录，而包含1千万个小记录的数据库则是很正常和平凡的情况。尽管如此，操作系统和数据库之间的相似性和不同之处还是值得探讨的。

Unix and relational databases have approached the information management problem with very different philosophies. Unix viewed its purpose as presenting programmers with a logical but fairly low-level hardware abstraction, whereas relational databases wanted to give application programmers a high-level abstraction that would hide the complexities of data structures on disk, concurrency, crash recovery, and so on. Unix developed pipes and files that are just sequences of bytes, whereas databases developed SQL and transactions.

Unix和关系型数据库以非常不同的哲学方法解决信息管理问题。 Unix认为其目的是向程序员提供一个逻辑但相当低级别的硬件抽象，而关系型数据库希望为应用程序员提供一个高级抽象，隐藏磁盘上的数据结构、并发性、崩溃恢复等复杂性。 Unix开发了管道和文件，它们只是字节序列，而数据库则开发了SQL和事务。

Which approach is better? Of course, it depends what you want. Unix is “simpler” in the sense that it is a fairly thin wrapper around hardware resources; relational databases are “simpler” in the sense that a short declarative query can draw on a lot of powerful infrastructure (query optimization, indexes, join methods, concurrency control, replication, etc.) without the author of the query needing to understand the implementation details.

哪种方法更好？当然，这取决于你想要什么。Unix的“简单”在于它是一个相当薄的硬件资源包装器；关系型数据库的“简单”在于，一个简短的声明性查询可以利用大量强大的基础设施（查询优化、索引、连接方法、并发控制、复制等等），而查询的作者无需理解实现细节。

The tension between these philosophies has lasted for decades (both Unix and the relational model emerged in the early 1970s) and still isn’t resolved. For example, I would interpret the NoSQL movement as wanting to apply a Unix-esque approach of low-level abstractions to the domain of distributed OLTP data storage.

这些哲学之间的紧张关系已经持续了数十年（Unix 和关系模型都出现在 20 世纪 70 年代初），但仍未得到解决。例如，我认为 NoSQL 运动想要将低级抽象的 Unix 式方法应用到分布式 OLTP 数据存储领域。

In this section I will attempt to reconcile the two philosophies, in the hope that we can combine the best of both worlds.

在这个部分，我将尝试调和这两种哲学，希望我们能将两种最好的东西结合起来。

Composing Data Storage Technologies

Over the course of this book we have discussed various features provided by databases and how they work, including:

在本书中，我们讨论了数据库提供的各种功能和它们的运作方式，包括：

Secondary indexes, which allow you to efficiently search for records based on the value of a field (see “Other Indexing Structures” )

二级索引可以根据字段值高效搜索记录（参见“其他索引结构”）。
Materialized views, which are a kind of precomputed cache of query results (see “Aggregation: Data Cubes and Materialized Views” )

材料化视图是查询结果的预计算缓存，是数据立方体和材料化视图的聚合方法之一。
Replication logs, which keep copies of the data on other nodes up to date (see “Implementation of Replication Logs” )

复制日志，它可以使其他节点的数据副本保持最新（参见“复制日志的实现”）。
Full-text search indexes, which allow keyword search in text (see “Full-text search and fuzzy indexes” ) and which are built into some relational databases [ 1 ]

全文搜索索引，允许关键词在文本中搜索（见“全文搜索和模糊索引”），并且已经内置在一些关系型数据库中[1]。

In Chapters 10 and 11 , similar themes emerged. We talked about building full-text search indexes (see “The Output of Batch Workflows” ), about materialized view maintenance (see “Maintaining materialized views” ), and about replicating changes from a database to derived data systems (see “Change Data Capture” ).

在第10章和第11章，出现了类似的主题。我们谈论了构建全文搜索索引（请参见“批处理工作流的输出”），关于物化视图的维护（请参见“维护物化视图”），以及从数据库到派生数据系统的复制更改（请参见“更改数据捕捉”）。

It seems that there are parallels between the features that are built into databases and the derived data systems that people are building with batch and stream processors.

似乎数据库中内置的功能和人们使用批处理和流处理器构建的派生数据系统之间存在相似之处。

Creating an index

Think about what happens when you run CREATE INDEX to create a new index in a relational database. The database has to scan over a consistent snapshot of a table, pick out all of the field values being indexed, sort them, and write out the index. Then it must process the backlog of writes that have been made since the consistent snapshot was taken (assuming the table was not locked while creating the index, so writes could continue). Once that is done, the database must continue to keep the index up to date whenever a transaction writes to the table.

当在关系型数据库中运行CREATE INDEX以创建新索引时，请考虑会发生什么。数据库必须扫描表的一致快照，挑选出所有正在索引的字段值，对它们进行排序，并编写索引。然后，它必须处理自一致快照以来产生的写入积压（假设在创建索引时没有锁定表格，因此可以继续写入）。完成此操作后，数据库必须继续在事务写入表时保持索引最新。

This process is remarkably similar to setting up a new follower replica (see “Setting Up New Followers” ), and also very similar to bootstrapping change data capture in a streaming system (see “Initial snapshot” ).

这个过程与设置新的从属复制（请参见“设置新的从属”）非常相似，也与在流系统中引导变更数据捕获（请参见“初始快照”）非常相似。

Whenever you run CREATE INDEX , the database essentially reprocesses the existing dataset (as discussed in “Reprocessing data for application evolution” ) and derives the index as a new view onto the existing data. The existing data may be a snapshot of the state rather than a log of all changes that ever happened, but the two are closely related (see “State, Streams, and Immutability” ).

每当运行CREATE INDEX时，数据库本质上重新处理现有数据集（如“重新处理应用程序演变中的数据”中所讨论的），并将索引视为现有数据的新视图。现有数据可能是状态的快照，而不是所有发生变化的日志，但两者密切相关（请参见“状态、流和不可变性”）。

The meta-database of everything

In this light, I think that the dataflow across an entire organization starts looking like one huge database [ 7 ]. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.

从这个角度来看，我认为整个组织的数据流开始看起来像一个巨大的数据库[7]。无论是批处理、流处理还是 ETL 进程，将数据从一个地方和形式传输到另一个地方和形式时，就像数据库子系统，保持索引或材料化视图最新。

Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines. The derived data systems they maintain are like different index types. For example, a relational database may support B-tree indexes, hash indexes, spatial indexes (see “Multi-column indexes” ), and other types of indexes. In the emerging architecture of derived data systems, instead of implementing those facilities as features of a single integrated database product, they are provided by various different pieces of software, running on different machines, administered by different teams.

从这个角度来看，批处理和流处理器就像是复杂的触发器、存储过程和物化视图维护程序的实现。它们维护的派生数据系统就像是不同的索引类型。例如，关系数据库可以支持B-tree索引、哈希索引、空间索引（参见“多列索引”）和其他类型的索引。在派生数据系统的新兴架构中，它们不再是单一集成数据库产品特性的实现，而是由各种不同的软件提供，运行在不同的机器上，由不同的团队管理。

Where will these developments take us in the future? If we start from the premise that there is no single data model or storage format that is suitable for all access patterns, I speculate that there are two avenues by which different storage and processing tools can nevertheless be composed into a cohesive system:

这些发展将带领我们走向何方？如果我们的前提是没有一个适用于所有访问模式的单一数据模型或存储格式，那么我猜测有两种途径不同的存储和处理工具可以被组合成一个有凝聚力的系统。

Federated databases: unifying reads

It is possible to provide a unified query interface to a wide variety of underlying storage engines and processing methods—an approach known as a federated database or polystore [ 18 , 19 ]. For example, PostgreSQL’s foreign data wrapper feature fits this pattern [ 20 ]. Applications that need a specialized data model or query interface can still access the underlying storage engines directly, while users who want to combine data from disparate places can do so easily through the federated interface.

可以为各种底层存储引擎和处理方法提供统一的查询接口，这种方法被称为联合数据库或多存储[18，19]。例如，PostgreSQL的外部数据封装功能符合此模式[20]。需要专门的数据模型或查询接口的应用程序仍然可以直接访问底层存储引擎，而想要从不同地方合并数据的用户可以通过联合接口轻松实现。

A federated query interface follows the relational tradition of a single integrated system with a high-level query language and elegant semantics, but a complicated implementation.

一个联合查询界面遵循单一集成系统的关系传统，具有高级查询语言和优美的语义，但实现复杂。

Unbundled databases: unifying writes

While federation addresses read-only querying across several different systems, it does not have a good answer to synchronizing writes across those systems. We said that within a single database, creating a consistent index is a built-in feature. When we compose several storage systems, we similarly need to ensure that all data changes end up in all the right places, even in the face of faults. Making it easier to reliably plug together storage systems (e.g., through change data capture and event logs) is like unbundling a database’s index-maintenance features in a way that can synchronize writes across disparate technologies [ 7 , 21 ].

联合地址用于只读查询多个不同系统，但它没有一个好的答案来同步这些系统中的写操作。我们说，在一个单一的数据库中，创建一个一致的索引是一个内置的功能。当我们组合多个存储系统时，同样需要确保所有数据更改都出现在所有正确的位置，即使在面对故障的情况下。使可靠地将存储系统连接在一起变得更容易（例如，通过变更数据捕获和事件日志）就像解开数据库的索引维护功能，以在不同的技术中同步写入一样 [7,21]。

The unbundled approach follows the Unix tradition of small tools that do one thing well [ 22 ], that communicate through a uniform low-level API (pipes), and that can be composed using a higher-level language (the shell) [ 16 ].

这种分解的方法遵循了Unix传统：使用小而精的工具 [22]，它们通过一个统一的低级API（管道）进行通信，并且可以使用高级语言（shell）进行组合 [16]。

Making unbundling work

Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and maintainable system out of diverse components. Federated read-only querying requires mapping one data model into another, which takes some thought but is ultimately quite a manageable problem. I think that keeping the writes to several storage systems in sync is the harder engineering problem, and so I will focus on it.

联邦化和分解是同一个硬币的两面：通过组合不同的组件来构建可靠、可扩展和易维护的系统。联邦只读查询需要将一个数据模型映射到另一个模型，这需要一些思考，但最终是一个可以管理的问题。我认为让多个存储系统的写入同步是更困难的工程问题，因此我将专注于此。

The traditional approach to synchronizing writes requires distributed transactions across heterogeneous storage systems [ 18 ], which I think is the wrong solution (see “Derived data versus distributed transactions” ). Transactions within a single storage or stream processing system are feasible, but when data crosses the boundary between different technologies, I believe that an asynchronous event log with idempotent writes is a much more robust and practical approach.

传统的同步写入方法需要在异构存储系统中进行分布式事务[18]，我认为这是错误的解决方案（请参阅“派生数据与分布式事务”）。在单个存储或流处理系统中的事务是可行的，但当数据越过不同技术之间的界限时，我认为使用具有幂等写入的异步事件日志是更强大和实用的方法。

For example, distributed transactions are used within some stream processors to achieve exactly-once semantics (see “Atomic commit revisited” ), and this can work quite well. However, when a transaction would need to involve systems written by different groups of people (e.g., when data is written from a stream processor to a distributed key-value store or search index), the lack of a standardized transaction protocol makes integration much harder. An ordered log of events with idempotent consumers (see “Idempotence” ) is a much simpler abstraction, and thus much more feasible to implement across heterogeneous systems [ 7 ].

例如，一些流处理器在实现正好一次语义时使用分布式事务（请参见“重新审视原子提交”），这可以很好地工作。但是，当事务需要涉及由不同团队编写的系统时（例如，从流处理器将数据写入分布式键值存储或搜索索引时），缺乏标准化的事务协议会使集成变得更加困难。具有幂等消费者的事件有序日志（请参见“幂等性”）是一个更简单的抽象，因此更容易在异构系统中实现[7]。

The big advantage of log-based integration is loose coupling between the various components, which manifests itself in two ways:

日志为基础的集成的重要优势在于各组件之间的松散耦合性，表现为两方面：

At a system level, asynchronous event streams make the system as a whole more robust to outages or performance degradation of individual components. If a consumer runs slow or fails, the event log can buffer messages (see “Disk space usage” ), allowing the producer and any other consumers to continue running unaffected. The faulty consumer can catch up when it is fixed, so it doesn’t miss any data, and the fault is contained. By contrast, the synchronous interaction of distributed transactions tends to escalate local faults into large-scale failures (see “Limitations of distributed transactions” ).

在系统层面上，异步事件流使整个系统更具鲁棒性，能够应对单个组件的故障或性能降级。如果消费者运行缓慢或失败，事件日志可以缓冲消息（参见“磁盘空间使用”），使生产者和任何其他消费者可以继续运行而不受影响。当有故障的消费者修复后，它可以赶上进度，因此不会错过任何数据，同时也可以将故障局限在一个地方。相比之下，分布式事务的同步交互往往会将本地故障升级为大规模故障（参见“分布式事务的限制”）。
At a human level, unbundling data systems allows different software components and services to be developed, improved, and maintained independently from each other by different teams. Specialization allows each team to focus on doing one thing well, with well-defined interfaces to other teams’ systems. Event logs provide an interface that is powerful enough to capture fairly strong consistency properties (due to durability and ordering of events), but also general enough to be applicable to almost any kind of data.

从人类的角度来看，数据系统的拆解允许不同的软件组件和服务由不同的团队独立地开发、改进和维护。专业化使得每个团队专注于做好一件事，并与其他团队的系统有良好定义的接口。事件日志提供了一个接口，它足够强大以捕捉相当强的一致性属性（由于事件的持久性和排序），同时又足够通用以适用于几乎任何类型的数据。

Unbundled versus integrated systems

If unbundling does indeed become the way of the future, it will not replace databases in their current form—they will still be needed as much as ever. Databases are still required for maintaining state in stream processors, and in order to serve queries for the output of batch and stream processors (see “The Output of Batch Workflows” and “Processing Streams” ). Specialized query engines will continue to be important for particular workloads: for example, query engines in MPP data warehouses are optimized for exploratory analytic queries and handle this kind of workload very well (see “Comparing Hadoop to Distributed Databases” ).

如果解耦确实成为未来的趋势，它不会取代当前形式的数据库 - 它们仍然像以往一样必不可少。数据库仍然需要用于在流处理器中维护状态，并为批处理和流处理器的输出提供查询服务（参见“批处理工作流程的输出”和“处理流”）。专业化的查询引擎将继续对特定工作负载很重要：例如，MPP数据仓库中的查询引擎经过优化，非常适合探索式分析查询负载（请参见“比较Hadoop和分布式数据库”）。

The complexity of running several different pieces of infrastructure can be a problem: each piece of software has a learning curve, configuration issues, and operational quirks, and so it is worth deploying as few moving parts as possible. A single integrated software product may also be able to achieve better and more predictable performance on the kinds of workloads for which it is designed, compared to a system consisting of several tools that you have composed with application code [ 23 ]. As I said in the Preface , building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization.

运营不同基础设施的复杂性可能会产生问题：每个软件都有其发展过程、配置问题和操作特点，因此部署尽可能少的动态部分是值得的。与由多个工具组成的系统相比，单个集成软件产品也可能能够在其设计的工作负载类型上实现更好且更可预测的性能。正如我在前言中所说，为没有必要的规模而构建是浪费的努力，可能会将您锁定在不灵活的设计中。实际上，这是一种过早优化的形式。

The goal of unbundling is not to compete with individual databases on performance for particular workloads; the goal is to allow you to combine several different databases in order to achieve good performance for a much wider range of workloads than is possible with a single piece of software. It’s about breadth, not depth—in the same vein as the diversity of storage and processing models that we discussed in “Comparing Hadoop to Distributed Databases” .

解绑的目标不是为了在特定工作负载的性能方面与单个数据库竞争；目标是允许您组合几个不同的数据库，以实现比单个软件更宽的工作负载范围的良好性能。它是关于广度，而不是深度 - 就像我们在“比较Hadoop和分布式数据库”中讨论的存储和处理模型的多样性一样。

Thus, if there is a single technology that does everything you need, you’re most likely best off simply using that product rather than trying to reimplement it yourself from lower-level components. The advantages of unbundling and composition only come into the picture when there is no single piece of software that satisfies all your requirements.

因此，如果有一种技术可以满足你所有需求，最好的选择是直接使用该产品，而不是尝试从更低级别的组件重新实现它。只有在没有单一软件满足您所有要求时，拆分和组合的优点才会出现。

What’s missing?

The tools for composing data systems are getting better, but I think one major part is missing: we don’t yet have the unbundled-database equivalent of the Unix shell (i.e., a high-level language for composing storage and processing systems in a simple and declarative way).

数据系统构建工具越来越好，但我认为还缺少一个重要的部分：我们还没有类似于Unix shell的拆分数据库（即，一种高级语言，可以简单声明式地组合存储和处理系统）。

For example, I would love it if we could simply declare mysql | elasticsearch , by analogy to Unix pipes [ 22 ], which would be the unbundled equivalent of CREATE INDEX : it would take all the documents in a MySQL database and index them in an Elasticsearch cluster. It would then continually capture all the changes made to the database and automatically apply them to the search index, without us having to write custom application code. This kind of integration should be possible with almost any kind of storage or indexing system.

例如，如果我们能够像Unix管道[22]的类比那样简单地声明mysql | elasticsearch，那将是CREATE INDEX的未捆绑等效项：它将获取MySQL数据库中的所有文档并将它们索引到Elasticsearch群集中。然后，它将不断捕获对数据库所做的所有更改并自动将它们应用于搜索索引，而无需我们编写自定义应用程序代码。几乎任何类型的存储或索引系统都应该能够实现这种集成。

Similarly, it would be great to be able to precompute and update caches more easily. Recall that a materialized view is essentially a precomputed cache, so you could imagine creating a cache by declaratively specifying materialized views for complex queries, including recursive queries on graphs (see “Graph-Like Data Models” ) and application logic. There is interesting early-stage research in this area, such as differential dataflow [ 24 , 25 ], and I hope that these ideas will find their way into production systems.

同样地，能够更轻松地预计算和更新缓存将是非常好的。回顾一下，物化视图本质上就是预计算的缓存，所以你可以通过声明性地指定复杂查询的物化视图来创建缓存，包括对图形递归查询（请参见“类似图形的数据模型”）和应用逻辑。在这个领域已经有了有趣的早期研究，例如差分数据流 [24, 25]，我希望这些想法能够应用于生产系统中。

Designing Applications Around Dataflow

The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the “database inside-out” approach [ 26 ], after the title of a conference talk I gave in 2014 [ 27 ]. However, calling it a “new architecture” is too grandiose. I see it more as a design pattern, a starting point for discussion, and we give it a name simply so that we can better talk about it.

通过将专门的存储和处理系统与应用程序代码组合来分解数据库的方法，也被称为“内部化的数据库”方法[26]，这是我在2014年的一次会议演讲的标题[27]。然而，称其为“新架构”太过夸张。我认为它更像是一种设计模式，是讨论的起点，我们只是给它一个名称，以便我们更好地谈论它。

These ideas are not mine; they are simply an amalgamation of other people’s ideas from which I think we should learn. In particular, there is a lot of overlap with dataflow languages such as Oz [ 28 ] and Juttle [ 29 ], functional reactive programming (FRP) languages such as Elm [ 30 , 31 ], and logic programming languages such as Bloom [ 32 ]. The term unbundling in this context was proposed by Jay Kreps [ 7 ].

这些想法不是我的，只是其他人的想法的融合，我认为我们应该从中学习。特别地，与数据流语言（如Oz [28]和Juttle [29]）、函数响应式编程（FRP）语言（如Elm [30,31]）和逻辑编程语言（如Bloom [32]）有很多重叠之处。在这种情况下，“解捆绑”这个术语是由Jay Kreps [7]提出的。

Even spreadsheets have dataflow programming capabilities that are miles ahead of most mainstream programming languages [ 33 ]. In a spreadsheet, you can put a formula in one cell (for example, the sum of cells in another column), and whenever any input to the formula changes, the result of the formula is automatically recalculated. This is exactly what we want at a data system level: when a record in a database changes, we want any index for that record to be automatically updated, and any cached views or aggregations that depend on the record to be automatically refreshed. You should not have to worry about the technical details of how this refresh happens, but be able to simply trust that it works correctly.

即使是电子表格也具有数据流编程能力，这方面甚至比大多数主流编程语言还要强大[33]。在电子表格中，你可以在一个单元格中输入公式（例如，另一列单元格的总和），并且每当公式的任何输入发生改变，公式的结果就会自动重新计算。这正是我们在数据系统级别想要的：当数据库中的记录发生更改时，我们希望自动更新该记录的任何索引，并自动刷新任何依赖于该记录的缓存视图或聚合。你不应该担心这种刷新是如何发生的技术细节，而是应该简单地相信它能正常工作。

Thus, I think that most data systems still have something to learn from the features that VisiCalc already had in 1979 [ 34 ]. The difference from spreadsheets is that today’s data systems need to be fault-tolerant, scalable, and store data durably. They also need to be able to integrate disparate technologies written by different groups of people over time, and reuse existing libraries and services: it is unrealistic to expect all software to be developed using one particular language, framework, or tool.

因此，我认为大多数数据系统仍需从VisiCalc在1979年已有的功能中学习[34]。与电子表格的不同之处在于，现代数据系统需要具备容错性、可扩展性和数据持久性存储。它们还需要能够整合不同时间段内由不同人编写的不同技术，并重用现有的库和服务：期望所有的软件都使用特定的语言、框架或工具进行开发是不现实的。

In this section I will expand on these ideas and explore some ways of building applications around the ideas of unbundled databases and dataflow.

在这个部分，我将进一步阐述这些想法并探讨围绕未捆绑的数据库和数据流构建应用程序的一些方法。

Application code as a derivation function

When one dataset is derived from another, it goes through some kind of transformation function. For example:

当一个数据集来自另一个数据集时，它会经过某种转换函数进行转换。例如：

A secondary index is a kind of derived dataset with a straightforward transformation function: for each row or document in the base table, it picks out the values in the columns or fields being indexed, and sorts by those values (assuming a B-tree or SSTable index, which are sorted by key, as discussed in Chapter 3 ).

二级索引是一种派生数据集，可以通过简单的转换函数实现：对于基础表中的每一行或文档，它会挑选出需要建立索引的列或字段的值，并按这些值排序（假设使用基于键的B树或SSTable索引，如第三章所讨论的）。
A full-text search index is created by applying various natural language processing functions such as language detection, word segmentation, stemming or lemmatization, spelling correction, and synonym identification, followed by building a data structure for efficient lookups (such as an inverted index).

通过应用各种自然语言处理功能，如语言检测、词语切分、词干提取或词形归并、拼写矫正和同义词识别，可以创建一个全文搜索索引，然后构建有效查找的数据结构（如倒排索引）。
In a machine learning system, we can consider the model as being derived from the training data by applying various feature extraction and statistical analysis functions. When the model is applied to new input data, the output of the model is derived from the input and the model (and hence, indirectly, from the training data).

在机器学习系统中，我们可以将模型视为通过应用各种特征提取和统计分析函数从训练数据中派生出来的。当该模型应用于新的输入数据时，模型的输出是从输入和模型（间接地从训练数据）派生出来的。
A cache often contains an aggregation of data in the form in which it is going to be displayed in a user interface (UI). Populating the cache thus requires knowledge of what fields are referenced in the UI; changes in the UI may require updating the definition of how the cache is populated and rebuilding the cache.

缓存通常包含聚合的数据，以它将要在用户界面（UI）中显示的形式为基础。因此，填充缓存需要知道在UI中引用了哪些字段；在UI中进行更改可能需要更新填充缓存的定义并重新构建缓存。

The derivation function for a secondary index is so commonly required that it is built into many databases as a core feature, and you can invoke it by merely saying CREATE INDEX . For full-text indexing, basic linguistic features for common languages may be built into a database, but the more sophisticated features often require domain-specific tuning. In machine learning, feature engineering is notoriously application-specific, and often has to incorporate detailed knowledge about the user interaction and deployment of an application [ 35 ].

次要索引的派生函数是如此常用，以至于它被构建进了许多数据库的核心功能，您只需说CREATE INDEX就可以调用它。对于全文索引，通用语言的基本语言特征可能被构建到数据库中，但更复杂的特征通常需要特定于领域的调整。在机器学习中，特征工程因应用程序而异，通常必须结合对用户交互和应用程序部署的详细知识[35]。

When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects. And this custom code is where many databases struggle. Although relational databases commonly support triggers, stored procedures, and user-defined functions, which can be used to execute application code within the database, they have been somewhat of an afterthought in database design (see “Transmitting Event Streams” ).

当用于创建派生数据集的函数不是标准的模板函数，例如创建二级索引，需要编写定制代码来处理特定于应用程序的方面。这些定制代码是许多数据库的难点。尽管关系数据库通常支持触发器、存储过程和用户定义函数，可以用于在数据库中执行应用程序代码，但它们在数据库设计中有些被忽略（见“传输事件流”）。

Separation of application code and state

In theory, databases could be deployment environments for arbitrary application code, like an operating system. However, in practice they have turned out to be poorly suited for this purpose. They do not fit well with the requirements of modern application development, such as dependency and package management, version control, rolling upgrades, evolvability, monitoring, metrics, calls to network services, and integration with external systems.

在理论上，数据库可以是任意应用程序代码的部署环境，就像操作系统一样。然而，在实际应用中，它们被证明不适合这个目的。它们与现代应用程序开发的要求不相符，如依赖和包管理、版本控制、滚动升级、可演进性、监控、指标、调用网络服务和与外部系统集成。

On the other hand, deployment and cluster management tools such as Mesos, YARN, Docker, Kubernetes, and others are designed specifically for the purpose of running application code. By focusing on doing one thing well, they are able to do it much better than a database that provides execution of user-defined functions as one of its many features.

另一方面，部署和集群管理工具如Mesos、YARN、Docker、Kubernetes等都是专门为运行应用程序代码而设计的。通过专注于做好一件事情，它们能够比提供执行用户定义功能作为其众多功能之一的数据库做得更好。

I think it makes sense to have some parts of a system that specialize in durable data storage, and other parts that specialize in running application code. The two can interact while still remaining independent.

我认为在系统中有一些部分专门负责持久数据存储，其他部分专门负责运行应用程序代码是有意义的。这两者可以互相交互，同时仍然保持独立。

Most web applications today are deployed as stateless services, in which any user request can be routed to any application server, and the server forgets everything about the request once it has sent the response. This style of deployment is convenient, as servers can be added or removed at will, but the state has to go somewhere: typically, a database. The trend has been to keep stateless application logic separate from state management (databases): not putting application logic in the database and not putting persistent state in the application [ 36 ]. As people in the functional programming community like to joke, “We believe in the separation of Church and state” [ 37 ]. ⁱ

当今大多数Web应用程序都作为无状态服务部署，其中任何用户请求都可以路由到任何应用程序服务器，并且服务器在发送响应后就忘记了有关请求的一切。这种部署风格很方便，因为可以随意添加或删除服务器，但状态必须存在某个地方：通常是数据库。趋势是将无状态应用程序逻辑与状态管理（数据库）分开：不将应用程序逻辑放在数据库中，也不将持久状态放在应用程序中。正如函数式编程社区中的人们喜欢开玩笑：“我们相信教会与政府应该分离”。

In this typical web application model, the database acts as a kind of mutable shared variable that can be accessed synchronously over the network. The application can read and update the variable, and the database takes care of making it durable, providing some concurrency control and fault tolerance.

在这种典型的Web应用程序模型中，数据库充当一种可在网络上同步访问的可变共享变量。应用程序可以读取并更新该变量，数据库负责使其持久，提供一些并发控制和容错能力。

However, in most programming languages you cannot subscribe to changes in a mutable variable—you can only read it periodically. Unlike in a spreadsheet, readers of the variable don’t get notified if the value of the variable changes. (You can implement such notifications in your own code—this is known as the observer pattern —but most languages do not have this pattern as a built-in feature.)

然而，在大多数编程语言中，您无法订阅可变变量的更改 - 您只能定期读取它。与电子表格不同，变量的读者如果变量的值发生更改，不会收到通知。（您可以在自己的代码中实现此类通知 - 这称为观察者模式 - 但大多数语言没有此模式作为内置功能。）

Databases have inherited this passive approach to mutable data: if you want to find out whether the content of the database has changed, often your only option is to poll (i.e., to repeat your query periodically). Subscribing to changes is only just beginning to emerge as a feature (see “API support for change streams” ).

数据库已经继承了这种对可变数据的被动方法：如果您想了解数据库内容是否更改，通常您唯一的选择是轮询（即定期重复查询）。订阅更改仅在最近作为一项功能出现（参见“支持变更流的API”）。

Dataflow: Interplay between state changes and application code

Thinking about applications in terms of dataflow implies renegotiating the relationship between application code and state management. Instead of treating a database as a passive variable that is manipulated by the application, we think much more about the interplay and collaboration between state, state changes, and code that processes them. Application code responds to state changes in one place by triggering state changes in another place.

考虑数据流应用意味着重新协商应用代码和状态管理之间的关系。我们不再将数据库视为应用程序操作的被动变量，而是更多地考虑状态、状态变化和处理它们的代码之间的相互作用和协作。应用代码通过在一个地方响应状态变化来触发另一个地方的状态变化。

We saw this line of thinking in “Databases and Streams” , where we discussed treating the log of changes to a database as a stream of events that we can subscribe to. Message-passing systems such as actors (see “Message-Passing Dataflow” ) also have this concept of responding to events. Already in the 1980s, the tuple spaces model explored expressing distributed computations in terms of processes that observe state changes and react to them [ 38 , 39 ].

我们在“数据库和数据流”中看到了这种思路，我们讨论了将对数据库的更改日志视为我们可以订阅的事件流。消息传递系统如演员（见“消息传递数据流”）也有响应事件的概念。早在20世纪80年代，元组空间模型就探索了用观察状态变化并对其做出反应的进程来表达分布式计算的方法。[38，39]。

As discussed, similar things happen inside a database when a trigger fires due to a data change, or when a secondary index is updated to reflect a change in the table being indexed. Unbundling the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose.

讨论过后，当触发器由于数据更改被触发，或者当二级索引更新以反映被索引的表的更改时，类似的事情会发生在数据库内部。将数据库解绑意味着将这个想法应用于在主数据库之外创建派生数据集的过程中：缓存、全文搜索索引、机器学习或分析系统。我们可以使用流处理和消息系统来实现这个目的。

The important thing to keep in mind is that maintaining derived data is not the same as asynchronous job execution, for which messaging systems are traditionally designed (see “Logs compared to traditional messaging” ):

需要记住的重要事情是，维护派生数据并不等同于异步作业执行，传统的消息系统正是为此而设计的（见“与传统消息相比的日志”）：

When maintaining derived data, the order of state changes is often important (if several views are derived from an event log, they need to process the events in the same order so that they remain consistent with each other). As discussed in “Acknowledgments and redelivery” , many message brokers do not have this property when redelivering unacknowledged messages. Dual writes are also ruled out (see “Keeping Systems in Sync” ).

在维护派生数据时，状态更改的顺序通常很重要（如果从事件日志派生了多个视图，则它们需要按相同的顺序处理事件，以保持彼此一致）。如“确认和重新交付”中所讨论的，许多消息代理在重新传递未经确认的消息时没有此属性。双重写入也被排除（请参见“保持系统同步”）。
Fault tolerance is key for derived data: losing just a single message causes the derived dataset to go permanently out of sync with its data source. Both message delivery and derived state updates must be reliable. For example, many actor systems by default maintain actor state and messages in memory, so they are lost if the machine running the actor crashes.

容错性对于派生数据非常关键：仅丢失一条信息就可能导致派生数据集永久失去与其数据源的同步。消息传递和派生状态更新都必须可靠。例如，许多演员系统默认将演员状态和消息保存在内存中，因此如果运行演员的计算机崩溃，则它们将丢失。

Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions. Modern stream processors can provide these ordering and reliability guarantees at scale, and they allow application code to be run as stream operators.

稳定的消息排序和容错的消息处理是非常严格的要求，但它们比分布式事务更便宜、更具操作韧性。现代的流处理器可以提供这些排序和可靠性保证，还允许应用程序代码作为流操作符运行。

This application code can do the arbitrary processing that built-in derivation functions in databases generally don’t provide. Like Unix tools chained by pipes, stream operators can be composed to build large systems around dataflow. Each operator takes streams of state changes as input, and produces other streams of state changes as output.

该应用程序代码可以进行内置派生函数通常不提供的任意处理。就像通过管道链接的Unix工具一样，流操作器可以组合以构建围绕数据流的大型系统。每个操作器都将状态更改流作为输入，并生成其他状态更改流作为输出。

Stream processors and services

The currently trendy style of application development involves breaking down functionality into a set of services that communicate via synchronous network requests such as REST APIs (see “Dataflow Through Services: REST and RPC” ). The advantage of such a service-oriented architecture over a single monolithic application is primarily organizational scalability through loose coupling: different teams can work on different services, which reduces coordination effort between teams (as long as the services can be deployed and updated independently).

当前流行的应用程序开发风格涉及将功能分解为一组通过同步网络请求进行通信的服务，例如REST API（请参见“服务之间的数据流：REST和RPC”）。与单个单块应用程序相比，这种面向服务的体系结构的优势主要在于通过松散的耦合实现组织可扩展性：不同的团队可以在不同的服务上工作，这降低了团队之间的协调工作（只要可以独立部署和更新服务）。

Composing stream operators into dataflow systems has a lot of similar characteristics to the microservices approach [ 40 ]. However, the underlying communication mechanism is very different: one-directional, asynchronous message streams rather than synchronous request/response interactions.

把流运算符组成数据流系统与微服务方法有很多相似性[40]。然而，底层通信机制非常不同：单向的异步消息流而不是同步的请求/响应交互。

Besides the advantages listed in “Message-Passing Dataflow” , such as better fault tolerance, dataflow systems can also achieve better performance. For example, say a customer is purchasing an item that is priced in one currency but paid for in another currency. In order to perform the currency conversion, you need to know the current exchange rate. This operation could be implemented in two ways [ 40 , 41 ]:

除了“消息传递数据流”中列出的优点之外，例如更好的容错能力，数据流系统还可以实现更好的性能。例如，假设客户购买的商品定价为一种货币，但是用另一种货币支付。为了进行货币转换，您需要知道当前的汇率。此操作可以通过两种方式实现[40，41]：

In the microservices approach, the code that processes the purchase would probably query an exchange-rate service or database in order to obtain the current rate for a particular currency.

在微服务的方法中，处理购买的代码可能会查询一个汇率服务或数据库，以获取特定货币的当前汇率。
In the dataflow approach, the code that processes purchases would subscribe to a stream of exchange rate updates ahead of time, and record the current rate in a local database whenever it changes. When it comes to processing the purchase, it only needs to query the local database.

在数据流方法中，处理购买的代码将提前订阅汇率更新的流，并在汇率发生变化时将当前汇率记录在本地数据库中。在处理购买时，它只需要查询本地数据库。

The second approach has replaced a synchronous network request to another service with a query to a local database (which may be on the same machine, even in the same process). ⁱⁱ Not only is the dataflow approach faster, but it is also more robust to the failure of another service. The fastest and most reliable network request is no network request at all! Instead of RPC, we now have a stream join between purchase events and exchange rate update events (see “Stream-table join (stream enrichment)” ).

第二种方法已经将同步网络请求另一个服务转换为查询本地数据库（可能在同一台机器上，甚至在同一进程中）。流数据方法不仅更快，而且对于另一个服务的失效也更加稳健。最快、最可靠的网络请求就是根本不发送网络请求！现在，我们不再使用 RPC，而是在购买事件和汇率更新事件之间进行流加入（参见“流表连接（流扩充）”）。

The join is time-dependent: if the purchase events are reprocessed at a later point in time, the exchange rate will have changed. If you want to reconstruct the original output, you will need to obtain the historical exchange rate at the original time of purchase. No matter whether you query a service or subscribe to a stream of exchange rate updates, you will need to handle this time dependence (see “Time-dependence of joins” ).

连接是时间依赖性的：如果购买事件在以后的时间被重新处理，汇率将会发生变化。如果您想重建原始输出，您需要获得购买时刻的历史汇率。无论您查询服务还是订阅汇率更新流，您都需要处理这种时间依赖性（请参见“连接的时间依赖性”）。

Subscribing to a stream of changes, rather than querying the current state when needed, brings us closer to a spreadsheet-like model of computation: when some piece of data changes, any derived data that depends on it can swiftly be updated. There are still many open questions, for example around issues like time-dependent joins, but I believe that building applications around dataflow ideas is a very promising direction to go in.

订阅一组变化流，而不是按需查询当前状态，使我们更接近电子表格式的计算模型：当某些数据发生变化时，依赖于它的任何衍生数据都可以迅速更新。仍然存在许多未解决的问题，例如有关时间依存连接等问题，但我相信以数据流思想为基础构建应用程序是一个非常有前途的方向。

Observing Derived State

At an abstract level, the dataflow systems discussed in the last section give you a process for creating derived datasets (such as search indexes, materialized views, and predictive models) and keeping them up to date. Let’s call that process the write path : whenever some piece of information is written to the system, it may go through multiple stages of batch and stream processing, and eventually every derived dataset is updated to incorporate the data that was written. Figure 12-1 shows an example of updating a search index.

在抽象层面上，上一节中讨论的数据流系统为您提供了一种创建派生数据集（例如搜索索引、物化视图和预测模型）并使它们保持最新的过程。让我们把这个过程称为写入路径：每当某个信息被写入系统时，它可能经过多个批处理和流处理阶段，最终每个派生数据集都会更新以纳入被写入的数据。图12-1显示了更新搜索索引的示例。

But why do you create the derived dataset in the first place? Most likely because you want to query it again at a later time. This is the read path : when serving a user request you read from the derived dataset, perhaps perform some more processing on the results, and construct the response to the user.

但是你为什么首先要创建派生数据集呢？很可能是因为你想在以后的某个时间再次查询它。这是读取路径：在对用户请求进行服务时，您从派生数据集中读取，可能对结果进行更多处理，并构建向用户的响应。

Taken together, the write path and the read path encompass the whole journey of the data, from the point where it is collected to the point where it is consumed (probably by another human). The write path is the portion of the journey that is precomputed—i.e., that is done eagerly as soon as the data comes in, regardless of whether anyone has asked to see it. The read path is the portion of the journey that only happens when someone asks for it. If you are familiar with functional programming languages, you might notice that the write path is similar to eager evaluation, and the read path is similar to lazy evaluation.

总体而言，写路径和读路径涵盖了数据的整个旅程，从收集到消费（可能由另一个人）。写路径是预先计算的旅程，即不管是否有人要查看，只要数据进入，就会急切地完成。阅读路径是只有当有人要求时才发生的旅程。如果您熟悉函数式编程语言，您可能会注意到写路径类似于急切评估，而读取路径类似于延迟评估。

The derived dataset is the place where the write path and the read path meet, as illustrated in Figure 12-1 . It represents a trade-off between the amount of work that needs to be done at write time and the amount that needs to be done at read time.

派生数据集是写路径和读路径相遇的地方，如图12-1所示。它代表了需要在写入时完成的工作量和需要在读取时完成的工作量之间的权衡。

Materialized views and caching

A full-text search index is a good example: the write path updates the index, and the read path searches the index for keywords. Both reads and writes need to do some work. Writes need to update the index entries for all terms that appear in the document. Reads need to search for each of the words in the query, and apply Boolean logic to find documents that contain all of the words in the query (an AND operator), or any synonym of each of the words (an OR operator).

全文搜索索引是一个很好的例子：写入路径更新索引，读取路径搜索关键字的索引。读取和写入都需要做一些工作。写入需要更新文档中出现的所有术语的索引条目。读取需要搜索查询中的每个单词，并应用布尔逻辑以查找包含查询中所有单词（AND运算符）或每个单词的任何同义词（OR运算符）的文档。

If you didn’t have an index, a search query would have to scan over all documents (like grep ), which would get very expensive if you had a large number of documents. No index means less work on the write path (no index to update), but a lot more work on the read path.

如果没有索引，搜索查询将不得不扫描所有文档（就像grep一样），如果您有大量文档，这将非常昂贵。没有索引意味着在写路径上少一些工作（没有索引需要更新），但在读路径上需要更多的工作。

On the other hand, you could imagine precomputing the search results for all possible queries. In that case, you would have less work to do on the read path: no Boolean logic, just find the results for your query and return them. However, the write path would be a lot more expensive: the set of possible search queries that could be asked is infinite, and thus precomputing all possible search results would require infinite time and storage space. That wouldn’t work so well. ⁱⁱⁱ

另一方面，你可以想象为所有可能的查询预先计算搜索结果。在这种情况下，读取路径上的工作量会减少：没有布尔逻辑，只需查找您的查询结果并返回它们。然而，写入路径会变得更加昂贵：可以被询问的可能的搜索查询集是无穷的，因此预先计算所有可能的搜索结果需要无限的时间和存储空间。那样不会很好。

Another option would be to precompute the search results for only a fixed set of the most common queries, so that they can be served quickly without having to go to the index. The uncommon queries can still be served from the index. This would generally be called a cache of common queries, although we could also call it a materialized view, as it would need to be updated when new documents appear that should be included in the results of one of the common queries.

另一个选择是预先计算仅限一组最常见查询的搜索结果，以便可以快速提供它们，而无需访问索引。不常见的查询仍可以从索引中提供。这通常被称为常见查询的缓存，尽管我们也可以称之为物化视图，因为当新文档出现时，它需要更新以包含其中一个常见查询的结果。

From this example we can see that an index is not the only possible boundary between the write path and the read path. Caching of common search results is possible, and grep -like scanning without the index is also possible on a small number of documents. Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. They allow us to do more work on the write path, by precomputing results, in order to save effort on the read path.

从这个例子中可以看到，索引不是唯一可能的写入路径和读取路径之间的边界。可以缓存常见的搜索结果，并且类似于grep的扫描也可以在少量文档上进行，这样看来，缓存、索引和物化视图的作用很简单：它们可以将读取路径和写入路径之间的边界转移。它们允许我们在写入路径上更多地进行预计算结果的工作，以便在读取路径上节省工作量。

Shifting the boundary between work done on the write path and the read path was in fact the topic of the Twitter example at the beginning of this book, in “Describing Load” . In that example, we also saw how the boundary between write path and read path might be drawn differently for celebrities compared to ordinary users. After 500 pages we have come full circle!

将在写路径和读路径上完成的工作的边界移动实际上是本书开头的Twitter示例中的主题，在“描述负载”中。在那个例子中，我们还看到了对于名人和普通用户来说，写路径和读路径之间的边界可能会有不同的划分。经过500页，我们已经来了一圈！

Stateful, offline-capable clients

I find the idea of a boundary between write and read paths interesting because we can discuss shifting that boundary and explore what that shift means in practical terms. Let’s look at the idea in a different context.

我认为在写入和阅读路径之间设立边界的想法很有趣，因为我们可以讨论移动边界并探索它在实际方面的意义。让我们在不同的背景下看这个想法。

The huge popularity of web applications in the last two decades has led us to certain assumptions about application development that are easy to take for granted. In particular, the client/server model—in which clients are largely stateless and servers have the authority over data—is so common that we almost forget that anything else exists. However, technology keeps moving on, and I think it is important to question the status quo from time to time.

在过去的二十年中，Web 应用程序的巨大流行使我们产生了一些关于应用程序开发的假设，很容易被视为理所当然。特别是客户端/服务器模型——其中客户端大多数是无状态的，服务器对数据拥有授权——是如此普遍，以至于我们几乎忘记了其他任何东西的存在。然而，技术不断进步，我认为有必要时常质疑现状。

Traditionally, web browsers have been stateless clients that can only do useful things when you have an internet connection (just about the only thing you could do offline was to scroll up and down in a page that you had previously loaded while online). However, recent “single-page” JavaScript web apps have gained a lot of stateful capabilities, including client-side user interface interaction and persistent local storage in the web browser. Mobile apps can similarly store a lot of state on the device and don’t require a round-trip to the server for most user interactions.

传统上，网络浏览器是无状态客户端，只有在连接到互联网时才能执行有用的操作（在离线状态下，你只能浏览你之前加载过的页面）。然而，最近的“单页”JavaScript网络应用程序已经获得了许多有状态的功能，包括客户端用户界面交互和 web 浏览器中的持久本地存储。移动应用程序可以在设备上存储大量状态，并且大多数用户交互都不需要与服务器进行往返通信。

These changing capabilities have led to a renewed interest in offline-first applications that do as much as possible using a local database on the same device, without requiring an internet connection, and sync with remote servers in the background when a network connection is available [ 42 ]. Since mobile devices often have slow and unreliable cellular internet connections, it’s a big advantage for users if their user interface does not have to wait for synchronous network requests, and if apps mostly work offline (see “Clients with offline operation” ).

这些变化的能力已经导致了对离线优先应用的重新关注，这些应用尽可能使用同一设备上的本地数据库，而不需要互联网连接，并在网络连接可用时与远程服务器同步（42）。由于移动设备通常具有缓慢且不可靠的蜂窝互联网连接，如果用户界面不必等待同步网络请求，并且应用程序基本上可以离线使用，那对用户来说是一个巨大的优势（见“具有离线操作的客户端”）。

When we move away from the assumption of stateless clients talking to a central database and toward state that is maintained on end-user devices, a world of new opportunities opens up. In particular, we can think of the on-device state as a cache of state on the server . The pixels on the screen are a materialized view onto model objects in the client app; the model objects are a local replica of state in a remote datacenter [ 27 ].

当我们不再假设无状态客户端与中央数据库交互，而是向维护在终端设备上的状态转移时，就会有许多新机会出现。特别是，我们可以将设备上的状态视为服务器上状态的缓存。屏幕上的像素是客户端应用程序中模型对象的体现视图；模型对象是远程数据中心中状态的本地副本[27]。

Pushing state changes to clients

In a typical web page, if you load the page in a web browser and the data subsequently changes on the server, the browser does not find out about the change until you reload the page. The browser only reads the data at one point in time, assuming that it is static—it does not subscribe to updates from the server. Thus, the state on the device is a stale cache that is not updated unless you explicitly poll for changes. (HTTP-based feed subscription protocols like RSS are really just a basic form of polling.)

在典型的网页中，如果您在网页浏览器中加载了页面，然后数据在服务器上发生了变化，直到您重新加载页面，浏览器才能发现更改。浏览器只在某一时间点读取数据，假定它是静态的-它不订阅来自服务器的更新。因此，设备上的状态是过时的缓存，除非您显式地轮询更改，否则不会更新。(基于HTTP的订阅协议，如RSS，实际上只是一种基本的轮询形式。)

More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent events (the EventSource API) and WebSockets provide communication channels by which a web browser can keep an open TCP connection to a server, and the server can actively push messages to the browser as long as it remains connected. This provides an opportunity for the server to actively inform the end-user client about any changes to the state it has stored locally, reducing the staleness of the client-side state.

较新的协议已经超越了HTTP的基本请求/响应模式：服务器推送事件（EventSource API）和WebSockets提供了通信渠道，通过它，Web浏览器可以保持与服务器的开放TCP连接，并且只要它保持连接，服务器就可以主动将消息推送到浏览器。这为服务器提供了机会，主动向最终用户客户端通知其本地存储的任何状态更改，减少客户端状态过时的可能性。

In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. When a client is first initialized, it would still need to use a read path to get its initial state, but thereafter it could rely on a stream of state changes sent by the server. The ideas we discussed around stream processing and messaging are not restricted to running only in a datacenter: we can take the ideas further, and extend them all the way to end-user devices [ 43 ].

就我们的写路径和读路径模型而言，积极地将状态变化推送到客户端设备意味着将写路径延伸到最终用户。当客户端首次初始化时，它仍需要使用读路径来获取初始状态，但之后它可以依靠服务器发送的状态更改流。我们讨论过的关于流处理和消息传递的想法并不仅限于仅在数据中心运行：我们可以进一步发挥这些想法，并将它们延伸到最终用户设备[43]。

The devices will be offline some of the time, and unable to receive any notifications of state changes from the server during that time. But we already solved that problem: in “Consumer offsets” we discussed how a consumer of a log-based message broker can reconnect after failing or becoming disconnected, and ensure that it doesn’t miss any messages that arrived while it was disconnected. The same technique works for individual users, where each device is a small subscriber to a small stream of events.

设备将有一段时间处于离线状态，无法接收服务器的状态更改通知。但我们已经解决了这个问题：在“消费者偏移量”中我们讨论了如何重新连接基于日志的消息代理的消费者，在失败或断开连接后确保不会错过任何到达的消息。同样的技术也适用于个人用户，在这里每个设备都是一个小的订阅者，订阅小的事件流。

End-to-end event streams

Recent tools for developing stateful clients and user interfaces, such as the Elm language [ 30 ] and Facebook’s toolchain of React, Flux, and Redux [ 44 ], already manage internal client-side state by subscribing to a stream of events representing user input or responses from a server, structured similarly to event sourcing (see “Event Sourcing” ).

最近用于开发有状态客户端和用户界面的工具，例如Elm编程语言[30]和Facebook的React、Flux和Redux工具链[44]，已经通过订阅一系列事件来管理客户端内部状态，这些事件代表用户输入或来自服务器的响应，类似于事件溯源（见“事件溯源”）。

It would be very natural to extend this programming model to also allow a server to push state-change events into this client-side event pipeline. Thus, state changes could flow through an end-to-end write path: from the interaction on one device that triggers a state change, via event logs and through several derived data systems and stream processors, all the way to the user interface of a person observing the state on another device. These state changes could be propagated with fairly low delay—say, under one second end to end.

将该编程模型扩展到允许服务器将状态更改事件推入客户端事件管道中，这将非常自然。因此，状态更改可以通过端到端的写入路径流动：从触发状态更改的设备上的交互，通过事件日志并通过多个派生数据系统和流处理器，一直到观察另一个设备上状态的人的用户界面。这些状态更改可以以相当低的延迟（例如，端到端低于1秒）传播。

Some applications, such as instant messaging and online games, already have such a “real-time” architecture (in the sense of interactions with low delay, not in the sense of “Response time guarantees” ). But why don’t we build all applications this way?

一些应用程序，例如即时通讯和在线游戏，已经拥有这样的“实时”架构（以低延迟的相互作用为特点，而不是“响应时间保证”的意义上）。但为什么我们不把所有的应用程序都建成这样呢？

The challenge is that the assumption of stateless clients and request/response interactions is very deeply ingrained in our databases, libraries, frameworks, and protocols. Many datastores support read and write operations where a request returns one response, but much fewer provide an ability to subscribe to changes—i.e., a request that returns a stream of responses over time (see “API support for change streams” ).

挑战在于，状态无关客户端和请求/响应交互的假设已经深深地根植于我们的数据库、库、框架和协议中。许多数据存储支持读写操作，其中一个请求返回一个响应，但能够订阅更改（即，一个请求随时间返回一系列响应的能力）的能力则更少（请参见“支持更改流的API”）。

In order to extend the write path all the way to the end user, we would need to fundamentally rethink the way we build many of these systems: moving away from request/response interaction and toward publish/subscribe dataflow [ 27 ]. I think that the advantages of more responsive user interfaces and better offline support would make it worth the effort. If you are designing data systems, I hope that you will keep in mind the option of subscribing to changes, not just querying the current state.

为了将写入路径延伸到最终用户，我们需要彻底重新思考如何构建这些系统：从请求/响应交互向发布/订阅数据流转移[27]。我认为更响应的用户界面和更好的离线支持的优点会使得这个努力值得。如果您正在设计数据系统，我希望您能记住订阅更改的选项，而不仅仅是查询当前状态。

Reads are events too

We discussed that when a stream processor writes derived data to a store (database, cache, or index), and when user requests query that store, the store acts as the boundary between the write path and the read path. The store allows random-access read queries to the data that would otherwise require scanning the whole event log.

当流处理器将派生数据写入存储（数据库，缓存或索引）时，我们讨论了存储行为在写路径和读路径之间的边界。存储允许对数据进行随机访问读取查询，否则需要扫描整个事件日志。

In many cases, the data storage is separate from the streaming system. But recall that stream processors also need to maintain state to perform aggregations and joins (see “Stream Joins” ). This state is normally hidden inside the stream processor, but some frameworks allow it to also be queried by outside clients [ 45 ], turning the stream processor itself into a kind of simple database.

在许多情况下，数据存储是与流系统分开的。但要记住，流处理器也需要维护状态来执行聚合和连接（请参见“流连接”）。这种状态通常隐藏在流处理器中，但是一些框架也允许外部客户端查询它，将流处理器本身转变为一种简单的数据库。

I would like to take that idea further. As discussed so far, the writes to the store go through an event log, while reads are transient network requests that go directly to the nodes that store the data being queried. This is a reasonable design, but not the only possible one. It is also possible to represent read requests as streams of events, and send both the read events and the write events through a stream processor; the processor responds to read events by emitting the result of the read to an output stream [ 46 ].

我想进一步探讨这个想法。到目前为止，对存储的写入都经过了事件日志，而读取则是瞬态的网络请求，直接发送到存储被查询的节点。这是一个合理的设计，但不是唯一可能的设计。还可以将读请求表示为事件流，将读事件和写事件都发送到流处理器；处理器通过将读事件的结果发射到输出流来响应读事件。

When both the writes and the reads are represented as events, and routed to the same stream operator in order to be handled, we are in fact performing a stream-table join between the stream of read queries and the database. The read event needs to be sent to the database partition holding the data (see “Request Routing” ), just like batch and stream processors need to copartition inputs on the same key when joining (see “Reduce-Side Joins and Grouping” ).

当读和写都被表示为事件，并路由到同一个流运算符以进行处理时，实际上我们正在执行流表连接，将读查询流和数据库连接起来。需要发送读事件到持有数据的数据库分区（请参见“请求路由”），就像批处理和流处理器在连接时必须在相同的键上进行共同分区输入一样（请参见“Reduce-Side连接和分组”）。

This correspondence between serving requests and performing joins is quite fundamental [ 47 ]. A one-off read request just passes the request through the join operator and then immediately forgets it; a subscribe request is a persistent join with past and future events on the other side of the join.

在服务请求和执行连接之间的这种对应关系是相当基础的[47]。一次性读取请求只需通过连接运算符传递请求，然后立即忘记它；订阅请求是一种具有过去和未来事件的持久连接。

Recording a log of read events potentially also has benefits with regard to tracking causal dependencies and data provenance across a system: it would allow you to reconstruct what the user saw before they made a particular decision. For example, in an online shop, it is likely that the predicted shipping date and the inventory status shown to a customer affect whether they choose to buy an item [ 4 ]. To analyze this connection, you need to record the result of the user’s query of the shipping and inventory status.

记录所读事件的日志还可能具有跟踪系统中因果依赖和数据来源的好处：它将使您能够重建用户做出特定决策之前所看到的内容。例如，在在线商店中，预计的发货日期和向客户显示的库存状态可能会影响他们是否选择购买物品[4]。为了分析这种联系，您需要记录用户查询的发货和库存状态的结果。

Writing read events to durable storage thus enables better tracking of causal dependencies (see “Ordering events to capture causality” ), but it incurs additional storage and I/O cost. Optimizing such systems to reduce the overhead is still an open research problem [ 2 ]. But if you already log read requests for operational purposes, as a side effect of request processing, it is not such a great change to make the log the source of the requests instead.

将读取事件写入持久存储，从而实现更好的因果依赖跟踪（请参见“排序事件以捕获因果关系”），但会产生额外的存储和I/O成本。优化这种系统以减少开销仍然是一个开放的研究问题[2]。但如果您已经为运营目的记录了读取请求，作为请求处理的副作用，使日志成为请求的来源并不是一个很大的改变。

Multi-partition data processing

For queries that only touch a single partition, the effort of sending queries through a stream and collecting a stream of responses is perhaps overkill. However, this idea opens the possibility of distributed execution of complex queries that need to combine data from several partitions, taking advantage of the infrastructure for message routing, partitioning, and joining that is already provided by stream processors.

对于仅涉及单个分区的查询，通过流发送查询并收集响应的工作可能有些杀鸡焉用。然而，这个想法打开了多个分区数据结合的复杂查询的分布式执行的可能性，利用流处理器已经提供的消息路由、分区和连接基础设施。

Storm’s distributed RPC feature supports this usage pattern (see “Message passing and RPC” ). For example, it has been used to compute the number of people who have seen a URL on Twitter—i.e., the union of the follower sets of everyone who has tweeted that URL [ 48 ]. As the set of Twitter users is partitioned, this computation requires combining results from many partitions.

Storm分布式RPC功能支持这种使用模式（参见“消息传递和RPC”）。例如，它已经用于计算在Twitter上看到URL的人数，即推文该URL的每个人的追随者集的并集。由于Twitter用户集被分区，因此此计算需要组合许多分区的结果。

Another example of this pattern occurs in fraud prevention: in order to assess the risk of whether a particular purchase event is fraudulent, you can examine the reputation scores of the user’s IP address, email address, billing address, shipping address, and so on. Each of these reputation databases is itself partitioned, and so collecting the scores for a particular purchase event requires a sequence of joins with differently partitioned datasets [ 49 ].

另一个这种模式的例子发生在防止欺诈方面：为了评估特定购买事件是否欺诈的风险，您可以检查用户的IP地址，电子邮件地址，账单地址，发货地址等的声誉分数。每个声誉数据库本身都是分区的，因此收集特定购买事件的分数需要与不同分区数据集的序列连接[49]。

The internal query execution graphs of MPP databases have similar characteristics (see “Comparing Hadoop to Distributed Databases” ). If you need to perform this kind of multi-partition join, it is probably simpler to use a database that provides this feature than to implement it using a stream processor. However, treating queries as streams provides an option for implementing large-scale applications that run against the limits of conventional off-the-shelf solutions.

MPP数据库的内部查询执行图具有类似的特征（参见“将Hadoop与分布式数据库进行比较”）。如果需要执行此类多分区连接，则使用提供此功能的数据库可能比使用流处理器实现更简单。然而，将查询视为流提供了一种实现大规模应用程序的选项，这些应用程序针对传统现成解决方案的限制运行。

Aiming for Correctness

With stateless services that only read data, it is not a big deal if something goes wrong: you can fix the bug and restart the service, and everything returns to normal. Stateful systems such as databases are not so simple: they are designed to remember things forever (more or less), so if something goes wrong, the effects also potentially last forever—which means they require more careful thought [ 50 ].

对于仅读取数据的无状态服务而言，如果出了问题，也不会有太大的影响：可以修复错误并重新启动服务，一切都恢复正常。而针对数据库等有状态系统，则不那么简单：它们被设计为永久地（或多或少）保存信息，因此如果出现故障，其影响也具有潜在的永久性，这意味着需要更加谨慎地考虑。

We want to build applications that are reliable and correct (i.e., programs whose semantics are well defined and understood, even in the face of various faults). For approximately four decades, the transaction properties of atomicity, isolation, and durability ( Chapter 7 ) have been the tools of choice for building correct applications. However, those foundations are weaker than they seem: witness for example the confusion of weak isolation levels (see “Weak Isolation Levels” ).

我们希望构建可靠和正确的应用程序（即，程序具有明确定义和理解的语义，即使面对各种故障）。在大约四十年的时间里，原子性、隔离性和持久性（第7章）的事务属性一直是构建正确应用程序的首选工具。然而，这些基础比它们看起来要脆弱得多：例如，弱隔离级别的混淆（参见“弱隔离级别”）。

In some areas, transactions are being abandoned entirely and replaced with models that offer better performance and scalability, but much messier semantics (see for example “Leaderless Replication” ). Consistency is often talked about, but poorly defined (see “Consistency” and Chapter 9 ). Some people assert that we should “embrace weak consistency” for the sake of better availability, while lacking a clear idea of what that actually means in practice.

在一些领域，交易被完全放弃，被更好的性能和可伸缩性模型所取代，但语义变得更加混乱（例如“无主复制”）。一致性经常被谈论，但定义不清（参见“一致性”和第9章）。有些人声称我们应该为了更好的可用性“拥抱弱一致性”，但缺乏在实践中确切的理解。

For a topic that is so important, our understanding and our engineering methods are surprisingly flaky. For example, it is very difficult to determine whether it is safe to run a particular application at a particular transaction isolation level or replication configuration [ 51 , 52 ]. Often simple solutions appear to work correctly when concurrency is low and there are no faults, but turn out to have many subtle bugs in more demanding circumstances.

对于一个如此重要的主题，我们的理解和工程方法异常脆弱。例如，很难确定在特定事务隔离级别或复制配置下运行特定应用程序是否安全。通常，在并发性低且没有故障的情况下，简单的解决方案似乎可以正常工作，但在更苛刻的情况下却可能有许多微妙的错误。

For example, Kyle Kingsbury’s Jepsen experiments [ 53 ] have highlighted the stark discrepancies between some products’ claimed safety guarantees and their actual behavior in the presence of network problems and crashes. Even if infrastructure products like databases were free from problems, application code would still need to correctly use the features they provide, which is error-prone if the configuration is hard to understand (which is the case with weak isolation levels, quorum configurations, and so on).

例如，Kyle Kingsbury的Jepsen实验[53]已经突显了一些产品所声称的安全保证与它们在网络问题和崩溃情况下的实际行为之间的明显差异。即使基础架构产品如数据库没有问题，应用程序代码仍然需要正确地使用它们提供的功能，如果配置难以理解，这很容易出现错误（这在弱隔离级别、仲裁配置等情况下是普遍存在的）。

If your application can tolerate occasionally corrupting or losing data in unpredictable ways, life is a lot simpler, and you might be able to get away with simply crossing your fingers and hoping for the best. On the other hand, if you need stronger assurances of correctness, then serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve.

如果你的应用程序能容忍偶尔发生的数据损坏或不可预测的丢失，那么生活就简单多了，你可能只需祈祷一下，希望一切顺利。另一方面，如果你需要更强的正确性保证，那么串行化和原子提交是已经确立的方法，但它们也有成本：它们通常只在单个数据中心中有效（排除了地理分布式架构），并且限制了你可以实现的规模和容错性能。

While the traditional transaction approach is not going away, I also believe it is not the last word in making applications correct and resilient to faults. In this section I will suggest some ways of thinking about correctness in the context of dataflow architectures.

尽管传统的交易方式并没有消失，但我相信它并非是应用程序的正确性和容错性的最终解决方案。在本节中，我将提出一些在数据流体系结构上思考正确性的方式。

The End-to-End Argument for Databases

Just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. For example, if an application has a bug that causes it to write incorrect data, or delete data from a database, serializable transactions aren’t going to save you.

仅仅因为一个应用程序使用了一个提供了相对较强的安全性能的数据系统（例如可串行化交易），这并不意味着该应用程序就一定没有数据丢失或损坏的风险。例如，如果一个应用程序存在错误导致其写入不正确的数据，或从数据库中删除数据，则可串行化交易并不能解救你。

This example may seem frivolous, but it is worth taking seriously: application bugs occur, and people make mistakes. I used this example in “State, Streams, and Immutability” to argue in favor of immutable and append-only data, because it is easier to recover from such mistakes if you remove the ability of faulty code to destroy good data.

这个例子可能看起来不太重要，但它值得认真对待：应用程序漏洞会发生，并且人们会犯错。在“状态、流和不可变性”中使用这个例子是为了支持不可变和仅追加数据，因为如果您消除了有故障的代码破坏良好数据的能力，那么从这些错误中恢复起来就更容易。

Although immutability is useful, it is not a cure-all by itself. Let’s look at a more subtle example of data corruption that can occur.

尽管不可变性很有用，但它本身并不能解决所有问题。让我们来看一个更微妙的数据破坏的例子。

Exactly-once execution of an operation

In “Fault Tolerance” we encountered an idea called exactly-once (or effectively-once ) semantics. If something goes wrong while processing a message, you can either give up (drop the message—i.e., incur data loss) or try again. If you try again, there is the risk that it actually succeeded the first time, but you just didn’t find out about the success, and so the message ends up being processed twice.

在“容错性”中，我们遇到了一个叫做“仅一次”（或有效执行一次）语义的概念。如果在处理消息时出现问题，你可以选择放弃（丢弃消息，即数据损失）或者重试。如果你选择重试，存在这样的风险，即第一次就成功了，但你却没有发现这个成功，导致消息被处理了两次。

Processing twice is a form of data corruption: it is undesirable to charge a customer twice for the same service (billing them too much) or increment a counter twice (overstating some metric). In this context, exactly-once means arranging the computation such that the final effect is the same as if no faults had occurred, even if the operation actually was retried due to some fault. We previously discussed a few approaches for achieving this goal.

处理两次是数据损坏的一种形式：两次收费给客户同样的服务是不可取的（过多收费），或是将计数器增加两次（夸大某些指标）。在这个情境下，确保仅处理一次的意思是将计算安排得妥当，以便在发生故障时，最终效果与未发生故障时相同，即使操作确实因某些故障而被重试。我们之前讨论过一些实现这个目标的方法。

One of the most effective approaches is to make the operation idempotent (see “Idempotence” ); that is, to ensure that it has the same effect, no matter whether it is executed once or multiple times. However, taking an operation that is not naturally idempotent and making it idempotent requires some effort and care: you may need to maintain some additional metadata (such as the set of operation IDs that have updated a value), and ensure fencing when failing over from one node to another (see “The leader and the lock” ).

其中最有效的方法之一是使操作幂等（参见“幂等性”）；也就是要确保无论执行一次还是多次，它都具有相同的影响。然而，将一个本质上不幂等的操作变成幂等操作需要一些努力和小心：你可能需要维护一些额外的元数据（比如更新值的操作ID集合），并且在从一个节点故障转移到另一个节点时确保隔离（参见“领导者和锁”）。

Duplicate suppression

The same pattern of needing to suppress duplicates occurs in many other places besides stream processing. For example, TCP uses sequence numbers on packets to put them in the correct order at the recipient, and to determine whether any packets were lost or duplicated on the network. Any lost packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data to an application.

许多其他场合除了流处理外，也需要压制重复模式。例如，TCP 在数据包上使用序列号将它们正确排序至接收端，并确定在网络上传输过程中是否有任何遗失或重复的数据包。任何遗失的数据包将被重新传输，任何重复的数据包将在 TCP 堆栈将数据传递给应用程序之前被删除。

However, this duplicate suppression only works within the context of a single TCP connection. Imagine the TCP connection is a client’s connection to a database, and it is currently executing the transaction in Example 12-1 . In many databases, a transaction is tied to a client connection (if the client sends several queries, the database knows that they belong to the same transaction because they are sent on the same TCP connection). If the client suffers a network interruption and connection timeout after sending the COMMIT , but before hearing back from the database server, it does not know whether the transaction has been committed or aborted ( Figure 8-1 ).

然而，这种重复抑制仅适用于单个TCP连接的上下文中。想象一下TCP连接是客户端连接到数据库，目前正在执行示例12-1中的事务。在许多数据库中，事务与客户端连接相关联（如果客户端发送多个查询，则数据库知道它们属于同一事务，因为它们在同一TCP连接上发送）。如果客户端在提交后但在听到数据库服务器响应之前经历了网络中断和连接超时，则无法确定事务是已提交还是已中止（图8-1）。

Example 12-1. A nonidempotent transfer of money from one account to another

BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;

The client can reconnect to the database and retry the transaction, but now it is outside of the scope of TCP duplicate suppression. Since the transaction in Example 12-1 is not idempotent, it could happen that $22 is transferred instead of the desired $11. Thus, even though Example 12-1 is a standard example for transaction atomicity, it is actually not correct, and real banks do not work like this [ 3 ].

客户端可以重新连接到数据库并重试事务，但此时它已经超出了TCP重复抑制的范围。由于示例12-1中的事务不是幂等的，可能会转移22美元而不是所需的11美元。因此，尽管示例12-1是事务原子性的标准示例，但实际上是不正确的，真正的银行不会像这样运作。

Two-phase commit (see “Atomic Commit and Two-Phase Commit (2PC)” ) protocols break the 1:1 mapping between a TCP connection and a transaction, since they must allow a transaction coordinator to reconnect to a database after a network fault, and tell it whether to commit or abort an in-doubt transaction. Is this sufficient to ensure that the transaction will only be executed once? Unfortunately not.

两阶段提交（参见“原子提交和两阶段提交（2PC）”）协议打破了TCP连接和事务之间的1:1映射，因为它们必须允许事务协调器在网络故障后重新连接到数据库，并告诉它是否提交或中止一个不确定的事务。这是否足以确保事务只执行一次？不幸的是，不是。

Even if we can suppress duplicate transactions between the database client and server, we still need to worry about the network between the end-user device and the application server. For example, if the end-user client is a web browser, it probably uses an HTTP POST request to submit an instruction to the server. Perhaps the user is on a weak cellular data connection, and they succeed in sending the POST, but the signal becomes too weak before they are able to receive the response from the server.

即使我们可以在数据库客户端和服务器之间抑制重复交易，我们仍然需要担心终端用户设备和应用服务器之间的网络。例如，如果终端用户客户端是Web浏览器，它可能使用HTTP POST请求向服务器提交指令。也许用户正在使用弱信号的蜂窝数据连接，他们在成功发送POST之前信号变得太弱，无法接收来自服务器的响应。

In this case, the user will probably be shown an error message, and they may retry manually. Web browsers warn, “Are you sure you want to submit this form again?”—and the user says yes, because they wanted the operation to happen. (The Post/Redirect/Get pattern [ 54 ] avoids this warning message in normal operation, but it doesn’t help if the POST request times out.) From the web server’s point of view the retry is a separate request, and from the database’s point of view it is a separate transaction. The usual deduplication mechanisms don’t help.

在这种情况下，用户可能会看到一个错误消息，并可以手动重试。Web浏览器会提醒：“您确定要再次提交此表单吗？”-用户会选择“是”，因为他们想要进行操作。（Post/Redirect/Get模式[54]可以在正常操作中避免此警告消息，但如果POST请求超时则无法起作用。）从Web服务器的角度来看，重试是一个单独的请求，而从数据库的角度来看，它是一个单独的事务。通常的去重机制对此无效。

Operation identifiers

To make the operation idempotent through several hops of network communication, it is not sufficient to rely just on a transaction mechanism provided by a database—you need to consider the end-to-end flow of the request.

使操作在网络通信的多次跳跃中具有幂等性，仅依赖数据库提供的事务机制是不够的 - 您需要考虑请求的端到端流程。

For example, you could generate a unique identifier for an operation (such as a UUID) and include it as a hidden form field in the client application, or calculate a hash of all the relevant form fields to derive the operation ID [ 3 ]. If the web browser submits the POST request twice, the two requests will have the same operation ID. You can then pass that operation ID all the way through to the database and check that you only ever execute one operation with a given ID, as shown in Example 12-2 .

例如，您可以为操作生成唯一标识符（例如UUID），并将其包括为客户端应用程序中的隐藏表单字段，或计算所有相关表单字段的哈希以推导操作ID [3]。如果浏览器提交POST请求两次，则这两个请求将具有相同的操作ID。然后，您可以将该操作ID传递到数据库，并检查您只执行具有给定ID的一个操作，如示例12-2所示。

Example 12-2. Suppressing duplicate requests using a unique ID

ALTER TABLE requests ADD UNIQUE (request_id);

BEGIN TRANSACTION;

INSERT INTO requests
  (request_id, from_account, to_account, amount)
  VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);

UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;

COMMIT;

Example 12-2 relies on a uniqueness constraint on the request_id column. If a transaction attempts to insert an ID that already exists, the INSERT fails and the transaction is aborted, preventing it from taking effect twice. Relational databases can generally maintain a uniqueness constraint correctly, even at weak isolation levels (whereas an application-level check-then-insert may fail under nonserializable isolation, as discussed in “Write Skew and Phantoms” ).

示例12-2依赖于request_id列上的唯一性约束。如果一个事务试图插入已经存在的ID，那么插入操作失败，并且事务被中止，防止其重复生效。关系型数据库通常可以正确维护唯一性约束，即使在弱隔离级别下（而应用程序级别的检查-然后插入可能在不可串行隔离下失败，如“写偏斜和幻象”中讨论的那样）。

Besides suppressing duplicate requests, the requests table in Example 12-2 acts as a kind of event log, hinting in the direction of event sourcing (see “Event Sourcing” ). The updates to the account balances don’t actually have to happen in the same transaction as the insertion of the event, since they are redundant and could be derived from the request event in a downstream consumer—as long as the event is processed exactly once, which can again be enforced using the request ID.

除了压制重复的请求之外，示例12-2中的请求表还充当一种事件日志，提示事件源（请参见“事件源”）的方向。账户余额的更新实际上不必在与事件插入相同的事务中发生，因为它们是多余的并且可以从下游消费者的请求事件中派生出来，只要确保事件被处理一次就可以了，这可以再次使用请求ID来强制执行。

The end-to-end argument

This scenario of suppressing duplicate transactions is just one example of a more general principle called the end-to-end argument , which was articulated by Saltzer, Reed, and Clark in 1984 [ 55 ]:

压制重复交易的情景只是更为普遍的“端到端原则”的一个例子，这一原则是由Saltzer、Reed和Clark在1984年提出的。

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)

该功能只有在通信系统端点应用的知识和帮助下才能完整且正确地实现。因此，将该功能作为通信系统本身的特性提供是不可能的。（有时，通信系统提供的不完整版本功能可能有益于提高性能。）

In our example, the function in question was duplicate suppression. We saw that TCP suppresses duplicate packets at the TCP connection level, and some stream processors provide so-called exactly-once semantics at the message processing level, but that is not enough to prevent a user from submitting a duplicate request if the first one times out. By themselves, TCP, database transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem requires an end-to-end solution: a transaction identifier that is passed all the way from the end-user client to the database.

在我们的示例中，问题函数是重复抑制。我们看到TCP在TCP连接级别抑制重复数据包，一些流处理器在消息处理级别提供所谓的恰好一次语义，但这还不足以防止用户在第一个超时的情况下提交重复请求。 TCP，数据库事务和流处理器本身无法完全排除这些重复。解决问题需要一种端到端的解决方案：从最终用户客户端到数据库传递的事务标识符。

The end-to-end argument also applies to checking the integrity of data: checksums built into Ethernet, TCP, and TLS can detect corruption of packets in the network, but they cannot detect corruption due to bugs in the software at the sending and receiving ends of the network connection, or corruption on the disks where the data is stored. If you want to catch all possible sources of data corruption, you also need end-to-end checksums.

端到端的原则同样适用于检查数据的完整性：以太网、TCP和TLS内置的校验和可以检测网络中数据包的损坏，但无法检测发送和接收端连接的软件中存在的错误或数据存储盘上的损坏。如果您想要捕捉所有可能的数据损坏源，则还需要端到端校验和。

A similar argument applies with encryption [ 55 ]: the password on your home WiFi network protects against people snooping your WiFi traffic, but not against attackers elsewhere on the internet; TLS/SSL between your client and the server protects against network attackers, but not against compromises of the server. Only end-to-end encryption and authentication can protect against all of these things.

类似的论点也适用于加密。家庭 WiFi 网络的密码可以保护您的 WiFi 流量不被窥探，但无法防范互联网上的攻击者；客户端和服务器之间的 TLS/SSL 可以防范网络攻击，但无法防范服务器妥协。只有端对端加密和身份验证才能保护所有这些事情。只有端对端加密和身份验证才能保护所有这些事情。

Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption) cannot provide the desired end-to-end features by themselves, they are still useful, since they reduce the probability of problems at the higher levels. For example, HTTP requests would often get mangled if we didn’t have TCP putting the packets back in the right order. We just need to remember that the low-level reliability features are not by themselves sufficient to ensure end-to-end correctness.

尽管底层特征（TCP重复抑制，以太网校验和，WiFi加密）本身不能提供所需的端到端功能，但它们仍然很有用，因为它们可以减少高层级出现问题的概率。例如，如果没有TCP将数据包按正确顺序排列，HTTP请求就很容易出错。我们只需要记住，仅靠底层可靠性特征是不足以确保端到端正确性的。

Applying end-to-end thinking in data systems

This brings me back to my original thesis: just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. The application itself needs to take end-to-end measures, such as duplicate suppression, as well.

这让我回到我的原始论点：即使一个应用程序使用了一个提供了相对较强的安全属性的数据系统，比如可串行化事务，也不意味着该应用程序必然没有数据丢失或损坏的风险。应用程序本身需要采取端到端的措施，例如重复抑制。

That is a shame, because fault-tolerance mechanisms are hard to get right. Low-level reliability mechanisms, such as those in TCP, work quite well, and so the remaining higher-level faults occur fairly rarely. It would be really nice to wrap up the remaining high-level fault-tolerance machinery in an abstraction so that application code needn’t worry about it—but I fear that we have not yet found the right abstraction.

很遗憾，因为容错机制很难正确实现。低层可靠性机制（例如 TCP 中的机制）工作得相当不错，因此剩余的高层故障出现得相当罕见。将剩余的高层容错机制封装在一个抽象层中，以使应用程序代码无需担心它，这将非常好，但我担心我们还没有找到正确的抽象层。

Transactions have long been seen as a good abstraction, and I do believe that they are useful. As discussed in the introduction to Chapter 7 , they take a wide range of possible issues (concurrent writes, constraint violations, crashes, network interruptions, disk failures) and collapse them down to two possible outcomes: commit or abort. That is a huge simplification of the programming model, but I fear that it is not enough.

交易长期以来被视为一个很好的抽象层，我相信它们是有用的。就像第七章引言中所讨论的那样，它们解决了一系列可能出现的问题（并发写入、约束违规、崩溃、网络中断、磁盘故障）并将它们简化成两种可能的结果：提交或中止。这是程序模型的一个巨大简化，但我担心这还不够。

Transactions are expensive, especially when they involve heterogeneous storage technologies (see “Distributed Transactions in Practice” ). When we refuse to use distributed transactions because they are too expensive, we end up having to reimplement fault-tolerance mechanisms in application code. As numerous examples throughout this book have shown, reasoning about concurrency and partial failure is difficult and counterintuitive, and so I suspect that most application-level mechanisms do not work correctly. The consequence is lost or corrupted data.

交易很昂贵，特别是涉及异构存储技术（参见“实践中的分布式事务”）。当我们拒绝使用分布式事务，因为它们太昂贵时，我们最终不得不在应用程序代码中重新实现容错机制。正如本书中的许多例子所示，推理并发性和部分故障是困难和反直觉的，因此我怀疑大多数应用级机制都不正确地工作。其结果是丢失或损坏的数据。

For these reasons, I think it is worth exploring fault-tolerance abstractions that make it easy to provide application-specific end-to-end correctness properties, but also maintain good performance and good operational characteristics in a large-scale distributed environment.

出于这些原因，我认为值得探索容错抽象，以实现易于提供特定于应用程序的端到端正确性属性，同时在大规模分布式环境中维护良好的性能和运行特性。

Enforcing Constraints

Let’s think about correctness in the context of the ideas around unbundling databases ( “Unbundling Databases” ). We saw that end-to-end duplicate suppression can be achieved with a request ID that is passed all the way from the client to the database that records the write. What about other kinds of constraints?

让我们在“数据库分离”这个概念的语境下考虑正确性。我们发现，通过将请求ID从客户端传递到记录写入操作的数据库，可以实现端到端的重复数据抑制。那么其他类型的约束条件呢？

In particular, let’s focus on uniqueness constraints—such as the one we relied on in Example 12-2 . In “Constraints and uniqueness guarantees” we saw several other examples of application features that need to enforce uniqueness: a username or email address must uniquely identify a user, a file storage service cannot have more than one file with the same name, and two people cannot book the same seat on a flight or in a theater.

特别是，让我们专注于唯一性约束——例如我们在示例12-2中所依赖的那个。在“约束和唯一性保证”中，我们看到了其他几个需要强制唯一性的应用程序特性的例子：用户名或电子邮件地址必须唯一地标识用户，文件存储服务不能有多个具有相同名称的文件，两个人不能预订同一班机票或剧院的座位。

Other kinds of constraints are very similar: for example, ensuring that an account balance never goes negative, that you don’t sell more items than you have in stock in the warehouse, or that a meeting room does not have overlapping bookings. Techniques that enforce uniqueness can often be used for these kinds of constraints as well.

其他类型的限制非常相似：例如，确保账户余额从不为负，仓库中不售出超过库存的物品，或者会议室不出现时间冲突的预订。强制唯一性的技术通常也可以用于这些类型的限制。

Uniqueness constraints require consensus

In Chapter 9 we saw that in a distributed setting, enforcing a uniqueness constraint requires consensus: if there are several concurrent requests with the same value, the system somehow needs to decide which one of the conflicting operations is accepted, and reject the others as violations of the constraint.

在第9章中，我们看到在分布式环境中，强制唯一性约束需要达成一致：如果有多个具有相同值的并发请求，系统需要决定哪个相冲突的操作被接受，并拒绝其他违反约束的操作。

The most common way of achieving this consensus is to make a single node the leader, and put it in charge of making all the decisions. That works fine as long as you don’t mind funneling all requests through a single node (even if the client is on the other side of the world), and as long as that node doesn’t fail. If you need to tolerate the leader failing, you’re back at the consensus problem again (see “Single-leader replication and consensus” ).

达成这种共识的最常见方法是让一个单点成为领导者，并让其负责做出所有决策。只要您不介意将所有请求（即使客户端在世界的另一端）通过一个节点进行传递，并且只要该节点不会失败，这种方法就可以很好地工作。如果您需要容忍领导者失败，则又回到了共识问题（请参见“单领导者复制和共识”）。

Uniqueness checking can be scaled out by partitioning based on the value that needs to be unique. For example, if you need to ensure uniqueness by request ID, as in Example 12-2 , you can ensure all requests with the same request ID are routed to the same partition (see Chapter 6 ). If you need usernames to be unique, you can partition by hash of username.

可以通过基于需要保证唯一性的值进行分区来扩展唯一性检查。例如，如果您需要像示例12-2中所示通过请求ID保证唯一性，您可以确保具有相同请求ID的所有请求路由到同一个分区（参见第6章）。如果您需要用户名唯一，则可以通过用户名的哈希进行分区。

However, asynchronous multi-master replication is ruled out, because it could happen that different masters concurrently accept conflicting writes, and thus the values are no longer unique (see “Implementing Linearizable Systems” ). If you want to be able to immediately reject any writes that would violate the constraint, synchronous coordination is unavoidable [ 56 ].

然而，异步多主复制被排除，因为不同的主机同时接受冲突写入可能会发生，导致值不再唯一（参见“实现线性化系统”）。如果您想立即拒绝任何违反约束的写入，则无法避免同步协调 [56]。

Uniqueness in log-based messaging

The log ensures that all consumers see messages in the same order—a guarantee that is formally known as total order broadcast and is equivalent to consensus (see “Total Order Broadcast” ). In the unbundled database approach with log-based messaging, we can use a very similar approach to enforce uniqueness constraints.

日志确保所有消费者按照相同顺序查看消息，这是一种正式称为全序广播的保证，并等同于共识（参见“全序广播”）。在基于日志的消息传递的非捆绑数据库方法中，我们可以使用非常相似的方法来强制执行唯一性约束。

A stream processor consumes all the messages in a log partition sequentially on a single thread (see “Logs compared to traditional messaging” ). Thus, if the log is partitioned based on the value that needs to be unique, a stream processor can unambiguously and deterministically decide which one of several conflicting operations came first. For example, in the case of several users trying to claim the same username [ 57 ]:

流处理器会在单个线程上按顺序消费日志分区中的所有消息（请参见“日志与传统消息比较”）。因此，如果日志根据需要唯一的值进行分区，流处理器可以明确且确定地判断几个发生冲突的操作中的哪一个先发生。例如，在多个用户试图声明相同用户名的情况下[57]：`，请返回仅翻译内容，不包括原始文本。

Every request for a username is encoded as a message, and appended to a partition determined by the hash of the username.

每个用户名的请求都被编码为消息，并附加到由用户名哈希确定的分区。
A stream processor sequentially reads the requests in the log, using a local database to keep track of which usernames are taken. For every request for a username that is available, it records the name as taken and emits a success message to an output stream. For every request for a username that is already taken, it emits a rejection message to an output stream.

流处理器按顺序读取日志中的请求，使用本地数据库来跟踪哪些用户名已被占用。对于每一个请求可用的用户名，它记录为已占用的名称，并向输出流发出成功消息。对于已经被占用的用户名的每一个请求，它向输出流发出拒绝消息。
The client that requested the username watches the output stream and waits for a success or rejection message corresponding to its request.

请求用户名的客户端会监视输出流，并等待与其请求对应的成功或拒绝信息。

This algorithm is basically the same as in “Implementing linearizable storage using total order broadcast” . It scales easily to a large request throughput by increasing the number of partitions, as each partition can be processed independently.

该算法与“使用总序广播实现可线性化存储”的算法基本相同，通过增加分区的数量可以轻松扩展到大量请求吞吐量，因为每个分区可以独立处理。

The approach works not only for uniqueness constraints, but also for many other kinds of constraints. Its fundamental principle is that any writes that may conflict are routed to the same partition and processed sequentially. As discussed in “What is a conflict?” and “Write Skew and Phantoms” , the definition of a conflict may depend on the application, but the stream processor can use arbitrary logic to validate a request. This idea is similar to the approach pioneered by Bayou in the 1990s [ 58 ].

这种方法不仅适用于唯一性约束，也适用于许多其他类型的约束。其基本原则是任何可能产生冲突的写入都被路由到同一个分区并按顺序处理。如在“什么是冲突？”和“写入偏斜和虚幻数据”中所讨论的，冲突的定义可能取决于应用程序，但流处理器可以使用任意逻辑来验证请求。这个想法类似于1990年代Bayou开创的方法。

Multi-partition request processing

Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved. In Example 12-2 , there are potentially three partitions: the one containing the request ID, the one containing the payee account, and the one containing the payer account. There is no reason why those three things should be in the same partition, since they are all independent from each other.

当涉及到多个分区时，确保操作在满足约束条件的情况下原子执行变得更加有趣。在示例12-2中，可能有三个分区：包含请求ID的分区，包含收款人账户的分区和包含付款人账户的分区。这三件事没有理由在同一个分区中，因为它们彼此独立。

In the traditional approach to databases, executing this transaction would require an atomic commit across all three partitions, which essentially forces it into a total order with respect to all other transactions on any of those partitions. Since there is now cross-partition coordination, different partitions can no longer be processed independently, so throughput is likely to suffer.

在传统数据库的方法中，执行此交易将需要在所有三个分区上执行原子提交，这基本上会将其强制变成与任何一个分区上的其他交易相对应的总顺序。由于现在存在跨分区协调，因此不同分区不能再独立处理，因此吞吐量很可能会受到影响。

However, it turns out that equivalent correctness can be achieved with partitioned logs, and without an atomic commit:

然而，实际上可以通过分区日志实现等效的正确性，而无需进行原子提交。

The request to transfer money from account A to account B is given a unique request ID by the client, and appended to a log partition based on the request ID.

客户在从账户A转账至账户B时，为该请求指定唯一请求ID并根据该ID将其附加到日志分区中。
A stream processor reads the log of requests. For each request message it emits two messages to output streams: a debit instruction to the payer account A (partitioned by A), and a credit instruction to the payee account B (partitioned by B). The original request ID is included in those emitted messages.

流处理器读取请求日志。对于每个请求消息，它向输出流发出两条消息：一条向付款人账户A发出的借方指令（按A分区），以及一条向收款人账户B发出的贷方指令（按B分区）。原始请求ID包含在这些发出的消息中。
Further processors consume the streams of credit and debit instructions, deduplicate by request ID, and apply the changes to the account balances.

进一步的处理器消耗信用和借记指令流，按请求ID进行去重，并将更改应用于账户余额。

Steps 1 and 2 are necessary because if the client directly sent the credit and debit instructions, it would require an atomic commit across those two partitions to ensure that either both or neither happen. To avoid the need for a distributed transaction, we first durably log the request as a single message, and then derive the credit and debit instructions from that first message. Single-object writes are atomic in almost all data systems (see “Single-object writes” ), and so the request either appears in the log or it doesn’t, without any need for a multi-partition atomic commit.

步骤1和2是必要的，因为如果客户直接发送信用和借记指令，它将需要跨这两个分区进行原子提交，以确保两者都发生或都不发生。为避免需要分布式事务，我们首先将请求持久记录为单个消息，然后从第一个消息派生信用和借记指令。在几乎所有数据系统中，单一物体写入是原子的（请参阅“单一物体写入”），因此请求要么出现在日志中，要么不出现，没有需要进行多个分区的原子提交。

If the stream processor in step 2 crashes, it resumes processing from its last checkpoint. In doing so, it does not skip any request messages, but it may process requests multiple times and produce duplicate credit and debit instructions. However, since it is deterministic, it will just produce the same instructions again, and the processors in step 3 can easily deduplicate them using the end-to-end request ID.

如果步骤2中的流处理器崩溃，它会从最后一个检查点恢复处理。这样做，它不会跳过任何请求消息，但可能会处理多个请求并产生重复的信用和借记指令。但是，由于它是确定性的，它将再次生成相同的指令，步骤3中的处理器可以使用端到端的请求ID轻松地去重。

If you want to ensure that the payer account is not overdrawn by this transfer, you can additionally have a stream processor (partitioned by payer account number) that maintains account balances and validates transactions. Only valid transactions would then be placed in the request log in step 1.

如果您想确保通过此转账不会使支付帐户透支，您还可以使用流处理器（按付款人帐户号分区）来维护帐户余额和验证交易。然后，只有有效交易才会被放置在步骤1的请求日志中。如需翻译其他内容，请告知。

By breaking down the multi-partition transaction into two differently partitioned stages and using the end-to-end request ID, we have achieved the same correctness property (every request is applied exactly once to both the payer and payee accounts), even in the presence of faults, and without using an atomic commit protocol. The idea of using multiple differently partitioned stages is similar to what we discussed in “Multi-partition data processing” (see also “Concurrency control” ).

通过将多分区事务分解为两个不同分区的阶段并使用端到端请求ID，我们已经实现了相同的正确性属性（每个请求都会应用于付款者和收款人帐户，并且即使存在故障，也不需要使用原子提交协议）。使用多个不同分区的阶段的想法类似于我们在“多分区数据处理”中讨论的内容（请参见“并发控制”）。

Timeliness and Integrity

A convenient property of transactions is that they are typically linearizable (see “Linearizability” ): that is, a writer waits until a transaction is committed, and thereafter its writes are immediately visible to all readers.

事务的一个方便性质是它们通常是可线性化（见“线性化”）的：即写者等待事务提交后，它的写入立即对所有读者可见。

This is not the case when unbundling an operation across multiple stages of stream processors: consumers of a log are asynchronous by design, so a sender does not wait until its message has been processed by consumers. However, it is possible for a client to wait for a message to appear on an output stream. This is what we did in “Uniqueness in log-based messaging” when checking whether a uniqueness constraint was satisfied.

当在流处理器的多个阶段中分解操作时，情况并非如此：日志的消费者通常是异步的设计，所以发送者不必等待消息被消费者处理。但是，客户端可以等待消息出现在输出流上。这就是我们在“基于日志的消息传递中的唯一性”中检查唯一性约束是否满足时所做的事情。（Translated: 在将操作拆分到多个流处理器阶段时，情况则不同：日志的消费者通常是异步的设计，因此发送者无需等待消息被消费者处理。但是，客户端可以等待在输出流中出现消息。这就是在“基于日志的消息传递中的唯一性”中检查唯一性约束是否满足时所做的操作。）

In this example, the correctness of the uniqueness check does not depend on whether the sender of the message waits for the outcome. The waiting only has the purpose of synchronously informing the sender whether or not the uniqueness check succeeded, but this notification can be decoupled from the effects of processing the message.

在这个例子中，唯一性检查的正确性不取决于消息的发送者是否等待结果。等待只是为了同步地通知发送者唯一性检查是否成功，但这种通知可以与处理消息的效果分离。

More generally, I think the term consistency conflates two different requirements that are worth considering separately:

更一般地说，我认为一致性一词混淆了两个不同的要求，值得分别考虑：

Timeliness

Timeliness means ensuring that users observe the system in an up-to-date state. We saw previously that if a user reads from a stale copy of the data, they may observe it in an inconsistent state (see “Problems with Replication Lag” ). However, that inconsistency is temporary, and will eventually be resolved simply by waiting and trying again.

及时性意味着确保用户在系统处于最新状态。我们之前看到，如果用户从过时的数据副本中读取，他们可能会以不一致的状态观察它（请参见“复制滞后的问题”）。但是，这种不一致性是暂时的，只需等待并再次尝试即可解决。

The CAP theorem (see “The Cost of Linearizability” ) uses consistency in the sense of linearizability, which is a strong way of achieving timeliness. Weaker timeliness properties like read-after-write consistency (see “Reading Your Own Writes” ) can also be useful.

CAP定理（参见“ 线性化的成本”）使用线性可达性的一致性，这是实现及时性的强有力的方式。像读取后写入一致性（参见“读取您自己的写入”）这样的较弱及时性属性也可以很有用。

Integrity

Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data. In particular, if some derived dataset is maintained as a view onto some underlying data (see “Deriving current state from the event log” ), the derivation must be correct. For example, a database index must correctly reflect the contents of the database—an index in which some records are missing is not very useful.

诚信意味着没有腐败；即没有数据丢失，也没有矛盾或虚假的数据。特别是，如果某个派生数据集作为对某些基本数据的视图进行维护（请参见“从事件日志中推导当前状态”），则此推导必须是正确的。例如，数据库索引必须正确反映数据库的内容 - 如果索引缺少某些记录，则它并不是很有用。

If integrity is violated, the inconsistency is permanent: waiting and trying again is not going to fix database corruption in most cases. Instead, explicit checking and repair is needed. In the context of ACID transactions (see “The Meaning of ACID” ), consistency is usually understood as some kind of application-specific notion of integrity. Atomicity and durability are important tools for preserving integrity.

如果完整性受到破坏，不一致性将是永久的：在大多数情况下，等待和重试不能修复数据库损坏。相反，需要进行明确的检查和修复。在ACID事务的上下文中（见“ACID的含义”），一致性通常被理解为某种应用程序特定的完整性概念。原子性和耐久性是保持完整性的重要工具。如果完整性受到破坏，无法再次尝试修复，需要进行明确的检查和修复。在ACID事务的上下文中，一致性是完整性的应用程序特定概念，而原子性和耐久性则是保持完整性的重要工具。

In slogan form: violations of timeliness are “eventual consistency,” whereas violations of integrity are “perpetual inconsistency.”

准时性的违规行为是“最终一致性”，而完整性的违规行为是“永久不一致性”。

I am going to assert that in most applications, integrity is much more important than timeliness. Violations of timeliness can be annoying and confusing, but violations of integrity can be catastrophic.

在大多数应用中，诚实比及时更重要。时效性的违反可能会令人恼火和困惑，但诚实的违反可能会造成灾难性的后果。

For example, on your credit card statement, it is not surprising if a transaction that you made within the last 24 hours does not yet appear—it is normal that these systems have a certain lag. We know that banks reconcile and settle transactions asynchronously, and timeliness is not very important here [ 3 ]. However, it would be very bad if the statement balance was not equal to the sum of the transactions plus the previous statement balance (an error in the sums), or if a transaction was charged to you but not paid to the merchant (disappearing money). Such problems would be violations of the integrity of the system.

例如，在您的信用卡账单上，如果您在过去的 24 小时内进行的交易尚未出现，这并不奇怪——这些系统通常会有一定的滞后。我们知道，银行会异步协调和结算交易，时间性并不是非常重要 [3]。然而，如果账单余额与交易金额加上上期账单余额不相等（求和错误），或者如果某笔交易已经从您的账户中扣款但没有支付给商家（款项消失），这将非常糟糕。这些问题会违反系统的完整性。

Correctness of dataflow systems

ACID transactions usually provide both timeliness (e.g., linearizability) and integrity (e.g., atomic commit) guarantees. Thus, if you approach application correctness from the point of view of ACID transactions, the distinction between timeliness and integrity is fairly inconsequential.

ACID交易通常提供及时性（例如，线性可操作性）和完整性（例如，原子提交）保证。因此，如果您从ACID交易的角度考虑应用程序的正确性，那么及时性和完整性之间的区别相当不重要。

On the other hand, an interesting property of the event-based dataflow systems that we have discussed in this chapter is that they decouple timeliness and integrity. When processing event streams asynchronously, there is no guarantee of timeliness, unless you explicitly build consumers that wait for a message to arrive before returning. But integrity is in fact central to streaming systems.

另一方面，我们在本章中讨论的基于事件的数据流系统具有一个有趣的特性，即它们将时效性和完整性解耦。异步处理事件流时，除非显式构建等待消息到达的消费者，否则无法保证时效性。但事实上，完整性对于流系统而言是至关重要的。

Exactly-once or effectively-once semantics (see “Fault Tolerance” ) is a mechanism for preserving integrity. If an event is lost, or if an event takes effect twice, the integrity of a data system could be violated. Thus, fault-tolerant message delivery and duplicate suppression (e.g., idempotent operations) are important for maintaining the integrity of a data system in the face of faults.

确保一次性或有效性语义（参见“容错”）是保持完整性的机制。如果事件丢失或事件有效两次，则可能破坏数据系统的完整性。因此，在面对故障时，具有容错的消息传递和消除重复项（例如幂等操作）对于维护数据系统的完整性非常重要。

As we saw in the last section, reliable stream processing systems can preserve integrity without requiring distributed transactions and an atomic commit protocol, which means they can potentially achieve comparable correctness with much better performance and operational robustness. We achieved this integrity through a combination of mechanisms:

正如我们在上一节中所看到的，可靠的流处理系统可以在不需要分布式事务和原子提交协议的情况下保持完整性，这意味着它们可以潜在地实现可比性强得多的正确性，同时具有更好的性能和操作韧性。我们通过以下机制实现了这种完整性：

Representing the content of the write operation as a single message, which can easily be written atomically—an approach that fits very well with event sourcing (see “Event Sourcing” )

将写操作的内容表示为单个消息，可以轻松地原子性地写入 - 这种方法非常适合事件溯源（请参阅“事件溯源”）。
Deriving all other state updates from that single message using deterministic derivation functions, similarly to stored procedures (see “Actual Serial Execution” and “Application code as a derivation function” )

使用确定性派生函数从那个单一的消息中推导出所有其他的状态更新，类似于存储过程（参见“实际串行执行”和“应用代码作为派生函数”）
Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence

将由客户端生成的请求ID通过所有级别的处理，实现端到端的重复抑制和幂等性。
Making messages immutable and allowing derived data to be reprocessed from time to time, which makes it easier to recover from bugs (see “Advantages of immutable events” )

使消息不可变，并允许从时间到时间重新处理派生数据，这使得从错误中恢复更加容易（见“不可变事件的优点”）。

This combination of mechanisms seems to me a very promising direction for building fault-tolerant applications in the future.

这种机制的组合在未来构建容错性应用方面似乎是一个非常有前途的方向。

Loosely interpreted constraints

As discussed previously, enforcing a uniqueness constraint requires consensus, typically implemented by funneling all events in a particular partition through a single node. This limitation is unavoidable if we want the traditional form of uniqueness constraint, and stream processing cannot avoid it.

如先前所讨论的，强制执行唯一性约束需要共识，通常通过将特定分区中的所有事件导向单个节点来实现。如果我们想要传统形式的唯一性约束，那么这个限制是无法避免的，流处理也无法避免它。

However, another thing to realize is that many real applications can actually get away with much weaker notions of uniqueness:

然而，需要认识到的另一件事情是许多真实应用实际上可以使用较弱的唯一性概念来满足。

If two people concurrently register the same username or book the same seat, you can send one of them a message to apologize, and ask them to choose a different one. This kind of change to correct a mistake is called a compensating transaction [ 59 , 60 ].

如果两个人同时注册相同的用户名或预订相同的座位，您可以向其中一个发送一条道歉信息，并请他们选择另一个。这种更改以纠正错误的方式被称为补偿性交易[59，60]。
If customers order more items than you have in your warehouse, you can order in more stock, apologize to customers for the delay, and offer them a discount. This is actually the same as what you’d have to do if, say, a forklift truck ran over some of the items in your warehouse, leaving you with fewer items in stock than you thought you had [ 61 ]. Thus, the apology workflow already needs to be part of your business processes anyway, and so it might be unnecessary to require a linearizable constraint on the number of items in stock.

如果客户订购的货品超出了仓库存货数量，你可以订购更多的存货，为客户道歉并提供折扣。这其实与如果货车在仓库里压坏了部分货物，导致你的存货比你预计的要少所遵循的程序是一样的。因此，道歉的流程已经需要成为你业务流程的一部分，所以可能不需要在存货数量方面施加线性化约束。
Similarly, many airlines overbook airplanes in the expectation that some passengers will miss their flight, and many hotels overbook rooms, expecting that some guests will cancel. In these cases, the constraint of “one person per seat” is deliberately violated for business reasons, and compensation processes (refunds, upgrades, providing a complimentary room at a neighboring hotel) are put in place to handle situations in which demand exceeds supply. Even if there was no overbooking, apology and compensation processes would be needed in order to deal with flights being cancelled due to bad weather or staff on strike—recovering from such issues is just a normal part of business [ 3 ].

同样地，许多航空公司会为了预计会有一些乘客会错过航班而超售。许多酒店也会超预订客房，预计会有一些客人会取消预订。在这些情况下，“每个座位一人”的限制出于商业原因被故意违反，并采取了赔偿措施（退款、升级、提供附近酒店的免费客房）来处理需求超过供应的情况。即使没有超售，由于恶劣天气或工人罢工导致航班取消，也需要道歉和补偿处理流程——恢复这些问题只是业务中的常规流程。[3]
If someone withdraws more money than they have in their account, the bank can charge them an overdraft fee and ask them to pay back what they owe. By limiting the total withdrawals per day, the risk to the bank is bounded.

如果有人从他们的账户中提取比他们拥有的钱更多的钱，银行可以收取透支费并要求他们偿还所欠款项。通过限制每日的总提款额，银行的风险得到了限制。

In many business contexts, it is actually acceptable to temporarily violate a constraint and fix it up later by apologizing. The cost of the apology (in terms of money or reputation) varies, but it is often quite low: you can’t unsend an email, but you can send a follow-up email with a correction. If you accidentally charge a credit card twice, you can refund one of the charges, and the cost to you is just the processing fees and perhaps a customer complaint. Once money has been paid out of an ATM, you can’t directly get it back, although in principle you can send debt collectors to recover the money if the account was overdrawn and the customer won’t pay it back.

在许多商业环境中，暂时违反某个限制并稍后通过道歉进行修复实际上是可以接受的。道歉的成本（以金钱或声誉为代价）各不相同，但通常相当低：您无法撤回一封电子邮件，但可以发送一封纠正性的后续电子邮件。如果您无意中多次收取了信用卡，您可以退还其中一笔费用，成本只有处理费和可能的客户投诉。一旦从自动取款机中支付了钱，您无法直接收回，但原则上如果账户透支且客户不愿意偿还，您可以派遣公司收债人员来收回钱款。

Whether the cost of the apology is acceptable is a business decision. If it is acceptable, the traditional model of checking all constraints before even writing the data is unnecessarily restrictive, and a linearizable constraint is not needed. It may well be a reasonable choice to go ahead with a write optimistically, and to check the constraint after the fact. You can still ensure that the validation occurs before doing things that would be expensive to recover from, but that doesn’t imply you must do the validation before you even write the data.

道歉的成本是否可接受是一個商業決策。如果可接受，傳統的模式在寫入數據之前檢查所有限制是不必要的限制，並且不需要可線性化約束。可能會合理地選擇樂觀地寫入，並在事實發生後檢查約束。您仍然可以確保驗證在進行昂貴的恢復操作之前發生，但這並不意味著您必須在寫入數據之前進行驗證。

These applications do require integrity: you would not want to lose a reservation, or have money disappear due to mismatched credits and debits. But they don’t require timeliness on the enforcement of the constraint: if you have sold more items than you have in the warehouse, you can patch up the problem after the fact by apologizing. Doing so is similar to the conflict resolution approaches we discussed in “Handling Write Conflicts” .

这些应用程序需要完整性：您不希望因不匹配的借贷记录而失去预订或丢失资金。但是，它们不需要在约束执行方面及时：如果您已经销售了比仓库中还要多的物品，则可以事后通过道歉来解决问题。这样做类似于我们在“处理写操作冲突”中讨论的冲突解决方法。

Coordination-avoiding data systems

We have now made two interesting observations:

我们现在做出了两个有趣的观察。

Dataflow systems can maintain integrity guarantees on derived data without atomic commit, linearizability, or synchronous cross-partition coordination.

数据流系统可以在不进行原子提交、线性化或同步跨分区协调的情况下，保持派生数据的完整性保证。
Although strict uniqueness constraints require timeliness and coordination, many applications are actually fine with loose constraints that may be temporarily violated and fixed up later, as long as integrity is preserved throughout.

尽管严格的唯一性约束需要及时性和协调，但是许多应用程序实际上可以使用宽松的约束，这些约束可能会暂时违反并稍后修复，只要在整个过程中维护完整性即可。

Taken together, these observations mean that dataflow systems can provide the data management services for many applications without requiring coordination, while still giving strong integrity guarantees. Such coordination-avoiding data systems have a lot of appeal: they can achieve better performance and fault tolerance than systems that need to perform synchronous coordination [ 56 ].

综上所述，这些观察结果意味着，数据流系统可以为许多应用程序提供数据管理服务，而不需要协调，同时仍然提供强大的完整性保证。这样避免协调的数据系统具有很大的吸引力：它们可以比需要进行同步协调的系统实现更好的性能和容错性。

For example, such a system could operate distributed across multiple datacenters in a multi-leader configuration, asynchronously replicating between regions. Any one datacenter can continue operating independently from the others, because no synchronous cross-region coordination is required. Such a system would have weak timeliness guarantees—it could not be linearizable without introducing coordination—but it can still have strong integrity guarantees.

例如，这样一个系统可以在多个数据中心中以多个领导者的配置进行分布式运作，在区域之间异步复制。任何一个数据中心都可以继续独立运作，因为不需要同步的跨区域协调。这样一个系统将具有较弱的时效性保证——如果不引入协调，它就无法变成可线性化的，但它仍然可以具有强大的完整性保证。

In this context, serializable transactions are still useful as part of maintaining derived state, but they can be run at a small scope where they work well [ 8 ]. Heterogeneous distributed transactions such as XA transactions (see “Distributed Transactions in Practice” ) are not required. Synchronous coordination can still be introduced in places where it is needed (for example, to enforce strict constraints before an operation from which recovery is not possible), but there is no need for everything to pay the cost of coordination if only a small part of an application needs it [ 43 ].

在这种情况下，可序列化的事务仍然可以作为维护派生状态的一部分而有用，但它们可以在适当的小范围内运行[8]。不需要异构分布式事务，如XA事务（见“实践中的分布式事务”）。同步协调仍然可以在需要的地方引入（例如，在无法从中恢复的操作之前强制执行严格约束），但是如果应用程序的只有一小部分需要，则不需要为所有内容支付协调成本[43]。

Another way of looking at coordination and constraints: they reduce the number of apologies you have to make due to inconsistencies, but potentially also reduce the performance and availability of your system, and thus potentially increase the number of apologies you have to make due to outages. You cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your needs—the sweet spot where there are neither too many inconsistencies nor too many availability problems.

另一种看待协调和约束的方式：它们减少了由于不一致而必须道歉的次数，但可能会降低系统的性能和可用性，从而可能增加因故障而必须道歉的次数。你不能将道歉次数降到零，但可以努力寻找最佳平衡点，即既没有太多的不一致，也没有太多的可用性问题。

Trust, but Verify

All of our discussion of correctness, integrity, and fault-tolerance has been under the assumption that certain things might go wrong, but other things won’t. We call these assumptions our system model (see “Mapping system models to the real world” ): for example, we should assume that processes can crash, machines can suddenly lose power, and the network can arbitrarily delay or drop messages. But we might also assume that data written to disk is not lost after fsync , that data in memory is not corrupted, and that the multiplication instruction of our CPU always returns the correct result.

我们所讨论的正确性、完整性和容错性都是在假设某些东西可能出错，但其他东西不会出错的前提下进行的。我们称这些假设为我们的系统模型（参见“将系统模型映射到真实世界”）：例如，我们应该假设进程可能会崩溃，机器可能会突然断电，网络可能会任意延迟或丢弃消息。但我们也可以假设，经过fsync写入磁盘的数据不会丢失，内存中的数据不会损坏，我们的CPU的乘法指令始终返回正确的结果。

These assumptions are quite reasonable, as they are true most of the time, and it would be difficult to get anything done if we had to constantly worry about our computers making mistakes. Traditionally, system models take a binary approach toward faults: we assume that some things can happen, and other things can never happen. In reality, it is more a question of probabilities: some things are more likely, other things less likely. The question is whether violations of our assumptions happen often enough that we may encounter them in practice.

这些假设是相当合理的，因为它们大多数时候都是正确的，如果我们不停地担心电脑出错会很难继续工作。传统的系统模型对故障采取二元方法：我们假设有些问题可能发生，有些则永远不会发生。实际上，这更多地是一个概率问题：有些事情更可能发生，而其他事情则不太可能。问题在于我们的假设是否经常被违反，以至于我们在实践中会遇到它们。

We have seen that data can become corrupted while it is sitting untouched on disks (see “Replication and Durability” ), and data corruption on the network can sometimes evade the TCP checksums (see “Weak forms of lying” ). Maybe this is something we should be paying more attention to?

我们已经看到数据在磁盘上闲置时可能会变得损坏（参见“复制和耐用性”），而网络上的数据损坏有时可能会逃避TCP校验和（参见“虚假的弱形式”）。也许这是我们应该更加关注的事情？

One application that I worked on in the past collected crash reports from clients, and some of the reports we received could only be explained by random bit-flips in the memory of those devices. It seems unlikely, but if you have enough devices running your software, even very unlikely things do happen. Besides random memory corruption due to hardware faults or radiation, certain pathological memory access patterns can flip bits even in memory that has no faults [ 62 ]—an effect that can be used to break security mechanisms in operating systems [ 63 ] (this technique is known as rowhammer ). Once you look closely, hardware isn’t quite the perfect abstraction that it may seem.

过去我曾经开发过一个应用程序，收集客户机的崩溃报告，我们收到的一些报告只能通过设备内存中的随机位翻转来解释。虽然看起来很不可能，但如果你有足够多的设备运行你的软件，即使非常不可能的事情也会发生。除了由于硬件故障或辐射而导致的随机内存损坏，某些病理性的内存访问模式甚至会翻转没有故障的内存位。这种效果可以用来破解操作系统中的安全机制（这种技术被称为行哈默）。一旦你仔细观察，硬件并不是它看起来那么完美的抽象概念。

To be clear, random bit-flips are still very rare on modern hardware [ 64 ]. I just want to point out that they are not beyond the realm of possibility, and so they deserve some attention.

要明确的是，对于现代硬件来说，随机位翻转仍然是非常罕见的[64]。我只是想指出它们不是不可能发生的，因此应该引起一些关注。请注意，这是翻译的内容，不包括原文。

Maintaining integrity in the face of software bugs

Besides such hardware issues, there is always the risk of software bugs, which would not be caught by lower-level network, memory, or filesystem checksums. Even widely used database software has bugs: I have personally seen cases of MySQL failing to correctly maintain a uniqueness constraint [ 65 ] and PostgreSQL’s serializable isolation level exhibiting write skew anomalies [ 66 ], even though MySQL and PostgreSQL are robust and well-regarded databases that have been battle-tested by many people for many years. In less mature software, the situation is likely to be much worse.

除了这类硬件问题，还存在软件错误的风险，这些错误不会被低级网络、内存或文件系统校验和捕捉到。即使广泛使用的数据库软件也存在错误：我个人曾见过MySQL未能正确维护唯一性约束[65]，而PostgreSQL的可串行化隔离级别会展现出写入偏差异常[66]，尽管MySQL和PostgreSQL是经过许多人多年的实践检验的强大和评价极高的数据库。在不太成熟的软件中，情况很可能会更糟。

Despite considerable efforts in careful design, testing, and review, bugs still creep in. Although they are rare, and they eventually get found and fixed, there is still a period during which such bugs can corrupt data.

尽管进行了仔细的设计、测试和审查，但错误仍会悄悄地潜入。虽然它们很少见，最终会被发现和修复，但在此期间，这些错误仍可能破坏数据。

When it comes to application code, we have to assume many more bugs, since most applications don’t receive anywhere near the amount of review and testing that database code does. Many applications don’t even correctly use the features that databases offer for preserving integrity, such as foreign key or uniqueness constraints [ 36 ].

谈到应用程序代码，我们必须假设存在更多的错误，因为大多数应用程序不接受任何接近数据库代码的审查和测试。许多应用程序甚至没有正确使用数据库提供的维护完整性的功能，例如外键或唯一性约束[36]。

Consistency in the sense of ACID (see “Consistency” ) is based on the idea that the database starts off in a consistent state, and a transaction transforms it from one consistent state to another consistent state. Thus, we expect the database to always be in a consistent state. However, this notion only makes sense if you assume that the transaction is free from bugs. If the application uses the database incorrectly in some way, for example using a weak isolation level unsafely, the integrity of the database cannot be guaranteed.

ACID的一致性（请参阅“一致性”）是基于这样的想法，即数据库始终处于一种一致的状态，并且事务将其从一种一致的状态转换为另一种一致的状态。因此，我们期望数据库始终处于一种一致的状态。然而，只有在假定事务没有漏洞的情况下，这种概念才有意义。如果应用程序以某种方式不正确地使用数据库，例如不安全地使用弱隔离级别，则无法保证数据库的完整性。

Don’t just blindly trust what they promise

With both hardware and software not always living up to the ideal that we would like them to be, it seems that data corruption is inevitable sooner or later. Thus, we should at least have a way of finding out if data has been corrupted so that we can fix it and try to track down the source of the error. Checking the integrity of data is known as auditing .

由于硬件和软件不总是如我们期望的那样完美，数据损坏似乎迟早是不可避免的。因此，我们至少应该有一种方法，能够检测数据是否已被破坏，以便我们可以修复它并尝试追踪错误的源头。检查数据的完整性被称为审计。

As discussed in “Advantages of immutable events” , auditing is not just for financial applications. However, auditability is highly important in finance precisely because everyone knows that mistakes happen, and we all recognize the need to be able to detect and fix problems.

正如“不可变事件的优点”中所讨论的，审计不仅仅适用于财务应用。然而，审计在财务领域中非常重要，因为每个人都知道错误会发生，我们都认识到需要能够检测和解决问题。

Mature systems similarly tend to consider the possibility of unlikely things going wrong, and manage that risk. For example, large-scale storage systems such as HDFS and Amazon S3 do not fully trust disks: they run background processes that continually read back files, compare them to other replicas, and move files from one disk to another, in order to mitigate the risk of silent corruption [ 67 ].

成熟的系统通常考虑到不太可能出错的情况，并管理风险。例如，大规模存储系统（如HDFS和Amazon S3）不完全信任磁盘：它们运行后台进程，不断读取文件，将其与其他副本进行比较，并将文件从一个磁盘移动到另一个磁盘，以减轻静默损坏的风险 [67]。

If you want to be sure that your data is still there, you have to actually read it and check. Most of the time it will still be there, but if it isn’t, you really want to find out sooner rather than later. By the same argument, it is important to try restoring from your backups from time to time—otherwise you may only find out that your backup is broken when it is too late and you have already lost data. Don’t just blindly trust that it is all working.

如果您想确保您的数据仍然存在，您必须实际阅读并检查它。大多数情况下它仍然存在，但如果它没有了，最好尽早找出来。同样的道理，定期尝试从备份恢复数据也很重要 - 否则当您已经丢失数据时才发现备份已损坏就太晚了。不要盲目相信所有东西都正常运行。

A culture of verification

Systems like HDFS and S3 still have to assume that disks work correctly most of the time—which is a reasonable assumption, but not the same as assuming that they always work correctly. However, not many systems currently have this kind of “trust, but verify” approach of continually auditing themselves. Many assume that correctness guarantees are absolute and make no provision for the possibility of rare data corruption. I hope that in the future we will see more self-validating or self-auditing systems that continually check their own integrity, rather than relying on blind trust [ 68 ].

像HDFS和S3这样的系统仍然需要假设磁盘大部分时间都能正常工作 - 这是一个合理的假设，但不同于假设它们总是正常工作。然而，目前不多有系统采取这种“信任但验证”的方法不断地审核自己。许多人假定正确性保证是绝对的，并没有为稀有数据损坏的可能性作出任何安排。我希望未来会看到更多自我验证或自我审核的系统，它们会不断检查自己的完整性，而不是依赖盲目的信任。[68]

I fear that the culture of ACID databases has led us toward developing applications on the basis of blindly trusting technology (such as a transaction mechanism), and neglecting any sort of auditability in the process. Since the technology we trusted worked well enough most of the time, auditing mechanisms were not deemed worth the investment.

我担心ACID数据库文化导致我们盲目信任技术（如事务机制）来开发应用程序，忽略了任何形式的审计。由于我们信任的技术大部分时间表现良好，审计机制被认为不值得投资。

But then the database landscape changed: weaker consistency guarantees became the norm under the banner of NoSQL, and less mature storage technologies became widely used. Yet, because the audit mechanisms had not been developed, we continued building applications on the basis of blind trust, even though this approach had now become more dangerous. Let’s think for a moment about designing for auditability.

然而，数据库格局发生了变化：在NoSQL的旗帜下，更弱的一致性保证成为了常态，不太成熟的存储技术也被广泛使用。然而，由于审计机制还未开发出来，我们继续在盲目信任的基础上构建应用程序，即使这种方法现在变得更加危险。让我们为可审计性设计思考一会儿。

Designing for auditability

If a transaction mutates several objects in a database, it is difficult to tell after the fact what that transaction means. Even if you capture the transaction logs (see “Change Data Capture” ), the insertions, updates, and deletions in various tables do not necessarily give a clear picture of why those mutations were performed. The invocation of the application logic that decided on those mutations is transient and cannot be reproduced.

如果一项交易导致数据库中的多个对象发生变化，事后很难确定该交易的含义。即使您捕获了交易日志（请参阅“更改数据捕获”），各表格中的插入、更新和删除并不一定能清楚地解释为什么要进行这些变更。决定这些变更的应用程序逻辑的调用是瞬时的，无法重现。

By contrast, event-based systems can provide better auditability. In the event sourcing approach, user input to the system is represented as a single immutable event, and any resulting state updates are derived from that event. The derivation can be made deterministic and repeatable, so that running the same log of events through the same version of the derivation code will result in the same state updates.

相比之下，基于事件的系统可以提供更好的审计能力。在事件溯源的方法中，用户对系统的输入被表示为单个不可变的事件，并且任何由该事件导致的状态更新均来自该事件。可以使推导过程具有确定性和可重复性，以便通过相同版本的推导代码运行相同的事件日志将导致相同的状态更新。

Being explicit about dataflow (see “Philosophy of batch process outputs” ) makes the provenance of data much clearer, which makes integrity checking much more feasible. For the event log, we can use hashes to check that the event storage has not been corrupted. For any derived state, we can rerun the batch and stream processors that derived it from the event log in order to check whether we get the same result, or even run a redundant derivation in parallel.

明确数据流程（参见“批处理输出哲学”），使数据来源更加清晰，从而使完整性检查更加可行。对于事件日志，我们可以使用哈希来检查事件存储是否已经损坏。对于任何派生状态，我们可以重新运行批处理和流处理器，以从事件日志中派生它，以检查是否获得相同的结果，甚至可以并行运行冗余派生过程。

A deterministic and well-defined dataflow also makes it easier to debug and trace the execution of a system in order to determine why it did something [ 4 , 69 ]. If something unexpected occurred, it is valuable to have the diagnostic capability to reproduce the exact circumstances that led to the unexpected event—a kind of time-travel debugging capability.

确定性和明确定义的数据流也使得调试和跟踪系统执行更容易，以确定为什么系统执行特定动作。如果发生了意外事件，拥有诊断能力可以重现导致意外事件发生的确切环境，这种时间旅行般的调试能力非常有价值。

The end-to-end argument again

If we cannot fully trust that every individual component of the system will be free from corruption—that every piece of hardware is fault-free and that every piece of software is bug-free—then we must at least periodically check the integrity of our data. If we don’t check, we won’t find out about corruption until it is too late and it has caused some downstream damage, at which point it will be much harder and more expensive to track down the problem.

如果我们不能完全信任系统的每个组件都没有受到损坏 - 每个硬件部件无故障，每个软件部分无漏洞 - 那么我们至少必须定期检查我们的数据的完整性。如果我们不检查，我们将无法发现损坏，直到它太晚并且已经引起了某些下游损失，此时，追踪问题将更加困难和昂贵。

Checking the integrity of data systems is best done in an end-to-end fashion (see “The End-to-End Argument for Databases” ): the more systems we can include in an integrity check, the fewer opportunities there are for corruption to go unnoticed at some stage of the process. If we can check that an entire derived data pipeline is correct end to end, then any disks, networks, services, and algorithms along the path are implicitly included in the check.

检查数据系统的完整性最好以端到端的方式进行（参见“数据库的端到端论证”）：我们可以包括在完整性检查中的系统越多，就越少机会让某个阶段的数据损坏未被发现。如果我们可以检查整个导出数据管道的端到端是否正确，那么路径上的任何磁盘、网络、服务和算法都会被隐含地包括在检查中。

Having continuous end-to-end integrity checks gives you increased confidence about the correctness of your systems, which in turn allows you to move faster [ 70 ]. Like automated testing, auditing increases the chances that bugs will be found quickly, and thus reduces the risk that a change to the system or a new storage technology will cause damage. If you are not afraid of making changes, you can much better evolve an application to meet changing requirements.

持续的端到端完整性检查让您对系统的正确性更加有信心，从而让您能够更快地前进[70]。像自动化测试一样，审计增加了发现错误的机会，从而减少了对系统更改或新存储技术造成损害的风险。如果您不害怕进行更改，您可以更好地发展一个应用程序，以满足不断变化的需求。

Tools for auditable data systems

At present, not many data systems make auditability a top-level concern. Some applications implement their own audit mechanisms, for example by logging all changes to a separate audit table, but guaranteeing the integrity of the audit log and the database state is still difficult. A transaction log can be made tamper-proof by periodically signing it with a hardware security module, but that does not guarantee that the right transactions went into the log in the first place.

目前，不多的数据系统将可审计性作为最高层次的关注点。有些应用程序通过将所有更改记录到单独的审计表中来实现自己的审计机制，但保证审计日志和数据库状态的完整性仍然很困难。一个事务日志可以通过定期使用硬件安全模块签名来防篡改，但这并不保证正确的事务首先进入日志。

It would be interesting to use cryptographic tools to prove the integrity of a system in a way that is robust to a wide range of hardware and software issues, and even potentially malicious actions. Cryptocurrencies, blockchains, and distributed ledger technologies such as Bitcoin, Ethereum, Ripple, Stellar, and various others [ 71 , 72 , 73 ] have sprung up to explore this area.

使用密码工具以一种坚固的方式来证明系统的完整性，可以应对各种硬件、软件问题，甚至潜在的恶意行为，这将是十分有趣的。此领域相继出现了各种加密货币，区块链以及分布式分类账技术，如比特币，以太坊，瑞波币，恒星币等等。

I am not qualified to comment on the merits of these technologies as currencies or mechanisms for agreeing contracts. However, from a data systems point of view they contain some interesting ideas. Essentially, they are distributed databases, with a data model and transaction mechanism, in which different replicas can be hosted by mutually untrusting organizations. The replicas continually check each other’s integrity and use a consensus protocol to agree on the transactions that should be executed.

我不具备评论这些技术作为货币或合同达成机制的优点的资格。然而，从数据系统的角度来看，它们包含了一些有趣的想法。本质上，它们是分布式数据库，具有数据模型和事务机制，在其中不同副本可以由相互不信任的组织托管。这些副本不断检查彼此的完整性，并使用共识协议来达成应执行的交易。

I am somewhat skeptical about the Byzantine fault tolerance aspects of these technologies (see “Byzantine Faults” ), and I find the technique of proof of work (e.g., Bitcoin mining) extraordinarily wasteful. The transaction throughput of Bitcoin is rather low, albeit for political and economic reasons more than for technical ones. However, the integrity checking aspects are interesting.

我对这些技术的拜占庭容错方面持有一定的怀疑（参见“拜占庭错误”），而且我发现工作证明技术（例如比特币挖矿）非常浪费。比特币的交易吞吐量相对较低，尽管原因更多是政治和经济因素，而非技术原因。然而，完整性检查方面很有趣。

Cryptographic auditing and integrity checking often relies on Merkle trees [ 74 ], which are trees of hashes that can be used to efficiently prove that a record appears in some dataset (and a few other things). Outside of the hype of cryptocurrencies, certificate transparency is a security technology that relies on Merkle trees to check the validity of TLS/SSL certificates [ 75 , 76 ].

数字加密审计和完整性检查通常依赖于默克尔树[74]，这是一种可用于有效地证明记录出现在某个数据集中（以及其他一些事情）的哈希树。除了加密货币的炒作外，证书透明度是一种依赖于Merkle树来检查TLS/SSL证书有效性的安全技术[75，76]。

I could imagine integrity-checking and auditing algorithms, like those of certificate transparency and distributed ledgers, becoming more widely used in data systems in general. Some work will be needed to make them equally scalable as systems without cryptographic auditing, and to keep the performance penalty as low as possible. But I think this is an interesting area to watch in the future.

我可以想象诸如证书透明度和分布式账本之类的完整性检查和审计算法在数据系统中更广泛地使用。需要进行一些工作，使它们与没有加密审计的系统同样具有可扩展性，并将性能惩罚降至最低。但我认为这是未来值得关注的一个有趣领域。

Doing the Right Thing

In the final section of this book, I would like to take a step back. Throughout this book we have examined a wide range of different architectures for data systems, evaluated their pros and cons, and explored techniques for building reliable, scalable, and maintainable applications. However, we have left out an important and fundamental part of the discussion, which I would now like to fill in.

在这本书的最后一节中，我想退后一步。在整本书中，我们审查了各种不同的数据系统架构，评估了它们的优缺点，并探索了建立可靠、可扩展和易于维护的应用程序的技术。然而，我们忽略了讨论的一个重要和基本部分，现在我想填补这个空缺。

Every system is built for a purpose; every action we take has both intended and unintended consequences. The purpose may be as simple as making money, but the consequences for the world may reach far beyond that original purpose. We, the engineers building these systems, have a responsibility to carefully consider those consequences and to consciously decide what kind of world we want to live in.

每个系统都是为一个目的而建立的；我们所采取的每个行动都有意图和无意图的后果。这个目的可能只是为了赚钱，但对于世界的影响可能远远超出了最初的目的。作为构建这些系统的工程师，我们有责任认真考虑这些后果，并有意识地决定我们想要生活在什么样的世界中。

We talk about data as an abstract thing, but remember that many datasets are about people: their behavior, their interests, their identity. We must treat such data with humanity and respect. Users are humans too, and human dignity is paramount.

我们谈论数据是一件抽象的事情，但是请记住，许多数据集关乎人们：他们的行为、他们的兴趣、他们的身份。我们必须以人性和尊重对待这样的数据。用户也是人类，人类的尊严至关重要。

Software development increasingly involves making important ethical choices. There are guidelines to help software engineers navigate these issues, such as the ACM’s Software Engineering Code of Ethics and Professional Practice [ 77 ], but they are rarely discussed, applied, and enforced in practice. As a result, engineers and product managers sometimes take a very cavalier attitude to privacy and potential negative consequences of their products [ 78 , 79 , 80 ].

软件开发越来越需要做出重要的道德选择。有一些指南可以帮助软件工程师应对这些问题，比如ACM的《软件工程师道德和职业实践准则》[77]，但实际上很少有讨论、应用和执行。因此，工程师和产品经理有时会对隐私和潜在的负面影响采取非常草率的态度[78, 79, 80]。

A technology is not good or bad in itself—what matters is how it is used and how it affects people. This is true for a software system like a search engine in much the same way as it is for a weapon like a gun. I think it is not sufficient for software engineers to focus exclusively on the technology and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics is difficult, but it is too important to ignore.

技术本身并不是好或坏的，重要的是它如何被使用以及它对人们的影响。这一点对于像搜索引擎这样的软件系统与像枪这样的武器一样正确。我认为，仅仅把注意力集中在技术上而忽略其后果是不够的：我们也要承担道德责任。道德推理是很困难的，但是它太重要了不能忽视。

Predictive Analytics

For example, predictive analytics is a major part of the “Big Data” hype. Using data analysis to predict the weather, or the spread of diseases, is one thing [ 81 ]; it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a loan is likely to default, or whether an insurance customer is likely to make expensive claims. The latter have a direct effect on individual people’s lives.

预测分析是“大数据”热潮的重要组成部分。利用数据分析预测天气或疾病传播一事是一回事；而预测罪犯是否有可能再次犯罪、贷款申请人是否有可能违约以及保险客户是否可能提出高额索赔则另当别论。后者直接影响个人的生活。

Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans, airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy people. From their point of view, the cost of a missed business opportunity is low, but the cost of a bad loan or a problematic employee is much higher, so it is natural for organizations to want to be cautious. If in doubt, they are better off saying no.

自然而然，支付网络希望防止欺诈交易，银行希望避免不良贷款，航空公司希望避免劫机，公司希望避免雇用无效或不可信任的人。从他们的角度来看，错过商机的成本很低，但不良贷款或有问题的员工的成本要高得多，因此组织希望保持谨慎。如果有疑问，他们最好拒绝。

However, as algorithmic decision-making becomes more widespread, someone who has (accurately or falsely) been labeled as risky by some algorithm may suffer a large number of those “no” decisions. Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial services, and other key aspects of society is such a large constraint of the individual’s freedom that it has been called “algorithmic prison” [ 82 ]. In countries that respect human rights, the criminal justice system presumes innocence until proven guilty; on the other hand, automated systems can systematically and arbitrarily exclude a person from participating in society without any proof of guilt, and with little chance of appeal.

然而，随着算法决策越来越普及，某些算法准确或错误地将某人标记为风险，可能导致大量"不"的决定。系统性地被排除在就业机会、航空旅行、保险覆盖、物业租赁、金融服务以及社会的其他关键方面之外，是对个人自由的如此巨大的限制，以至于被称为"算法监狱"。在尊重人权的国家，刑事司法系统假定无罪，除非被证明有罪；另一方面，自动化系统可以系统性和武断地排除一个人参与社会，而没有任何有罪证据，并且几乎没有上诉的机会。

Bias and discrimination

Decisions made by an algorithm are not necessarily any better or any worse than those made by a human. Every person is likely to have biases, even if they actively try to counteract them, and discriminatory practices can become culturally institutionalized. There is hope that basing decisions on data, rather than subjective and instinctive assessments by people, could be more fair and give a better chance to people who are often overlooked in the traditional system [ 83 ].

算法做出的决定并不一定比人类做出的更好或更差。即使人们积极尝试抵消偏见，每个人都可能有偏见，而歧视性做法可能会成为文化制度。有希望的是，基于数据而不是人们主观和本能的评估来做出决策，可能会更公平，并为那些在传统系统中经常被忽视的人提供更好的机会。

When we develop predictive analytics systems, we are not merely automating a human’s decision by using software to specify the rules for when to say yes or no; we are even leaving the rules themselves to be inferred from data. However, the patterns learned by these systems are opaque: even if there is some correlation in the data, we may not know why. If there is a systematic bias in the input to an algorithm, the system will most likely learn and amplify that bias in its output [ 84 ].

当我们开发预测分析系统时，我们不仅仅是通过软件来指定何时说“是”或“否”来自动化人类的决策；我们甚至将规则本身留给了从数据中推断。然而，这些系统学习到的模式是不透明的：即使数据中存在某种相关性，我们可能也不知道为什么。如果算法的输入存在系统性偏见，系统很可能会在其输出中学习并放大这种偏见。

In many countries, anti-discrimination laws prohibit treating people differently depending on protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features of a person’s data may be analyzed, but what happens if they are correlated with protected traits? For example, in racially segregated neighborhoods, a person’s postal code or even their IP address is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could somehow take biased data as input and produce fair and impartial output from it [ 85 ]. Yet this belief often seems to be implied by proponents of data-driven decision making, an attitude that has been satirized as “machine learning is like money laundering for bias” [ 86 ].

在许多国家，反歧视法律禁止因为受保护特征（如种族、年龄、性别、性取向、残疾或信仰）而对人们进行不同待遇。尽管可以分析一个人的其他特征，但如果这些特征与受保护特征相关联会发生什么呢？例如，在种族隔离的社区中，一个人的邮政编码甚至其IP地址是种族的强预测因素。这样说听起来似乎荒谬，认为算法可以从有偏见的数据中产生公正和客观的结果。然而，支持数据驱动决策的倡导者似乎常常暗示这种信念，这种态度被讽刺为“机器学习就像是对偏见进行的洗钱”。

Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify that discrimination. If we want the future to be better than the past, moral imagination is required, and that’s something only humans can provide [ 87 ]. Data and models should be our tools, not our masters.

预测分析系统只是从过去进行推断；如果过去存在歧视，它们会将这种歧视编码下来。如果我们想让未来比过去更好，就需要道德想象力，而这只有人类才能提供。数据和模型应该成为我们的工具，而不是我们的主人。

Responsibility and accountability

Automated decision making opens the question of responsibility and accountability [ 87 ]. If a human makes a mistake, they can be held accountable, and the person affected by the decision can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [ 88 ]? When a self-driving car causes an accident, who is responsible? If an automated credit scoring algorithm systematically discriminates against people of a particular race or religion, is there any recourse? If a decision by your machine learning system comes under judicial review, can you explain to the judge how the algorithm made its decision?

自动化决策引发了责任和问责的问题[87]。如果人类犯错，他们可以被追究责任，而受到决策影响的人可以上诉。算法也会犯错，但如果出现问题，谁来负责[88]？自动驾驶汽车造成事故，谁应负责？如果自动信用评分算法系统性地歧视某个种族或宗教的人，有没有任何救济措施？如果你的机器学习系统的决策受到司法审查，你能向法官认真解释算法是如何做出决策的吗？

Credit rating agencies are an old example of collecting data to make decisions about people. A bad credit score makes life difficult, but at least a credit score is normally based on relevant facts about a person’s actual borrowing history, and any errors in the record can be corrected (although the agencies normally do not make this easy). However, scoring algorithms based on machine learning typically use a much wider range of inputs and are much more opaque, making it harder to understand how a particular decision has come about and whether someone is being treated in an unfair or discriminatory way [ 89 ].

信用评级机构是收集关于个人的数据以做出决策的经典例子。糟糕的信用评分会让生活变得难以承受，但至少信用评分通常基于一个人实际借贷历史的相关事实，而且记录中的任何错误都可以纠正（尽管这些机构通常不会让这个过程变得简单）。但是，基于机器学习的评分算法通常使用更广泛的输入内容，并且更加难以理解一个特定决策是如何产生的，以及某人是否受到了不公或歧视的待遇。[89]

A credit score summarizes “How did you behave in the past?” whereas predictive analytics usually work on the basis of “Who is similar to you, and how did people like you behave in the past?” Drawing parallels to others’ behavior implies stereotyping people, for example based on where they live (a close proxy for race and socioeconomic class). What about people who get put in the wrong bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible [ 87 ].

信用评分总结了“您过去的行为如何？”而预测分析通常基于“谁与您相似，人们如何像您过去的行为？”将其他人的行为与您进行比较意味着对人进行刻板印象，例如基于他们居住的地方（接近代表种族和社会经济阶层）。那些被错误分类的人怎么办？此外，如果决策由于错误数据而不正确，几乎没有任何救济[87]。

Much data is statistical in nature, which means that even if the probability distribution on the whole is correct, individual cases may well be wrong. For example, if the average life expectancy in your country is 80 years, that doesn’t mean you’re expected to drop dead on your 80th birthday. From the average and the probability distribution, you can’t say much about the age to which one particular person will live. Similarly, the output of a prediction system is probabilistic and may well be wrong in individual cases.

许多数据具有统计性质，这就意味着即使整体的概率分布是正确的，个别情况也可能是错误的。例如，如果你所在国家的平均寿命是80岁，这并不意味着你在80岁生日那一天就会离世。从平均值和概率分布来看，你并不能太多地说出一个人会活多少岁。同样地，预测系统的输出是概率性的，也可能在个别情况下出现错误。

A blind belief in the supremacy of data for making decisions is not only delusional, it is positively dangerous. As data-driven decision making becomes more widespread, we will need to figure out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases, and how to fix them when they inevitably make mistakes.

对于盲目相信数据占据决策至高无上地位的观点，不仅是一厢情愿的，而且是极其危险的。随着数据驱动的决策越来越普及，我们需要想办法让算法具有责任和透明度，避免强化现有偏见，以及在它们必然犯错时如何进行修复。

We will also need to figure out how to prevent data being used to harm people, and realize its positive potential instead. For example, analytics can reveal financial and social characteristics of people’s lives. On the one hand, this power could be used to focus aid and support to help those people who most need it. On the other hand, it is sometimes used by predatory business seeking to identify vulnerable people and sell them risky products such as high-cost loans and worthless college degrees [ 87 , 90 ].

我们还需要想办法防止数据被用来伤害人们，而是要发挥其正面潜力。例如，分析可以揭示人们生活的财务和社会特征。一方面，这种力量可以用来集中援助和支持，帮助那些最需要帮助的人。另一方面，这有时会被掠夺性企业用来识别脆弱的人并向他们销售高成本贷款和毫无价值的大学学位。

Feedback loops

Even with predictive applications that have less immediately far-reaching effects on people, such as recommendation systems, there are difficult issues that we must confront. When services become good at predicting what content users want to see, they may end up showing people only opinions they already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization can breed. We are already seeing the impact of social media echo chambers on election campaigns [ 91 ].

即使是具有对人们影响较少的预测应用程序，如推荐系统，我们也必须面对困难的问题。当服务变得擅长预测用户想要看到的内容时，它们可能最终只向人们展示他们已经同意的观点，导致回声室，其中刻板印象，错误信息和极化可能滋生。我们已经看到社交媒体回音室对选举活动的影响。

When predictive analytics affect people’s lives, particularly pernicious problems arise due to self-reinforcing feedback loops. For example, consider the case of employers using credit scores to evaluate potential hires. You may be a good worker with a good credit score, but suddenly find yourself in financial difficulties due to a misfortune outside of your control. As you miss payments on your bills, your credit score suffers, and you will be less likely to find work. Joblessness pushes you toward poverty, which further worsens your scores, making it even harder to find employment [ 87 ]. It’s a downward spiral due to poisonous assumptions, hidden behind a camouflage of mathematical rigor and data.

当预测性分析影响人们的生活时，由于自我加强的反馈循环，特别是会导致恶性问题。例如，考虑雇主使用信用评分来评估潜在的雇员。你可能是一位有着良好信用评分的好工作者，但是由于在你的控制之外的不幸事件而陷入财务困境。随着你未能按时支付账单，你的信用评分会受到影响，你将不太可能找到工作。失业将把你推向贫困，进一步恶化你的得分，使找到工作更加困难。这是由有毒的假设引起的恶性循环，隐藏在数学严谨和数据的伪装下。

We can’t always predict when such feedback loops happen. However, many consequences can be predicted by thinking about the entire system (not just the computerized parts, but also the people interacting with it)—an approach known as systems thinking [ 92 ]. We can try to understand how a data analysis system responds to different behaviors, structures, or characteristics. Does the system reinforce and amplify existing differences between people (e.g., making the rich richer or the poor poorer), or does it try to combat injustice? And even with the best intentions, we must beware of unintended consequences.

我们不能总是预测这样的反馈循环何时发生。然而，通过考虑整个系统（不仅仅是计算机化部分，还有与之交互的人），可以预测许多后果——这被称为系统思维的方法[92]。我们可以尝试了解数据分析系统对不同行为、结构或特征的反应。系统是否加强和放大人们之间的现有差异（例如，让富人更富或贫穷更贫），还是试图对抗不公正？即使有着最好的意图，我们也必须谨防意外后果。我们无法始终预测此类反馈循环何时发生。但是，通过考虑整个系统（不仅仅是计算机化部分，而且还有与之交互的人），可以预测许多后果，这是一种称为系统思维的方法[92]。我们可以尝试了解数据分析系统对不同行为、结构或特征的反应。该系统是否加强和放大人们之间的现有差异（例如，让富人更富或穷人更穷），或者试图对抗不公正？即使有着最好的意图，我们也必须小心意外后果。

Privacy and Tracking

Besides the problems of predictive analytics—i.e., using data to make automated decisions about people—there are ethical problems with data collection itself. What is the relationship between the organizations collecting data and the people whose data is being collected?

除了预测分析的问题——即使用数据来做出有关人的自动化决策之外，数据收集本身也存在伦理问题。收集数据的组织与数据被收集的人之间的关系是什么？

When a system only stores data that a user has explicitly entered, because they want the system to store and process it in a certain way, the system is performing a service for the user: the user is the customer. But when a user’s activity is tracked and logged as a side effect of other things they are doing, the relationship is less clear. The service no longer just does what the user tells it to do, but it takes on interests of its own, which may conflict with the user’s interests.

当系统仅存储用户显式输入的数据，因为他们希望系统以特定的方式存储和处理它时，系统正在为用户提供服务：用户是客户。但是，当用户的活动被跟踪和记录为他们正在做其他事情的副作用时，关系就不太清楚了。该服务不再仅仅是按照用户的指示执行操作，而是具有自己的利益，这可能与用户的利益冲突。

Tracking behavioral data has become increasingly important for user-facing features of many online services: tracking which search results are clicked helps improve the ranking of search results; recommending “people who liked X also liked Y” helps users discover interesting and useful things; A/B tests and user flow analysis can help indicate how a user interface might be improved. Those features require some amount of tracking of user behavior, and users benefit from them.

追踪行为数据对于许多在线服务的用户界面功能变得越来越重要：跟踪点击哪些搜索结果有助于改善搜索结果的排名；推荐“喜欢X的人也喜欢Y”有助于用户发现有趣和有用的东西；A/B测试和用户流分析可以帮助指示如何改进用户界面。这些功能需要追踪用户行为的一定量，用户也从中受益。

However, depending on a company’s business model, tracking often doesn’t stop there. If the service is funded through advertising, the advertisers are the actual customers, and the users’ interests take second place. Tracking data becomes more detailed, analyses become further-reaching, and data is retained for a long time in order to build up detailed profiles of each person for marketing purposes.

不过，根据企业的商业模式，跟踪往往不止于此。如果服务是通过广告资助的，那么广告客户才是真正的客户，而用户的利益则排在第二位。跟踪数据变得更加详细，分析变得更加深入，数据也会长时间保留，以建立每个人的详细营销档案。

Now the relationship between the company and the user whose data is being collected starts looking quite different. The user is given a free service and is coaxed into engaging with it as much as possible. The tracking of the user serves not primarily that individual, but rather the needs of the advertisers who are funding the service. I think this relationship can be appropriately described with a word that has more sinister connotations: surveillance .

现在，公司和被收集数据的用户之间的关系开始变得截然不同。用户获得免费服务，并被诱导尽可能多地与之互动。对用户的跟踪主要为了广告商的需求而不是个人服务。我认为这种关系可以用一个具有更加险恶含义的词来适当描述：监视。

Surveillance

As a thought experiment, try replacing the word data with surveillance , and observe if common phrases still sound so good [ 93 ]. How about this: “In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights.”

作为一个思维实验，尝试用“监视”代替“数据”，看看常见短语是否仍然听起来不错。比如这样：“在我们的监视驱动组织中，我们收集实时监视流并将其存储在我们的监视仓库中。我们的监视科学家使用先进的分析和监视处理技术，以获取新的见解。”

This thought experiment is unusually polemic for this book, Designing Surveillance-Intensive Applications , but I think that strong words are needed to emphasize this point. In our attempts to make software “eat the world” [ 94 ], we have built the greatest mass surveillance infrastructure the world has ever seen. Rushing toward an Internet of Things, we are rapidly approaching a world in which every inhabited space contains at least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled assistant devices, baby monitors, and even children’s toys that use cloud-based speech recognition. Many of these devices have a terrible security record [ 95 ].

这个思想实验在这本《设计监控密集型应用》的书中显得特别争议，但我认为需要强烈的措辞来强调这一点。在我们努力让软件“吞噬世界”的同时，我们建造了史上最大规模的大规模监控基础设施。在向物联网迅速迈进的时候，我们正接近一个每个居住空间都至少包含一个网络连接麦克风的世界，如智能手机、智能电视、语音控制的助理设备、婴儿监控器，甚至使用基于云的语音识别的儿童玩具。其中很多设备都有可怕的安全记录。

Even the most totalitarian and repressive regimes could only dream of putting a microphone in every room and forcing every person to constantly carry a device capable of tracking their location and movements. Yet we apparently voluntarily, even enthusiastically, throw ourselves into this world of total surveillance. The difference is just that the data is being collected by corporations rather than government agencies [ 96 ].

即使是最极权和压制性的政权也只能梦想着在每个房间里安装麦克风，并强迫每个人不断携带能够跟踪其位置和行动的设备。然而，我们显然自愿甚至热情地投身于这个完全监控的世界。不同之处仅在于数据是由企业而非政府机构收集的[96]。

Not all data collection necessarily qualifies as surveillance, but examining it as such can help us understand our relationship with the data collector. Why are we seemingly happy to accept surveillance by corporations? Perhaps you feel you have nothing to hide—in other words, you are totally in line with existing power structures, you are not a marginalized minority, and you needn’t fear persecution [ 97 ]. Not everyone is so fortunate. Or perhaps it’s because the purpose seems benign—it’s not overt coercion and conformance, but merely better recommendations and more personalized marketing. However, combined with the discussion of predictive analytics from the last section, that distinction seems less clear.

不是所有的数据收集都符合监视的标准，但是将其视为监视可以帮助我们了解数据收集者与我们的关系。为什么我们似乎很乐意接受公司的监视？也许你觉得自己没有什么好隐瞒的，换句话说，你完全符合现有的权力结构，你不是被边缘化的少数派，也不需要担心迫害。并非每个人都那么幸运。或者可能是因为目的看起来是善意的——不是明显的强迫和服从，而是更好的建议和更个性化的营销。但是，与上一节的预测分析讨论相结合，这种区别似乎不那么明显。

We are already seeing car insurance premiums linked to tracking devices in cars, and health insurance coverage that depends on people wearing a fitness tracking device. When surveillance is used to determine things that hold sway over important aspects of life, such as insurance coverage or employment, it starts to appear less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for example, passwords) with fairly good accuracy [ 98 ]. And algorithms for analysis are only going to get better.

我们已经看到了汽车保险费与汽车内追踪设备相关联，以及一些健康保险的保险费取决于人们佩戴健康追踪设备。当监视用于决定生活中重要方面的事情，比如保险覆盖范围或就业时，它开始显得不那么良善。此外，数据分析可以揭示出令人惊讶的入侵性事情：例如，智能手表或健身跟踪器中的运动传感器可以用来推断出您正在键入的内容（例如密码），准确率相当高 [98]。而且，分析算法只会变得越来越好。

Consent and freedom of choice

We might assert that users voluntarily choose to use a service that tracks their activity, and they have agreed to the terms of service and privacy policy, so they consent to data collection. We might even claim that users are receiving a valuable service in return for the data they provide, and that the tracking is necessary in order to provide the service. Undoubtedly, social networks, search engines, and various other free online services are valuable to users—but there are problems with this argument.

我们可以主张，用户自愿选择使用跟踪他们活动的服务，他们已同意服务条款和隐私政策，所以他们同意数据收集。我们甚至可以声称用户通过提供数据获得了有价值的服务，而跟踪是为了提供该服务而必要的。毫无疑问，社交网络、搜索引擎和其他各种免费在线服务对用户来说是有价值的——但是这个论点存在问题。

Users have little knowledge of what data they are feeding into our databases, or how it is retained and processed—and most privacy policies do more to obscure than to illuminate. Without understanding what happens to their data, users cannot give any meaningful consent. Often, data from one user also says things about other people who are not users of the service and who have not agreed to any terms. The derived datasets that we discussed in this part of the book—in which data from the entire user base may have been combined with behavioral tracking and external data sources—are precisely the kinds of data of which users cannot have any meaningful understanding.

用户对于他们输入到我们数据库中的数据以及如何保存和处理此类数据知道得很少，很多隐私政策更多的是模糊而非明确说明。如果用户不了解他们的数据被如何处理，他们就无法对此作出有意义的同意。通常，从一个用户得到的数据也会涉及到其他未同意如何使用此类数据的人。我们在本部分中讨论的数据派生集，它们将整个用户群的数据与行为跟踪和外部数据源结合在一起，恰恰是用户无法有任何有意义理解的数据类型。

Moreover, data is extracted from users through a one-way process, not a relationship with true reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how much data they provide and what service they receive in return: the relationship between the service and the user is very asymmetric and one-sided. The terms are set by the service, not by the user [ 99 ].

此外，数据是通过单向过程从用户提取的，而不是通过真正互惠的关系和公平价值的交换。没有对话，用户没有选择权来谈判提供多少数据以及以何种服务回报：服务与用户之间的关系非常不对称和单向的。条款由服务方设定，而不是用户[99]。

For a user who does not consent to surveillance, the only real alternative is simply not to use a service. But this choice is not free either: if a service is so popular that it is “regarded by most people as essential for basic social participation” [ 99 ], then it is not reasonable to expect people to opt out of this service—using it is de facto mandatory. For example, in most Western social communities, it has become the norm to carry a smartphone, to use Facebook for socializing, and to use Google for finding information. Especially when a service has network effects, there is a social cost to people choosing not to use it.

对于不同意监视的用户来说，唯一真正的选择就是简单地不使用该服务。但是这种选择也并不自由：如果一个服务非常流行，以至于“大多数人都认为它是基本社交参与所必需的”[99]，那么期望人们选择退出该服务是不合理的——使用它是事实上的强制性。例如，在大多数西方社交群体中，携带智能手机、使用Facebook社交以及使用Google搜索信息已经成为常态。尤其是当一个服务具有网络效应时，如果人们选择不使用它，将会面临社交成本。若用户不同意监视，唯一的真正替代选择便是不使用该服务。但这也不是一个自由的选择：若该服务如此普及，以至于“大多数人认为它对基础社交参与至关重要”[99]，那么期望用户选择退出不合理——毕竟使用它已成为一种事实上的强制。例如，在大多数西方社交圈子中，携带智能手机、使用Facebook社交、以及使用Google搜索信息已经成为常规。特别是对于有网络效应的服务，选择不使用将面临社交成本。

Declining to use a service due to its tracking of users is only an option for the small number of people who are privileged enough to have the time and knowledge to understand its privacy policy, and who can afford to potentially miss out on social participation or professional opportunities that may have arisen if they had participated in the service. For people in a less privileged position, there is no meaningful freedom of choice: surveillance becomes inescapable.

拒绝使用一个追踪用户的服务只是少数人的选择，这些人足够有时间和知识理解其隐私政策，并有能力承担可能错失社交或职业机会的风险。对于处于相对较弱势地位的人来说，没有实质的选择自由：监视变得不可避免。

Privacy and use of data

Sometimes people claim that “privacy is dead” on the grounds that some users are willing to post all sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal. However, this claim is false and rests on a misunderstanding of the word privacy .

有时人们声称“隐私已死”，理由是一些用户愿意在社交媒体上发布各种各样关于他们生活的事情，有时是平凡无奇的，有时是非常个人的。然而，这种说法是错误的，这基于对隐私这个词的误解。

Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a decision right: it enables each person to decide where they want to be on the spectrum between secrecy and transparency in each situation [ 99 ]. It is an important aspect of a person’s freedom and autonomy.

拥有隐私并不意味着保持所有事情的秘密；它意味着有选择权，可以决定对谁透露什么，公开什么，保守什么。隐私权是一项决策权：它使每个人能够决定在每种情况下他们想要在保密和透明之间处于哪个区间[99]。它是一个人自由和自治的重要方面。

When data is extracted from people through surveillance infrastructure, privacy rights are not necessarily eroded, but rather transferred to the data collector. Companies that acquire data essentially say “trust us to do the right thing with your data,” which means that the right to decide what to reveal and what to keep secret is transferred from the individual to the company.

当通过监控基础设施从人们那里提取数据时，隐私权不一定会被侵蚀，而是转移到数据收集者手中。获取数据的公司基本上会说“相信我们会妥善处理你的数据”，这意味着选择何时透露、何时保密的权利已被从个人转移到公司手中。

The companies in turn choose to keep much of the outcome of this surveillance secret, because to reveal it would be perceived as creepy, and would harm their business model (which relies on knowing more about people than other companies do). Intimate information about users is only revealed indirectly, for example in the form of tools for targeting advertisements to specific groups of people (such as those suffering from a particular illness).

公司们选择保密大部分监视结果的原因是它们被视为令人不安的且会损害它们的商业模式（依赖于比其他公司更多地了解人们）。用户的私人信息只会以间接的方式透露，例如以面向特定人群（例如患有某种疾病的人）的广告定位工具的形式。

Even if particular users cannot be personally reidentified from the bucket of people targeted by a particular ad, they have lost their agency about the disclosure of some intimate information, such as whether they suffer from some illness. It is not the user who decides what is revealed to whom on the basis of their personal preferences—it is the company that exercises the privacy right with the goal of maximizing its profit.

即使特定用户无法从特定广告针对的人群中被重新识别，他们仍会失去有关披露某些私人信息的权利，例如他们是否患有某种疾病。这不是用户根据个人喜好决定向谁透露信息，而是公司以最大化其利润为目标行使隐私权。

Many companies have a goal of not being perceived as creepy—avoiding the question of how intrusive their data collection actually is, and instead focusing on managing user perceptions. And even these perceptions are often managed poorly: for example, something may be factually correct, but if it triggers painful memories, the user may not want to be reminded about it [ 100 ]. With any kind of data we should expect the possibility that it is wrong, undesirable, or inappropriate in some way, and we need to build mechanisms for handling those failures. Whether something is “undesirable” or “inappropriate” is of course down to human judgment; algorithms are oblivious to such notions unless we explicitly program them to respect human needs. As engineers of these systems we must be humble, accepting and planning for such failings.

许多公司的目标是要避免被视为“可怕”，而不是考虑他们的数据收集实际上有多么侵入性，而是专注于管理用户印象。而即使是这些印象通常也管理得很差：例如，某些东西可能是事实正确的，但如果它触发了痛苦的记忆，用户可能不想被提醒 [100]。对于任何类型的数据，我们都应该预期可能存在错误、不良或不适当的情况，并建立处理这些失败的机制。无论某事物是“不良”或“不适当”，当然都是由人类判断的；除非我们明确地编程让算法尊重人类的需要，否则算法是无意识的这些概念的。作为这些系统的工程师，我们必须谦虚、接受并计划这样的失败。

Privacy settings that allow a user of an online service to control which aspects of their data other users can see are a starting point for handing back some control to users. However, regardless of the setting, the service itself still has unfettered access to the data, and is free to use it in any way permitted by the privacy policy. Even if the service promises not to sell the data to third parties, it usually grants itself unrestricted rights to process and analyze the data internally, often going much further than what is overtly visible to users.

隐私设置允许在线服务的用户控制其他用户可以看到其数据的哪些方面，这是将一些控制权交还给用户的起点。然而，无论设置如何，服务本身仍能自由访问并使用此数据，且在隐私政策允许的情况下可以用于任何用途。即使服务承诺不向第三方出售数据，通常也会授予其自身无限制的内部处理和分析数据的权利，往往超出用户明显可见的范围。

This kind of large-scale transfer of privacy rights from individuals to corporations is historically unprecedented [ 99 ]. Surveillance has always existed, but it used to be expensive and manual, not scalable and automated. Trust relationships have always existed, for example between a patient and their doctor, or between a defendant and their attorney—but in these cases the use of data has been strictly governed by ethical, legal, and regulatory constraints. Internet services have made it much easier to amass huge amounts of sensitive information without meaningful consent, and to use it at massive scale without users understanding what is happening to their private data.

这种大规模的隐私权转移从个人到企业的现象，在历史上是前所未有的[99]。监视一直存在，但曾经很昂贵和手动化，而不是可扩展和自动化的。信任关系一直存在，例如患者和医生之间，或被告和律师之间的关系，但在这些情况下，数据使用受到了道德、法律和监管限制。互联网服务使得大规模收集敏感信息变得更加容易，而用户却没有真正的同意，并且在大规模使用其私人数据时，用户也没有理解发生了什么。

Data as assets and power

Since behavioral data is a byproduct of users interacting with a service, it is sometimes called “data exhaust”—suggesting that the data is worthless waste material. Viewed this way, behavioral and predictive analytics can be seen as a form of recycling that extracts value from data that would have otherwise been thrown away.

由于行为数据是用户与服务互动的副产品，因此有时被称为“数据废气”，暗示这些数据是无用的废弃物料。从这个角度来看，行为和预测分析可以被视为一种回收利用的形式，从本来会被丢弃的数据中提取价值。

More correct would be to view it the other way round: from an economic point of view, if targeted advertising is what pays for a service, then behavioral data about people is the service’s core asset. In this case, the application with which the user interacts is merely a means to lure users into feeding more and more personal information into the surveillance infrastructure [ 99 ]. The delightful human creativity and social relationships that often find expression in online services are cynically exploited by the data extraction machine.

更正确的做法是反过来看：从经济角度来看，如果定向广告是为一项服务买单的方式，那么关于用户行为的数据就是该服务的核心资产。在这种情况下，用户与之交互的应用程序仅仅是吸引用户向监视基础设施提供越来越多个人信息的手段。在线服务中经常表现出来的迷人人类创造力和社交关系被数据提取机器愚弄地利用。

The assertion that personal data is a valuable asset is supported by the existence of data brokers, a shady industry operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive personal data about people, mostly for marketing purposes [ 90 ]. Startups are valued by their user numbers, by “eyeballs”—i.e., by their surveillance capabilities.

私人資料是有價值的資產這種說法得到了證實，因為資料經紀人的存在。這是一個在秘密運作的陰暗行業，他們購買、匯總、分析、推測和轉售人們的侵犯性個人資料，主要是為了市場營銷目的 [90]。創業公司通常根據其用戶數來評估其價值，通過它們的監控能力來評估。

Because the data is valuable, many people want it. Of course companies want it—that’s why they collect it in the first place. But governments want to obtain it too: by means of secret deals, coercion, legal compulsion, or simply stealing it [ 101 ]. When a company goes bankrupt, the personal data it has collected is one of the assets that get sold. Moreover, the data is difficult to secure, so breaches happen disconcertingly often [ 102 ].

因为数据很有价值，许多人都想要它。当然，公司也想要它，这就是为什么他们首先收集它的原因。但政府也想获得它：通过秘密交易、强迫、法律约束或仅仅是偷窃[101]。当公司破产时，它收集的个人数据是被出售的资产之一。此外，数据很难保护，所以泄漏发生的频率令人不安[102]。

These observations have led critics to saying that data is not just an asset, but a “toxic asset” [ 101 ], or at least “hazardous material” [ 103 ]. Even if we think that we are capable of preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign intelligence services, data may be leaked by insiders, the company may fall into the hands of unscrupulous management that does not share our values, or the country may be taken over by a regime that has no qualms about compelling us to hand over the data.

这些观察结果导致评论家们称数据不仅仅是一种资产，而是“有毒资产”[101]，或者至少是“有害物质”[103]。即使我们认为自己能够防止数据被滥用，每当我们收集数据时，我们都需要权衡收益与风险。计算机系统可能会被罪犯或敌对外国情报服务机构入侵，数据可能会被内部人员泄露，公司可能会落入不道德的管理层手中，他们不会分享我们的价值观，或者国家可能会被一个毫不犹豫地迫使我们交出数据的政权所接管。

When collecting data, we need to consider not just today’s political environment, but all possible future governments. There is no guarantee that every government elected in future will respect human rights and civil liberties, so “it is poor civic hygiene to install technologies that could someday facilitate a police state” [ 104 ].

在收集数据时，我们需要考虑不仅是今天的政治环境，而且所有可能的未来政府。无法保证未来选举出的每个政府都会尊重人权和公民自由，因此，“安装可能有助于警察国家的技术是一种贫乏的公民卫生”[104]。

“Knowledge is power,” as the old adage goes. And furthermore, “to scrutinize others while avoiding scrutiny oneself is one of the most important forms of power” [ 105 ]. This is why totalitarian governments want surveillance: it gives them the power to control the population. Although today’s technology companies are not overtly seeking political power, the data and knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which is surreptitious, outside of public oversight [ 106 ].

据古谚语所说，“知识就是力量”。而且，“深入审查他人，同时避免接受审查本身，是最重要的权力形式之一”[105]。这就是为什么极权政府需要监视：它给了他们控制人民的权力。尽管当今的科技公司并非公开寻求政治权力，但他们积累的数据和知识仍然在我们生活中拥有很大的影响力，其中很大一部分是隐秘的，没有公众监督[106]。

Remembering the Industrial Revolution

Data is the defining feature of the information age. The internet, data storage, processing, and software-driven automation are having a major impact on the global economy and human society. As our daily lives and social organization have changed in the past decade, and will probably continue to radically change in the coming decades, comparisons to the Industrial Revolution come to mind [ 87 , 96 ].

数据是信息时代的决定性特征。互联网、数据存储、处理和软件驱动的自动化对全球经济和人类社会产生了重大影响。随着我们日常生活和社会组织在过去十年发生了变化，而且可能在未来几十年发生根本性变化，不禁让人想起工业革命的比较。

The Industrial Revolution came about through major technological and agricultural advances, and it brought sustained economic growth and significantly improved living standards in the long run. Yet it also came with major problems: pollution of the air (due to smoke and chemical processes) and the water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was common, including dangerous and poorly paid work in mines.

工业革命是通过重大的技术和农业进步实现的，从而带来了经济持续增长和明显改善的长期生活水平。但是，它也带来了重大问题：由于烟雾和化学过程导致的空气污染和（由工业和人类废物）的水污染是可怕的。工厂老板生活在华丽之中，而城市工人通常生活在非常糟糕的住房中，在恶劣的条件下长时间工作。童工很普遍，包括在矿山中从事危险和低薪工作。

It took a long time before safeguards were established, such as environmental protection regulations, safety protocols for workplaces, outlawing child labor, and health inspections for food. Undoubtedly the cost of doing business increased when factories could no longer dump their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited hugely, and few of us would want to return to a time before those regulations [ 87 ].

很长时间过去了，人们才建立了许多保障措施，比如环境保护法规、工作场所安全协议、禁止童工，并对食品进行卫生检查。当工厂不能再将废弃物排放到河流中、销售污染食品或剥削工人时，经商成本无疑会增加。但整个社会从中获益匪浅，很少有人愿意返回这些规定实施之前的时代。

Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the information age has major problems that we need to confront and solve. I believe that the collection and use of data is one of those problems. In the words of Bruce Schneier [ 96 ]:

正如工业革命存在需要管理的阴暗面一样，我们向信息时代的过渡也存在着一系列需要我们应对和解决的重大问题。我认为数据的收集和使用就是其中之一。用布鲁斯·施奈尔（Bruce Schneier）的话来说：“数据就是新时代的权力”，这种权力必须得到有效的监管和管控。

Data is the pollution problem of the information age, and protecting privacy is the environmental challenge. Almost all computers produce information. It stays around, festering. How we deal with it—how we contain it and how we dispose of it—is central to the health of our information economy. Just as we look back today at the early decades of the industrial age and wonder how our ancestors could have ignored pollution in their rush to build an industrial world, our grandchildren will look back at us during these early decades of the information age and judge us on how we addressed the challenge of data collection and misuse.

数据是信息时代的污染问题，保护隐私则是环境方面的挑战。几乎所有的计算机都会产生信息，而信息却一直存在，潜在地危害着我们的经济健康。我们如何处理这些信息，如何加以限制和处理，对于信息经济的健康至关重要。正如我们今天回顾工业时代早期几十年的历史，在我们先辈们忙于建立工业化世界时为何会置污染于不顾，我们的后代将来审视我们这个信息时代早期几十年历史时，会根据我们如何应对数据收集和滥用的挑战来评判我们。

We should try to make them proud.

我们应该努力让他们感到骄傲。

Legislation and self-regulation

Data protection laws might be able to help preserve individuals’ rights. For example, the 1995 European Data Protection Directive states that personal data must be “collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes,” and furthermore that data must be “adequate, relevant and not excessive in relation to the purposes for which they are collected” [ 107 ].

数据保护法可以帮助保护个人权利。例如，1995年欧洲数据保护指令规定，个人数据必须“为特定、明确和合法的目的收集，不得以与这些目的不兼容的方式进一步处理”，并且数据必须“与收集目的相关，而且不应过度”。[107]。

However, it is doubtful whether this legislation is effective in today’s internet context [ 108 ]. These rules run directly counter to the philosophy of Big Data, which is to maximize data collection, to combine it with other datasets, to experiment and to explore in order to generate new insights. Exploration means using data for unforeseen purposes, which is the opposite of the “specified and explicit” purposes for which the user gave their consent (if we can meaningfully speak of consent at all [ 109 ]). Updated regulations are now being developed [ 89 ].

然而，在当前互联网背景下，这种立法是否有效仍存在疑问[108]。这些规定直接违反了大数据的哲学，即最大化数据采集，将其与其他数据集合并，进行实验和探索以产生新的见解。探索意味着将数据用于未预料的目的，这与用户给出明确和明确的同意目的（如果我们能有意义地谈论同意的话[109]）相反。正在制定更新的规定[89]。

Companies that collect lots of data about people oppose regulation as being a burden and a hindrance to innovation. To some extent that opposition is justified. For example, when sharing medical data, there are clear risks to privacy, but there are also potential opportunities: how many deaths could be prevented if data analysis was able to help us achieve better diagnostics or find better treatments [ 110 ]? Over-regulation may prevent such breakthroughs. It is difficult to balance such potential opportunities with the risks [ 105 ].

大量收集人们数据的公司反对管制，认为管制会成为负担和创新的阻碍。从某种程度上说，这种反对是有道理的。比如，在分享医疗数据时，虽然会存在隐私风险，但也存在潜在的机会：如果数据分析能够帮助我们实现更好的诊断或找到更好的治疗方法，那将有助于挽救多少生命。过度管制可能会阻碍这种突破。很难平衡这种潜在机会和风险。

Fundamentally, I think we need a culture shift in the tech industry with regard to personal data. We should stop regarding users as metrics to be optimized, and remember that they are humans who deserve respect, dignity, and agency. We should self-regulate our data collection and processing practices in order to establish and maintain the trust of the people who depend on our software [ 111 ]. And we should take it upon ourselves to educate end users about how their data is used, rather than keeping them in the dark.

从根本上讲，我认为技术产业需要转变文化，关注个人数据。我们应该停止把用户视为要优化的指标，而要记住他们是有尊严、权利和自主权的人。我们应该自我约束数据收集和处理实践，以建立和维护人们对我们软件的信任[111]。我们应该自己教育终端用户，让他们了解他们的数据如何被使用，而不是让他们一无所知。

We should allow each individual to maintain their privacy—i.e., their control over own data—and not steal that control from them through surveillance. Our individual right to control our data is like the natural environment of a national park: if we don’t explicitly protect and care for it, it will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it. Ubiquitous surveillance is not inevitable—we are still able to stop it.

我们应该允许每个人保护自己的隐私——也就是掌控自己的数据——不能通过监控侵犯他们的掌控权。我们每个人掌控自己数据的权利就像国家公园的自然环境：如果我们不明确地保护和呵护它，它将被摧毁。这将是“公地悲剧”，我们所有人都会因此受到损害。无处不在的监控并非不可避免——我们仍然能够阻止它发生。

How exactly we might achieve this is an open question. To begin with, we should not retain data forever, but purge it as soon as it is no longer needed [ 111 , 112 ]. Purging data runs counter to the idea of immutability (see “Limitations of immutability” ), but that issue can be solved. A promising approach I see is to enforce access control through cryptographic protocols, rather than merely by policy [ 113 , 114 ]. Overall, culture and attitude changes will be necessary.

我们如何确切地达成这个目标仍然是个未知问题。首先，我们不应该永久地保留数据，而是在不再需要时立刻清除它[111，112]。清除数据违反了不可变性的概念（请参见“不可变性的限制”），但这个问题可以解决。我看到的一个有前途的方法是通过加密协议来执行访问控制，而不仅仅是通过策略[113，114]。总的来说，需要文化和态度上的改变。

Summary

In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this data integration problem by using batch processing and event streams to let data changes flow between different systems.

在本章中，我们讨论了设计数据系统的新方法，并包括了我个人对未来的看法和推测。我们从观察到没有一种单一的工具可以有效地为所有可能的用例服务开始，并且应用程序必须组合几个不同的软件来实现其目标。我们讨论了如何通过使用批处理和事件流解决这个数据集成问题，以让数据变化在不同的系统之间流动。

In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole.

在这种方法中，某些系统被指定为记录系统，而其他数据则通过转换从它们派生。通过这种方式，我们可以维护索引、物化视图、机器学习模型、统计摘要等等。通过将这些派生和转换异步和松散耦合，防止一个区域的问题传播到系统的无关部分，从而增加整个系统的鲁棒性和容错性。

Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if something goes wrong, you can fix the code and reprocess the data in order to recover.

将数据流表达为从一个数据集到另一个数据集的转换也有助于演变应用程序：如果您想更改其中一个处理步骤，例如更改索引或缓存的结构，您可以在整个输入数据集上重新运行新的转换代码，以便重新生成输出。同样，如果出现问题，您可以修复代码并重新处理数据以进行恢复。

These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as unbundling the components of a database, and building an application by composing these loosely coupled components.

这些过程与数据库内部已经执行的过程非常相似，因此我们将数据流应用程序的理念重塑为解构数据库组件，并通过组合这些松散耦合的组件构建应用程序。

Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline.

派生状态可以通过观察基础数据的变化来更新。此外，下游消费者还可以进一步观察派生状态本身。我们甚至可以将这种数据流传递到显示数据的最终用户设备上，因此构建动态更新以反映数据变化并继续在离线状态下运行的用户界面。

Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scalably with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk having to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.

接下来，我们讨论如何确保所有的处理在存在故障时都是正确的。我们发现强整体性保证可以通过异步事件处理进行可扩展实现，使用端到端操作标识符使操作幂等，并异步地检查约束条件。客户端可以等待检查通过，或者不等待，但存在约束违反的风险。这种方法比传统的使用分布式事务的方法更具可扩展性和鲁棒性，并符合许多业务流程的实际使用情况。

By structuring applications around dataflow and checking constraints asynchronously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the presence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption.

通过围绕数据流构建应用程序，并异步检查约束条件，我们可以避免大部分协调工作，并创建保持完整性的系统，即使在地理分布的情况下和存在故障的情况下系统仍然表现良好。接下来我们会详细讨论使用审计来验证数据的完整性，并检测数据是否损坏。

Finally, we took a step back and examined some ethical aspects of building data-intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.

最后，我们退后一步，审视了建立数据密集型应用程序的一些伦理方面。我们发现，虽然数据可以用于做好事，但也可能造成相当严重的伤害：作出影响人们生活并难以上诉的决策，导致歧视和剥削，将监视通常化并暴露私密信息。我们还面临着数据泄露的风险，而我们可能会发现，对数据的善意使用具有意想不到的后果。

As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.

由于软件和数据对世界产生如此之大的影响，我们工程师必须记得，我们承担着推动我们想要生活的这种世界的责任——一个尊重人类和尊重的世界。我希望我们可以共同努力实现这个目标。

Footnotes

ⁱ Explaining a joke rarely improves it, but I don’t want anyone to feel left out. Here, Church is a reference to the mathematician Alonzo Church, who created the lambda calculus, an early form of computation that is the basis for most functional programming languages. The lambda calculus has no mutable state (i.e., no variables that can be overwritten), so one could say that mutable state is separate from Church’s work.

解释笑话很少能使它变得更好，但我不想让任何人感觉被遗漏。在这里，“Church”是对数学家Alonzo Church的引用，他创造了Lambda演算，这是大多数函数式编程语言的基础。Lambda演算没有可变状态（即不能被覆盖的变量），因此可以说可变状态与Church的工作是分开的。

ⁱⁱ In the microservices approach, you could avoid the synchronous network request by caching the exchange rate locally in the service that processes the purchase. However, in order to keep that cache fresh, you would need to periodically poll for updated exchange rates, or subscribe to a stream of changes—which is exactly what happens in the dataflow approach.

在微服务方法中，您可以通过在处理购买的服务中将汇率缓存在本地来避免同步网络请求。然而，为了保持缓存的新鲜，您需要定期轮询更新的汇率，或者订阅变化的流—这正是数据流方法所发生的。

ⁱⁱⁱ Less facetiously, the set of distinct search queries with nonempty search results is finite, assuming a finite corpus. However, it would be exponential in the number of terms in the corpus, which is still pretty bad news.

非幽默地说，假设有限语料库存在，具有非空搜索结果的不同搜索查询集是有限的。然而，其在语料库中术语数量上是指数级增长的，这仍然是一个坏消息。

References

[ 1 ] Rachid Belaid: “ Postgres Full-Text Search is Good Enough! ,” rachbelaid.com , July 13, 2015.

[1] Rachid Belaid： “Postgres全文搜索足够好了！”， rachbelaid.com，2015年7月13日。

[ 2 ] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “ Challenges to Adopting Stronger Consistency at Scale ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[2] Philippe Ajoux，Nathan Bronson，Sanjeev Kumar等: “在规模上采用更强的一致性所面临的挑战”，于2015年5月在第15届USENIX操作系统热点话题研讨会（HotOS）上发表。

[ 3 ] Pat Helland and Dave Campbell: “ Building on Quicksand ,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

[3] Pat Helland和Dave Campbell：“在流沙上建立”，发表于2009年1月的第四届创新数据系统研究双年会（CIDR）。

[ 4 ] Jessica Kerr: “ Provenance and Causality in Distributed Systems ,” blog.jessitron.com , September 25, 2016.

[4] Jessica Kerr: "分布式系统中的溯源和因果关系," blog.jessitron.com, 2016年9月25日.

[ 5 ] Kostas Tzoumas: “ Batch Is a Special Case of Streaming ,” data-artisans.com , September 15, 2015.

【5】Kostas Tzoumas：“Batch是流数据的特例”，data-artisans.com，2015年9月15日。

[ 6 ] Shinji Kim and Robert Blafford: “ Stream Windowing Performance Analysis: Concord and Spark Streaming ,” concord.io , July 6, 2016.

[6] Shinji Kim和Robert Blafford： “流窗口性能分析：Concord和Spark Streaming”，concord.io，2016年7月6日。 [6] Shinji Kim和Robert Blafford： “流窗口性能分析：Concord和Spark Streaming”，concord.io，2016年7月6日。

[ 7 ] Jay Kreps: “ The Log: What Every Software Engineer Should Know About Real-Time Data’s Unifying Abstraction ,” engineering.linkedin.com , December 16, 2013.

[7] Jay Kreps: “日志：关于实时数据统一抽象的每个软件工程师都应该知道的事情”， engineering.linkedin.com，2013年12月16日。

[ 8 ] Pat Helland: “ Life Beyond Distributed Transactions: An Apostate’s Opinion ,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

[8] Pat Helland: “超越分布式事务: 一个异端的观点”，于2007年1月举行的第三届创新数据系统研究会议（CIDR）上发表。

[ 9 ] “ Great Western Railway (1835–1948) ,” Network Rail Virtual Archive, networkrail.co.uk .

[9] “大西部铁路（1835年至1948年），” 英国网络铁路虚拟档案, networkrail.co.uk.

[ 10 ] Jacqueline Xu: “ Online Migrations at Scale ,” stripe.com , February 2, 2017.

[10] Jacqueline Xu：“大规模在线迁移”，stripe.com，2017年2月2日。

[ 11 ] Molly Bartlett Dishman and Martin Fowler: “ Agile Architecture ,” at O’Reilly Software Architecture Conference , March 2015.

[11] Molly Bartlett Dishman 和 Martin Fowler: "敏捷架构"，于 O'Reilly 软件架构会议，2015 年 3 月。

[ 12 ] Nathan Marz and James Warren: Big Data: Principles and Best Practices of Scalable Real-Time Data Systems . Manning, 2015. ISBN: 978-1-617-29034-3

[12] Nathan Marz和James Warren：《大数据：可伸缩实时数据系统的原则和最佳实践》。曼宁出版社，2015年。ISBN: 978-1-617-29034-3。

[ 13 ] Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin: “ Summingbird: A Framework for Integrating Batch and Online MapReduce Computations ,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

"[13] Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin: “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014." "[13] 奥斯卡·博伊金（Oscar Boykin），山姆·里奇（Sam Ritchie），伊恩·奥康纳（Ian O'Connell）和吉米·林（Jimmy Lin）：``Summingbird：一种整合批处理和在线MapReduce计算的框架''，于2014年9月在第40届国际超大数据库会议（VLDB）上发表。"

[ 14 ] Jay Kreps: “ Questioning the Lambda Architecture ,” oreilly.com , July 2, 2014.

[14] Jay Kreps：“质疑Lamba架构”，oreilly.com，2014年7月2日。

[ 15 ] Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “ Liquid: Unifying Nearline and Offline Big Data Integration ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[15] Raul Castro Fernandez、Peter Pietzuch、Jay Kreps等人：“液态：统一近线和离线的大数据集成”，发表于第七届创新数据系统研究会议(CIDR)，2015年1月。

[ 16 ] Dennis M. Ritchie and Ken Thompson: “ The UNIX Time-Sharing System ,” Communications of the ACM , volume 17, number 7, pages 365–375, July 1974. doi:10.1145/361011.361061

「[16] 丹尼斯·里奇和肯·汤普森：「UNIX 分时系统」，《ACM通讯》，卷17，第7期，365-375页，1974年7月。doi:10.1145/361011.361061」的簡體中文翻譯如下：丹尼斯·里奇和肯·汤普森：「UNIX 分时系统」，ACM通讯，卷17，第7期，1974年7月，365-375页。doi:10.1145/361011.361061。

[ 17 ] Eric A. Brewer and Joseph M. Hellerstein: “ CS262a: Advanced Topics in Computer Systems ,” lecture notes, University of California, Berkeley, cs.berkeley.edu , August 2011.

[17] 艾瑞克·A·布鲁尔和约瑟夫·M·赫勒斯坦： “CS262a：计算机系统高级主题”，讲义，加州大学伯克利分校cs.berkeley.edu， 2011年8月。

[ 18 ] Michael Stonebraker: “ The Case for Polystores ,” wp.sigmod.org , July 13, 2015.

[18] 迈克尔·斯通布雷克：《Polystore的案例》，wp.sigmod.org，2015年7月13日。

[ 19 ] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “ The BigDAWG Polystore System ,” ACM SIGMOD Record , volume 44, number 2, pages 11–16, June 2015. doi:10.1145/2814710.2814713

【19】詹妮·达根、亚伦·J·埃尔莫、迈克尔·斯通布莱克等：《BigDAWG Polystore 系统》，ACM SIGMOD Record，第 44 卷，第 2 期，第 11-16 页，2015 年 6 月。doi:10.1145/2814710.2814713。

[ 20 ] Patrycja Dybka: “ Foreign Data Wrappers for PostgreSQL ,” vertabelo.com , March 24, 2015.

[20] Patrycja Dybka：“PostgreSQL的外部数据包装器”，vertabelo.com，2015年3月24日。

[ 21 ] David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “ Unbundling Transaction Services in the Cloud ,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

[21] David B. Lomet，Alan Fekete，Gerhard Weikum和Mike Zwilling：“在云中对交易服务进行解耦”，发表于第四届创新数据系统研究双年会（CIDR），2009年1月。

[ 22 ] Martin Kleppmann and Jay Kreps: “ Kafka, Samza and the Unix Philosophy of Distributed Data ,” IEEE Data Engineering Bulletin , volume 38, number 4, pages 4–14, December 2015.

[22] Martin Kleppmann和Jay Kreps：“Kafka、Samza和分布式数据的Unix哲学”，IEEE数据工程通报，卷38，第4期，页4-14，2015年12月。

[ 23 ] John Hugg: “ Winning Now and in the Future: Where VoltDB Shines ,” voltdb.com , March 23, 2016.

"[23] John Hugg：‘赢在现在和未来：VoltDB的优势’，voltdb.com，2016年3月23日。"

[ 24 ] Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “ Differential Dataflow ,” at 6th Biennial Conference on Innovative Data Systems Research (CIDR), January 2013.

[24] Frank McSherry, Derek G. Murray, Rebecca Isaacs和Michael Isard: “差异化数据流”，发表于第六届创新数据系统研究双年会（CIDR），2013年1月。

[ 25 ] Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “ Naiad: A Timely Dataflow System ,” at 24th ACM Symposium on Operating Systems Principles (SOSP), pages 439–455, November 2013. doi:10.1145/2517349.2522738

"[25] Derek G Murray, Frank McSherry, Rebecca Isaacs等：“Naiad：一种及时数据流系统”， 2013年11月，在第24届ACM操作系统原则研讨会（SOSP）上，第439-455页。doi:10.1145/2517349.2522738"

[ 26 ] Gwen Shapira: “ We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’ ” twitter.com , July 28, 2016.

"[26] Gwen Shapira：我们有很多客户正在实施“内部数据库”概念，他们都会问'有其他公司这么做吗？我们是不是疯了？'。" 推特，2016年7月28日。"

[ 27 ] Martin Kleppmann: “ Turning the Database Inside-out with Apache Samza, ” at Strange Loop , September 2014.

[27] Martin Kleppmann：在奇怪的循环会议上，于2014年9月发表了题为“用Apache Samza将数据库颠覆”的演讲。

[ 28 ] Peter Van Roy and Seif Haridi: Concepts, Techniques, and Models of Computer Programming . MIT Press, 2004. ISBN: 978-0-262-22069-9

【28】Peter Van Roy和Seif Haridi：《计算机编程的概念、技术和模型》。麻省理工学院出版社，2004年。ISBN：978-0-262-22069-9。

[ 29 ] “ Juttle Documentation ,” juttle.github.io , 2016.

[29] "Juttle文档", juttle.github.io, 2016.

[ 30 ] Evan Czaplicki and Stephen Chong: “ Asynchronous Functional Reactive Programming for GUIs ,” at 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013. doi:10.1145/2491956.2462161

"30. Evan Czaplicki和Stephen Chong：“用于GUI的异步函数响应式编程”，发表于第34届ACM SIGPLAN编程语言设计和实现会议（PLDI），2013年6月。 doi：10.1145 / 2491956.2462161"

[ 31 ] Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “ A Survey on Reactive Programming ,” ACM Computing Surveys , volume 45, number 4, pages 1–34, August 2013. doi:10.1145/2501654.2501666

“反应式编程综述”，作者：Engineer Bainomugisha、Andoni Lombide Carreton、Tom van Cutsem、Stijn Mostinckx、Wolfgang de Meuter，发表于《ACM Computing Surveys》杂志，2013年8月，第45卷第4期，1-34页。doi:10.1145/2501654.2501666。

[ 32 ] Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “ Consistency Analysis in Bloom: A CALM and Collected Approach ,” at 5th Biennial Conference on Innovative Data Systems Research (CIDR), January 2011.

[32] Peter Alvaro、Neil Conway、Joseph M. Hellerstein和William R. Marczak：「Consistency Analysis in Bloom: A CALM and Collected Approach」，于2011年1月在第5届创新数据系统研究双年会（CIDR）上发表。

[ 33 ] Felienne Hermans: “ Spreadsheets Are Code ,” at Code Mesh , November 2015.

Felienne Hermans:“电子表格是代码”，于2015年11月在Code Mesh上。

[ 34 ] Dan Bricklin and Bob Frankston: “ VisiCalc: Information from Its Creators ,” danbricklin.com .

[34] 丹·布里克林和鲍勃·弗兰克斯顿：“VisiCalc：从其创作者那里获得信息”，danbricklin.com。

[ 35 ] D. Sculley, Gary Holt, Daniel Golovin, et al.: “ Machine Learning: The High-Interest Credit Card of Technical Debt ,” at NIPS Workshop on Software Engineering for Machine Learning (SE4ML), December 2014.

[35] D. Sculley, Gary Holt, Daniel Golovin等人： “机器学习：技术债务中的高利贷”，于2014年12月于NIPS机器学习软件工程研讨会（SE4ML）上发表。

[ 36 ] Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “ Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity ,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2737784

[36] Peter Bailis，Alan Fekete，Michael J Franklin等：“野生并发控制：现代应用完整性的实证研究”，发表于2015年6月ACM数据管理国际会议（SIGMOD），doi:10.1145/2723372.2737784。

[ 37 ] Guy Steele: “ Re: Need for Macros (Was Re: Icon) ,” email to ll1-discuss mailing list, people.csail.mit.edu , December 24, 2001.

[37] Guy Steele： “关于需要宏 (回复：Icon)，” 发送至ll1-discuss邮件列表的电子邮件，people.csail.mit.edu，2001年12月24日。

[ 38 ] David Gelernter: “ Generative Communication in Linda ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 7, number 1, pages 80–112, January 1985. doi:10.1145/2363.2433

[38] 大卫·格勒特纳：“Linda中的生成通信”，ACM编程语言和系统交易（TOPLAS），第7卷，第1号，第80-112页，1985年1月。doi：10.1145 / 2363.2433

[ 39 ] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “ The Many Faces of Publish/Subscribe ,” ACM Computing Surveys , volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078

[39] Patrick Th. Eugster，Pascal A. Felber，Rachid Guerraoui和Anne-Marie Kermarrec：“发布/订阅的多重面貌”，ACM计算机调查，第35卷，第2号，页114-131，2003年6月。doi：10.1145/857076.857078。

[ 40 ] Ben Stopford: “ Microservices in a Streaming World ,” at QCon London , March 2016.

[40] Ben Stopford: “流式世界中的微服务”，于2016年3月在QCon伦敦举行。

[ 41 ] Christian Posta: “ Why Microservices Should Be Event Driven: Autonomy vs Authority ,” blog.christianposta.com , May 27, 2016.

[41] Christian Posta：“微服务应该是事件驱动的：自治与权限”，blog.christianposta.com，2016年5月27日。

[ 42 ] Alex Feyerke: “ Say Hello to Offline First ,” hood.ie , November 5, 2013.

[42] Alex Feyerke：“跟离线优先说个hello”（Say Hello to Offline First）, hood.ie, 2013年11月5日。

[ 43 ] Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “ Global Sequence Protocol: A Robust Abstraction for Replicated Shared State ,” at 29th European Conference on Object-Oriented Programming (ECOOP), July 2015. doi:10.4230/LIPIcs.ECOOP.2015.568

[43] Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich：“全局序列协议: 用于复制共享状态的强大抽象”，发表于2015年7月的第29届欧洲面向对象编程会议（ECOOP）。doi:10.4230/LIPIcs.ECOOP.2015.568。

[ 44 ] Mark Soper: “ Clearing Up React Data Management Confusion with Flux, Redux, and Relay ,” medium.com , December 3, 2015.

[44] Mark Soper: “Flux，Redux和Relay带来的React数据管理混乱的澄清”，medium.com，2015年12月3日。

[ 45 ] Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “ Unifying Stream Processing and Interactive Queries in Apache Kafka ,” confluent.io , October 26, 2016.

Eno Thereska、Damian Guy、Michael Noll和Neha Narkhede：“将流处理和交互式查询统一于Apache Kafka中”，confluent.io网站，2016年10月26日。

[ 46 ] Frank McSherry: “ Dataflow as Database ,” github.com , July 17, 2016.

[46] Frank McSherry：“数据流作为数据库”，github.com，2016年7月17日。

[ 47 ] Peter Alvaro: “ I See What You Mean ,” at Strange Loop , September 2015.

Peter Alvaro：“我明白你的意思”在奇怪的循环中，2015年9月。

[ 48 ] Nathan Marz: “ Trident: A High-Level Abstraction for Realtime Computation ,” blog.twitter.com , August 2, 2012.

[48] Nathan Marz： “Trident：一个用于实时计算的高层抽象”，blog.twitter.com，2012年8月2日。

[ 49 ] Edi Bice: “ Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends ,” at Merchant Risk Council MRC Vegas Conference , March 2016.

"低延迟的Web规模欺诈预防，采用Apache Samza、Kafka和伙伴技术"，于2016年3月在商家风险协会MRC Vegas会议上发表，由Edi Bice。"

[ 50 ] Charity Majors: “ The Accidental DBA ,” charity.wtf , October 2, 2016.

[50] 慈善·梅乔斯：《意外DBA》, charity.wtf, 2016年10月2日.

[ 51 ] Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “ Semantic Conditions for Correctness at Different Isolation Levels ,” at 16th International Conference on Data Engineering (ICDE), February 2000. doi:10.1109/ICDE.2000.839387

[51] Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu：“不同隔离级别的正确性的语义条件”，于2000年2月第16届国际数据工程会议(ICDE)。doi:10.1109/ICDE.2000.839387。

[ 52 ] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “ Automating the Detection of Snapshot Isolation Anomalies ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[52] Sudhir Jorwekar, Alan Fekete，Krithi Ramamritham和S. Sudarshan：“自动检测快照隔离异常”，于2007年9月在第33届国际大型数据库会议（VLDB）上发表。

[ 53 ] Kyle Kingsbury: Jepsen blog post series , aphyr.com , 2013–2016.

[53] Kyle Kingsbury： Jepsen博客文章系列，aphyr.com，2013年至2016年。 [53] 凯尔·金斯伯里： Jepsen博客文章系列，aphyr.com，2013年至2016年。

[ 54 ] Michael Jouravlev: “ Redirect After Post ,” theserverside.com , August 1, 2004.

[54] 迈克尔·乔拉夫列夫: “在发布后重定向”， theserverside.com，2004年8月1日。 [54] Michael Jouravlev： “发布后重定向”， theserverside.com，2004年8月1日。

[ 55 ] Jerome H. Saltzer, David P. Reed, and David D. Clark: “ End-to-End Arguments in System Design ,” ACM Transactions on Computer Systems , volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

[55] Jerome H. Saltzer, David P. Reed和David D. Clark：“系统设计中的端到端论据”，ACM计算机系统交易，卷2，编号4，页277-288，1984年11月。doi：10.1145/357401.357402。

[ 56 ] Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “ Coordination-Avoiding Database Systems ,” Proceedings of the VLDB Endowment , volume 8, number 3, pages 185–196, November 2014.

“避免协调的数据库系统” VLDB论文集第8卷第3期，2014年11月，185-196页。作者：Peter Bailis，Alan Fekete，Michael J. Franklin等

[ 57 ] Alex Yarmula: “ Strong Consistency in Manhattan ,” blog.twitter.com , March 17, 2016.

[57] Alex Yarmula：“曼哈顿的强一致性”，blog.twitter.com，2016年3月17日。

[ 58 ] Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “ Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System ,” at 15th ACM Symposium on Operating Systems Principles (SOSP), pages 172–182, December 1995. doi:10.1145/224056.224070

“Bayou：弱连接复制存储系统中更新冲突的管理”，作者：Douglas B Terry，Marvin M Theimer，Karin Petersen等，发表于1995年12月第15届ACM操作系统原理研讨会（SOSP），页码：172-182，doi:10.1145/224056.224070。

[ 59 ] Jim Gray: “ The Transaction Concept: Virtues and Limitations ,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

[59] 吉姆·格雷：「交易概念：优点与局限」，发表于1981年9月的第七届国际大型数据库会议（VLDB）。

[ 60 ] Hector Garcia-Molina and Kenneth Salem: “ Sagas ,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

[60] 赫克托·加西亚-莫利纳和肯尼斯·萨勒姆：「Sagas，」于ACM数据管理国际会议(SIGMOD)，1987年5月。doi:10.1145/38713.38742。

[ 61 ] Pat Helland: “ Memories, Guesses, and Apologies ,” blogs.msdn.com , May 15, 2007.

[61] Pat Helland：“记忆、猜想和道歉”，blogs.msdn.com，2007年5月15日。

[ 62 ] Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “ Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors ,” at 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. doi:10.1145/2678373.2665726

[62] Yoongu Kim，Ross Daly，Jeremie Kim等：“不访问内存即可翻转位：关于DRAM干扰误差的实验研究”，发表于2014年6月的第41届国际计算机体系结构研讨会(ISCA)，doi：10.1145/2678373.2665726。

[ 63 ] Mark Seaborn and Thomas Dullien: “ Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges ,” googleprojectzero.blogspot.co.uk , March 9, 2015.

[63] Mark Seaborn和Thomas Dullien：「利用DRAM Rowhammer漏洞获取内核特权」，googleprojectzero.blogspot.co.uk，2015年3月9日。

[ 64 ] Jim N. Gray and Catharine van Ingen: “ Empirical Measurements of Disk Failure Rates and Error Rates ,” Microsoft Research, MSR-TR-2005-166, December 2005.

[64] Jim N. Gray 和 Catharine van Ingen：《磁盘故障率和错误率的经验测量》，微软研究，MSR-TR-2005-166，2005 年 12 月。

[ 65 ] Annamalai Gurusami and Daniel Price: “ Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021 ,” bugs.mysql.com , July 2014.

“Bug #73170: 唯一二级索引存在重复的问题因为 Bug#68021 的修复而引起的”，bugs.mysql.com，2014年7月。

[ 66 ] Gary Fredericks: “ Postgres Serializability Bug ,” github.com , September 2015.

[66] Gary Fredericks： “Postgres串行化漏洞”， github.com，2015年9月。

[ 67 ] Xiao Chen: “ HDFS DataNode Scanners and Disk Checker Explained ,” blog.cloudera.com , December 20, 2016.

[67] 肖琛：“HDFS 数据节点扫描器和磁盘检查器解释”，blog.cloudera.com，2016 年 12 月 20 日。

[ 68 ] Jay Kreps: “ Getting Real About Distributed System Reliability ,” blog.empathybox.com , March 19, 2012.

"真正理解分布式系统可靠性"，Jay Kreps，blog.empathybox.com，2012年3月19日。"

[ 69 ] Martin Fowler: “ The LMAX Architecture ,” martinfowler.com , July 12, 2011.

[69] Martin Fowler：“LMAX 架构”，martinfowler.com，2011年7月12日。

[ 70 ] Sam Stokes: “ Move Fast with Confidence ,” blog.samstokes.co.uk , July 11, 2016.

[70] Sam Stokes：“怀着信心快速前进”，blog.samstokes.co.uk，2016年7月11日。

[ 71 ] “ Sawtooth Lake Documentation ,” Intel Corporation, intelledger.github.io , 2016.

【71】“锯齿湖文档”，英特尔公司，intelledger.github.io，2016年。

[ 72 ] Richard Gendal Brown: “ Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services ,” gendal.me , April 5, 2016.

"R3 Corda™：专为金融服务设计的分布式账本"，理查德·根达尔·布朗（Richard Gendal Brown）于2016年4月5日在gendal.me上发表。"

[ 73 ] Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “ BigchainDB: A Scalable Blockchain Database ,” bigchaindb.com , June 8, 2016.

[73] Trent McConaghy, Rodolphe Marques, Andreas Müller等： “BigchainDB：可扩展的区块链数据库”，bigchaindb.com，2016年6月8日。

[ 74 ] Ralph C. Merkle: “ A Digital Signature Based on a Conventional Encryption Function ,” at CRYPTO ’87 , August 1987. doi:10.1007/3-540-48184-2_32

"“基于常规加密功能的数字签名”，Ralph C. Merkle于1987年8月在CRYPTO '87上发表。doi：10.1007/3-540-48184-2_32"

[ 75 ] Ben Laurie: “ Certificate Transparency ,” ACM Queue , volume 12, number 8, pages 10-19, August 2014. doi:10.1145/2668152.2668154

[75] Ben Laurie：“证书透明性”，ACM Queue，第12卷第8期，10-19页，2014年8月。DOI:10.1145/2668152.2668154

[ 76 ] Mark D. Ryan: “ Enhanced Certificate Transparency and End-to-End Encrypted Mail ,” at Network and Distributed System Security Symposium (NDSS), February 2014. doi:10.14722/ndss.2014.23379

[76] Mark D. Ryan：「增強的證書透明度和端到端加密郵件」，於2014年2月的網絡和分佈式系統安全研討會（NDSS）中發表。doi:10.14722/ndss.2014.23379。

[ 77 ] “ Software Engineering Code of Ethics and Professional Practice ,” Association for Computing Machinery, acm.org , 1999.

「[77]“软件工程道德和职业实践准则”，计算机协会，acm.org，1999。」

[ 78 ] François Chollet: “ Software development is starting to involve important ethical choices ,” twitter.com , October 30, 2016.

"软件开发正开始涉及重要的伦理选择。"- François Chollet：“Software development is starting to involve important ethical choices,” twitter.com，2016年10月30日。

[ 79 ] Igor Perisic: “ Making Hard Choices: The Quest for Ethics in Machine Learning ,” engineering.linkedin.com , November 2016.

"79. Igor Perisic: "机器学习中的道德追求：做出艰难选择”，engineering.linkedin.com，2016年11月。"

[ 80 ] John Naughton: “ Algorithm Writers Need a Code of Conduct ,” theguardian.com , December 6, 2015.

【80】约翰·诺顿： “算法作者需要行为准则”，theguardian.com，2015年12月6日。

[ 81 ] Logan Kugler: “ What Happens When Big Data Blunders? ,” Communications of the ACM , volume 59, number 6, pages 15–16, June 2016. doi:10.1145/2911975

"大数据失误会带来什么后果？"，刊于ACM交流杂志，2016年6月，第59卷，第6期，第15-16页，作者为Logan Kugler。doi:10.1145/2911975。"

[ 82 ] Bill Davidow: “ Welcome to Algorithmic Prison ,” theatlantic.com , February 20, 2014.

[82] 比尔·戴维多夫： “欢迎来到算法监狱”，theatlantic.com，2014年2月20日。

[ 83 ] Don Peck: “ They’re Watching You at Work ,” theatlantic.com , December 2013.

[83] 唐·佩克：《他们在观察你的工作》，theatlantic.com，2013年12月。

[ 84 ] Leigh Alexander: “ Is an Algorithm Any Less Racist Than a Human? ” theguardian.com , August 3, 2016.

“算法比人类更不具种族主义吗？” - 来自theguardian.com的Leigh Alexander，2016年8月3日。

[ 85 ] Jesse Emspak: “ How a Machine Learns Prejudice ,” scientificamerican.com , December 29, 2016.

"机器如何学习偏见"， 2016年12月29日，科学美国人网站，Jesse Emspak，编号85。

[ 86 ] Maciej Cegłowski: “ The Moral Economy of Tech ,” idlewords.com , June 2016.

[86] Maciej Cegłowski： “技术的道德经济”， idlewords.com，2016年6月。

[ 87 ] Cathy O’Neil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy . Crown Publishing, 2016. ISBN: 978-0-553-41881-1

[87] 凯西·奥尼尔：《数学武器：大数据如何加剧不平等，威胁民主》。皇冠出版社，2016年。 ISBN：978-0-553-41881-1。

[ 88 ] Julia Angwin: “ Make Algorithms Accountable ,” nytimes.com , August 1, 2016.

"让算法负责任"，朱莉娅·安格温（Julia Angwin），nytimes.com，2016年8月1日。

[ 89 ] Bryce Goodman and Seth Flaxman: “ European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’ ,” arXiv:1606.08813 , August 31, 2016.

[89] Bryce Goodman和Seth Flaxman：“欧洲联盟关于算法决策和‘解释权’的规定”，arXiv:1606.08813，2016年8月31日。

[ 90 ] “ A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes ,” Staff Report, United States Senate Committee on Commerce, Science, and Transportation , commerce.senate.gov , December 2013.

90.《数据经纪业的回顾：消费者数据的收集、使用和销售用于营销目的》，参议院商务、科学和运输委员会工作报告，commerce.senate.gov，2013年12月。

[ 91 ] Olivia Solon: “ Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected? ” theguardian.com , November 10, 2016.

"Facebook的失敗：虛假新聞和極端化政治是否使特朗普當選?"，出自Olivia Solon的文章，日期為2016年11月10日，刊登於theguardian.com網站。"

[ 92 ] Donella H. Meadows and Diana Wright: Thinking in Systems: A Primer . Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7

[92] 唐妮拉•H•梅多斯和戴安娜•怀特：《系统思维入门》。切尔西格林出版社，2008年。ISBN: 978-1-603-58055-7。

[ 93 ] Daniel J. Bernstein: “ Listening to a ‘big data’/‘data science’ talk ,” twitter.com , May 12, 2015.

[93] 丹尼尔·J·伯恩斯坦：“聆听一场‘大数据/数据科学’演讲” ，twitter.com，2015年5月12日。

[ 94 ] Marc Andreessen: “ Why Software Is Eating the World ,” The Wall Street Journal , 20 August 2011.

[94] 马克·安德里森： “为什么软件正在吞噬世界”，《华尔街日报》，2011年8月20日。

[ 95 ] J. M. Porup: “ ‘Internet of Things’ Security Is Hilariously Broken and Getting Worse ,” arstechnica.com , January 23, 2016.

[95] J. M. Porup： “‘物联网’安全已经显然破裂并且越来越糟糕了”，arstechnica.com，2016年1月23日。

[ 96 ] Bruce Schneier: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World . W. W. Norton, 2015. ISBN: 978-0-393-35217-7

[96] 布鲁斯·施奈尔（Bruce Schneier）：《数据与巨人：收集你的数据和控制你的世界的隐藏战斗》。W·W·诺顿，2015年。ISBN：978-0-393-35217-7。

[ 97 ] The Grugq: “ Nothing to Hide ,” grugq.tumblr.com , April 15, 2016.

[97] The Grugq： “没有什么可以隐藏的”， grugq.tumblr.com，2016年4月15日。

[ 98 ] Tony Beltramelli: “ Deep-Spying: Spying Using Smartwatch and Deep Learning ,” Masters Thesis, IT University of Copenhagen, December 2015. Available at arxiv.org/abs/1512.05616

[98] Tony Beltramelli：“Deep-Spying：使用智能手表和深度学习进行间谍活动”，哥本哈根IT大学硕士论文，2015年12月。可以在arxiv.org/abs/1512.05616查阅。

[ 99 ] Shoshana Zuboff: “ Big Other: Surveillance Capitalism and the Prospects of an Information Civilization ,” Journal of Information Technology , volume 30, number 1, pages 75–89, April 2015. doi:10.1057/jit.2015.5

[99] Shoshana Zuboff：“大他者：监视资本主义和信息文明的前景”，《信息技术》杂志，第30卷，第1期，第75-89页，2015年4月。doi:10.1057/jit.2015.5

[ 100 ] Carina C. Zona: “ Consequences of an Insightful Algorithm ,” at GOTO Berlin , November 2016.

[100] Carina C. Zona：“一种精辟算法的后果”，于2016年11月在GOTO Berlin发表演讲。

[ 101 ] Bruce Schneier: “ Data Is a Toxic Asset, So Why Not Throw It Out? ,” schneier.com , March 1, 2016.

【101】布鲁斯·舒内尔：「数据是一种有毒资产，为何不将其扔掉？」，schneier.com，2016年3月1日。数据是一种有毒资产，为何不将其扔掉？

[ 102 ] John E. Dunn: “ The UK’s 15 Most Infamous Data Breaches ,” techworld.com , November 18, 2016.

[102] 约翰·E·邓恩：“英国15起最臭名昭著的数据泄露事件”，techworld.com，2016年11月18日。

[ 103 ] Cory Scott: “ Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want ,” twitter.com , March 6, 2016.

“数据并不是有害物质——这意味着没有好处——而是危险品，我们必须平衡需求与欲望。” - Cory Scott, 推特，2016年3月6日。

[ 104 ] Bruce Schneier: “ Mission Creep: When Everything Is Terrorism ,” schneier.com , July 16, 2013.

"任务蔓延：当一切皆为恐怖主义"，作者Bruce Schneier，网址schneier.com，发表于2013年7月16日。

[ 105 ] Lena Ulbricht and Maximilian von Grafenstein: “ Big Data: Big Power Shifts? ,” Internet Policy Review , volume 5, number 1, March 2016. doi:10.14763/2016.1.406

【105】Lena Ulbricht和Maximilian von Grafenstein：“大数据：权力转移？”，《互联网政策评论》第5卷，第1期，2016年3月。doi：10.14763 / 2016.1.406

[ 106 ] Ellen P. Goodman and Julia Powles: “ Facebook and Google: Most Powerful and Secretive Empires We’ve Ever Known ,” theguardian.com , September 28, 2016.

[106] Ellen P. Goodman和Julia Powles：“Facebook和Google：我们所知道的最强大和最神秘的帝国”，theguardian.com，2016年9月28日。

[ 107 ] Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data , Official Journal of the European Communities No. L 281/31, eur-lex.europa.eu , November 1995.

[107] 《关于个人数据处理保护和数据自由流动的规定》，欧盟公报第 L 281/31 号，eur-lex.europa.eu，1995 年 11 月。

[ 108 ] Brendan Van Alsenoy: “ Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing ,” Thesis, KU Leuven Centre for IT and IP Law, August 2016.

[108] Brendan Van Alsenoy：“调节数据保护：个人数据处理中相关参与者的责任和风险分配”，论文，卢汶大学信息技术与知识产权法律中心，2016年8月。

[ 109 ] Michiel Rhoen: “ Beyond Consent: Improving Data Protection Through Consumer Protection Law ,” Internet Policy Review , volume 5, number 1, March 2016. doi:10.14763/2016.1.404

[109] Michiel Rhoen：“通过消费者保护法改进数据保护”，《互联网政策评论》第5卷，第1期，2016年3月。doi：10.14763/2016.1.404

[ 110 ] Jessica Leber: “ Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine ,” fastcoexist.com , March 15, 2016.

"[110] Jessica Leber: "你不曾意识到的数据足迹影响了你的生活", fastcoexist.com, 2016年3月15日。"

[ 111 ] Maciej Cegłowski: “ Haunted by Data ,” idlewords.com , October 2015.

"麦克杰·塞格洛夫斯基（Maciej Cegłowski）：“数据的幽灵”，idlewords.com，2015年10月。"

[ 112 ] Sam Thielman: “ You Are Not What You Read: Librarians Purge User Data to Protect Privacy ,” theguardian.com , January 13, 2016.

[112] 萨姆·席尔曼： "你不是你所读的书：图书馆管理员清除用户数据以保护隐私，" theguardian.com，2016年1月13日。

[ 113 ] Conor Friedersdorf: “ Edward Snowden’s Other Motive for Leaking ,” theatlantic.com , May 13, 2014.

[113] 康纳·弗里德斯多夫（Conor Friedersdorf）：“爱德华·斯诺登泄露的另一个动机”，theatlantic.com，2014年5月13日。

[ 114 ] Phillip Rogaway: “ The Moral Character of Cryptographic Work ,” Cryptology ePrint 2015/1162, December 2015.

[114] 菲利普·罗盖韦：《密码学工作的道德性格》，密码学电子印刷品2015/1162号，2015年12月。

Glossary

Note

Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.

请注意，这个词汇表中的定义是简短而简单的，旨在传达核心概念，而不是一个术语的所有细节。如需更多细节，请查看主文中的参考资料。

asynchronous

Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchronous Replication” , “Synchronous Versus Asynchronous Networks” , and “System Model and Reality” .

不等待某个操作完成（例如，向另一个节点发送数据），也不做任何关于操作所需时间的假设。请参考“同步与异步复制”、“同步与异步网络”和“系统模型与现实”。

atomic

In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation .

描述一个操作在同时发生时，出现在单个时间点，因此另一个并发进程永远不会遇到该操作处于“半完成”的状态。请参考隔离。
In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” and “Atomic Commit and Two-Phase Commit (2PC)” .

在交易的背景下：将一组写入操作分组，无论发生何种故障，其必须全部提交或全部回滚。请参见“原子性”和“原子提交和两阶段提交(2PC)”。

backpressure

Forcing the sender of some data to slow down because the recipient cannot keep up with it. Also known as flow control . See “Messaging Systems” .

强制数据发送方减缓速度，因为接收方无法跟上。也称为流量控制。参见“消息系统”。

batch process

A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See Chapter 10 .

一种计算，它以一些固定（通常很大）的数据集作为输入并产生一些其他数据作为输出，而不修改输入。请参见第十章。

bounded

Having some known upper limit or size. Used for example in the context of network delay (see “Timeouts and Unbounded Delays” ) and datasets (see the introduction to Chapter 11 ).

具有某些已知上限或大小。例如，在网络延迟（参见“超时和无界延迟”）和数据集（参见第11章介绍）的上下文中使用。

Byzantine fault

A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” .

一个节点以某种任意的方式行为不当，例如向其他节点发送矛盾或恶意信息。请参见“拜占庭故障”。

cache

A component that remembers recently used data in order to speed up future reads of the same data. It is generally not complete: thus, if some data is missing from the cache, it has to be fetched from some underlying, slower data storage system that has a complete copy of the data.

一个记忆最近使用的数据以加速未来读取同样数据的组件。通常并不完整：如果缓存中遗漏了某些数据，则必须从底层、较慢的数据存储系统中获取完整的数据拷贝。

CAP theorem

A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” .

一个在实践中没有用处且被广泛误解的理论结果。请查看“CAP定理”。

causality

The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” and “Ordering and Causality” .

当一个事件“在某个系统中发生之前”，另一个事件发生时它们之间就出现了依赖关系。例如，后续事件是对先前事件的响应，是建立在先前事件之上，或者应该在先前事件的基础上加以理解。请参阅“The “happens-before” relationship and concurrency”和“Ordering and Causality”。

consensus

A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” .

分散式計算中的根本問題，涉及使多個節點就某事達成一致（例如，應該是哪個節點成為數據庫集群的首領）。這個問題比看上去困難得多。請參閱“容錯共識”。

data warehouse

A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See “Data Warehousing” .

将来自多个不同OLTP系统的数据组合并准备好用于分析目的的数据库。参见“数据仓库”。

declarative

Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of queries, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” .

描述某物应该具有的特性，但不是如何实现它的确切步骤。在查询的上下文中，查询优化器会接收一个声明性的查询并决定如何最好地执行它。请参见“数据查询语言”。

denormalize

To introduce some amount of redundancy or duplication in a normalized dataset, typically in the form of a cache or index , in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See “Single-Object and Multi-Object Operations” and “Deriving several views from the same event log” .

在规范化数据集中引入一定程度的冗余或重复，通常以缓存或索引的形式来加速读取。非规范化值是一种预先计算的查询结果，类似于材料化视图。请参阅“单对象和多对象操作”和“从相同的事件日志派生多个视图”。

derived data

A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III .

衍生数据集是通过可重复的过程从其他数据中创建的，如果需要，您可以再次运行该过程。通常，衍生数据用于加快对数据的特定读取访问速度。索引、缓存和物化视图是衍生数据的示例。请参见第III部分的介绍。

deterministic

Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things.

描述一个函数，如果你给它相同的输入，它总是会产生相同的输出。这意味着它不能依赖于随机数、时间、网络通信或其他不可预测的事物。

distributed

Running on several nodes connected by a network. Characterized by partial failures : some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” .

运行在由网络连接的多个节点上。特点是部分故障：系统的一些部分可能出现故障，而其他部分仍在工作，软件通常无法知道哪个部分出现故障。请参见“故障和部分故障”。

durable

Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” .

以一种方式存储数据，即使发生各种故障，您也认为数据不会丢失。参见“耐久性”。

ETL

Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” .

抽取-转换-加载。从源数据库中提取数据，将其转换为更适合进行分析查询的形式，并将其加载到数据仓库或批处理系统中的过程。参见“数据仓库”。

failover

In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See “Handling Node Outages” .

在只有一个领导者的系统中，故障转移是将领导角色从一个节点移动到另一个节点的过程。请参见“处理节点故障”。

fault-tolerant

Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See “Reliability” .

自动恢复功能可在出现问题时（例如机器崩溃或网络链路故障）进行恢复。请参见“可靠性”。

flow control

See backpressure .

看背压。

follower

A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a secondary , slave , read replica , or hot standby . See “Leaders and Followers” .

一种复制品，不直接接受客户端的任何写入操作，而是仅处理从领导者接收到的数据更改。也称为次要、从属、只读复制品或热备份。参见“领导者和跟随者”。

full-text search

Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of secondary index that supports such queries. See “Full-text search and fuzzy indexes” .

通过任意关键字搜索文本，通常具有匹配拼写类似单词或同义词等其他功能。全文索引是一种支持此类查询的次要索引。参见“全文搜索和模糊索引”。

graph

A data structure consisting of vertices (things that you can refer to, also known as nodes or entities ) and edges (connections from one vertex to another, also known as relationships or arcs ). See “Graph-Like Data Models” .

一个由顶点（也称为节点或实体，可以引用的事物）和边缘（从一个顶点到另一个顶点的连接，也称为关系或弧）组成的数据结构。参见“类似图形的数据模型”。

hash

A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a collision ). See “Partitioning by Hash of Key” .

一个将输入转换为随机外表数字的函数。相同的输入始终会返回相同的输出数字。两个不同的输入非常可能会产生不同的输出数字，虽然有可能两个不同的输入会产生相同的输出数字（这被称为冲突）。参见“按键哈希分区”。

idempotent

Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See “Idempotence” .

描述一个可以安全重试的操作；如果它被执行多次，它的效果与它只被执行一次相同。请参见“幂等性”。

index

A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” .

一种数据结构，可让您有效地搜索特定字段中具有特定值的所有记录。请参见“驱动您的数据库的数据结构”。

isolation

In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. Serializable isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” .

在交易的背景下，描述同时执行的交易之间会产生干扰的程度。串行化隔离提供最强的保证，但使用较弱的隔离级别也很常见。参见“隔离”。

join

To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See “Many-to-One and Many-to-Many Relationships” and “Reduce-Side Joins and Grouping” .

将具有共同点的记录汇集在一起。通常用于一个记录引用另一个记录的情况（外键、文档引用、图中的边），并且查询需要获取引用所指向的记录。参见“一对多和多对多关系”和“Reduce-Side连接和分组”。

leader

When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the primary or master . See “Leaders and Followers” .

当数据或服务被复制到多个节点时，领导者是被指定允许进行更改的副本。领导者可以通过某种协议选举，或由管理员手动选择。也称为主要或主控。请参见“领导者和跟随者”。

linearizable

Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” .

表现得好像系统中只有单个数据副本，而该副本通过原子操作进行更新。请参见“线性化”。

locality

A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” .

一个性能优化：如果一些数据在同一时间频繁地需要，将这些数据放在同一个位置。请参阅“查询的数据局部性”。

lock

A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” and “The leader and the lock” .

确保只有一个线程、节点或事务可以访问某物，并且任何想要访问相同物品的人必须等待锁被释放的机制。参见“两阶段锁定（2PL）”和“领导者和锁”。

log

An append-only file for storing data. A write-ahead log is used to make a storage engine resilient against crashes (see “Making B-trees reliable” ), a log-structured storage engine uses logs as its primary storage format (see “SSTables and LSM-Trees” ), a replication log is used to copy writes from a leader to followers (see “Leaders and Followers” ), and an event log can represent a data stream (see “Partitioned Logs” ).

一个仅能附加的文件用于存储数据。采用写前日志来使存储引擎能够抵御崩溃（见“使B树可靠”），采用日志结构的存储引擎将日志用作其主要存储格式（见“SSTables和LSM-Trees”），复制日志用于将写操作从领导者复制到跟随者（见“领导者和跟随者”），事件日志可以表示数据流（见“分区日志”）。

materialize

To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See “Aggregation: Data Cubes and Materialized Views” and “Materialization of Intermediate State” .

急切地执行一项计算并输出结果，而不是在需要时计算。请参见“聚合：数据立方体和材料化视图”和“中间状态的材料化”。

node

An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.

某台计算机上运行的软件实例，通过网络与其他节点通信，以完成某项任务。

normalized

Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to-Many Relationships” .

规范化的数据库结构设计避免了冗余和重复的存在。当某个数据发生变化时，你只需要在一个地方修改，而不是在许多不同的地方修改。请参见“一对多和多对多关系”。

OLAP

Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” .

在线分析处理。访问模式通过对大量记录进行聚合（如计数，求和，平均值）来表征。参见“事务处理还是分析？”。

OLTP

Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” .

在线事务处理。访问模式特征为快速查询读取或写入数量较少的记录，通常由键索引。请参阅“事务处理还是分析？”。

partitioning

Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding . See Chapter 6 .

将太大无法在单个机器上处理的大型数据集或计算拆分成较小的部分，并将它们分散在多台机器上。也称为分片。请参见第6章。

percentile

A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period complete in less than t , and 5% take longer than t . See “Describing Performance” .

一种通过计算有多少值高于或低于某个阈值来衡量价值分布的方法。例如，在某段时间内，第95个百分位响应时间是时间t，该时间下的请求完成时间少于t的百分之95，而超过t的有5％。请参阅“描述性能”。

primary key

A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index .

一个唯一标识记录的值（通常是数字或字符串）。在许多应用程序中，当记录被创建时（例如按顺序或随机），系统会生成主键；它们通常不是由用户设置的。另见二级索引。

quorum

The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” .

操作被视为成功之前需要投票的最小节点数。请参见“读写配额”。

rebalance

To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” .

将数据或服务从一个节点移动到另一个节点，以使负载分布更加公平。参见“重新平衡分区”。

replication

Keeping a copy of the same data on several nodes ( replicas ) so that it remains accessible if a node becomes unreachable. See Chapter 5 .

将相同的数据副本存储在多个节点上，以便在一个节点无法访问时仍然可访问。见第五章。

schema

A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model” ), and a schema can change over time (see Chapter 4 ).

一些数据的结构描述，包括其字段和数据类型。可以在数据的生命周期的各个阶段检查某些数据是否符合模式（请参见“文档模型中的模式灵活性”），并且模式可以随时间改变（请参见第四章）。

secondary index

An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Structures” and “Partitioning and Secondary Indexes” .

附加的数据结构同时维护在主数据存储之外，能够帮助你高效地搜索符合某种条件的记录。请参见“其他索引结构”和“分区和二级索引”。

serializable

A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” .

如果多个事务同时执行，有一个保证它们的行为就像它们一个接一个地按照某种顺序执行一样。请参见“串行化”。

shared-nothing

An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See the introduction to Part II .

一种体系结构，其中独立的节点——每个节点都有自己的CPU、内存和磁盘——通过传统网络连接，与共享内存或共享磁盘体系结构相反。请参阅第II部分的介绍。

skew

Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots . See “Skewed Workloads and Relieving Hot Spots” and “Handling skew” .

分区之间的负载不平衡，导致一些分区有大量请求或数据，其他分区则较少。也称为热点。请参见“倾斜工作负载和缓解热点”和“处理倾斜”。
A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” , write skew in “Write Skew and Phantoms” , and clock skew in “Timestamps for ordering events” .

时间异常导致事件以意外的非顺序方式出现。请参阅“快照隔离和可重复读”中关于读取偏斜的讨论，“写偏斜和幻影”中关于写入偏斜的讨论，以及“事件排序的时间戳”中关于时钟偏差的讨论。

split brain

A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Outages” and “The Truth Is Defined by the Majority” .

两个节点同时认为自己是领导者的情况可能会导致系统保障被违反。请参见“处理节点故障”和“真相由多数决定”。

stored procedure

A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See “Actual Serial Execution” .

一种编码事务逻辑的方式，使其可以完全在数据库服务器上执行，而无需在事务期间与客户端来回通信。参见“实际串行执行”。

stream process

A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11 .

一个持续运行的计算，消耗无止境的事件流作为输入，并从中派生一些输出。请参阅第11章。

synchronous

The opposite of asynchronous .

同步的。

system of record

A system that holds the primary, authoritative version of some data, also known as the source of truth . Changes are first written here, and other datasets may be derived from the system of record. See the introduction to Part III .

一个持有某些数据主要权威版本，也被称为真相来源的系统。更改首先在此处书写，其他数据集可以派生自记录系统。请参见第三部分的介绍。

timeout

One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” .

其中一种最简单的检测故障的方法是观察在某段时间内未收到响应。但是，无法确定超时是由远程节点的问题还是网络问题导致。请参阅《超时和无界延迟》。

total order

A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a partial order . See “The causal order is not a total order” .

一种比较事物（例如时间戳）的方法，使您始终可以说出两个事物中哪一个更大或更小。一种排序，其中某些事物是无法比较的（您无法说哪一个更大或更小），被称为偏序。请参见“因果关系不是完全排序”。

transaction

Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See Chapter 7 .

将几次读写操作归为一个逻辑单元，以简化错误处理和并发问题。详见第七章。

two-phase commit (2PC)

An algorithm to ensure that several database nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” .

一个算法，确保多个数据库节点要么全部提交要么全部撤回一个事务。参见“原子提交和两阶段提交（2PC）”。

two-phase locking (2PL)

An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Locking (2PL)” .

一种实现可串行隔离的算法，其原理是事务在读取或写入数据时需获取锁，并在事务结束之前一直持有该锁。参考“两阶段锁定（2PL）”。

unbounded

Not having any known upper limit or size. The opposite of bounded .

没有任何已知的上限或大小。与有界相反。

Index

A

aborts (transactions) , Transactions , Atomicity
- in two-phase commit , Introduction to two-phase commit
- performance of optimistic concurrency control , Performance of serializable snapshot isolation
- retrying aborted transactions , Handling errors and aborts
abstraction , Simplicity: Managing Complexity , Data Models and Query Languages , Transactions , Summary , Consistency and Consensus
access path (in network model) , The network model , The SPARQL query language
accidental complexity, removing , Simplicity: Managing Complexity
accountability , Responsibility and accountability
ACID properties (transactions) , Transaction Processing or Analytics? , The Meaning of ACID
- atomicity , Atomicity , Single-Object and Multi-Object Operations
- consistency , Consistency , Maintaining integrity in the face of software bugs
- durability , Durability
- isolation , Isolation , Single-Object and Multi-Object Operations
acknowledgements (messaging) , Acknowledgments and redelivery
active/active replication ( see multi-leader replication)
active/passive replication ( see leader-based replication)
ActiveMQ (messaging) , Message brokers , Message brokers compared to databases
- distributed transaction support , XA transactions
ActiveRecord (object-relational mapper) , The Object-Relational Mismatch , Handling errors and aborts
actor model , Distributed actor frameworks
- ( see also message-passing)
- comparison to Pregel model , The Pregel processing model
- comparison to stream processing , Message passing and RPC
Advanced Message Queuing Protocol ( see AMQP)
aerospace systems , Reliability , Human Errors , Byzantine Faults , Membership services
aggregation
- data cubes and materialized views , Aggregation: Data Cubes and Materialized Views
- in batch processes , GROUP BY
- in stream processes , Stream analytics
aggregation pipeline query language , MapReduce Querying
Agile , Evolvability: Making Change Easy
- minimizing irreversibility , Philosophy of batch process outputs , Reprocessing data for application evolution
- moving faster with confidence , The end-to-end argument again
- Unix philosophy , The Unix Philosophy
agreement , Fault-Tolerant Consensus
- ( see also consensus)
Airflow (workflow scheduler) , MapReduce workflows
Ajax , Dataflow Through Services: REST and RPC
Akka (actor framework) , Distributed actor frameworks
algorithms
- algorithm correctness , Correctness of an algorithm
- B-trees , B-Trees - B-tree optimizations
- for distributed systems , System Model and Reality
- hash indexes , Hash Indexes - Hash Indexes
- mergesort , SSTables and LSM-Trees , Distributed execution of MapReduce , Sort-merge joins
- red-black trees , Constructing and maintaining SSTables
- SSTables and LSM-trees , SSTables and LSM-Trees - Performance optimizations
all-to-all replication topologies , Multi-Leader Replication Topologies
AllegroGraph (database) , Graph-Like Data Models
ALTER TABLE statement (SQL) , Schema flexibility in the document model , Encoding and Evolution
Amazon
- Dynamo (database) , Leaderless Replication
Amazon Web Services (AWS) , Hardware Faults
- Kinesis Streams (messaging) , Using logs for message storage
- network reliability , Network Faults in Practice
- postmortems , Software Errors
- RedShift (database) , The divergence between OLTP databases and data warehouses
- S3 (object storage) , MapReduce and Distributed Filesystems
  - checking data integrity , Don’t just blindly trust what they promise
amplification
- of bias , Bias and discrimination
- of failures , Limitations of distributed transactions , Maintaining derived state
- of tail latency , Describing Performance , Partitioning Secondary Indexes by Document
- write amplification , Advantages of LSM-trees
AMQP (Advanced Message Queuing Protocol) , Message brokers compared to databases
- ( see also messaging systems)
- comparison to log-based messaging , Logs compared to traditional messaging , Replaying old messages
- message ordering , Acknowledgments and redelivery
analytics , Transaction Processing or Analytics?
- comparison to transaction processing , Transaction Processing or Analytics?
- data warehousing ( see data warehousing)
- parallel query execution in MPP databases , Comparing Hadoop to Distributed Databases
- predictive ( see predictive analytics)
- relation to batch processing , The Output of Batch Workflows
- schemas for , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- snapshot isolation for queries , Snapshot Isolation and Repeatable Read
- stream analytics , Stream analytics
- using MapReduce, analysis of user activity events (example) , Example: analysis of user activity events
anti-caching (in-memory databases) , Keeping everything in memory
anti-entropy , Read repair and anti-entropy
Apache ActiveMQ ( see ActiveMQ)
Apache Avro ( see Avro)
Apache Beam ( see Beam)
Apache BookKeeper ( see BookKeeper)
Apache Cassandra ( see Cassandra)
Apache CouchDB ( see CouchDB)
Apache Curator ( see Curator)
Apache Drill ( see Drill)
Apache Flink ( see Flink)
Apache Giraph ( see Giraph)
Apache Hadoop ( see Hadoop)
Apache HAWQ ( see HAWQ)
Apache HBase ( see HBase)
Apache Helix ( see Helix)
Apache Hive ( see Hive)
Apache Impala ( see Impala)
Apache Jena ( see Jena)
Apache Kafka ( see Kafka)
Apache Lucene ( see Lucene)
Apache MADlib ( see MADlib)
Apache Mahout ( see Mahout)
Apache Oozie ( see Oozie)
Apache Parquet ( see Parquet)
Apache Qpid ( see Qpid)
Apache Samza ( see Samza)
Apache Solr ( see Solr)
Apache Spark ( see Spark)
Apache Storm ( see Storm)
Apache Tajo ( see Tajo)
Apache Tez ( see Tez)
Apache Thrift ( see Thrift)
Apache ZooKeeper ( see ZooKeeper)
Apama (stream analytics) , Complex event processing
append-only B-trees , B-tree optimizations , Indexes and snapshot isolation
append-only files ( see logs)
Application Programming Interfaces (APIs) , Thinking About Data Systems , Data Models and Query Languages
- for batch processing , MapReduce workflows
- for change streams , API support for change streams
- for distributed transactions , XA transactions
- for graph processing , The Pregel processing model
- for services , Dataflow Through Services: REST and RPC - Data encoding and evolution for RPC
  - ( see also services)
  - evolvability , Data encoding and evolution for RPC
  - RESTful , Web services
  - SOAP , Web services
application state ( see state)
approximate search ( see similarity search)
archival storage, data from databases , Archival storage
arcs ( see edges)
arithmetic mean , Describing Performance
ASCII text , Thrift and Protocol Buffers , A uniform interface
ASN.1 (schema language) , The Merits of Schemas
asynchronous networks , Unreliable Networks , Glossary
- comparison to synchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
asynchronous replication , Synchronous Versus Asynchronous Replication , Glossary
- conflict detection , Synchronous versus asynchronous conflict detection
- data loss on failover , Leader failure: Failover
- reads from asynchronous follower , Problems with Replication Lag
Asynchronous Transfer Mode (ATM) , Can we not simply make network delays predictable?
atomic broadcast ( see total order broadcast)
atomic clocks (caesium clocks) , Clock readings have a confidence interval , Synchronized clocks for global snapshots
- ( see also clocks)
atomicity (concurrency) , Glossary
- atomic increment-and-get , Implementing total order broadcast using linearizable storage
- compare-and-set , Compare-and-set , What Makes a System Linearizable?
  - ( see also compare-and-set operations)
- replicated operations , Conflict resolution and replication
- write operations , Atomic write operations
atomicity (transactions) , Atomicity , Single-Object and Multi-Object Operations , Glossary
- atomic commit , Distributed Transactions and Consensus
  - avoiding , Multi-partition request processing , Coordination-avoiding data systems
  - blocking and nonblocking , Three-phase commit
  - in stream processing , Exactly-once message processing , Atomic commit revisited
  - maintaining derived data , Keeping Systems in Sync
- for multi-object transactions , Single-Object and Multi-Object Operations
- for single-object writes , Single-object writes
auditability , Trust, but Verify - Tools for auditable data systems
- designing for , Designing for auditability
- self-auditing systems , A culture of verification
- through immutability , Advantages of immutable events
- tools for auditable data systems , Tools for auditable data systems
availability , Hardware Faults
- ( see also fault tolerance)
- in CAP theorem , The CAP theorem
- in service level agreements (SLAs) , Describing Performance
Avro (data format) , Avro - Code generation and dynamically typed languages
- code generation , Code generation and dynamically typed languages
- dynamically generated schemas , Dynamically generated schemas
- object container files , But what is the writer’s schema? , Archival storage , Philosophy of batch process outputs
- reader determining writer’s schema , But what is the writer’s schema?
- schema evolution , The writer’s schema and the reader’s schema
- use in Hadoop , Philosophy of batch process outputs
awk (Unix tool) , Simple Log Analysis
AWS ( see Amazon Web Services)
Azure ( see Microsoft)

N

nanomsg (messaging library) , Direct messaging from producers to consumers
Narayana (transaction coordinator) , Introduction to two-phase commit
NATS (messaging) , Message brokers
near-real-time (nearline) processing , Batch Processing
- ( see also stream processing)
Neo4j (database)
- Cypher query language , The Cypher Query Language
- graph data model , Graph-Like Data Models
Nephele (dataflow engine) , Dataflow engines
netcat (Unix tool) , Separation of logic and wiring
Netflix Chaos Monkey , Reliability , Network Faults in Practice
Network Attached Storage (NAS) , Distributed Data , MapReduce and Distributed Filesystems
network model , The network model
- graph databases versus , The SPARQL query language
- imperative query APIs , Declarative Queries on the Web
Network Time Protocol ( see NTP)
networks
- congestion and queueing , Network congestion and queueing
- datacenter network topologies , Cloud Computing and Supercomputing
- faults ( see faults)
- linearizability and network delays , Linearizability and network delays
- network partitions , Network Faults in Practice , The CAP theorem
- timeouts and unbounded delays , Timeouts and Unbounded Delays
next-key locking , Index-range locks
nodes (in graphs) ( see vertices)
nodes (processes) , Glossary
- handling outages in leader-based replication , Handling Node Outages
- system models for failure , System Model and Reality
noisy neighbors , Network congestion and queueing
nonblocking atomic commit , Three-phase commit
nondeterministic operations
- accidental nondeterminism , Fault tolerance
- partial failures in distributed systems , Faults and Partial Failures
nonfunctional requirements , Summary
nonrepeatable reads , Snapshot Isolation and Repeatable Read
- ( see also read skew)
normalization (data representation) , Many-to-One and Many-to-Many Relationships , Glossary
- executing joins , Which data model leads to simpler application code? , Convergence of document and relational databases , Reduce-Side Joins and Grouping
- foreign key references , The need for multi-object transactions
- in systems of record , Derived Data
- versus denormalization , Deriving several views from the same event log
NoSQL , The Birth of NoSQL , Unbundling Databases
- transactions and , The Slippery Concept of a Transaction
Notation3 (N3) , Triple-Stores and SPARQL
npm (package manager) , The move toward declarative query languages
NTP (Network Time Protocol) , Unreliable Clocks
- accuracy , Clock Synchronization and Accuracy , Timestamps for ordering events
- adjustments to monotonic clocks , Monotonic clocks
- multiple server addresses , Weak forms of lying
numbers, in XML and JSON encodings , JSON, XML, and Binary Variants

O

object-relational mapping (ORM) frameworks , The Object-Relational Mismatch
- error handling and aborted transactions , Handling errors and aborts
- unsafe read-modify-write cycle code , Atomic write operations
object-relational mismatch , The Object-Relational Mismatch
observer pattern , Separation of application code and state
offline systems , Batch Processing
- ( see also batch processing)
- stateful, offline-capable clients , Clients with offline operation , Stateful, offline-capable clients
offline-first applications , Stateful, offline-capable clients
offsets
- consumer offsets in partitioned logs , Consumer offsets
- messages in partitioned logs , Using logs for message storage
OLAP (online analytic processing) , Transaction Processing or Analytics? , Glossary
- data cubes , Aggregation: Data Cubes and Materialized Views
OLTP (online transaction processing) , Transaction Processing or Analytics? , Glossary
- analytics queries versus , The Output of Batch Workflows
- workload characteristics , Actual Serial Execution
one-to-many relationships , The Object-Relational Mismatch
- JSON representation , The Object-Relational Mismatch
online systems , Batch Processing
- ( see also services)
Oozie (workflow scheduler) , MapReduce workflows
OpenAPI (service definition format) , Web services
OpenStack
- Nova (cloud infrastructure)
  - use of ZooKeeper , Membership and Coordination Services
- Swift (object storage) , MapReduce and Distributed Filesystems
operability , Operability: Making Life Easy for Operations
operating systems versus databases , Unbundling Databases
operation identifiers , Operation identifiers , Multi-partition request processing
operational transformation , Custom conflict resolution logic
operators , Dataflow engines
- flow of data between , Graphs and Iterative Processing
- in stream processing , Processing Streams
optimistic concurrency control , Pessimistic versus optimistic concurrency control
Oracle (database)
- distributed transaction support , XA transactions
- GoldenGate (change data capture) , Trigger-based replication , Multi-datacenter operation , Implementing change data capture
- lack of serializability , Isolation
- leader-based replication , Leaders and Followers
- multi-table index cluster tables , Data locality for queries
- not preventing write skew , Characterizing write skew
- partitioned indexes , Partitioning Secondary Indexes by Term
- PL/SQL language , Pros and cons of stored procedures
- preventing lost updates , Automatically detecting lost updates
- read committed isolation , Implementing read committed
- Real Application Clusters (RAC) , Locking and leader election
- recursive query support , Graph Queries in SQL
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- TimesTen (in-memory database) , Keeping everything in memory
- WAL-based replication , Write-ahead log (WAL) shipping
- XML support , The Object-Relational Mismatch
ordering , Ordering Guarantees - Implementing total order broadcast using linearizable storage
- by sequence numbers , Sequence Number Ordering - Timestamp ordering is not sufficient
- causal ordering , Ordering and Causality - Capturing causal dependencies
  - partial order , The causal order is not a total order
- limits of total ordering , The limits of total ordering
- total order broadcast , Total Order Broadcast - Implementing total order broadcast using linearizable storage
Orleans (actor framework) , Distributed actor frameworks
outliers (response time) , Describing Performance
Oz (programming language) , Designing Applications Around Dataflow

P

package managers , The move toward declarative query languages , Separation of application code and state
packet switching , Can we not simply make network delays predictable?
packets
- corruption of , Weak forms of lying
- sending via UDP , Direct messaging from producers to consumers
PageRank (algorithm) , Graph-Like Data Models , Graphs and Iterative Processing
paging ( see virtual memory)
ParAccel (database) , The divergence between OLTP databases and data warehouses
parallel databases ( see massively parallel processing)
parallel execution
- of graph analysis algorithms , Parallel execution
- queries in MPP databases , Parallel Query Execution
Parquet (data format) , Column-Oriented Storage , Archival storage
- ( see also column-oriented storage)
- use in Hadoop , Philosophy of batch process outputs
partial failures , Faults and Partial Failures , Summary
- limping , Summary
partial order , The causal order is not a total order
partitioning , Partitioning - Summary , Glossary
- and replication , Partitioning and Replication
- in batch processing , Summary
- multi-partition operations , Multi-partition data processing
  - enforcing constraints , Multi-partition request processing
  - secondary index maintenance , Maintaining derived state
- of key-value data , Partitioning of Key-Value Data - Skewed Workloads and Relieving Hot Spots
  - by key range , Partitioning by Key Range
  - skew and hot spots , Skewed Workloads and Relieving Hot Spots
- rebalancing partitions , Rebalancing Partitions - Operations: Automatic or Manual Rebalancing
  - automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
  - problems with hash mod N , How not to do it: hash mod N
  - using dynamic partitioning , Dynamic partitioning
  - using fixed number of partitions , Fixed number of partitions
  - using N partitions per node , Partitioning proportionally to nodes
- replication and , Distributed Data
- request routing , Request Routing - Parallel Query Execution
- secondary indexes , Partitioning and Secondary Indexes - Partitioning Secondary Indexes by Term
  - document-based partitioning , Partitioning Secondary Indexes by Document
  - term-based partitioning , Partitioning Secondary Indexes by Term
- serial execution of transactions and , Partitioning
Paxos (consensus algorithm) , Consensus algorithms and total order broadcast
- ballot number , Epoch numbering and quorums
- Multi-Paxos (total order broadcast) , Consensus algorithms and total order broadcast
percentiles , Describing Performance , Glossary
- calculating efficiently , Describing Performance
- importance of high percentiles , Describing Performance
- use in service level agreements (SLAs) , Describing Performance
Percona XtraBackup (MySQL tool) , Setting Up New Followers
performance
- describing , Describing Performance
- of distributed transactions , Distributed Transactions in Practice
- of in-memory databases , Keeping everything in memory
- of linearizability , Linearizability and network delays
- of multi-leader replication , Multi-datacenter operation
perpetual inconsistency , Timeliness and Integrity
pessimistic concurrency control , Pessimistic versus optimistic concurrency control
phantoms (transaction isolation) , Phantoms causing write skew
- materializing conflicts , Materializing conflicts
- preventing, in serializability , Predicate locks
physical clocks ( see clocks)
pickle (Python) , Language-Specific Formats
Pig (dataflow language) , Beyond MapReduce , High-Level APIs and Languages
- replicated joins , Broadcast hash joins
- skewed joins , Handling skew
- workflows , MapReduce workflows
Pinball (workflow scheduler) , MapReduce workflows
pipelined execution , Discussion of materialization
- in Unix , The Unix Philosophy
point in time , Unreliable Clocks
polyglot persistence , The Birth of NoSQL
polystores , The meta-database of everything
PostgreSQL (database)
- BDR (multi-leader replication) , Multi-datacenter operation
  - causal ordering of writes , Multi-Leader Replication Topologies
- Bottled Water (change data capture) , Implementing change data capture
- Bucardo (trigger-based replication) , Trigger-based replication , Custom conflict resolution logic
- distributed transaction support , XA transactions
- foreign data wrappers , The meta-database of everything
- full text search support , Combining Specialized Tools by Deriving Data
- leader-based replication , Leaders and Followers
- log sequence number , Setting Up New Followers
- MVCC implementation , Implementing snapshot isolation , Indexes and snapshot isolation
- PL/pgSQL language , Pros and cons of stored procedures
- PostGIS geospatial indexes , Multi-column indexes
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Serializable Snapshot Isolation (SSI)
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- representing graphs , Property Graphs
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI)
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- WAL-based replication , Write-ahead log (WAL) shipping
- XML and JSON support , The Object-Relational Mismatch , Convergence of document and relational databases
pre-splitting , Dynamic partitioning
Precision Time Protocol (PTP) , Clock Synchronization and Accuracy
predicate locks , Predicate locks
predictive analytics , Predictive Analytics - Feedback loops
- amplifying bias , Bias and discrimination
- ethics of ( see ethics)
- feedback loops , Feedback loops
preemption
- of datacenter resources , Designing for frequent faults
- of threads , Process Pauses
Pregel processing model , The Pregel processing model
primary keys , Other Indexing Structures , Glossary
- compound primary key (Cassandra) , Partitioning by Hash of Key
primary-secondary replication ( see leader-based replication)
privacy , Privacy and Tracking - Legislation and self-regulation
- consent and freedom of choice , Consent and freedom of choice
- data as assets and power , Data as assets and power
- deleting data , Limitations of immutability
- ethical considerations ( see ethics)
- legislation and self-regulation , Legislation and self-regulation
- meaning of , Privacy and use of data
- surveillance , Surveillance
- tracking behavioral data , Privacy and Tracking
probabilistic algorithms , Describing Performance , Stream analytics
process pauses , Process Pauses - Limiting the impact of garbage collection
processing time (of events) , Reasoning About Time
producers (message streams) , Transmitting Event Streams
programming languages
- dataflow languages , Designing Applications Around Dataflow
- for stored procedures , Pros and cons of stored procedures
- functional reactive programming (FRP) , Designing Applications Around Dataflow
- logic programming , Designing Applications Around Dataflow
Prolog (language) , The Foundation: Datalog
- ( see also Datalog)
promises (asynchronous operations) , Current directions for RPC
property graphs , Property Graphs
- Cypher query language , The Cypher Query Language
Protocol Buffers (data format) , Thrift and Protocol Buffers - Datatypes and schema evolution
- field tags and schema evolution , Field tags and schema evolution
provenance of data , Designing for auditability
publish/subscribe model , Messaging Systems
publishers (message streams) , Transmitting Event Streams
punch card tabulating machines , Batch Processing
pure functions , MapReduce Querying
putting computation near data , Distributed execution of MapReduce

Q

Qpid (messaging) , Message brokers compared to databases
quality of service (QoS) , Can we not simply make network delays predictable?
Quantcast File System (distributed filesystem) , MapReduce and Distributed Filesystems
query languages , Query Languages for Data - MapReduce Querying
- aggregation pipeline , MapReduce Querying
- CSS and XSL , Declarative Queries on the Web
- Cypher , The Cypher Query Language
- Datalog , The Foundation: Datalog
- Juttle , Designing Applications Around Dataflow
- MapReduce querying , MapReduce Querying - MapReduce Querying
- recursive SQL queries , Graph Queries in SQL
- relational algebra and SQL , Query Languages for Data
- SPARQL , The SPARQL query language
query optimizers , The relational model , The move toward declarative query languages
queueing delays (networks) , Network congestion and queueing
- head-of-line blocking , Describing Performance
- latency and response time , Describing Performance
queues (messaging) , Message brokers
quorums , Quorums for reading and writing - Limitations of Quorum Consistency , Glossary
- for leaderless replication , Quorums for reading and writing
- in consensus algorithms , Epoch numbering and quorums
- limitations of consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
- making decisions in distributed systems , The Truth Is Defined by the Majority
- monitoring staleness , Monitoring staleness
- multi-datacenter replication , Multi-datacenter operation
- relying on durability , Mapping system models to the real world
- sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff

R

R-trees (indexes) , Multi-column indexes
RabbitMQ (messaging) , Message brokers , Message brokers compared to databases
- leader-based replication , Leaders and Followers
race conditions , Isolation
- ( see also concurrency)
- avoiding with linearizability , Cross-channel timing dependencies
- caused by dual writes , Keeping Systems in Sync
- dirty writes , No dirty writes
- in counter increments , No dirty writes
- lost updates , Preventing Lost Updates - Conflict resolution and replication
- preventing with event logs , Concurrency control , Dataflow: Interplay between state changes and application code
- preventing with serializable isolation , Serializability
- write skew , Write Skew and Phantoms - Materializing conflicts
Raft (consensus algorithm) , Consensus algorithms and total order broadcast
- sensitivity to network problems , Limitations of consensus
- term number , Epoch numbering and quorums
- use in etcd , Distributed Transactions and Consensus
RAID (Redundant Array of Independent Disks) , Hardware Faults , MapReduce and Distributed Filesystems
railways, schema migration on , Reprocessing data for application evolution
RAMCloud (in-memory storage) , Keeping everything in memory
ranking algorithms , Graphs and Iterative Processing
RDF (Resource Description Framework) , The semantic web
- querying with SPARQL , The SPARQL query language
RDMA (Remote Direct Memory Access) , Cloud Computing and Supercomputing
read committed isolation level , Read Committed - Implementing read committed
- implementing , Implementing read committed
- multi-version concurrency control (MVCC) , Implementing snapshot isolation
- no dirty reads , No dirty reads
- no dirty writes , No dirty writes
read path (derived data) , Observing Derived State
read repair (leaderless replication) , Read repair and anti-entropy
- for linearizability , Linearizability and quorums
read replicas ( see leader-based replication)
read skew (transaction isolation) , Snapshot Isolation and Repeatable Read , Summary
- as violation of causality , Ordering and Causality
read-after-write consistency , Reading Your Own Writes , Timeliness and Integrity
- cross-device , Reading Your Own Writes
read-modify-write cycle , Preventing Lost Updates
read-scaling architecture , Problems with Replication Lag
reads as events , Reads are events too
real-time
- collaborative editing , Collaborative editing
- near-real-time processing , Batch Processing
  - ( see also stream processing)
- publish/subscribe dataflow , End-to-end event streams
- response time guarantees , Response time guarantees
- time-of-day clocks , Time-of-day clocks
rebalancing partitions , Rebalancing Partitions - Operations: Automatic or Manual Rebalancing , Glossary
- ( see also partitioning)
- automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
- dynamic partitioning , Dynamic partitioning
- fixed number of partitions , Fixed number of partitions
- fixed number of partitions per node , Partitioning proportionally to nodes
- problems with hash mod N , How not to do it: hash mod N
recency guarantee , Linearizability
recommendation engines
- batch process outputs , Key-value stores as batch process output
- batch workflows , MapReduce workflows , Materialization of Intermediate State
- iterative processing , Graphs and Iterative Processing
- statistical and numerical algorithms , Specialization for different domains
records , MapReduce Job Execution
- events in stream processing , Transmitting Event Streams
recursive common table expressions (SQL) , Graph Queries in SQL
redelivery (messaging) , Acknowledgments and redelivery
Redis (database)
- atomic operations , Atomic write operations
- durability , Keeping everything in memory
- Lua scripting , Pros and cons of stored procedures
- single-threaded execution , Actual Serial Execution
- usage example , Thinking About Data Systems
redundancy
- hardware components , Hardware Faults
- of derived data , Derived Data
  - ( see also derived data)
Reed–Solomon codes (error correction) , MapReduce and Distributed Filesystems
refactoring , Evolvability: Making Change Easy
- ( see also evolvability)
regions (partitioning) , Partitioning
register (data structure) , What Makes a System Linearizable?
relational data model , Relational Model Versus Document Model - Convergence of document and relational databases
- comparison to document model , Relational Versus Document Databases Today - Convergence of document and relational databases
- graph queries in SQL , Graph Queries in SQL
- in-memory databases with , Keeping everything in memory
- many-to-one and many-to-many relationships , Many-to-One and Many-to-Many Relationships
- multi-object transactions, need for , The need for multi-object transactions
- NoSQL as alternative to , The Birth of NoSQL
- object-relational mismatch , The Object-Relational Mismatch
- relational algebra and SQL , Query Languages for Data
- versus document model
  - convergence of models , Convergence of document and relational databases
  - data locality , Data locality for queries
relational databases
- eventual consistency , Problems with Replication Lag
- history , Relational Model Versus Document Model
- leader-based replication , Leaders and Followers
- logical logs , Logical (row-based) log replication
- philosophy compared to Unix , Unbundling Databases , The meta-database of everything
- schema changes , Schema flexibility in the document model , Encoding and Evolution , Different values written at different times
- statement-based replication , Statement-based replication
- use of B-tree indexes , B-Trees
relationships ( see edges)
reliability , Reliability - How Important Is Reliability? , The Future of Data Systems
- building a reliable system from unreliable components , Cloud Computing and Supercomputing
- defined , Thinking About Data Systems , Summary
- hardware faults , Hardware Faults
- human errors , Human Errors
- importance of , How Important Is Reliability?
- of messaging systems , Messaging Systems
- software errors , Software Errors
Remote Method Invocation (Java RMI) , The problems with remote procedure calls (RPCs)
remote procedure calls (RPCs) , The problems with remote procedure calls (RPCs) - Data encoding and evolution for RPC
- ( see also services)
- based on futures , Current directions for RPC
- data encoding and evolution , Data encoding and evolution for RPC
- issues with , The problems with remote procedure calls (RPCs)
- using Avro , But what is the writer’s schema? , Current directions for RPC
- using Thrift , Current directions for RPC
- versus message brokers , Message-Passing Dataflow
repeatable reads (transaction isolation) , Repeatable read and naming confusion
replicas , Leaders and Followers
replication , Replication - Summary , Glossary
- and durability , Durability
- chain replication , Synchronous Versus Asynchronous Replication
- conflict resolution and , Conflict resolution and replication
- consistency properties , Problems with Replication Lag - Solutions for Replication Lag
  - consistent prefix reads , Consistent Prefix Reads
  - monotonic reads , Monotonic Reads
  - reading your own writes , Reading Your Own Writes
- in distributed filesystems , MapReduce and Distributed Filesystems
- leaderless , Leaderless Replication - Version vectors
  - detecting concurrent writes , Detecting Concurrent Writes - Version vectors
  - limitations of quorum consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
  - sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff
- monitoring staleness , Monitoring staleness
- multi-leader , Multi-Leader Replication - Multi-Leader Replication Topologies
  - across multiple datacenters , Multi-datacenter operation , The Cost of Linearizability
  - handling write conflicts , Handling Write Conflicts - What is a conflict?
  - replication topologies , Multi-Leader Replication Topologies - Multi-Leader Replication Topologies
- partitioning and , Distributed Data , Partitioning and Replication
- reasons for using , Distributed Data , Replication
- single-leader , Leaders and Followers - Trigger-based replication
  - failover , Leader failure: Failover
  - implementation of replication logs , Implementation of Replication Logs - Trigger-based replication
  - relation to consensus , Single-leader replication and consensus
  - setting up new followers , Setting Up New Followers
  - synchronous versus asynchronous , Synchronous Versus Asynchronous Replication - Synchronous Versus Asynchronous Replication
- state machine replication , Using total order broadcast , Databases and Streams
- using erasure coding , MapReduce and Distributed Filesystems
- with heterogeneous data systems , Keeping Systems in Sync
replication logs ( see logs)
reprocessing data , Reprocessing data for application evolution , Unifying batch and stream processing
- ( see also evolvability)
- from log-based messaging , Replaying old messages
request routing , Request Routing - Parallel Query Execution
- approaches to , Request Routing
- parallel query execution , Parallel Query Execution
resilient systems , Reliability
- ( see also fault tolerance)
response time
- as performance metric for services , Describing Performance , Batch Processing
- guarantees on , Response time guarantees
- latency versus , Describing Performance
- mean and percentiles , Describing Performance
- user experience , Describing Performance
responsibility and accountability , Responsibility and accountability
REST (Representational State Transfer) , Web services
- ( see also services)
RethinkDB (database)
- document data model , The Object-Relational Mismatch
- dynamic partitioning , Dynamic partitioning
- join support , Many-to-One and Many-to-Many Relationships , Convergence of document and relational databases
- key-range partitioning , Partitioning by Key Range
- leader-based replication , Leaders and Followers
- subscribing to changes , API support for change streams
Riak (database)
- Bitcask storage engine , Hash Indexes
- CRDTs , Custom conflict resolution logic , Merging concurrently written values
- dotted version vectors , Version vectors
- gossip protocol , Request Routing
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- last-write-wins conflict resolution , Last write wins (discarding concurrent writes)
- leaderless replication , Leaderless Replication
- LevelDB storage engine , Making an LSM-tree out of SSTables
- linearizability, lack of , Linearizability and quorums
- multi-datacenter support , Multi-datacenter operation
- preventing lost updates across replicas , Conflict resolution and replication
- rebalancing , Operations: Automatic or Manual Rebalancing
- search feature , Partitioning Secondary Indexes by Term
- secondary indexes , Partitioning Secondary Indexes by Document
- siblings (concurrently written values) , Merging concurrently written values
- sloppy quorums , Sloppy Quorums and Hinted Handoff
ring buffers , Disk space usage
Ripple (cryptocurrency) , Tools for auditable data systems
rockets , Human Errors , Are Document Databases Repeating History? , Byzantine Faults
RocksDB (storage engine) , Making an LSM-tree out of SSTables
- leveled compaction , Performance optimizations
rollbacks (transactions) , Transactions
rolling upgrades , Hardware Faults , Encoding and Evolution
routing ( see request routing)
row-oriented storage , Column-Oriented Storage
- row-based replication , Logical (row-based) log replication
rowhammer (memory corruption) , Trust, but Verify
RPCs ( see remote procedure calls)
Rubygems (package manager) , The move toward declarative query languages
rules (Datalog) , The Foundation: Datalog

S

safety and liveness properties , Safety and liveness
- in consensus algorithms , Fault-Tolerant Consensus
- in transactions , Transactions
sagas ( see compensating transactions)
Samza (stream processor) , Stream analytics , Maintaining materialized views
- fault tolerance , Rebuilding state after a failure
- streaming SQL support , Complex event processing
sandboxes , Human Errors
SAP HANA (database) , The divergence between OLTP databases and data warehouses
scalability , Scalability - Approaches for Coping with Load , The Future of Data Systems
- approaches for coping with load , Approaches for Coping with Load
- defined , Summary
- describing load , Describing Load
- describing performance , Describing Performance
- partitioning and , Partitioning
- replication and , Problems with Replication Lag
- scaling up versus scaling out , Distributed Data
scaling out , Approaches for Coping with Load , Distributed Data
- ( see also shared-nothing architecture)
scaling up , Approaches for Coping with Load , Distributed Data
scatter/gather approach, querying partitioned databases , Partitioning Secondary Indexes by Document
SCD (slowly changing dimension) , Time-dependence of joins
schema-on-read , Schema flexibility in the document model
- comparison to evolvable schema , The Merits of Schemas
- in distributed filesystems , Diversity of storage
schema-on-write , Schema flexibility in the document model
schemaless databases ( see schema-on-read)
schemas , Glossary
- Avro , Avro - Code generation and dynamically typed languages
  - reader determining writer’s schema , But what is the writer’s schema?
  - schema evolution , The writer’s schema and the reader’s schema
- dynamically generated , Dynamically generated schemas
- evolution of , Reprocessing data for application evolution
  - affecting application code , Encoding and Evolution
  - compatibility checking , But what is the writer’s schema?
  - in databases , Dataflow Through Databases - Archival storage
  - in message-passing , Distributed actor frameworks
  - in service calls , Data encoding and evolution for RPC
- flexibility in document model , Schema flexibility in the document model
- for analytics , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- for JSON and XML , JSON, XML, and Binary Variants
- merits of , The Merits of Schemas
- schema migration on railways , Reprocessing data for application evolution
- Thrift and Protocol Buffers , Thrift and Protocol Buffers - Datatypes and schema evolution
  - schema evolution , Field tags and schema evolution
- traditional approach to design, fallacy in , Deriving several views from the same event log
searches
- building search indexes in batch processes , Building search indexes
- k-nearest neighbors , Specialization for different domains
- on streams , Search on streams
- partitioned secondary indexes , Partitioning and Secondary Indexes
secondaries ( see leader-based replication)
secondary indexes , Other Indexing Structures , Glossary
- partitioning , Partitioning and Secondary Indexes - Partitioning Secondary Indexes by Term , Summary
  - document-partitioned , Partitioning Secondary Indexes by Document
  - index maintenance , Maintaining derived state
  - term-partitioned , Partitioning Secondary Indexes by Term
- problems with dual writes , Keeping Systems in Sync , Reasoning about dataflows
- updating, transaction isolation and , The need for multi-object transactions
secondary sorts , Sort-merge joins
sed (Unix tool) , Simple Log Analysis
self-describing files , Code generation and dynamically typed languages
self-joins , Summary
self-validating systems , A culture of verification
semantic web , The semantic web
semi-synchronous replication , Synchronous Versus Asynchronous Replication
sequence number ordering , Sequence Number Ordering - Timestamp ordering is not sufficient
- generators , Synchronized clocks for global snapshots , Noncausal sequence number generators
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- Lamport timestamps , Lamport timestamps
- use of timestamps , Timestamps for ordering events , Synchronized clocks for global snapshots , Noncausal sequence number generators
sequential consistency , Implementing linearizable storage using total order broadcast
serializability , Isolation , Weak Isolation Levels , Serializability - Performance of serializable snapshot isolation , Glossary
- linearizability versus , What Makes a System Linearizable?
- pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
- serial execution , Actual Serial Execution - Summary of serial execution
  - partitioning , Partitioning
  - using stored procedures , Encapsulating transactions in stored procedures , Using total order broadcast
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
  - detecting stale MVCC reads , Detecting stale MVCC reads
  - detecting writes that affect prior reads , Detecting writes that affect prior reads
  - distributed execution , Performance of serializable snapshot isolation , Limitations of distributed transactions
  - performance of SSI , Performance of serializable snapshot isolation
  - preventing write skew , Decisions based on an outdated premise - Detecting writes that affect prior reads
- two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
  - index-range locks , Index-range locks
  - performance , Performance of two-phase locking
Serializable (Java) , Language-Specific Formats
serialization , Formats for Encoding Data
- ( see also encoding)
service discovery , Current directions for RPC , Request Routing , Service discovery
- using DNS , Request Routing , Service discovery
service level agreements (SLAs) , Describing Performance
service-oriented architecture (SOA) , Dataflow Through Services: REST and RPC
- ( see also services)
services , Dataflow Through Services: REST and RPC - Data encoding and evolution for RPC
- microservices , Dataflow Through Services: REST and RPC
  - causal dependencies across services , The limits of total ordering
  - loose coupling , Making unbundling work
- relation to batch/stream processors , Batch Processing , Stream processors and services
- remote procedure calls (RPCs) , The problems with remote procedure calls (RPCs) - Data encoding and evolution for RPC
  - issues with , The problems with remote procedure calls (RPCs)
- similarity to databases , Dataflow Through Services: REST and RPC
- web services , Web services , Current directions for RPC
session windows (stream processing) , Types of windows
- ( see also windows)
sessionization , GROUP BY
sharding ( see partitioning)
shared mode (locks) , Implementation of two-phase locking
shared-disk architecture , Distributed Data , MapReduce and Distributed Filesystems
shared-memory architecture , Distributed Data
shared-nothing architecture , Approaches for Coping with Load , Distributed Data - Distributed Data , Glossary
- ( see also replication)
- distributed filesystems , MapReduce and Distributed Filesystems
  - ( see also distributed filesystems)
- partitioning , Partitioning
- use of network , Unreliable Networks
sharks
- biting undersea cables , Network Faults in Practice
- counting (example) , MapReduce Querying - MapReduce Querying
- finding (example) , Query Languages for Data
- website about (example) , Declarative Queries on the Web
shredding (in relational model) , Which data model leads to simpler application code?
siblings (concurrent values) , Merging concurrently written values , Conflict resolution and replication
- ( see also conflicts)
similarity search
- edit distance , Full-text search and fuzzy indexes
- genome data , Summary
- k-nearest neighbors , Specialization for different domains
single-leader replication ( see leader-based replication)
single-threaded execution , Atomic write operations , Actual Serial Execution
- in batch processing , Bringing related data together in the same place , Dataflow engines , Parallel execution
- in stream processing , Logs compared to traditional messaging , Concurrency control , Uniqueness in log-based messaging
size-tiered compaction , Performance optimizations
skew , Glossary
- clock skew , Relying on Synchronized Clocks - Clock readings have a confidence interval , Implementing Linearizable Systems
- in transaction isolation
  - read skew , Snapshot Isolation and Repeatable Read , Summary
  - write skew , Write Skew and Phantoms - Materializing conflicts , Decisions based on an outdated premise - Detecting writes that affect prior reads
    - ( see also write skew)
- meanings of , Snapshot Isolation and Repeatable Read
- unbalanced workload , Partitioning of Key-Value Data
  - compensating for , Skewed Workloads and Relieving Hot Spots
  - due to celebrities , Skewed Workloads and Relieving Hot Spots
  - for time-series data , Partitioning by Key Range
  - in batch processing , Handling skew
slaves ( see leader-based replication)
sliding windows (stream processing) , Types of windows
- ( see also windows)
sloppy quorums , Sloppy Quorums and Hinted Handoff
- ( see also quorums)
- lack of linearizability , Implementing Linearizable Systems
slowly changing dimension (data warehouses) , Time-dependence of joins
smearing (leap seconds adjustments) , Clock Synchronization and Accuracy
snapshots (databases)
- causal consistency , Ordering and Causality
- computing derived data , Creating an index
- in change data capture , Initial snapshot
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation , What Makes a System Linearizable?
- setting up a new replica , Setting Up New Followers
- snapshot isolation and repeatable read , Snapshot Isolation and Repeatable Read - Repeatable read and naming confusion
  - implementing with MVCC , Implementing snapshot isolation
  - indexes and MVCC , Indexes and snapshot isolation
  - visibility rules , Visibility rules for observing a consistent snapshot
- synchronized clocks for global snapshots , Synchronized clocks for global snapshots
snowflake schemas , Stars and Snowflakes: Schemas for Analytics
SOAP , Web services
- ( see also services)
- evolvability , Data encoding and evolution for RPC
software bugs , Software Errors
- maintaining integrity , Maintaining integrity in the face of software bugs
solid state drives (SSDs)
- access patterns , Advantages of LSM-trees
- detecting corruption , The end-to-end argument , Don’t just blindly trust what they promise
- faults in , Durability
- sequential write throughput , Hash Indexes
Solr (search server)
- building indexes in batch processes , Building search indexes
- document-partitioned indexes , Partitioning Secondary Indexes by Document
- request routing , Request Routing
- usage example , Thinking About Data Systems
- use of Lucene , Making an LSM-tree out of SSTables
sort (Unix tool) , Simple Log Analysis , Sorting versus in-memory aggregation , The Unix Philosophy
sort-merge joins (MapReduce) , Sort-merge joins
Sorted String Tables ( see SSTables)
sorting
- sort order in column storage , Sort Order in Column Storage
source of truth ( see systems of record)
Spanner (database)
- data locality , Data locality for queries
- snapshot isolation using clocks , Synchronized clocks for global snapshots
- TrueTime API , Clock readings have a confidence interval
Spark (processing framework) , Dataflow engines - Discussion of materialization
- bytecode generation , The move toward declarative query languages
- dataflow APIs , High-Level APIs and Languages
- fault tolerance , Fault tolerance
- for data warehouses , The divergence between OLTP databases and data warehouses
- GraphX API (graph processing) , The Pregel processing model
- machine learning , Specialization for different domains
- query optimizer , The move toward declarative query languages
- Spark Streaming , Stream analytics
  - microbatching , Microbatching and checkpointing
- stream processing on top of batch processing , Batch and Stream Processing
SPARQL (query language) , The SPARQL query language
spatial algorithms , Specialization for different domains
split brain , Leader failure: Failover , Glossary
- in consensus algorithms , Distributed Transactions and Consensus , Single-leader replication and consensus
- preventing , Consistency and Consensus , Implementing Linearizable Systems
- using fencing tokens to avoid , The leader and the lock - Fencing tokens
spreadsheets, dataflow programming capabilities , Designing Applications Around Dataflow
SQL (Structured Query Language) , Simplicity: Managing Complexity , Relational Model Versus Document Model , Query Languages for Data
- advantages and limitations of , Diversity of processing models
- distributed query execution , MapReduce Querying
- graph queries in , Graph Queries in SQL
- isolation levels standard, issues with , Repeatable read and naming confusion
- query execution on Hadoop , Diversity of processing models
- résumé (example) , The Object-Relational Mismatch
- SQL injection vulnerability , Byzantine Faults
- SQL on Hadoop , The divergence between OLTP databases and data warehouses
- statement-based replication , Statement-based replication
- stored procedures , Pros and cons of stored procedures
SQL Server (database)
- data warehousing support , The divergence between OLTP databases and data warehouses
- distributed transaction support , XA transactions
- leader-based replication , Leaders and Followers
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Implementation of two-phase locking
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- serializable isolation , Implementation of two-phase locking
- snapshot isolation support , Snapshot Isolation and Repeatable Read
- T-SQL language , Pros and cons of stored procedures
- XML support , The Object-Relational Mismatch
SQLstream (stream analytics) , Complex event processing
SSDs ( see solid state drives)
SSTables (storage format) , SSTables and LSM-Trees - Performance optimizations
- advantages over hash indexes , SSTables and LSM-Trees
- concatenated index , Partitioning by Hash of Key
- constructing and maintaining , Constructing and maintaining SSTables
- making LSM-Tree from , Making an LSM-tree out of SSTables
staleness (old data) , Reading Your Own Writes
- cross-channel timing dependencies , Cross-channel timing dependencies
- in leaderless databases , Writing to the Database When a Node Is Down
- in multi-version concurrency control , Detecting stale MVCC reads
- monitoring for , Monitoring staleness
- of client state , Pushing state changes to clients
- versus linearizability , Linearizability
- versus timeliness , Timeliness and Integrity
standbys ( see leader-based replication)
star replication topologies , Multi-Leader Replication Topologies
star schemas , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- similarity to event sourcing , Event Sourcing
Star Wars analogy (event time versus processing time) , Event time versus processing time
state
- derived from log of immutable events , State, Streams, and Immutability
- deriving current state from the event log , Deriving current state from the event log
- interplay between state changes and application code , Dataflow: Interplay between state changes and application code
- maintaining derived state , Maintaining derived state
- maintenance by stream processor in stream-stream joins , Stream-stream join (window join)
- observing derived state , Observing Derived State - Multi-partition data processing
- rebuilding after stream processor failure , Rebuilding state after a failure
- separation of application code and , Separation of application code and state
state machine replication , Using total order broadcast , Databases and Streams
statement-based replication , Statement-based replication
statically typed languages
- analogy to schema-on-write , Schema flexibility in the document model
- code generation and , Code generation and dynamically typed languages
statistical and numerical algorithms , Specialization for different domains
StatsD (metrics aggregator) , Direct messaging from producers to consumers
stdin, stdout , A uniform interface , Separation of logic and wiring
Stellar (cryptocurrency) , Tools for auditable data systems
stock market feeds , Direct messaging from producers to consumers
STONITH (Shoot The Other Node In The Head) , Leader failure: Failover
stop-the-world ( see garbage collection)
storage
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
- diversity of, in MapReduce , Diversity of storage
Storage Area Network (SAN) , Distributed Data , MapReduce and Distributed Filesystems
storage engines , Storage and Retrieval - Summary
- column-oriented , Column-Oriented Storage - Writing to Column-Oriented Storage
  - column compression , Column Compression - Memory bandwidth and vectorized processing
  - defined , Column-Oriented Storage
  - distinction between column families and , Column Compression
  - Parquet , Column-Oriented Storage , Archival storage
  - sort order in , Sort Order in Column Storage - Several different sort orders
  - writing to , Writing to Column-Oriented Storage
- comparing requirements for transaction processing and analytics , Transaction Processing or Analytics? - Column-Oriented Storage
- in-memory storage , Keeping everything in memory
  - durability , Durability
- row-oriented , Data Structures That Power Your Database - Keeping everything in memory
  - B-trees , B-Trees - B-tree optimizations
  - comparing B-trees and LSM-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
  - defined , Column-Oriented Storage
  - log-structured , Hash Indexes - Performance optimizations
stored procedures , Trigger-based replication , Encapsulating transactions in stored procedures - Pros and cons of stored procedures , Glossary
- and total order broadcast , Using total order broadcast
- pros and cons of , Pros and cons of stored procedures
- similarity to stream processors , Application code as a derivation function
Storm (stream processor) , Stream analytics
- distributed RPC , Message passing and RPC , Multi-partition data processing
- Trident state handling , Idempotence
straggler events , Knowing when you’re ready , The lambda architecture
stream processing , Processing Streams - Summary , Glossary
- accessing external services within job , Stream-table join (stream enrichment) , Microbatching and checkpointing , Idempotence , Exactly-once execution of an operation
- combining with batch processing
  - lambda architecture , The lambda architecture
  - unifying technologies , Unifying batch and stream processing
- comparison to batch processing , Processing Streams
- complex event processing (CEP) , Complex event processing
- fault tolerance , Fault Tolerance - Rebuilding state after a failure
  - atomic commit , Atomic commit revisited
  - idempotence , Idempotence
  - microbatching and checkpointing , Microbatching and checkpointing
  - rebuilding state after a failure , Rebuilding state after a failure
- for data integration , Batch and Stream Processing - Unifying batch and stream processing
- maintaining derived state , Maintaining derived state
- maintenance of materialized views , Maintaining materialized views
- messaging systems ( see messaging systems)
- reasoning about time , Reasoning About Time - Types of windows
  - event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
  - knowing when window is ready , Knowing when you’re ready
  - types of windows , Types of windows
- relation to databases ( see streams)
- relation to services , Stream processors and services
- search on streams , Search on streams
- single-threaded execution , Logs compared to traditional messaging , Concurrency control
- stream analytics , Stream analytics
- stream joins , Stream Joins - Time-dependence of joins
  - stream-stream join , Stream-stream join (window join)
  - stream-table join , Stream-table join (stream enrichment)
  - table-table join , Table-table join (materialized view maintenance)
  - time-dependence of , Time-dependence of joins
streams , Stream Processing - Replaying old messages
- end-to-end, pushing events to clients , End-to-end event streams
- messaging systems ( see messaging systems)
- processing ( see stream processing)
- relation to databases , Databases and Streams - Limitations of immutability
  - ( see also changelogs)
  - API support for change streams , API support for change streams
  - change data capture , Change Data Capture - API support for change streams
  - derivative of state by time , State, Streams, and Immutability
  - event sourcing , Event Sourcing - Commands and events
  - keeping systems in sync , Keeping Systems in Sync - Keeping Systems in Sync
  - philosophy of immutable events , State, Streams, and Immutability - Limitations of immutability
- topics , Transmitting Event Streams
strict serializability , What Makes a System Linearizable?
strong consistency ( see linearizability)
strong one-copy serializability , What Makes a System Linearizable?
subjects, predicates, and objects (in triple-stores) , Triple-Stores and SPARQL
subscribers (message streams) , Transmitting Event Streams
- ( see also consumers)
supercomputers , Cloud Computing and Supercomputing
surveillance , Surveillance
- ( see also privacy)
Swagger (service definition format) , Web services
swapping to disk ( see virtual memory)
synchronous networks , Synchronous Versus Asynchronous Networks , Glossary
- comparison to asynchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
synchronous replication , Synchronous Versus Asynchronous Replication , Glossary
- chain replication , Synchronous Versus Asynchronous Replication
- conflict detection , Synchronous versus asynchronous conflict detection
system models , Knowledge, Truth, and Lies , System Model and Reality - Mapping system models to the real world
- assumptions in , Trust, but Verify
- correctness of algorithms , Correctness of an algorithm
- mapping to the real world , Mapping system models to the real world
- safety and liveness , Safety and liveness
systems of record , Derived Data , Glossary
- change data capture , Implementing change data capture , Reasoning about dataflows
- treating event log as , State, Streams, and Immutability
systems thinking , Feedback loops

T

t-digest (algorithm) , Describing Performance
table-table joins , Table-table join (materialized view maintenance)
Tableau (data visualization software) , Diversity of processing models
tail (Unix tool) , Using logs for message storage
tail vertex (property graphs) , Property Graphs
Tajo (query engine) , The divergence between OLTP databases and data warehouses
Tandem NonStop SQL (database) , Partitioning
TCP (Transmission Control Protocol) , Cloud Computing and Supercomputing
- comparison to circuit switching , Can we not simply make network delays predictable?
- comparison to UDP , Network congestion and queueing
- connection failures , Detecting Faults
- flow control , Network congestion and queueing , Messaging Systems
- packet checksums , Weak forms of lying , The end-to-end argument , Trust, but Verify
- reliability and duplicate suppression , Duplicate suppression
- retransmission timeouts , Network congestion and queueing
- use for transaction sessions , Single-Object and Multi-Object Operations
telemetry ( see monitoring)
Teradata (database) , The divergence between OLTP databases and data warehouses , Partitioning
term-partitioned indexes , Partitioning Secondary Indexes by Term , Summary
termination (consensus) , Fault-Tolerant Consensus
Terrapin (database) , Key-value stores as batch process output
Tez (dataflow engine) , Dataflow engines - Discussion of materialization
- fault tolerance , Fault tolerance
- support by higher-level tools , High-Level APIs and Languages
thrashing (out of memory) , Process Pauses
threads (concurrency)
- actor model , Distributed actor frameworks , Message passing and RPC
  - ( see also message-passing)
- atomic operations , Atomicity
- background threads , Hash Indexes , Downsides of LSM-trees
- execution pauses , Can we not simply make network delays predictable? , Process Pauses - Process Pauses
- memory barriers , Linearizability and network delays
- preemption , Process Pauses
- single ( see single-threaded execution)
three-phase commit , Three-phase commit
Thrift (data format) , Thrift and Protocol Buffers - Datatypes and schema evolution
- BinaryProtocol , Thrift and Protocol Buffers
- CompactProtocol , Thrift and Protocol Buffers
- field tags and schema evolution , Field tags and schema evolution
throughput , Describing Performance , Batch Processing
TIBCO , Message brokers
- Enterprise Message Service , Message brokers compared to databases
- StreamBase (stream analytics) , Complex event processing
time
- concurrency and , The “happens-before” relationship and concurrency
- cross-channel timing dependencies , Cross-channel timing dependencies
- in distributed systems , Unreliable Clocks - Limiting the impact of garbage collection
  - ( see also clocks)
  - clock synchronization and accuracy , Clock Synchronization and Accuracy
  - relying on synchronized clocks , Relying on Synchronized Clocks - Synchronized clocks for global snapshots
- process pauses , Process Pauses - Limiting the impact of garbage collection
- reasoning about, in stream processors , Reasoning About Time - Types of windows
  - event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
  - knowing when window is ready , Knowing when you’re ready
  - timestamp of events , Whose clock are you using, anyway?
  - types of windows , Types of windows
- system models for distributed systems , System Model and Reality
- time-dependence in stream joins , Time-dependence of joins
time-of-day clocks , Time-of-day clocks
timeliness , Timeliness and Integrity
- coordination-avoiding data systems , Coordination-avoiding data systems
- correctness of dataflow systems , Correctness of dataflow systems
timeouts , Unreliable Networks , Glossary
- dynamic configuration of , Network congestion and queueing
- for failover , Leader failure: Failover
- length of , Timeouts and Unbounded Delays
timestamps , Sequence Number Ordering
- assigning to events in stream processing , Whose clock are you using, anyway?
- for read-after-write consistency , Reading Your Own Writes
- for transaction ordering , Synchronized clocks for global snapshots
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- key range partitioning by , Partitioning by Key Range
- Lamport , Lamport timestamps
- logical , Ordering events to capture causality
- ordering events , Timestamps for ordering events , Noncausal sequence number generators
Titan (database) , Graph-Like Data Models
tombstones , Hash Indexes , Merging concurrently written values , Log compaction
topics (messaging) , Message brokers , Transmitting Event Streams
total order , The causal order is not a total order , Glossary
- limits of , The limits of total ordering
- sequence numbers or timestamps , Sequence Number Ordering
total order broadcast , Total Order Broadcast - Implementing total order broadcast using linearizable storage , The limits of total ordering , Uniqueness in log-based messaging
- consensus algorithms and , Consensus algorithms and total order broadcast - Epoch numbering and quorums
- implementation in ZooKeeper and etcd , Membership and Coordination Services
- implementing with linearizable storage , Implementing total order broadcast using linearizable storage
- using , Using total order broadcast
- using to implement linearizable storage , Implementing linearizable storage using total order broadcast
tracking behavioral data , Privacy and Tracking
- ( see also privacy)
transaction coordinator ( see coordinator)
transaction manager ( see coordinator)
transaction processing , Relational Model Versus Document Model , Transaction Processing or Analytics? - Stars and Snowflakes: Schemas for Analytics
- comparison to analytics , Transaction Processing or Analytics?
- comparison to data warehousing , The divergence between OLTP databases and data warehouses
transactions , Transactions - Summary , Glossary
- ACID properties of , The Meaning of ACID
  - atomicity , Atomicity
  - consistency , Consistency
  - durability , Durability
  - isolation , Isolation
- compensating ( see compensating transactions)
- concept of , The Slippery Concept of a Transaction
- distributed transactions , Distributed Transactions and Consensus - Limitations of distributed transactions
  - avoiding , Derived data versus distributed transactions , Making unbundling work , Enforcing Constraints - Coordination-avoiding data systems
  - failure amplification , Limitations of distributed transactions , Maintaining derived state
  - in doubt/uncertain status , Coordinator failure , Holding locks while in doubt
  - two-phase commit , Atomic Commit and Two-Phase Commit (2PC) - Three-phase commit
  - use of , Distributed Transactions in Practice - Exactly-once message processing
  - XA transactions , XA transactions - Limitations of distributed transactions
- OLTP versus analytics queries , The Output of Batch Workflows
- purpose of , Transactions
- serializability , Serializability - Performance of serializable snapshot isolation
  - actual serial execution , Actual Serial Execution - Summary of serial execution
  - pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
  - serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
  - two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
- single-object and multi-object , Single-Object and Multi-Object Operations - Handling errors and aborts
  - handling errors and aborts , Handling errors and aborts
  - need for multi-object transactions , The need for multi-object transactions
  - single-object writes , Single-object writes
- snapshot isolation ( see snapshots)
- weak isolation levels , Weak Isolation Levels - Materializing conflicts
  - preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
  - read committed , Read Committed - Snapshot Isolation and Repeatable Read
transitive closure (graph algorithm) , Graphs and Iterative Processing
trie (data structure) , Full-text search and fuzzy indexes
triggers (databases) , Trigger-based replication , Transmitting Event Streams
- implementing change data capture , Implementing change data capture
- implementing replication , Trigger-based replication
triple-stores , Triple-Stores and SPARQL - The SPARQL query language
- SPARQL query language , The SPARQL query language
tumbling windows (stream processing) , Types of windows
- ( see also windows)
- in microbatching , Microbatching and checkpointing
tuple spaces (programming model) , Dataflow: Interplay between state changes and application code
Turtle (RDF data format) , Triple-Stores and SPARQL
Twitter
- constructing home timelines (example) , Describing Load , Deriving several views from the same event log , Table-table join (materialized view maintenance) , Materialized views and caching
- DistributedLog (event log) , Using logs for message storage
- Finagle (RPC framework) , Current directions for RPC
- Snowflake (sequence number generator) , Synchronized clocks for global snapshots
- Summingbird (processing library) , The lambda architecture
two-phase commit (2PC) , Distributed Transactions and Consensus , Introduction to two-phase commit - Coordinator failure , Glossary
- confusion with two-phase locking , Introduction to two-phase commit
- coordinator failure , Coordinator failure
- coordinator recovery , Recovering from coordinator failure
- how it works , A system of promises
- issues in practice , Limitations of distributed transactions
- performance cost , Distributed Transactions in Practice
- transactions holding locks , Holding locks while in doubt
two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks , What Makes a System Linearizable? , Glossary
- confusion with two-phase commit , Introduction to two-phase commit
- index-range locks , Index-range locks
- performance of , Performance of two-phase locking
type checking, dynamic versus static , Schema flexibility in the document model

U

UDP (User Datagram Protocol)
- comparison to TCP , Network congestion and queueing
- multicast , Direct messaging from producers to consumers
unbounded datasets , Stream Processing , Glossary
- ( see also streams)
unbounded delays , Glossary
- in networks , Timeouts and Unbounded Delays
- process pauses , Process Pauses
unbundling databases , Unbundling Databases - Multi-partition data processing
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
  - federation versus unbundling , The meta-database of everything
  - need for high-level language , What’s missing?
- designing applications around dataflow , Designing Applications Around Dataflow - Stream processors and services
- observing derived state , Observing Derived State - Multi-partition data processing
  - materialized views and caching , Materialized views and caching
  - multi-partition data processing , Multi-partition data processing
  - pushing state changes to clients , Pushing state changes to clients
uncertain (transaction status) ( see in doubt)
uniform consensus , Fault-Tolerant Consensus
- ( see also consensus)
uniform interfaces , A uniform interface
union type (in Avro) , Schema evolution rules
uniq (Unix tool) , Simple Log Analysis
uniqueness constraints
- asynchronously checked , Loosely interpreted constraints
- requiring consensus , Uniqueness constraints require consensus
- requiring linearizability , Constraints and uniqueness guarantees
- uniqueness in log-based messaging , Uniqueness in log-based messaging
Unix philosophy , The Unix Philosophy - Transparency and experimentation
- command-line batch processing , Batch Processing with Unix Tools - Sorting versus in-memory aggregation
  - Unix pipes versus dataflow engines , Discussion of materialization
- comparison to Hadoop , Philosophy of batch process outputs - Philosophy of batch process outputs
- comparison to relational databases , Unbundling Databases , The meta-database of everything
- comparison to stream processing , Processing Streams
- composability and uniform interfaces , The Unix Philosophy
- loose coupling , Separation of logic and wiring
- pipes , The Unix Philosophy
- relation to Hadoop , Unbundling Databases
UPDATE statement (SQL) , Schema flexibility in the document model
updates
- preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
  - atomic write operations , Atomic write operations
  - automatically detecting lost updates , Automatically detecting lost updates
  - compare-and-set operations , Compare-and-set
  - conflict resolution and replication , Conflict resolution and replication
  - using explicit locking , Explicit locking
- preventing write skew , Write Skew and Phantoms - Materializing conflicts

V

validity (consensus) , Fault-Tolerant Consensus
vBuckets (partitioning) , Partitioning
vector clocks , Version vectors
- ( see also version vectors)
vectorized processing , Memory bandwidth and vectorized processing , The move toward declarative query languages
verification , Trust, but Verify - Tools for auditable data systems
- avoiding blind trust , Don’t just blindly trust what they promise
- culture of , A culture of verification
- designing for auditability , Designing for auditability
- end-to-end integrity checks , The end-to-end argument again
- tools for auditable data systems , Tools for auditable data systems
version control systems, reliance on immutable data , Limitations of immutability
version vectors , Multi-Leader Replication Topologies , Version vectors
- capturing causal dependencies , Capturing causal dependencies
- versus vector clocks , Version vectors
Vertica (database) , The divergence between OLTP databases and data warehouses
- handling writes , Writing to Column-Oriented Storage
- replicas using different sort orders , Several different sort orders
vertical scaling ( see scaling up)
vertices (in graphs) , Graph-Like Data Models
- property graph model , Property Graphs
Viewstamped Replication (consensus algorithm) , Consensus algorithms and total order broadcast
- view number , Epoch numbering and quorums
virtual machines , Distributed Data
- ( see also cloud computing)
- context switches , Process Pauses
- network performance , Network congestion and queueing
- noisy neighbors , Network congestion and queueing
- reliability in cloud services , Hardware Faults
- virtualized clocks in , Clock Synchronization and Accuracy
virtual memory
- process pauses due to page faults , Describing Performance , Process Pauses
- versus memory management by databases , Keeping everything in memory
VisiCalc (spreadsheets) , Designing Applications Around Dataflow
vnodes (partitioning) , Partitioning
Voice over IP (VoIP) , Network congestion and queueing
Voldemort (database)
- building read-only stores in batch processes , Key-value stores as batch process output
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- leaderless replication , Leaderless Replication
- multi-datacenter support , Multi-datacenter operation
- rebalancing , Operations: Automatic or Manual Rebalancing
- reliance on read repair , Read repair and anti-entropy
- sloppy quorums , Sloppy Quorums and Hinted Handoff
VoltDB (database)
- cross-partition serializability , Partitioning
- deterministic stored procedures , Pros and cons of stored procedures
- in-memory storage , Keeping everything in memory
- output streams , API support for change streams
- secondary indexes , Partitioning Secondary Indexes by Document
- serial execution of transactions , Actual Serial Execution
- statement-based replication , Statement-based replication , Rebuilding state after a failure
- transactions in stream processing , Atomic commit revisited

W

WAL (write-ahead log) , Making B-trees reliable
web services ( see services)
Web Services Description Language (WSDL) , Web services
webhooks , Direct messaging from producers to consumers
webMethods (messaging) , Message brokers
WebSocket (protocol) , Pushing state changes to clients
windows (stream processing) , Stream analytics , Reasoning About Time - Types of windows
- infinite windows for changelogs , Maintaining materialized views , Stream-table join (stream enrichment)
- knowing when all events have arrived , Knowing when you’re ready
- stream joins within a window , Stream-stream join (window join)
- types of windows , Types of windows
winners (conflict resolution) , Converging toward a consistent state
WITH RECURSIVE syntax (SQL) , Graph Queries in SQL
workflows (MapReduce) , MapReduce workflows
- outputs , The Output of Batch Workflows - Philosophy of batch process outputs
  - key-value stores , Key-value stores as batch process output
  - search indexes , Building search indexes
- with map-side joins , MapReduce workflows with map-side joins
working set , Sorting versus in-memory aggregation
write amplification , Advantages of LSM-trees
write path (derived data) , Observing Derived State
write skew (transaction isolation) , Write Skew and Phantoms - Materializing conflicts
- characterizing , Write Skew and Phantoms - Phantoms causing write skew , Decisions based on an outdated premise
- examples of , Write Skew and Phantoms , More examples of write skew
- materializing conflicts , Materializing conflicts
- occurrence in practice , Maintaining integrity in the face of software bugs
- phantoms , Phantoms causing write skew
- preventing
  - in snapshot isolation , Decisions based on an outdated premise - Detecting writes that affect prior reads
  - in two-phase locking , Predicate locks - Index-range locks
  - options for , Characterizing write skew
write-ahead log (WAL) , Making B-trees reliable , Write-ahead log (WAL) shipping
writes (database)
- atomic write operations , Atomic write operations
- detecting writes affecting prior reads , Detecting writes that affect prior reads
- preventing dirty writes with read committed , No dirty writes
WS-* framework , Web services
- ( see also services)
WS-AtomicTransaction (2PC) , Introduction to two-phase commit

X

XA transactions , Introduction to two-phase commit , XA transactions - Limitations of distributed transactions
- heuristic decisions , Recovering from coordinator failure
- limitations of , Limitations of distributed transactions
xargs (Unix tool) , Simple Log Analysis , A uniform interface
XML
- binary variants , Binary encoding
- encoding RDF data , The RDF data model
- for application data, issues with , JSON, XML, and Binary Variants
- in relational databases , The Object-Relational Mismatch , Convergence of document and relational databases
XSL/XPath , Declarative Queries on the Web

Y

Yahoo!
- Pistachio (database) , Deriving several views from the same event log
- Sherpa (database) , Implementing change data capture
YARN (job scheduler) , Diversity of processing models , Separation of application code and state
- preemption of jobs , Designing for frequent faults
- use of ZooKeeper , Membership and Coordination Services

Z

Zab (consensus algorithm) , Consensus algorithms and total order broadcast
- use in ZooKeeper , Distributed Transactions and Consensus
ZeroMQ (messaging library) , Direct messaging from producers to consumers
ZooKeeper (coordination service) , Membership and Coordination Services - Membership services
- generating fencing tokens , Fencing tokens , Using total order broadcast , Membership and Coordination Services
- linearizable operations , Implementing Linearizable Systems , Implementing linearizable storage using total order broadcast
- locks and leader election , Locking and leader election
- service discovery , Service discovery
- use for partition assignment , Request Routing , Allocating work to nodes
- use of Zab algorithm , Using total order broadcast , Distributed Transactions and Consensus , Consensus algorithms and total order broadcast

About the Author

Martin Kleppmann is a researcher in distributed systems at the University of Cambridge, UK. Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.

马丁·克莱普曼是英国剑桥大学分布式系统研究员。此前，他曾在领英和Rapportive等互联网公司担任软件工程师和企业家，专注于大规模数据基础设施的开发。在这个过程中，他通过艰辛的经验学到了一些经验和教训。他希望这本书能够帮助大家避免重蹈覆辙。

Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.

马丁是一位常规的会议演讲者、博客作者和开源贡献者。他认为深刻的技术思想应该对所有人都是可接触的，更深入的理解将帮助我们开发更好的软件。

Colophon

The animal on the cover of Designing Data-Intensive Applications is an Indian wild boar ( Sus scrofa cristatus ), a subspecies of wild boar found in India, Myanmar, Nepal, Sri Lanka, and Thailand. They are distinctive from European boars in that they have higher back bristles, no woolly undercoat, and a larger, straighter skull.

《设计数据密集型应用》封面上的动物是印度野猪（Sus scrofa cristatus），一种分布在印度、缅甸、尼泊尔、斯里兰卡和泰国的野猪亚种。它们与欧洲野猪不同之处在于它们的背部刷子更高，没有绒毛底毛，而且头骨更大、更直。

The Indian wild boar has a coat of gray or black hair, with stiff bristles running along the spine. Males have protruding canine teeth (called tushes) that are used to fight with rivals or fend off predators. Males are larger than females, but the species averages 33–35 inches tall at the shoulder and 200–300 pounds in weight. Their natural predators include bears, tigers, and various big cats.

印度野猪有一层灰色或黑色的毛皮，脊梁上有硬刺毛。雄性猪有突出的犬牙（称为獠牙），用于与比他强壮的敌人战斗或抵挡捕食者。雄猪比雌猪大，但这种物种平均肩高33-35英寸，体重200-300磅。它们的自然捕食者包括熊、老虎和各种大型猫科动物。

These animals are nocturnal and omnivorous—they eat a wide variety of things, including roots, insects, carrion, nuts, berries, and small animals. Wild boars are also known to root through garbage and crop fields, causing a great deal of destruction and earning the enmity of farmers. They need to eat 4,000–4,500 calories a day. Boars have a well-developed sense of smell, which helps them forage for underground plant material and burrowing animals. However, their eyesight is poor.

这些动物是夜行性和杂食性——它们会吃各种各样的东西，包括根、昆虫、腐肉、坚果、浆果和小动物。野猪也被称为在垃圾和农田中翻阅，导致大量的破坏并赢得农民的敌意。它们需要每天摄入4,000-4,500卡路里。野猪具有发达的嗅觉，有助于它们寻找地下植物材料和鑽洞动物。然而，它们的视力很差。

Wild boars have long held significance in human culture. In Hindu lore, the boar is an avatar of the god Vishnu. In ancient Greek funerary monuments, it was a symbol of a gallant loser (in contrast to the victorious lion). Due to its aggression, it was depicted on the armor and weapons of Scandinavian, Germanic, and Anglo-Saxon warriors. In the Chinese zodiac, it symbolizes determination and impetuosity.

野猪在人类文化中长期具有重要意义。在印度教传说中，野猪是神祇毗湿奴的化身。在古希腊葬礼纪念碑上，野猪是有勇气的失败者的象征（与胜利的狮子形成对比）。由于其攻击性，它被描绘在斯堪的纳维亚、日耳曼和盎格鲁-撒克逊战士的盔甲和武器上。在中国的生肖中，它象征着决心和鲁莽。

Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com .

许多在奥莱利封面上的动物都处于濒危状态；它们都对世界至关重要。了解更多如何帮助，请访问animals.oreilly.com。

The cover image is from Shaw’s Zoology . The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the font in diagrams is Adobe Myriad Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.

封面图片取自肖氏动物学。封面字体为URW打字机和卫报无细明体。正文字体为Adobe Minion Pro；图表中的字体为Adobe Myriad Pro；标题字体为Adobe Myriad Condensed；代码字体为Dalton Maag的Ubuntu Mono。

Designing Data-Intensive Applications

Designing Data-Intensive Applications

Revision History for the First Edition

Dedication

Preface

Who Should Read This Book?

Scope of This Book

Outline of This Book

References and Further Reading

O’Reilly Safari

Note

How to Contact Us

Acknowledgments

Part I. Foundations of Data Systems

Chapter 1. Reliable, Scalable, and Maintainable Applications

Thinking About Data Systems

Figure 1-1. One possible architecture for a data system that combines several components .

Reliability

Hardware Faults

Software Errors

Human Errors

How Important Is Reliability?

Scalability

Describing Load

Figure 1-2. Simple relational schema for implementing a Twitter home timeline.

Figure 1-3. Twitter’s data pipeline for delivering tweets to followers, with load parameters as of November 2012 [ 16 ].

Describing Performance

Latency and response time

Figure 1-4. Illustrating mean and percentiles: response times for a sample of 100 requests to a service.

Figure 1-5. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.

Approaches for Coping with Load

Maintainability

Operability: Making Life Easy for Operations

Simplicity: Managing Complexity

Evolvability: Making Change Easy

Summary

Footnotes

References

Chapter 2. Data Models and Query Languages

Relational Model Versus Document Model

The Birth of NoSQL

The Object-Relational Mismatch

Figure 2-1. Representing a LinkedIn profile using a relational schema. Photo of Bill Gates courtesy of Wikimedia Commons, Ricardo Stuckert, Agência Brasil.

Example 2-1. Representing a LinkedIn profile as a JSON document

Figure 2-2. One-to-many relationships forming a tree structure.

Many-to-One and Many-to-Many Relationships

Note

Figure 2-3. The company name is not just a string, but a link to a company entity. Screenshot of linkedin.com.

Figure 2-4. Extending résumés with many-to-many relationships.

Are Document Databases Repeating History?

The network model

The relational model

Comparison to document databases

Relational Versus Document Databases Today

Which data model leads to simpler application code?

Schema flexibility in the document model

Data locality for queries

Convergence of document and relational databases

Query Languages for Data

Declarative Queries on the Web

MapReduce Querying

Graph-Like Data Models

Figure 2-5. Example of graph-structured data (boxes represent vertices, arrows represent edges).

Property Graphs

Example 2-2. Representing a property graph using a relational schema

The Cypher Query Language

Example 2-3. A subset of the data in Figure 2-5 , represented as a Cypher query

Example 2-4. Cypher query to find people who emigrated from the US to Europe

Graph Queries in SQL

Example 2-5. The same query as Example 2-4 , expressed in SQL using recursive common table expressions

Triple-Stores and SPARQL

Example 2-6. A subset of the data in Figure 2-5 , represented as Turtle triples

Example 2-7. A more concise way of writing the data in Example 2-6

The semantic web

The RDF data model

Example 2-8. The data of Example 2-7 , expressed using RDF/XML syntax

The SPARQL query language

Example 2-9. The same query as Example 2-4 , expressed in SPARQL

The Foundation: Datalog

Example 2-10. A subset of the data in Figure 2-5 , represented as Datalog facts