This is the fourth in a series of posts presenting arguments for asynchronous architectures as the optimal way to build high-performance, scalable systems for a distributed environment.
In a QCon conference presentation on Availability and Consistency or how the CAP theorem ruins it all, Werner Vogels, Amazon CTO, examines the tension between availability & consistency in large-scale distributed systems, and presents a model for reasoning about the trade-offs between different solutions.
I recommend you find time to watch the entire 52-minute video. The flash streaming technology that InfoQ uses is subject to buffering hiccups, and you may have to restart it a few times. So in case you want to jump to a specific section, I've assembled copies of Werner's slides, with short timestamped notes on the content of each section. Werner did not present his slides in their numbered order, so in my notes I identify slides using the numbers printed on them, not their presentation order.
0:50: CTO's must match business with technology. Most really big IT shops must push the edge of what commercial technology can do. Technology has a very long adoption cycle -- it takes about 10 to 15 years for new technology to mature and be effective. For leading companies like Amazon, that's too slow, the scalability challenges are so great that they demand advanced solutions. So shops are forced (in effect) to do their own research, take advanced steps, just to succeed in a competitive marketplace.
2:15: Werner noted that his viewpoint disagreed strongly with that of Cameron Purdy, CEO of Tangosol, who was an advocate of database technology.
3:00: He introduced Eric Brewer's CAP theorem -- more later. [See the end of my previous post in this series, Asynchronous Architectures . The CAP theorem was first propounded in a 1998 presentation -- Lessons from Internet Services: ACID vs. BASE -- by Dr. Eric Brewer of Inktomi, now a professor at UC Berkeley].
3:45: What is Scalability? [slide 2]
3:45: The meat of Werner's QCon presentation really begins here. Proportional is the key word in these definitions. Adding resources should deliver increased capacity proportional to the added resources. Or if the intent was to deliver better performance, the gains should be proportional to the added resources. Performance here is not just about response, it could mean transfering more data or larger datasets.
4:40: Another reason for needing scalability is to achieve fault-tolerance. Adding resources to achieve redundancy should not hurt your performance. Traditional technologies (like databases) won't give you this kind of scalability, because overheads increase as you scale up. These are subjects I discussed at length when explaining The Parallelism Principle in High-Performence Client/Server:
13.1 The Parallelism Principle: Exploit parallel processing
Processing related pieces of work in parallel typically introduces additional synchronization overheads, and often introduces contention among the parallel streams. Use parallelism to overcome bottlenecks, provided the processing speedup offsets the additional costs introduced.
13.2 Scalability and speed up
Scalability refers to the capacity of the system to perform more total work in the same elapsed time, when its processing power is increased.
Speed up refers to the capacity of the system to perform a particular task in a shorter time, when its processing power is increased.
In a system with linear scalability and speed up, any increase in processing power generates a proportional improvement in throughput or response time.
--High-Performance Client/Server, Chapter 13, pp383-385.
8:15: Scalability for Real Systems [slide 3]
7:00: Slide 3 is the conclusion of a cost discussion that begins before the slide is shown. The biggest threat to availability is bugs, which are a cost factor introduced by humans. So operating costs must not grow as you scale up.
8:45: But ... [slide 4]
8:30: Traditional technologies, databases, two-phase commit may work for 2-4 nodes, but they will not scale to 100's (let alone 10.000) nodes. You may not have 10,000 nodes like Amazon, but you will run into these scalability challenges at 50-100 nodes.
10:05: Principles for Scalable Service Design [slide 13]
10:05: Guidelines for services design at Amazon -- a checklist of lessons learned through hard experience:
- 10:05: Decentralize. Any algorithm that requires agreement will eventually become a bottleneck. Two-phase commit is an in effect an unavailability algorithm, it is guaranteed to fail as you scale up the number of participating services.
- 10:50: Asynchrony. Make progress under all cicumstances, even if the world is burning around you. Even if fulfillment services are burning down, you want people to be able to place orders. So work locally, don't worry about the rest of the system.
- 11:40: Autonomy: Each node should be able to make decisions based only on local state. If you need to reach agreement based on global conditions at high load, you are lost. Nodes may be failing, coming up, going down all the time. Probabilistic techniques work well in these circumstances.
- 12:35: Controlled concurrency. Reduce concurrency as much as possible, so that you do not need to use locking.
- 13:15: Controlled parallelism. Control traffic going to each node using careful load balancing; nodes must have spare CPU and I/O capacity so that they can do other tasks (like load re-balancing) in the background.
- 14:10: Symmetry. Things work really well if all nodes do exactly the same thing. It is easy to add more nodes if nodes do not have to be identified as a directory node, a data-storage node, etc. Ideally, you just install the software and run it, and it responds to any client request and does whatever task is needed, or maybe forwards the request if necessary. This is the logical way to address a requirement that I first documented in a 1993 paper, and later included in Chapter 16, Architecture for High Performance, of my book:
For a large organization, moving to an enterprise client/server system represents a major shift from monolithic systems with fixed distribution to dynamic, heterogeneous, pervasively networked environments. The next generation of systems will be an order of magnitude more dynamic--always running, always changing--with thousands of machines in huge networks.
In such an environment, content components (service providers like DBMSs) and service consumers (e.g. GUIs) must be continually added and removed.
The key to doing this is the middle tier, the hub of the three tier architecture. In the first place, this central layer acts in a connecting role to let individual clients access multiple content servers, and (of course) servers support multiple clients. A separate central tier can also:
- Provide a set of services that can be dynamically invoked by both the consumer and content layers
- Allow new services to be added without major reconfiguration
- Allow services to be removed dynamically without affecting any participant not using those services
- Allow one service provider to be replaced by another
These are all vital characteristics in a distributed computing environment.
--High-Performance Client/Server, Chapter 16, pp514-515.
15:10: Algorithms that force you to obtain agreement will become a bottleneck. So avoid using two-phase commit, maybe by denormalization to make sure your transaction always runs within a single node. Or split your task into multi-transaction workflows. You have to take an end-to-end look at the business transaction and decide. [I have always advocated this approach -- see the conclusions of the second post in this series].
17:40: You can reuse some of those principles in building teams. Small teams are best, so that each team is responsible for a well-understood piece. Team effectiveness is just as important as architectural consistency.
19:00: We call this the two pizza rule -- If you can't feed a team with two pizzas, it's too big. OK, hungry just-out-of-college students do eat more, but they work harder too! As soon as you need more than 8 people, it's hard to understand what everyone is doing. Bigger teams, of 12 or more, must have meetings, and must spend a much larger percentage of their time communicating. This discussion harks back to the famous observation by Fred Brooks:
The Mythical Man-Month
Men and months are interchangeable commodities only when a task can be partitioned among many workers with no communication among them...
In tasks that can be partitioned but which require communication among the subtasks, the effort of communication must be added to the amount of work to be done...
The added burden of communication is made up of two parts, training and intercommunication...
If each part of the task must be separately coordinated with each other part, the effort increases as n(n-1)/2. Three workers require three times as much pairwise intercommunication as two, four require six times as much as two. If, moreover, there need to be conferences among three, four, etc., workers to resolve things jointly, matters get worse yet. The added effort of communicating may fully counteract the division of the original task.
--The Mythical Man-Month, Frederick P. Brooks, 20th Anniversary Edition pp17-18, Addison Wesley, 1995.
21:20: At Amazon, not all the services (1000 or more) support Web interactions at Amazon.com. Many are in the back-end systems such as supply-chain, fulfillment, enterprise services, handling feeds from 3rd-party suppliers, item management, recommendations, personalization.
22:15: Example of statistically improbable phrases (SIPs), an interesting digression about text analysis being implemented as yet another service. (Too long. The details are a plug for Amazon, but take time away from the main thread of the presentation).
24:30: Conclusion of the SIP discussion -- you need dependency management, and contracts. Servers can give an SLA based on workload conditions, that clients must honour. Automatic dependency discovery -- Amazon has home-grown tools that can show where dependencies exist, and the effects of failures in a network of nodes.
26:00: Scalability Through Smart System Engineering [slide 12]
26:00: Use scalable primitives. For example, RPC is not scalable. Don't conceal heterogeneity. We can pretend that systems don't fail, but in practice they do. That's the problem with RPC, it pretends to be a procedure call but it isn't, so transparency does not really work, failures do happen, performance differences do exist. So don't conceal these differences.
28:00: Configuration management. If you have 1000 services, configuration becomes really important. If your applications involve strong consistency properties, they can create problems when people leave your team.
29:00: Repair and recovery. Check out the work on Recovery Oriented Computing (at Stanford and Berkeley) by Dave Patterson and others. Can you design services to restart fast, maybe by keeping log information? If that really works well, then you can design your entire system around the principles of recovery and restart. If you don't like the behavior or performance of a service, you can kill it and let it restart. If it can be functioning again in a minute, this systems design approach can be very effective.
30:40: CAP Conjecture [slide 8]
30:40: At Amazon, all data applications are dominated by this theorem. Traditional data applications assumed that if you stored something in a database it would never go away [Durability, the "D" in the ACID properties]. In reality, because of redundancy, many nodes can be working in parallel and storing information, and then bad things can happen.
32:00: A Clash of Cultures [slide 5]
32:00: There's nothing wrong with transactions, they create a nice clean programming paradigm, and are good for programmers. But you must design for failure cases, because transactions can fail. So the ACID properties are great if you can get them, but getting these guarantees is costly. It may be fine if only a single node accesses the data. but not if 10's or 100's of machines need access.
[33:30] In that case another, more fuzzy, approach may be better -- BASE, in which data is basically available, more or less. Applications maintain a soft state, in which data is eventually consistent.
34:05: ACID vs BASE [unnumbered]
34:05: ACID has a pessimistic behavior, it will fail if it cannot reach the guarantees that you want. Availability is less important than consistency. For BASE systems, availability is the most important, and you are willing to sacrifice something to ensure it. So, for example, in a Web application you will design an application to accept and store shopping-cart input from a customer, and deal with minor problems in the data later. It's a weaker level of consistency, but you never want to tell the customer you can't accept their input.
35:50: Why the Divide? [Slide 7]
35:50: CAP stands for Consistency, Availability, and Partitioning. Eric Brewer came up with the conjecture that systems can only possess two of these three characteristics, which was subsequently proved to be true. That means systems designers must make choices, they must decide how to handle data reads and writes. If you insist on always enforcing consistency, your system may have to reject data interactions, making the system (in effect) unavailable at certain times. If you value availability and want to always accept user interactions, your application must then deal with the fact that some of its responses may later turn out to have been inconsistent.
38:45: Sometimes applications can deal with this. In the Web environment, the common technique of customer stickiness may help you to operate with lower levels of consistency. Once they have begun a session, customers are typically redirected to the same server cluster, or data center. So a local level of data consistency is sufficient; global consistency is unnecessary.
39:30: Consistency and Availability [slide 9]
39:30: Many applications have a workflow behavior. First the customer interacts with the shopping cart. At this time, availability is the most important. After that you do all kinds of things with that data, and during those activities, consistency is the most important. Now, because you are not interacting with the customer directly, if you can't obtain consistency for one data item, you can move on to process a different data item and come back later. Then you get to the shipment and delivery phase, and at this time the database is mostly read only. A data architecture that forces you to use the same powerful tool -- like a big relational database -- for all these different activities is not ideal. If you select data storage solutions that are appropriate for each phase, it's easier to scale your solution.
44:00: Partition-Tolerance and Availability [slide 10]
44:00: It's hard to program for weaker levels of consistency. Amazon has developed some API's, but Werner had no time to discuss these solutions. Slide 10 lists some examples, which he discussed briefly. The core design approach involves guaranteeing the durability of data inputs while relaxing consistency enforcement, then returning later to deal with any inconsistencies.
45:40: Techniques [slide 11]
45:40: Read the slide, because Werner does not talk about it!
We used to use a lot of DB technology at Amazon. It works really well, especially if most applications manipulate single data records using their primary keys. You can still create accessors that iterate over the entire database, but these should be relegated to a lower priority, background, status. The primary interfaces should offer only simple get/put accesses. In a production database that supports transactions, you don't need to also support queries, especially if the data is just XML text anyway. If engineers know what's inside those XML records, they may start coding against it!
48:00: Whatever DM software you are running, databases that require high performance need specialists to configure, operate, and manage them. Engineers can't do it effectively, you need DBAs, even for very simple access patterns. [Guideline 19.5 in High-Performance Client/Server: Although DBMSs may offer similar features, implementations usually differ. Never assume that a design rule of thumb learned for one DBMS can be applied unchanged to another.]
50:30: What does this mean for the data architecture?
50:30: Again, read the slide, because in the edited presentation stream, Werner appears to be speaking to a different slide altogether. And then he ran out of time, so ...
... that's it. A really insightful and informative talk by Werner Vogels. No doubt his presentation could have been improved, given more time, or even just better use of the time available. But all the same, it is very stimulating (I think) and well worth several listens, until you grasp the central points -- all of which I agree with. In fact, Werner's conclusions circle back to the conclusion of my book, and my opening statement in the first post in this series:
Decoupled processes and multi-transaction workflows are the optimal starting point for the design of high-performance (distributed) systems.
Documenting this talk has been both educational and satisfying. But -- since I can't type nearly as quickly as Werner can talk -- I may have misquoted him somewhere. If you spot a mistake please let me know, and I'll correct it.
This series of posts contains some material first published in High-Performance Client/Server. My 1998 book is out of print now, and contains some outdated examples and references. But most of the discussions of performance principles are timeless, and you can pick up a used copy for about $3.00 at Amazon.
Tags: Werner Vogels, Amazon, QCon, distributed systems, asynchronous architecture, Web services, SOA, performance, scalability, synchronization, autonomy, multi-transaction, workflow, David Patterson, Recovery Oriented Computing, Fred Brooks, Mythical Man Month, Eric Brewer, ACID properties, two-phase commit, BASE, CAP theorem, performance matters
Response: Webspace kostenlosAlso versuchen Sie bitte gar nicht erst, mich zu fragen, wie und warum unser zweiter Tag so seltsam anfangen konnte; ich weiss nur, dass, als ich aufgeweckt wurde (von Ger�usch meines Hundes Leo, der seinen Durst stillte), meine Uhr die Zeit mit einigen Minuten nach Vier ansagte. Als ich schliesslich aufstand, ...
Response: www.was-web.deKroatien bzw. Serbien- Montenegro m�ssen landschaftlich extremely nett sein? online gebucht habe ich bis dato nur Fl�ge, meist direkt beim Anbieter. Ich glaube �brigens gar nicht mehr an Internet und superbillig zu buchen. Echte Schn�ppchen habe ich meist dann doch am Counter gebucht. Zwischenzeitlich haben Reiseb�ros sich auch angepasst.