Part 0
- Preface
- Introduction
- Aiming for the Right Target
- Use the Force
- Quality of Life
- The Scope of the Challenge
- A Million Dollars Here, a Million There
- ePragmatic Architecture
Part I - Stability
- The Exception That Grounded an Airline
(an anecdotal story setting up problems to discuss solutions for in the next chapter)- 2.1 The Outage
- 2.2 Consequences
- 2.3 Post-mortem
- 2.4 The Smoking Gun
- 2.5 An Ounce of Prevention?
- Introducing Stability
(software must be cynical. it must expect bad things to happen and never be surprised when they do. failures cause loss of business and money and reputation)- 3.1 Defining Stability
(transient impulses and/or persistent stress manifest in strain, which could be increased RAM usage, high load, etc) - 3.2 Failure Modes
(*something* will fail first. that crack in the system can propagate and lead to a failure mode. Systems always have a variety of failure modes. Pearson Anecdote: think subpub...and then usercomposite... and them sms) - 3.3 Cracks Propagate
- 3.4 Chain of Failure
- 3.5 Patterns and Antipatterns
- 3.1 Defining Stability
- Stability Antipatterns
(antipatterns create, accelerate cracks in a system. this chapter discusses common antipatterns you should avoid.)- 4.1 Integration Points
(integration points are the #1 killer of systems. counter these problems with "Circuit Breaker" and "Decoupling Middleware" patterns. make a test harness to simulate problems in 3rd party system so you can check your stability.)- Socket Based Protocols
(refused connections, slow-to-respond-refused-connections, slow reads. slow responses are worse than no response often) - HTTP Protocols
(jdk offers NO read timeout. jakarta commons HttpClient does. always have a timeout.) - Vendor APIs
(little to no control on these... treat them as suspect)
- Socket Based Protocols
- 4.2 Chain Reactions
(problem inherent in an app. one app in a cluster fails, and the others must pickup the slack.. increasing their likelihood of failing. avoid these with "Bulkheads") - 4.3 Cascading Failures
(when problems in one layer cause problems in callers. think database failure. avoid these with "Circuit Breaker" and "Timeouts" pattern) - 4.4 Users
(keep as little in sessions as possible to minimize memory consumption. use SoftReferences to large objects. identify your most expensive transactions. ) - 4.5 Blocked Threads
(can get into a situation where every thread is blocked waiting for some impossible outcome. synchronize only when necessary. scrutinize resources pools. use timeouts.) - 4.6 Attacks of Self-Denial
- 4.7 Scaling Effects
- 4.8 Unbalanced Capacities
(make sure a big cluster doesnt overload a smaller cluster it is dependent upon) - 4.9 Slow Responses
(track your own response time.. consider failing-fast when it exceeds your SLA.) - 4.10 SLA Inversion
(dont offer a better SLA than one of your dependencies!) - 4.11 Unbounded Result Sets
(beware a result that is bigger than you can handle.)
- 4.1 Integration Points
- Stability Patterns
- 5.1 Use Timeouts
(networks are unreliable. any resource pool that can block a thread Must have a timeout. queue work for a slow retry later.) - 5.2 Circuit Breaker
(detect problems, open the circuit. prevent the operation instead of executing it and having it fail) - 5.3 Bulkheads
(protect critical clients by giving them their own pool of resources.) - 5.4 Steady State
(dont fiddle around with a system. remember to purge junk data. cache carefully and remember invalidation.) - 5.5 Fail Fast
(dont waste resources if you are just gonna throw out the result. dont do useless work) - 5.6 Handshaking
(can support throttling) - 5.7 Test Harness
(emulate out-of-spec failures, stress the caller via slow responses or no responses) - 5.8 Decoupling Middleware
- 5.1 Use Timeouts
- Stability Summary
Part II - Capacity
- Trampled by Your Own Customers
- 7.1 Countdown and Launch
(nice anecdotal story about a website crashing) - 7.2 Aiming for QA
(dont build things to pass QA, build them to run in production) - 7.3 Load Testing
(tough to know real prod traffic.) - 7.4 Murder by the Masses
(number of sessions had killed the site. there were tons of bad clients.) - 7.5 The Testing Gap
(test with stupid and ill behaved clients!!) - 7.6 Aftermath
(be able to black-list IPs. be able to "fail fast" and return a simple error page when overloaded.)
- 7.1 Countdown and Launch
- Introducing Capacity
- 8.1 Defining Capacity
(performance: how fast the app handles one transaction. can be isolated or under load. throughput: number of transactions the app handles in a timespan. capacity: max throughput a system can sustain while maintaining an ok response time.) - 8.2 Constraints
(one constraint determines capacity. its whatever limiting factors hits its ceiling first. find your system constraint. use it to plan capacity improvements.) - 8.3 Interrelations
- 8.4 Scalability
(horizontal scaling vs vertical scaling) - 8.5 Myths About Capacity
(cpu aint cheap. storage aint cheap. bandwidth aint cheap. dont code like a jerk) - 8.6 Summary
(capacity planning needs monitoring and optimization.)
- 8.1 Defining Capacity
- Capacity Antipatterns
- 9.1 Resource Pool Contention
(contention for resources can cause your CPU to waste its time waiting. make your resource pool size equal to the number of threads.) - 9.2 Excessive JSP Fragments
- 9.3 AJAX Overkill
(avoid needless requests. make sure each ajax request doesnt create a new session. minimize response sizes.) - 9.4 Overstaying Sessions
(set session timeout to one standard deviation past the average think time. keep keys, not whole objects.) - 9.5 Wasted Space in HTML
(omit needless characters. remove whitespace.) - 9.6 The Reload Button
- 9.7 Handcrafted SQL
(dont do it, often it will be a non-indexed query. test queries against prod-like database sizes.) - 9.8 Database Eutrophication
(archive and eliminate. dont do transactions and reporting! create indexes. ) - 9.9 Integration Point Latency
- 9.10 Cookie Monsters
(dont trust cookies, minimize cookies as they are sent with each http request. use cookies for identifies, note entire objects.) - 9.11 Summary
(stay in school, dont do drugs)
- 9.1 Resource Pool Contention
- Capacity Patterns
("we should forget about small efficiences about 97% of the time: premature optimization is the root of all evil." balance small gains against complexity.)- 10.1 Pool Connections
(connection pooling size is critical. an undersized pool leads to resource pool contention, oversized can lead to excess stress on DB. tune it. also, ensure timeouts.) - 10.2 Use Caching Carefully
(monitor hit rates. if they are low, your caching isnt helping and may be hurting. dont leave cache size unbounded. build a flush mechanism.) - 10.3 Precompute Content
(precompute pieces that change infrequently. serve that piece many times!) - 10.4 Tune the Garbage Collector
(untuned apps can spend 10% time in garbage collection. tuning can reduce to around 2%. use jconsole to show heap usage and time in garbage collection. adjust ratios of relative sizes for generations) - 10.5 Summary
- 10.1 Pool Connections
Part III - General Design Issues
- Networking
- 11.1 Multihomed Servers
(have admin&monitoring on its own highly secure network. partition backup traffic onto its own network segment. bind apps to the Non admin network.) - 11.2 Routing
- 11.3 Virtual IP Addresses
(can be moved from one NIC to another)
- 11.1 Multihomed Servers
- Security
- 12.1 The Principle of Least Privilege
(process should have the lowest level of privilege needed to accomplish a task. developers too! each major app should have its own user. ) - 12.2 Configured Passwords
(keep passwords separate from from config files.)
- 12.1 The Principle of Least Privilege
- Availability
(do not divorce a Want from its Cost.)- 13.1 Gathering Availability Requirements
(with uptime, determine actual cost vs avoided losses.) - 13.2 Documenting Availability Requirements
(better to define the SLAs in terms of specific features. see the list of SLA bulletpoints) - 13.3 Load Balancing
(DNS roundrobin. Reverse Proxy and x-forwarded-for header. hardware like F5 BigIP.) - 13.4 Clustering
(ok if no other alternatives for redundancy. doesnt really address scalability)
- 13.1 Gathering Availability Requirements
- Administration
- 14.1 “Does QA Match Production?”
(keep apps separated via vms if need be. one-to-one vs one-to-many.) - 14.2 Configuration Files
(keep production config away from plumbing config.) - 14.3 Start-up and Shutdown
(build clean startup sequences and dont accept connections until startup is complete. ) - 14.4 Administrative Interfaces
(GUIs are intuitive, but suck to script and automate)
- 14.1 “Does QA Match Production?”
Part IV - Operations
- Phenomenal Cosmic Powers, Itty-Bitty Living Space
- 16.1 Peak Season
- 16.2 Baby’s First Christmas
- 16.3 Taking the Pulse
- 16.4 Thanksgiving Day
- 16.5 Black Friday
- 16.6 Vital Signs
- 16.7 Diagnostic Tests
- 16.8 Call in a Specialist
- 16.9 Compare Treatment Options
- 16.10 Does the Condition Respond to Treatment?
- 16.11 Winding Down
- Transparency
(transparency refers to the qualities that allow people to understand the systems historical trends and present state. component level visibility is important.)- 17.1 Perspectives
(Different people needs different perspectives: historical trending, predicting the future, present state, instantaneous behavior) - 17.2 Designing for Transparency
(transparency needs to be built in.) - 17.3 Enabling Technologies
(blackbox sits outside. whitebox sits inside and are integrated during development) - 17.4 Logging
(logs are still invaluable. 'error' or 'severe' should require ops intervention. make logs human readable both in format and content.) - 17.5 Monitoring Systems
(want external monitoring in case your main process is hung. ) - 17.6 Standards, De Jure and De Facto
(SNMP, JMX, CIM) - 17.7 Operations Database
(for metrics vs production data.) - 17.8 Supporting Processes
(review problems weekly. solve the most time consuming.) - 17.9 Summary
(transparency can be the difference between a system that improves and a system that decays.)
- 17.1 Perspectives
- Adaptation
- 18.1 Adaptation Over Time
- 18.2 Adaptable Software Design
- 18.3 Adaptable Enterprise Architecture
- 18.4 Releases Shouldn’t Hurt
- 18.5 Summary