Part 0

Preface
Introduction
- Aiming for the Right Target
- Use the Force
- Quality of Life
- The Scope of the Challenge
- A Million Dollars Here, a Million There
- ePragmatic Architecture

Part I - Stability

The Exception That Grounded an Airline
(an anecdotal story setting up problems to discuss solutions for in the next chapter)
- 2.1 The Outage
- 2.2 Consequences
- 2.3 Post-mortem
- 2.4 The Smoking Gun
- 2.5 An Ounce of Prevention?
Introducing Stability
(software must be cynical. it must expect bad things to happen and never be surprised when they do. failures cause loss of business and money and reputation)
- 3.1 Defining Stability
  (transient impulses and/or persistent stress manifest in strain, which could be increased RAM usage, high load, etc)
- 3.2 Failure Modes
  (*something* will fail first. that crack in the system can propagate and lead to a failure mode. Systems always have a variety of failure modes. Pearson Anecdote: think subpub...and then usercomposite... and them sms)
- 3.3 Cracks Propagate
- 3.4 Chain of Failure
- 3.5 Patterns and Antipatterns
Stability Antipatterns
(antipatterns create, accelerate cracks in a system. this chapter discusses common antipatterns you should avoid.)
- 4.1 Integration Points
  (integration points are the #1 killer of systems. counter these problems with "Circuit Breaker" and "Decoupling Middleware" patterns. make a test harness to simulate problems in 3rd party system so you can check your stability.)
  - Socket Based Protocols
    (refused connections, slow-to-respond-refused-connections, slow reads. slow responses are worse than no response often)
  - HTTP Protocols
    (jdk offers NO read timeout. jakarta commons HttpClient does. always have a timeout.)
  - Vendor APIs
    (little to no control on these... treat them as suspect)
- 4.2 Chain Reactions
  (problem inherent in an app. one app in a cluster fails, and the others must pickup the slack.. increasing their likelihood of failing. avoid these with "Bulkheads")
- 4.3 Cascading Failures
  (when problems in one layer cause problems in callers. think database failure. avoid these with "Circuit Breaker" and "Timeouts" pattern)
- 4.4 Users
  (keep as little in sessions as possible to minimize memory consumption. use SoftReferences to large objects. identify your most expensive transactions. )
- 4.5 Blocked Threads
  (can get into a situation where every thread is blocked waiting for some impossible outcome. synchronize only when necessary. scrutinize resources pools. use timeouts.)
- 4.6 Attacks of Self-Denial
- 4.7 Scaling Effects
- 4.8 Unbalanced Capacities
  (make sure a big cluster doesnt overload a smaller cluster it is dependent upon)
- 4.9 Slow Responses
  (track your own response time.. consider failing-fast when it exceeds your SLA.)
- 4.10 SLA Inversion
  (dont offer a better SLA than one of your dependencies!)
- 4.11 Unbounded Result Sets
  (beware a result that is bigger than you can handle.)
Stability Patterns
- 5.1 Use Timeouts
  (networks are unreliable. any resource pool that can block a thread Must have a timeout. queue work for a slow retry later.)
- 5.2 Circuit Breaker
  (detect problems, open the circuit. prevent the operation instead of executing it and having it fail)
- 5.3 Bulkheads
  (protect critical clients by giving them their own pool of resources.)
- 5.4 Steady State
  (dont fiddle around with a system. remember to purge junk data. cache carefully and remember invalidation.)
- 5.5 Fail Fast
  (dont waste resources if you are just gonna throw out the result. dont do useless work)
- 5.6 Handshaking
  (can support throttling)
- 5.7 Test Harness
  (emulate out-of-spec failures, stress the caller via slow responses or no responses)
- 5.8 Decoupling Middleware
Stability Summary

Part II - Capacity

Trampled by Your Own Customers
- 7.1 Countdown and Launch
  (nice anecdotal story about a website crashing)
- 7.2 Aiming for QA
  (dont build things to pass QA, build them to run in production)
- 7.3 Load Testing
  (tough to know real prod traffic.)
- 7.4 Murder by the Masses
  (number of sessions had killed the site. there were tons of bad clients.)
- 7.5 The Testing Gap
  (test with stupid and ill behaved clients!!)
- 7.6 Aftermath
  (be able to black-list IPs. be able to "fail fast" and return a simple error page when overloaded.)
Introducing Capacity
- 8.1 Defining Capacity
  (performance: how fast the app handles one transaction. can be isolated or under load. throughput: number of transactions the app handles in a timespan. capacity: max throughput a system can sustain while maintaining an ok response time.)
- 8.2 Constraints
  (one constraint determines capacity. its whatever limiting factors hits its ceiling first. find your system constraint. use it to plan capacity improvements.)
- 8.3 Interrelations
- 8.4 Scalability
  (horizontal scaling vs vertical scaling)
- 8.5 Myths About Capacity
  (cpu aint cheap. storage aint cheap. bandwidth aint cheap. dont code like a jerk)
- 8.6 Summary
  (capacity planning needs monitoring and optimization.)
Capacity Antipatterns
- 9.1 Resource Pool Contention
  (contention for resources can cause your CPU to waste its time waiting. make your resource pool size equal to the number of threads.)
- 9.2 Excessive JSP Fragments
- 9.3 AJAX Overkill
  (avoid needless requests. make sure each ajax request doesnt create a new session. minimize response sizes.)
- 9.4 Overstaying Sessions
  (set session timeout to one standard deviation past the average think time. keep keys, not whole objects.)
- 9.5 Wasted Space in HTML
  (omit needless characters. remove whitespace.)
- 9.6 The Reload Button
- 9.7 Handcrafted SQL
  (dont do it, often it will be a non-indexed query. test queries against prod-like database sizes.)
- 9.8 Database Eutrophication
  (archive and eliminate. dont do transactions and reporting! create indexes. )
- 9.9 Integration Point Latency
- 9.10 Cookie Monsters
  (dont trust cookies, minimize cookies as they are sent with each http request. use cookies for identifies, note entire objects.)
- 9.11 Summary
  (stay in school, dont do drugs)
Capacity Patterns
("we should forget about small efficiences about 97% of the time: premature optimization is the root of all evil." balance small gains against complexity.)
- 10.1 Pool Connections
  (connection pooling size is critical. an undersized pool leads to resource pool contention, oversized can lead to excess stress on DB. tune it. also, ensure timeouts.)
- 10.2 Use Caching Carefully
  (monitor hit rates. if they are low, your caching isnt helping and may be hurting. dont leave cache size unbounded. build a flush mechanism.)
- 10.3 Precompute Content
  (precompute pieces that change infrequently. serve that piece many times!)
- 10.4 Tune the Garbage Collector
  (untuned apps can spend 10% time in garbage collection. tuning can reduce to around 2%. use jconsole to show heap usage and time in garbage collection. adjust ratios of relative sizes for generations)
- 10.5 Summary

Part III - General Design Issues

Networking
- 11.1 Multihomed Servers
  (have admin&monitoring on its own highly secure network. partition backup traffic onto its own network segment. bind apps to the Non admin network.)
- 11.2 Routing
- 11.3 Virtual IP Addresses
  (can be moved from one NIC to another)
Security
- 12.1 The Principle of Least Privilege
  (process should have the lowest level of privilege needed to accomplish a task. developers too! each major app should have its own user. )
- 12.2 Configured Passwords
  (keep passwords separate from from config files.)
Availability
(do not divorce a Want from its Cost.)
- 13.1 Gathering Availability Requirements
  (with uptime, determine actual cost vs avoided losses.)
- 13.2 Documenting Availability Requirements
  (better to define the SLAs in terms of specific features. see the list of SLA bulletpoints)
- 13.3 Load Balancing
  (DNS roundrobin. Reverse Proxy and x-forwarded-for header. hardware like F5 BigIP.)
- 13.4 Clustering
  (ok if no other alternatives for redundancy. doesnt really address scalability)
Administration
- 14.1 “Does QA Match Production?”
  (keep apps separated via vms if need be. one-to-one vs one-to-many.)
- 14.2 Configuration Files
  (keep production config away from plumbing config.)
- 14.3 Start-up and Shutdown
  (build clean startup sequences and dont accept connections until startup is complete. )
- 14.4 Administrative Interfaces
  (GUIs are intuitive, but suck to script and automate)

Part IV - Operations

Phenomenal Cosmic Powers, Itty-Bitty Living Space
- 16.1 Peak Season
- 16.2 Baby’s First Christmas
- 16.3 Taking the Pulse
- 16.4 Thanksgiving Day
- 16.5 Black Friday
- 16.6 Vital Signs
- 16.7 Diagnostic Tests
- 16.8 Call in a Specialist
- 16.9 Compare Treatment Options
- 16.10 Does the Condition Respond to Treatment?
- 16.11 Winding Down
Transparency
(transparency refers to the qualities that allow people to understand the systems historical trends and present state. component level visibility is important.)
- 17.1 Perspectives
  (Different people needs different perspectives: historical trending, predicting the future, present state, instantaneous behavior)
- 17.2 Designing for Transparency
  (transparency needs to be built in.)
- 17.3 Enabling Technologies
  (blackbox sits outside. whitebox sits inside and are integrated during development)
- 17.4 Logging
  (logs are still invaluable. 'error' or 'severe' should require ops intervention. make logs human readable both in format and content.)
- 17.5 Monitoring Systems
  (want external monitoring in case your main process is hung. )
- 17.6 Standards, De Jure and De Facto
  (SNMP, JMX, CIM)
- 17.7 Operations Database
  (for metrics vs production data.)
- 17.8 Supporting Processes
  (review problems weekly. solve the most time consuming.)
- 17.9 Summary
  (transparency can be the difference between a system that improves and a system that decays.)
Adaptation
- 18.1 Adaptation Over Time
- 18.2 Adaptable Software Design
- 18.3 Adaptable Enterprise Architecture
- 18.4 Releases Shouldn’t Hurt
- 18.5 Summary

SubwooferBeachChair

Sunday, January 24, 2016

Book - Release It

Part 0

Part I - Stability

Part II - Capacity

Part III - General Design Issues

Part IV - Operations

No comments:

Post a Comment

About Me