Sunday, January 24, 2016

Book - Release It

Part 0

  • Preface
  • Introduction
    • Aiming for the Right Target
    • Use the Force
    • Quality of Life
    • The Scope of the Challenge
    • A Million Dollars Here, a Million There
    • ePragmatic Architecture

Part I - Stability

  • The Exception That Grounded an Airline
    (an anecdotal story setting up problems to discuss solutions for in the next chapter)
    • 2.1   The Outage
    • 2.2   Consequences
    • 2.3   Post-mortem
    • 2.4   The Smoking Gun
    • 2.5   An Ounce of Prevention?
  • Introducing Stability
    (software must be cynical. it must expect bad things to happen and never be surprised when they do. failures cause loss of business and money and reputation)
    • 3.1   Defining Stability
      (transient impulses and/or persistent stress manifest in strain, which could be increased RAM usage, high load, etc
    • 3.2   Failure Modes
      (*something* will fail first. that crack in the system can propagate and lead to a failure mode.  Systems always have a variety of failure modes.  Pearson Anecdote:  think subpub...and then usercomposite... and them sms
    • 3.3   Cracks Propagate
    • 3.4   Chain of Failure
    • 3.5   Patterns and Antipatterns 
  • Stability Antipatterns
    (antipatterns create, accelerate cracks in a system.  this chapter discusses common antipatterns you should avoid.)
    • 4.1   Integration Points
      (integration points are the #1 killer of systems.  counter these problems with "Circuit Breaker" and "Decoupling Middleware" patterns.  make a test harness to simulate problems in 3rd party system so you can check your stability.
      • Socket Based Protocols
        (refused connections, slow-to-respond-refused-connections, slow reads.  slow responses are worse than no response often)
      • HTTP Protocols
        (jdk offers NO read timeout.  jakarta commons HttpClient does.  always have a timeout.)
      • Vendor APIs
        (little to no control on these... treat them as suspect)
    • 4.2   Chain Reactions
      (problem inherent in an app. one app in a cluster fails, and the others must pickup the slack.. increasing their likelihood of failing. avoid these with "Bulkheads")
    • 4.3   Cascading Failures
      (when problems in one layer cause problems in callers.  think database failure.  avoid these with "Circuit Breaker" and "Timeouts" pattern)
    • 4.4   Users
      (keep as little in sessions as possible to minimize memory consumption. use SoftReferences to large objects.  identify your most expensive transactions. 
    • 4.5   Blocked Threads
      (can get into a situation where every thread is blocked waiting for some impossible outcome.  synchronize only when necessary. scrutinize resources pools. use timeouts.
    • 4.6   Attacks of Self-Denial
    • 4.7   Scaling Effects
    • 4.8   Unbalanced Capacities
      (make sure a big cluster doesnt overload a smaller cluster it is dependent upon)
    • 4.9   Slow Responses
      (track your own response time.. consider failing-fast when it exceeds your SLA.)
    • 4.10  SLA Inversion
      (dont offer a better SLA than one of your dependencies!)
    • 4.11  Unbounded Result Sets
      (beware a result that is bigger than you can handle.)
  • Stability Patterns
    • 5.1   Use Timeouts
      (networks are unreliable.  any resource pool that can block a thread Must have a timeout.   queue work for a slow retry later.)
    • 5.2   Circuit Breaker
      (detect problems, open the circuit.  prevent the operation instead of executing it and having it fail)
    • 5.3   Bulkheads
      (protect critical clients by giving them their own pool of resources.)
    • 5.4   Steady State
      (dont fiddle around with a system.  remember to purge junk data.  cache carefully and remember invalidation.)
    • 5.5   Fail Fast
      (dont waste resources if you are just gonna throw out the result.  dont do useless work)
    • 5.6   Handshaking
      (can support throttling)
    • 5.7   Test Harness
      (emulate out-of-spec failures, stress the caller via slow responses or no responses)
    • 5.8   Decoupling Middleware
  • Stability Summary

Part II - Capacity

  • Trampled by Your Own Customers
    • 7.1   Countdown and Launch
      (nice anecdotal story about a website crashing
    • 7.2   Aiming for QA
      (dont build things to pass QA, build them to run in production
    • 7.3   Load Testing
      (tough to know real prod traffic.) 
    • 7.4   Murder by the Masses
      (number of sessions had killed the site.  there were tons of bad clients.
    • 7.5   The Testing Gap
      (test with stupid and ill behaved clients!!
    • 7.6   Aftermath
      (be able to black-list IPs.  be able to "fail fast" and return a simple error page when overloaded.)
  • Introducing Capacity
    • 8.1   Defining Capacity
      (performance:  how fast the app handles one transaction. can be isolated or under load.  throughput: number of transactions the app handles in a timespan.  capacity:  max throughput a system can sustain while maintaining an ok response time.
    • 8.2   Constraints
      (one constraint determines capacity. its whatever limiting factors hits its ceiling first.  find your system constraint. use it to plan capacity improvements.)
    • 8.3   Interrelations
    • 8.4   Scalability
       (horizontal scaling vs vertical scaling)
    • 8.5   Myths About Capacity
      (cpu aint cheap.  storage aint cheap.  bandwidth aint cheap.  dont code like a jerk
    • 8.6   Summary
      (capacity planning needs monitoring and optimization.
  • Capacity Antipatterns
    • 9.1   Resource Pool Contention
      (contention for resources can cause your CPU to waste its time waiting.  make your resource pool size equal to the number of threads.
    • 9.2   Excessive JSP Fragments
    • 9.3   AJAX Overkill
      (avoid needless requests. make sure each ajax request doesnt create a new session.  minimize response sizes.
    • 9.4   Overstaying Sessions
      (set session timeout to one standard deviation past the average think time.  keep keys, not whole objects.
    • 9.5   Wasted Space in HTML
      (omit needless characters.  remove whitespace.
    • 9.6   The Reload Button 
    • 9.7   Handcrafted SQL
      (dont do it, often it will be a non-indexed query.  test queries against prod-like database sizes.
    • 9.8   Database Eutrophication
      (archive and eliminate. dont do transactions and reporting!  create indexes. 
    • 9.9   Integration Point Latency
    • 9.10  Cookie Monsters
      (dont trust cookies, minimize cookies as they are sent with each http request.  use cookies for identifies, note entire objects.
    • 9.11  Summary
      (stay in school, dont do drugs
  • Capacity Patterns
    ("we should forget about small efficiences about 97% of the time:  premature optimization is the root of all evil."  balance small gains against complexity.)
    • 10.1  Pool Connections
      (connection pooling size is critical. an undersized pool leads to resource pool contention, oversized can lead to excess stress on DB. tune it. also, ensure timeouts.
    • 10.2  Use Caching Carefully
      (monitor hit rates.  if they are low, your caching isnt helping and may be hurting.  dont leave cache size unbounded.  build a flush mechanism.
    • 10.3  Precompute Content
      (precompute pieces that change infrequently. serve that piece many times!
    • 10.4  Tune the Garbage Collector
      (untuned apps can spend 10% time in garbage collection. tuning can reduce to around 2%.  use jconsole to show heap usage and time in garbage collection.  adjust ratios of relative sizes for generations)
    • 10.5  Summary 

Part III - General Design Issues

  • Networking
    • 11.1  Multihomed Servers
      (have admin&monitoring on its own highly secure network.  partition backup traffic onto its own network segment. bind apps to the Non admin network.
    • 11.2  Routing
    • 11.3  Virtual IP Addresses
      (can be moved from one NIC to another
  • Security
    • 12.1  The Principle of Least Privilege
      (process should have the lowest level of privilege needed to accomplish a task.  developers too!  each major app should have its own user. ) 
    • 12.2  Configured Passwords
      (keep passwords separate from  from config files.) 
  • Availability
    (do not divorce a Want from its Cost.)
    • 13.1  Gathering Availability Requirements
      (with uptime, determine actual cost vs avoided losses.
    • 13.2  Documenting Availability Requirements
      (better to define the SLAs in terms of specific features.  see the list of SLA bulletpoints
    • 13.3  Load Balancing
      (DNS roundrobin. Reverse Proxy and x-forwarded-for header. hardware like F5 BigIP.
    • 13.4  Clustering
      (ok if no other alternatives for redundancy. doesnt really address scalability
  • Administration
    • 14.1  “Does QA Match Production?”
      (keep apps separated via vms if need be.  one-to-one vs one-to-many.
    • 14.2  Configuration Files
      (keep production config away from plumbing config.
    • 14.3  Start-up and Shutdown
      (build clean startup sequences and dont accept connections until startup is complete. 
    • 14.4  Administrative Interfaces
      (GUIs are intuitive, but suck to script and automate

Part IV - Operations

  • Phenomenal Cosmic Powers, Itty-Bitty Living Space
    • 16.1  Peak Season
    • 16.2  Baby’s First Christmas
    • 16.3  Taking the Pulse
    • 16.4  Thanksgiving Day
    • 16.5  Black Friday
    • 16.6  Vital Signs
    • 16.7  Diagnostic Tests
    • 16.8  Call in a Specialist
    • 16.9  Compare Treatment Options
    • 16.10 Does the Condition Respond to Treatment?
    • 16.11 Winding Down
  • Transparency
    (transparency refers to the qualities that allow people to understand the systems historical trends and present state.  component level visibility is important.)
    • 17.1  Perspectives
      (Different people needs different perspectives:  historical trending, predicting the future, present state, instantaneous behavior
    • 17.2  Designing for Transparency
      (transparency needs to be built in.
    • 17.3  Enabling Technologies
      (blackbox sits outside.  whitebox sits inside and are integrated during development
    • 17.4  Logging
      (logs are still invaluable.  'error' or 'severe' should require ops intervention.  make logs human readable both in format and content.
    • 17.5  Monitoring Systems
      (want external monitoring in case your main process is hung. 
    • 17.6  Standards, De Jure and De Facto
      (SNMP, JMX, CIM
    • 17.7  Operations Database
      (for metrics vs production data.
    • 17.8  Supporting Processes
      (review problems weekly.  solve the most time consuming.) 
    • 17.9  Summary
      (transparency can be the difference between a system that improves and a system that decays.
  • Adaptation
    • 18.1  Adaptation Over Time
    • 18.2  Adaptable Software Design
    • 18.3  Adaptable Enterprise Architecture
    • 18.4  Releases Shouldn’t Hurt
    • 18.5  Summary

No comments:

Post a Comment