Gremlin Systematic Resilience Testing of Microservices

Gremlin Systematic Resilience Testing Of Microservices-PDF Download

  • Date:01 Jul 2020
  • Views:1
  • Downloads:0
  • Pages:10
  • Size:280.35 KB

Share Pdf : Gremlin Systematic Resilience Testing Of Microservices

Download and Preview : Gremlin Systematic Resilience Testing Of Microservices


Report CopyRight/DMCA Form For : Gremlin Systematic Resilience Testing Of Microservices


Transcription:

injection rules to be applied to the network messages ex Users. Application, changed between microservices Gremlin s data plane con. sists of network proxies that intercept log and manipulate Internet Services. messages exchanged between microservices The control A C Microservice Facebook. plane configures the network proxies for fault injection. based on the rules generated for a recipe After emulating the F Mobile Push. failure the control plane analyzes the observation logs from Notification. the network proxies to validate the assertions specified in the. recipe Gremlin recipes can be executed and checked in a. matter of seconds thereby providing quick feedback to the. operator This low latency feedback enables the operator to. create correlated failure scenarios by conditionally chaining. NoSQL RDBMS,different types of failures and assertion checks. Our case study shows that Gremlin requires a minimal Cloud Platform Services. learning curve developers at IBM found unhandled corner Figure 1 Typical architecture of a microservice based applica. case failure scenarios in a production enterprise application tion The application leverages services provided by the hosting. without modifying the application code Furthermore Grem cloud platform e g managed databases message queues data. lin correctly identified the lack of failure handling in an ac analytics and integrates with Internet services such as social. tively used library designed specifically for abstracting away networking mobile backends geolocation etc. failure handling code Controlled experiments indicate that of other microservices as long as the APIs they expose are. Gremlin is fast introduces low overhead and is suitable for backward compatible To achieve loose coupling microser. resiliency testing of operational microservice applications vices use standard application protocols such as HTTP to. Fault injection ideas from Gremlin and their software facilitate easy integration with other microservices Organi. defined architecture have been integrated into IBM s cloud zationally each microservice is owned and operated by an. offerings for microservice applications Integration of other independent team of developers The ability to immediately. parts such as validation of assertions is planned for the integrate updates into the production deployment 5 has. future The source code for Gremlin is publicly available in led to a continuous software delivery model 10 where. GitHub at https github com ResilienceTesting development teams incrementally deliver features while in. In summary our contributions are as follows corporating user feedback. A systematic resiliency testing framework for creating 2 1 Designing for Failure Current Best Practices. and executing recipes that capture a rich variety of high. level failure scenarios and assertions in microservice To remain available in the face of infrastructure outages. based applications in the cloud a microservice must guard itself from failures. of its dependencies Best design practices advocate incorpo. A framework that can be integrated easily into production rating resiliency design patterns such as timeouts bounded. or production like environments e g shadow deploy retries circuit breakers and bulkheads 9 16. ments without modifications to application code, Recipes capable of tolerating the rapid evolution of mi Timeouts ensure that an API call to a microservice com. croservice code by taking advantage of the standardized pletes in bounded time to maintain responsiveness and. interaction patterns between services release resources associated with the API call in a timely. 2 Background and Motivation Bounded retries handle transient failures in the system. Large scale Internet applications such as Netflix Face by retrying the API calls with the expectation that the. book Amazon store etc have demonstrated that in order to fault is temporary The API calls are retried a bounded. achieve scalability robustness and agility it is beneficial to number of times and are usually accompanied with an. split a monolithic web application into a collection of fine exponential backoff strategy to avoid overloading the. grained web services called microservices 15 Figure 1 callee microservice. illustrates the architecture of a typical microservice based Circuit breakers prevent failures from cascading across. web application deployed in a cloud platform such as the microservice chain When repeated calls to a mi. Amazon AWS IBM Bluemix Microsoft Azure etc Each croservice fail the circuit breaker transitions to open. microservice is a simple REST 6 based web service that mode and the caller service returns a cached or default. interacts with other services using HTTP Modern applica response to its upstream microservice After a fixed time. tions leverage both managed services offered by the hosting period the caller attempts to re establish connectivity. cloud platform e g relational databases key value stores with the failed downstream service If successful the. and third party services e g Facebook Twitter circuit is closed again resuming normal operation The. When compared to traditional service oriented archi definition of success is implementation dependent e g. tectures microservices are very loosely coupled with one response times within a threshold absence of errors in a. another they can be updated and deployed independently time period etc. Bulkheads provide fault isolation within a microservice 3 Gremlin Overview. If a shared thread pool is used to make API calls to To tackle the challenges described earlier we propose a. multiple microservices thread pool resources can be systematic resiliency testing framework called Gremlin The. quickly exhausted when one of the downstream services key observations behind Gremlin s approach are as follows. degrades Resource exhaustion renders the service inca. pable of processing new requests The bulkhead pattern O1 Touch the network not the app Irrespective of. mitigates this issue by assigning an independent thread the runtime heterogeneity C1 all communication in. pool for each type of dependent microservice being the application happens entirely over the network. called Multiple microservices work in coalition to generate. Our study of the postmortem reports of recent outages the response to an end user s request There are two. recall Table 1 shows that while developers may have important ramifications of this increased reliance on. implemented failure recovery measures in their application the network First common types of failures can be. logic they remain unaware whether their microservice can easily emulated by manipulating the network inter. tolerate failures until the failure actually occurs To our actions For example appearance of an overloaded. knowledge there are no tools that provide systematic feed service can be created by delaying requests between. back indicating whether failure recovery measures work as two microservices Second the failure recovery of a. expected in a microservice application microservice can be observed from the network For. example by observing the network interactions we can. 2 2 Challenges in Resiliency Testing of Microservices infer whether a microservice handles transient network. A microservice based application is fundamentally a outages by retrying its API calls to the destination. distributed application However it differs from distributed microservice We leverage this network observability. file systems databases co ordination services etc The latter property to automatically validate C3 the recovery. group of applications have complex distributed state ma behavior of collection of microservices. chines with a large number of possible state transitions O2 Volatile code with standard interactions Despite. Existing tools for resiliency testing cater to the needs of the rapid rate at which the microservice application. these traditional low level distributed applications 4 7 evolves in a daily fashion C2 the interaction be. 8 11 We find these tools to be unsuitable for use in tween different microservices can be characterized us. web mobile focused microservice applications due to the ing a few simple standard patterns such as request. following challenges response publish subscribe etc The semantics of these. application layer transport protocols and the interaction. C1 Runtime heterogeneity An application may be com. patterns are well understood Therefore it is possible. posed of microservices written in different program. to elicit a failure related reaction from any microser. ming languages Microservices may also be rewritten. vice irrespective of its application logic or runtime by. at any time using a different programming language as. manipulating these interactions directly For example. long as they expose the same set of APIs to other ser. an overloaded server can be emulated by intercepting. vices Consequently approaches that rely on language. the client s HTTP request and responding to it with the. specific capabilities e g dynamic code injection in. HTTP status code 503 Service unavailable, Java for fault injection and verification 7 11 are. infeasible in such heterogeneous environments since Gremlin s design leverages these observations to provide. few runtimes provide these capabilities a resiliency testing tool that is purely network oriented and. agnostic to the application code and runtime,C2 High code churn Microservices are autonomously.
managed by independent teams New versions of mi 3 1 Fault Model. croservices are deployed 10 100 times a day inde In a microservice based application response to a user. pendently of other services Exhaustive checkers 13 request is a composition of responses from different mi. cannot keep up with this time scale croservices that communicate over the network We confine. our fault model to failures that are observable from the. C3 Automatic validation The key to the agility network by other microservices Gremlin supports emulation. of a microservice based architecture is automation of fail stop crash failures performance omission failures. Resiliency testing tools such as Netflix s Chaos Mon and crash recovery failures 1 the most common types. key 34 inject unpredictable faults automatically How of failures encountered in modern day cloud deployments. ever manual validation that the microservices survived We do not formally prove coverage of these failure types. the failure as expected is still required When services We also do not test the resilience of the application against. fail to recover it could result in lengthy troubleshooting malicious attacks. From the perspective of a microservice making an API. A useful resiliency testing tool must be automated sys call failures in a remote microservice or the network man. tematic and agnostic to the application s runtime platform ifests in the form of delayed responses error responses. Furthermore it is crucial that both the fault injection and e g HTTP 404 HTTP 503 invalid responses connection. behavior validation are automated timeouts and failure to establish the connection The failure. incidents described in Table 1 in Section 1 can be emulated Results. by the failure modes currently supported by Gremlin even Operator. though our system does not cover emulating OS level errors. Gremlin Recipes,such as failed system calls,Control Plane. 3 2 Using Gremlin Recipe Failure Assertion, Before we delve into the design of Gremlin we provide Translator Orchestrator Checker. the reader with a sample of Gremlin s capabilities In Grem. lin the human operator e g developer or tester writes a fault injection rules. Gremlin recipe a test description written in Python which Data Plane. Real traffic, consists of the outage scenario to be created and assertions A B. Test traffic, to be checked Assertions specify expected behavior of Datastore. microservices during the outage An operator can orchestrate Gremlin. elaborate failure scenarios and validate complex application Microservice Agents. behaviors in short and easy to understand recipes Observations Event logs. Consider a simple application consisting of two HTTP Figure 2 High level overview of the Gremlin framework The. based microservices namely ServiceA and ServiceB where Recipe Translator breaks down high level failure scenarios into. ServiceA makes API calls to ServiceB An operator might fault injection rules and assertions using the logical application. wish to test the resiliency of ServiceA against any degrada graph The Failure Orchestrator programs the Gremlin agents in. tion of ServiceB with the expectation that ServiceA would the physical deployment to inject faults on matching request flows. retry failed API calls no more than five times With Gremlin The Assertion Checker executes assertions in the translated recipe. she can conduct this resiliency test using the following against event logs provided by Gremlin agents. recipe boilerplate code omitted for brevity Interface Mandatory Description. Example 1 Overload test Parameters, 1 Overload ServiceB Abort Src Dst Abort messages from Src to Dst where.
2 HasBoundedRetries ServiceA ServiceB 5 Error messages match pattern Pattern Return. Pattern an application level Error code to Src, In line 1 Gremlin emulates the overloaded state of Delay Src Dst Delay forwarding of messages from Src. ServiceB without actually impacting ServiceB When traffic Interval to Dst that match pattern Pattern by. is injected into the application ServiceA would experience Pattern specified Interval. 2 1 Designing for Failure Current Best Practices To remain available in the face of infrastructure outages in the cloud a microservice must guard itself from failures of its dependencies Best design practices advocate incorpo rating resiliency design patterns such as timeouts bounded retries circuit breakers and bulkheads 9 16

Related Books