Introduction
Have you ever wondered what it takes to drive a car from point A to point B? Yes, you may travel quickly by taking the highway. What about that roadway, though? How many technical marvels were required to keep the journey as safe as possible for years? Planning, design, right-of-way acquisition, site preparation, earthwork, subgrade preparation, pavement, drainage, markings, testing, and inspection are all included in the building process.
Creating a microservice is similar to a highway, as it involves numerous stages that require careful consideration and planning for that HTTP request you transmit to eventually yield a significant response, 99.9% of the time.
To attain a significant level of stability and uptime, a microservice must implement numerous strategies to autonomously address most unexpected problems, utilizing a tailored service response to avoid service interruptions. Some examples include application pods going down for multiple reasons, excessive network traffic, temporary database failures, concurrency issues, distributed denial-of-service attacks, security vulnerabilities, and data integrity concerns, among others.
The well-known 12-factor app methodology aims to increase awareness by establishing a series of guidelines for different aspects of the application, helping to prevent systemic issues encountered in contemporary app development.
In addition to the 12-factor app principles, this document aims to incorporate further elements that enhance the reliability of a microservice. We are focusing solely on the development aspect of microservice engineering. Future posts will address architecture and team management.
We went for two accomplishments:
This post tries to highlight some of the key features that an application requires to enhance the reliability of a microservice system during both design and runtime. It seeks to identify issues that could arise from the lack or improper execution of every phase.
It does not encompass every aspect of a particular feature, nor does it include all the solutions available at a given time.
Establish the project framework
Having a solid structure of the project start, both from coding and organizing point of views, is a crucial aspect since it determines the overall appearance of the implementation. A badly structured system leads to problems throughout the entire software lifecycle. Consequently, it is advisable to utilize only senior engineers, tech leaders or software architects for this phase.
This phase includes but is not limited to the following:
Add Docker support
The Docker method offers an intuitive means of allowing development, debugging, and testing to flexibly configure all system aspects. From databases operating in Docker containers with designated testing data, to identity servers for authentication and authorization featuring specific predefined users and roles, to caching or distributed synchronization across application instances.
Docker containers are also suitable for production environments, supported by infrastructure tools such as Docker Swarm, Kubernetes, or cloud services like Azure or AWS.
External services deployment with Docker Compose
By incorporating deployment specifications into the project setup, a developer can launch the entire system immediately on the local development machine.
The aim is also to have the main microservice running locally without Docker, while other dependent services are started altogether (with the help of Docker Compose).
Security considerations
To enhance the security of the microservice comprehensively, we must address both design-time security considerations and those that occur at runtime. Extra security must be implemented at the infrastructure layer, where hosting, networks, and even the physical security must be considered.
Authentication & authorization
Authentication and authorization serve as the initial barrier against possible threats. Authorization guarantees that following authentication, users can access not all resources, but only those appropriate for their role.
Security aspects at design time
Security in design phases must be incorporated into the development lifecycle. A system ought to incorporate supplementary tools such as Snyk or SonarQube for evaluating the code against recognized security flaws.
Additionally, a microservice should mitigate potential overuse attempts by leveraging proactive and reactive measures like timeouts, rate limiting, circuit breakers, or fallbacks.
Runtime security concerns
Runtime security is the most rational because it includes not just the security of the code, but also the infrastructure and all actual risks. For instance, flooding and static electricity (caused by humidity levels) represent potential hazards at this stage, necessitating that data centers choose locations and implement measures to mitigate these threats.
Maintenance security
Maintenance encompasses both physical and virtual actions applied to the microservice's runtime to guarantee smooth functionality and minimize current and future risks.
Concerns regarding security related to datacenters and the reliability of maintenance staff are additional factors to take into account here.
Database design & versioning
The primary objective is to establish a database design system that enables version control of changes in alignment with the application.
Consider the rollback situations where an earlier application version needs to be deployed in production, necessitating a database downgrade as well.
The most dependable scenarios for database upgrades involve scripts, as they ensure a uniform approach for all database tasks: schema alterations, data modifications, audits, etc., leading to more consistent deployment processes.
Logging support
This one demonstrates its value when an event occurs and everything must be linked back to the root problem.
An organized logging system categorizes logs using a unique correlation ID that connects to a specific business operation, even if it involves several microservice interactions. It further enriches the log messages with various attributes of the atomic business operation it specifies, including timestamp (to the millisecond), thread id, processor time, memory, and user id.
Exception handling
It is very important to have concise and predictable error handling.
It allows to systematically ensure that any error is correctly propagated to the right flow that follows an error whether it is for instance, a business error, validation, or database one.
Transient error handling
Transient errors determine a special kind of error handling that allows a specific application flow to be retried.
Whether is a database transient error, a service-to-service communication, or an IoT device malfunction error, they all have the option for the business flow to be retried with different foreseen outcomes, usually a successful one.
Localization & globalization
Some API consumers are requesting HTTP responses to be in their native language.
This functionality allows translating especially business-facing messages in different languages, specified by a locale parameter, mostly placed as an HTTP request header.
Caching
Caching is one of the most performance-centric functionality that allows significant gains in performance, especially for data-related operations.
There is a fine balance between caching, the number of application instances, and the performance of the database server (SPOT) in terms of scaling and costs. In such scenarios, the Architecture Decision Records should contain at least base data quantifications such as the maximum number of concurrent requests supported by the system and the ones required business-wise. This allows tackling different caching strategies that increase the performance of different microservice aspects.
Data caching
Data caching allows reducing the roundtrip time to data storage, whether it is a database or a blob. Some ORMs allow multiple levels of caching, some of them customizable enough to add a distributed cache.
Method-level caches
Method caching aims to reduce the processing time and power (processor time) of different methods within the application.
As a word of caution, for any caching types a system might use, there is a downside of data aging, which may not be sufficient for certain microservices, such as trading systems.
Additionally, every create, update, or delete operation must guarantee that the cache is in sync with the actual data. This makes any caching mechanism more labor-intensive.
Concurrency control
Concurrency aims to control the integrity of the data within a system, in the end affecting the reliability of the entire business.
Imagine a financial service whose transactions are being wrongfully calculated because some of the items are edited in the same time.
Database concurrency scenarios
There are quite a lot of issues caused by multiple processing pipelines trying to modify the same database resource.
Lost Update: Two transactions read the same data, modify it, and write it back, causing the first update to be overwritten. Example: Two users booking the last seat on a flight simultaneously.
Dirty Read: A transaction reads data that has been updated by another, uncommitted transaction. Example: A reporting query reads a new user record that is immediately rolled back due to an error, leading to inaccurate data.
Unrepeatable Read (Non-repeatable Read): A transaction reads the same row twice but gets different results because another committed transaction modified it in between. Example: A bank user checks their balance, a deposit is made, and they check again, seeing a different amount.
Phantom Read: A transaction runs a query, but a concurrently committed transaction inserts/deletes rows, causing the original query to produce different results. Example: A report calculates total sales, but a new sale is inserted during the calculation, changing the total.
Deadlock: Two transactions wait for each other to release locks, causing neither to proceed. Example: Transaction A locks Row 1 and needs Row 2, while Transaction B locks Row 2 and needs Row 1.
The microservice must select the concurrent handling strategy best suited to the business cases it targets in order to get around all of these problems. Although "Lost Update" works in the majority of situations (first arrived, first served), it might not be the ideal choice when the user anticipates a data validation for the input.
Distributed transactions & locking
Distributed locking tries to fix data integrity issues by allowing a single actor to perform and keep a data modification. If other processes attempted the same, a single one of them would commit the transaction, and the rest of them would roll back any data processing for that particular data.
When processing a credit card transaction, for example, a Kafka consumer may need to lock a database record to prevent interference from other concurrent consumers.
Request validation
A microservice request validation targets the same data integrity checks as distributed locking, and additionally, it ensures data enters the system in a predictable manner.
Validating the initial data that enters the system - that is, the data contained in the HTTP request itself - is essential.
Some project decisions attempted to validate the business entities resulted after request mapping. Not much to say, it ended in a lot of junk code to trace back the values from the HTTP request itself.
Health checks
Health checks are one of the main leverage points in ensuring a good indication of the service status.
It should contain all aspects that are direct dependencies to the main microservice like database status, caching status etc.
It would bring tremendous help for the monitoring team on creating alerts for various system parameters, especially the ones outside of a normal case scenario (i.e., number of transient database errors or number of requests rejected as a consequence of rate limiters or circuit breakers).
Testing
Testing is a vital part of any software development.
In the context of microservices, testing employs the following:
Deployment
Deployment accumulates all the work done for a specific version and publishes it to specific environments.
For a consistent approach, automation is mandatory. The best approach is to also leverage deployment scripts in source control, allowing manual deployments as well.
Documentation
Documentation is often minimized by the project owners because it doesn't bring value immediately as people involved already know the project. There are a few scenarios however, that benefit fully from the documentation:
Monitoring & Support
Monitoring in conjunction with health checks allows early discovery of potential issues, as well as reacting promptly to client incidents.
It also checks for performance considerations, especially on peak API usage times, to be in accordance with API expectations.
As an ongoing effort in a project, monitoring could be carried out by the development team or a different team altogether.
Big projects could employ different levels of monitoring and support, to ensure different types of issues are handled by the proper teams.
Fixing Live Issues
When live issues occur (and they will!), developers must have that live version of the service available from both the code and a server where to test.
That's why is very important that, besides a staging server (keeping a non-yet released version), an UAT (User Acceptance Testing) server is also useful. It mirrors the same configuration as the live one, but with testing data, allowing a developer to quickly test that issue, before and after the fix.
To better mirror the live system, some businesses permit the live data to be sent to a UAT server. The data would have the same format as the live data, but it might be anonymized, for obvious security reasons.
I've come across techniques that automatically build temporary, fully functional environments for a specific issue investigation. These are usually expensive approaches, especially if the environments are in the cloud. They are suitable for large projects and large databases, with specific Service Level Agreements that imply quick resolution of such problems.
References