To make NFR as predefined template/checklist, we came up with few critical points to start with and it would be auto-populated as and when someone creates any story to the project.
Idea is to pushing NFR in initial phase discussion like designing and developing and as a cross check goes to QA. Apart from predefined template/checklist, anyone can work on other points too for which checklist already been published in Confluence under Guidelines and having predefined checklist in each story would ensure we are having NFR discussions too along with functional towards any deliverables to production.
NFR List | Checklist_Points | Comments if any |
Logging | Have we ensured we are not logging access logs? | Access logs represent the request logs containing the API Path, status code, latencies & and any information about the request. We can avoid logging this since we already have this information in the istio-proxy logs |
Have we ensured we didn't add any sort of secrets in logs (DB passwords, keys, etc) ? | ||
Have we ensured that payload gets logged in the event of an error ? | ||
Have we ensured that logging level can be dyanamic configured ? | ||
Have we ensured that entire sequence of events in particular flow can be identified using an identifier like orderId or anything | - The logs added should be meaningful enough such that anyone looking at the logs, regardless of whether they have context on the code should be able to understand the flow. - For new features, it maybe important that the logs are logged as info to help ensure the feature is working is expected in production. Once we have confidence that the feature is working as expected, we could change these logs to debug unless required. Devs could take a call based on the requirement. | |
Have we ensured that we are using logging levels diligently ? | ||
Timeouts | Have we ensured that we have set a timeout for database calls ? | |
Have we ensured that we have set a timeout for API call ? | ||
Have we ensured that timeouts are derived from dependent component timeouts ? | An API might have dependencies on few other components (APIs, DB queries, etc) internally. It is important the overall API timeout is considered after careful consideration of the dependent component timeouts. | |
Have we ensured that we have set a HTTP timeout ? | Today, in most of our services we set timeouts at the client (caller). But we should also start looking at setting timeouts for requests on the server (callee). This way we ensure we kill the request in the server if it exceeds a timeout regardless of whether the client closes the connection or not. | |
Response Codes | Have we ensured that we are sending 2xx only for successfull scenarios ? | |
Have we ensured that we are sending 500 only for unexpected errors (excluding timeouts) ? | ||
Have we ensured that we are sending 504 for a timeout error ? | ||
Perf | Have we ensured that we did perf testing of any new API we build to get benchmark of the same we can go as per the expectations and can track accordingly going forward ? | We should identify below parameters as part of the perf test & any other additional info as per need: - Max number of requests a pod can handle with the allocated resources - CPU usage - Memory usage - Response times |
Have we ensured we did perf testing of existing APIs if there are changes around it to make sure we didn’t impact existing benchmark results ? | ||
Feature Toggle | Have we ensured that we have feature toggle for new features to be able to go back to the old state at any given point until we are confident of the new changes. We may need to have toggles like feature will be enabled for specific users or city ? | |
Resiliency | Have we ensured that we are resilient to failures of dependent components (database, services ) ? | |
Metrics | Have we ensured that we are capturing the right metrics in prometheous ? | Below are some of the metrics that could be captured based on need or criticality: - Business metrics (example: number of payment gateway failures) - Business logic failures (example: number of rider prioritization requests that failed) - Or any other errors which would be important to help assess the impact in a critical flow could be captured as metrics. |
Security | Have we ensured that right authentication scheme is active at the gateway level ? | This is applicable when we are adding any end point on Kong(Gateway). - any of the authentication plugins (jwt,key-auth/basic-auth) must be defined either at the route level or on the service level - for gateway kong end points, acl plugin must be added and same group must be present on the consumer definition. |
Have we ensured that proper rate limiting applied at the gateway level ? | This is applicable when we are adding any end point on Kong(Gateway).Team leads are the code owners, so one of them have to check this when approving the PR. - rate limiting plugin needs to be enabled on the route / service level on the PR raised against kong-config. | |
Have we ensured that we are retreiving the userId from JWT ? | if requests is coming from kong, userid in requestbody should be matched with headers. Or for fetching any user related information, we have to read the userId only from the header populated by kong (x-consumer-username). |
It would be populated in all Jira stories across projects as a predefined NFR checklist as given below screenshot.
No comments:
Post a Comment