Tech Unpacked – Research & Fundamentals with Nitin Sharma

Tuesday, July 20, 2021

NFR(Non Functional Requirement) testing in Distributed Systems

Before jumping into discussing resiliency in distributed systems, let’s

quickly refresh some basic terminology:

Basic Terminology

Resiliency
The capacity of any system to recover from difficulties.

Distributed Systems
These are networked components which communicate with each other by
passing messages most often to achieve a common goal.

Availability
Probability that any system is operating at time `t`.

Faults vs Failures

Fault is an incorrect internal state in your system.
Some common examples of fault in systems include:

1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most often because there’s not enough
validations on input data)

Whereas, Failure is an inability of the system to perform its intended job.

Failure means loss of up-time and availability on systems. Faults if not contained from propagating, can lead to failures.

Resiliency is all about preventing faults turning into failures

Why do we care about resiliency in our systems ?

Resiliency of a system is directly proportional to its up-time and availability. The more resilient the systems, the more available it is to serve users.

Failing to be resilient can affect companies in many ways.

not being resilient means:

1. It can lead to financial losses for the company
2. Losing customers to competitors
3. Affecting services for customers

Resiliency in distributed systems is hard

We all understand that ‘being available’ is critical. And to be available, we need to build in resiliency from ground up so that faults in our systems auto-heal.

But building resiliency in a complex micro-services architecture with multiple distributed systems communicating with each other is difficult.
Some of the things which make this hard are:

1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable

Though building resiliency is hard, it’s not impossible. Following some
of the patterns while building distributed systems can help us achieve
high up-time across our services. We will discuss some of these patterns
going forward:

Pattern[0] = nocode

The best way to write reliable and secure applications is write no code
at all — Write Nothing and Deploy nowhere — Kelsey Hightower

The most resilient piece of code you ever write will be the code you
never wrote.

The lesser the code you write, lower are the reasons for it to break.

Pattern[1] = Timeouts

Stop waiting for an answer.

Let’s consider this scenario:

You have a healthy service ‘A’ dependent on service ‘B’ for serving its requests. Service ‘B’ is affected and is slow.

The default Go HTTP client has no HTTP timeout. This causes application to leak go-routines (to handle every request Go spawns a go-routine). When you have a slow/failed downstream service, the go-routine waits forever for the reply from downstream service. To avoid this problem, it’s important to add timeouts for every integration point in our application.

This will help you fail fast if any of your downstream services does not reply back within, say 1ms.
Timeouts in application can help in following ways:

Preventing cascading failures
Cascading failures are failures which propagate very quickly to other
parts of the system.

Timeouts help us prevent these failures by failing fast. When downstream
services fail or are slower (violating their SLA), instead of waiting for the answer forever, you fail early to save your system as well as the systems which are dependent on yours.

Providing failure isolation
Failure isolation is the concept of isolating failures to only some part
of a system or a sub system.

Timeouts allow you to have failure isolation by not making some other
systems problem your problem.

How should timeouts be set ?
Timeouts must be based on the SLA’s provided by your dependencies. For
example, this could be around the dependency’s 99.9th percentile.

Pattern[2] = Retries

If you fail once, try again

Retries can help reduce recovery time. They are very effective when
dealing with intermittent failures.

Retries works well in conjunction with timeouts, when you timeout you

retry the request.

Retrying immediately might not always be useful
Dependency failures take time to recover in which case retrying could lead
to longer wait times for your users. To avoid these long wait times, we could potentially queue and retry these requests wherever possible. For example, system sends out an OTP sms message when you try to login. Instead of trying to send SMS’s synchronously with our telecom providers, we queue these requests and retry them. This helps us decouple our systems from failures of our telecom providers.

Idempotency is important
Wikipedia says:

Idempotence is the property of certain operations that they can be
applied multiple times without changing the result beyond the initial
application.

Consider a scenario in which the request to some server was processed but failed to reply back with a result. In this case, the client tries to retry the same operation. If the operation is not idempotent, it will lead to inconsistent states across your systems.

For example: non-idempotent operations in the booking creation
flow can lead to multiple bookings being created for the same user as well
as the same driver being allocated to multiple bookings.

Pattern[3] = Fallbacks

Degrade gracefully

When there are faults in your systems, they can choose to use alternative
mechanisms to respond with a degraded response instead of failing
completely.

The Curious case of Maps Service
we use Google Maps service for variety of reasons. We use it to
calculate the route path of our customers from their pickup location to destination, estimating fares etc. We have a Maps service which is an interface for all of our calls to Google. Initially, we used to have booking creation failures because of slowdown on Google maps api service. Our systems were not fault tolerant against these increases in latencies. This is how the route path looks like when systems are operating as expected.

The solution we went with was to fallback to a route approximation for
routing. When this fallback kicks in, systems depending on maps services work in a degraded mode and the route on the map looks something like this:

Fallback in the above scenario helped us prevent catastrophic failures across our systems which were potentially affecting our critical booking flows.

It is important to think of fallback at all the integration points.

Pattern [4] = Circuit Breakers

Trip the circuit to protect your dependencies

Circuit breakers are used in households to prevent sudden surge in current
preventing house from burning down. These trip the circuit and stop flow of current.

This same concept could be applied to our distributed systems wherein you stop making calls to downstream services when you know that the system is unhealthy and failing and allow it to recover.

The state transitions on a typical circuit breaker(CB) looks like this:

Initially when systems are healthy, the CB is in closed state. In this state, it makes calls to downstream services. When certain number of requests fail, the CB trips the circuit and goes into open state. In this state, CB stops making requests to failing downstream service. After a certain sleep threshold, CB attempts reset by going into half open state. If the next request in this state is successful, it goes to closed state. If this call fails, it stays in open state.

Hystrix by Netflix is a popular implementation of this pattern.

Circuit breakers are required at integration points, help preventing cascading
failures allowing the failing service to recover. You can also add a fallback for the circuit breaker to use it when it goes in open state.

You also need good metrics/monitoring around this to detect various state
transitions across various integration points. Hystrix has dashboards
which helps you visualize state transitions.

Pattern[5] = Resiliency Testing

Test to Break

It is important to simulate various failure conditions within your system. For example: Simulating various kinds of network failures, latencies in network, dependencies being slow or dead etc. After determining various failure modes, you codify it by creating some kind of test harness around it. These tests help you exercise some failure modes on every change to
code.

Injecting failures
Injecting failures into your system is a technique to induce faults purposefully to test resiliency. These kind of failures help us exercise a lot of unknown unknowns in our architectures.

Netflix has championed this approach with tools like Chaos Monkey, Latency monkey etc which are part of the Simian Army suite of applications.

In Conclusion:

Though following some of these patterns will help us achieve resiliency,
these is no silver bullet. Systems do fail, and the sad truth is we have
to deal with these failures. These patterns if exercised can help us achieve significant up-time/availability on our services.

Tuesday, June 22, 2021

API Security Testing - Cheat Sheet

REST evolved as Fielding wrote the HTTP/1.1 and URI specs and has been proven to be well-suited for developing distributed hypermedia applications. While REST is more widely applicable, it is most commonly used within the context of communicating with services via HTTP.

The key abstraction of information in REST is a resource. A REST API resource is identified by a URI, usually a HTTP URL. REST components use connectors to perform actions on a resource by using a representation to capture the current or intended state of the resource and transferring that representation.

The primary connector types are client and server, secondary connectors include cache, resolver and tunnel.

REST APIs are stateless. Stateful APIs do not adhere to the REST architectural style. State in the REST acronym refers to the state of the resource which the API accesses, not the state of a session within which the API is called. While there may be good reasons for building a stateful API, it is important to realize that managing sessions is complex and difficult to do securely.

Stateful services are out of scope of this Cheat Sheet: Passing state from client to backend, while making the service technically stateless, is an anti-pattern that should also be avoided as it is prone to replay and impersonation attacks.

In order to implement flows with REST APIs, resources are typically created, read, updated and deleted. For example, an ecommerce site may offer methods to create an empty shopping cart, to add items to the cart and to check out the cart. Each of these REST calls is stateless and the endpoint should check whether the caller is authorized to perform the requested operation.

Another key feature of REST applications is the use of standard HTTP verbs and error codes in the pursuit or removing unnecessary variation among different services.

Another key feature of REST applications is the use of HATEOAS or Hypermedia As The Engine of Application State. This provides REST applications a self-documenting nature making it easier for developers to interact with a REST service without prior knowledge.

HTTPS¶

Secure REST services must only provide HTTPS endpoints. This protects authentication credentials in transit, for example passwords, API keys or JSON Web Tokens. It also allows clients to authenticate the service and guarantees integrity of the transmitted data.

See the Transport Layer Protection Cheat Sheet for additional information.

Consider the use of mutually authenticated client-side certificates to provide additional protection for highly privileged web services.

Access Control¶

Non-public REST services must perform access control at each API endpoint. Web services in monolithic applications implement this by means of user authentication, authorisation logic and session management. This has several drawbacks for modern architectures which compose multiple microservices following the RESTful style.

in order to minimize latency and reduce coupling between services, the access control decision should be taken locally by REST endpoints
user authentication should be centralised in a Identity Provider (IdP), which issues access tokens

JWT¶

There seems to be a convergence towards using JSON Web Tokens (JWT) as the format for security tokens. JWTs are JSON data structures containing a set of claims that can be used for access control decisions. A cryptographic signature or message authentication code (MAC) can be used to protect the integrity of the JWT.

Ensure JWTs are integrity protected by either a signature or a MAC. Do not allow the unsecured JWTs: {"alg":"none"}.
- See here
In general, signatures should be preferred over MACs for integrity protection of JWTs.

If MACs are used for integrity protection, every service that is able to validate JWTs can also create new JWTs using the same key. This means that all services using the same key have to mutually trust each other. Another consequence of this is that a compromise of any service also compromises all other services sharing the same key. See here for additional information.

The relying party or token consumer validates a JWT by verifying its integrity and claims contained.

A relying party must verify the integrity of the JWT based on its own configuration or hard-coded logic. It must not rely on the information of the JWT header to select the verification algorithm. See here and here

Some claims have been standardised and should be present in JWT used for access controls. At least the following of the standard claims should be verified:

iss or issuer - is this a trusted issuer? Is it the expected owner of the signing key?
aud or audience - is the relying party in the target audience for this JWT?
exp or expiration time - is the current time before the end of the validity period of this token?
nbf or not before time - is the current time after the start of the validity period of this token?

As JWTs contain details of the authenticated entity (user etc.) a disconnect can occur between the JWT and the current state of the users session, for example, if the session is terminated earlier than the expiration time due to an explicit logout or an idle timeout. When an explicit session termination event occurs, a digest or hash of any associated JWTs should be submitted to a block list on the API which will invalidate that JWT for any requests until the expiration of the token. See the JSON_Web_Token_for_Java_Cheat_Sheet for further details.

API Keys¶

Public REST services without access control run the risk of being farmed leading to excessive bills for bandwidth or compute cycles. API keys can be used to mitigate this risk. They are also often used by organisation to monetize APIs; instead of blocking high-frequency calls, clients are given access in accordance to a purchased access plan.

API keys can reduce the impact of denial-of-service attacks. However, when they are issued to third-party clients, they are relatively easy to compromise.

Require API keys for every request to the protected endpoint.
Return 429 Too Many Requests HTTP response code if requests are coming in too quickly.
Revoke the API key if the client violates the usage agreement.
Do not rely exclusively on API keys to protect sensitive, critical or high-value resources.

Restrict HTTP methods¶

Apply an allow list of permitted HTTP Methods e.g. GET, POST, PUT.
Reject all requests not matching the allow list with HTTP response code 405 Method not allowed.
Make sure the caller is authorised to use the incoming HTTP method on the resource collection, action, and record

In Java EE in particular, this can be difficult to implement properly. See Bypassing Web Authentication and Authorization with HTTP Verb Tampering for an explanation of this common misconfiguration.

Input validation¶

Do not trust input parameters/objects.
Validate input: length / range / format and type.
Achieve an implicit input validation by using strong types like numbers, booleans, dates, times or fixed data ranges in API parameters.
Constrain string inputs with regexps.
Reject unexpected/illegal content.
Make use of validation/sanitation libraries or frameworks in your specific language.
Define an appropriate request size limit and reject requests exceeding the limit with HTTP response status 413 Request Entity Too Large.
Consider logging input validation failures. Assume that someone who is performing hundreds of failed input validations per second is up to no good.
Have a look at input validation cheat sheet for comprehensive explanation.
Use a secure parser for parsing the incoming messages. If you are using XML, make sure to use a parser that is not vulnerable to XXE and similar attacks.

Validate content types¶

A REST request or response body should match the intended content type in the header. Otherwise this could cause misinterpretation at the consumer/producer side and lead to code injection/execution.

Document all supported content types in your API.

Validate request content types¶

Reject requests containing unexpected or missing content type headers with HTTP response status 406 Unacceptable or 415 Unsupported Media Type.
For XML content types ensure appropriate XML parser hardening, see the XXE cheat sheet.
Avoid accidentally exposing unintended content types by explicitly defining content types e.g. Jersey (Java) @consumes("application/json"); @produces("application/json"). This avoids XXE-attack vectors for example.

Send safe response content types¶

It is common for REST services to allow multiple response types (e.g. application/xml or application/json, and the client specifies the preferred order of response types by the Accept header in the request.

Do NOT simply copy the Accept header to the Content-type header of the response.
Reject the request (ideally with a 406 Not Acceptable response) if the Accept header does not specifically contain one of the allowable types.

Services including script code (e.g. JavaScript) in their responses must be especially careful to defend against header injection attack.

Ensure sending intended content type headers in your response matching your body content e.g. application/json and not application/javascript.

Management endpoints¶

Avoid exposing management endpoints via Internet.
If management endpoints must be accessible via the Internet, make sure that users must use a strong authentication mechanism, e.g. multi-factor.
Expose management endpoints via different HTTP ports or hosts preferably on a different NIC and restricted subnet.
Restrict access to these endpoints by firewall rules or use of access control lists.

Error handling¶

Respond with generic error messages - avoid revealing details of the failure unnecessarily.
Do not pass technical details (e.g. call stacks or other internal hints) to the client.

Audit logs¶

Write audit logs before and after security related events.
Consider logging token validation errors in order to detect attacks.
Take care of log injection attacks by sanitising log data beforehand.

Security Headers¶

There are a number of security related headers that can be returned in the HTTP responses to instruct browsers to act in specific ways. However, some of these headers are intended to be used with HTML responses, and as such may provide little or no security benefits on an API that does not return HTML.

The following headers should be included in all API responses:

Header	Rationale
`Cache-Control: no-store`	Prevent sensitive information from being cached.
`Content-Security-Policy: frame-ancestors 'none'`	To protect against drag-and-drop style clickjacking attacks.
`Content-Type`	To specify the content type of the response. This should be `application/json` for JSON responses.
`Strict-Transport-Security`	To require connections over HTTPS and to protect against spoofed certificates.
`X-Content-Type-Options: nosniff`	To prevent browsers from performing MIME sniffing, and inappropriately interpreting responses as HTML.
`X-Frame-Options: DENY`	To protect against drag-and-drop style clickjacking attacks.

The headers below are only intended to provide additional security when responses are rendered as HTML. As such, if the API will never return HTML in responses, then these headers may not be necessary. However, if there is any uncertainty about the function of the headers, or the types of information that the API returns (or may return in future), then it is recommended to include them as part of a defence-in-depth approach.

Header	Rationale
`Content-Security-Policy: default-src 'none'`	The majority of CSP functionality only affects pages rendered as HTML.
`Feature-Policy: 'none'`	Feature policies only affect pages rendered as HTML.
`Referrer-Policy: no-referrer`	Non-HTML responses should not trigger additional requests.

CORS¶

Cross-Origin Resource Sharing (CORS) is a W3C standard to flexibly specify what cross-domain requests are permitted. By delivering appropriate CORS Headers your REST API signals to the browser which domains, AKA origins, are allowed to make JavaScript calls to the REST service.

Disable CORS headers if cross-domain calls are not supported/expected.
Be as specific as possible and as general as necessary when setting the origins of cross-domain calls.

Sensitive information in HTTP requests¶

RESTful web services should be careful to prevent leaking credentials. Passwords, security tokens, and API keys should not appear in the URL, as this can be captured in web server logs, which makes them intrinsically valuable.

In POST/PUT requests sensitive data should be transferred in the request body or request headers.
In GET requests sensitive data should be transferred in an HTTP Header.

OK:

https://example.com/resourceCollection/[ID]/action

https://twitter.com/vanderaj/lists

NOT OK:

https://example.com/controller/123/action?apiKey=a53f435643de32 because API Key is into the URL.

HTTP Return Code¶

HTTP defines status code. When designing REST API, don't just use 200 for success or 404 for error. Always use the semantically appropriate status code for the response.

Here is a non-exhaustive selection of security related REST API status codes. Use it to ensure you return the correct code.

Code	Message	Description
200	OK	Response to a successful REST API action. The HTTP method can be GET, POST, PUT, PATCH or DELETE.
201	Created	The request has been fulfilled and resource created. A URI for the created resource is returned in the Location header.
202	Accepted	The request has been accepted for processing, but processing is not yet complete.
301	Moved Permanently	Permanent redirection.
304	Not Modified	Caching related response that returned when the client has the same copy of the resource as the server.
307	Temporary Redirect	Temporary redirection of resource.
400	Bad Request	The request is malformed, such as message body format error.
401	Unauthorized	Wrong or no authentication ID/password provided.
403	Forbidden	It's used when the authentication succeeded but authenticated user doesn't have permission to the request resource.
404	Not Found	When a non-existent resource is requested.
405	Method Not Acceptable	The error for an unexpected HTTP method. For example, the REST API is expecting HTTP GET, but HTTP PUT is used.
406	Unacceptable	The client presented a content type in the Accept header which is not supported by the server API.
413	Payload too large	Use it to signal that the request size exceeded the given limit e.g. regarding file uploads.
415	Unsupported Media Type	The requested content type is not supported by the REST service.
429	Too Many Requests	The error is used when there may be DOS attack detected or the request is rejected due to rate limiting.
500	Internal Server Error	An unexpected condition prevented the server from fulfilling the request. Be aware that the response should not reveal internal information that helps an attacker, e.g. detailed error messages or stack traces.
501	Not Implemented	The REST service does not implement the requested operation yet.
503	Service Unavailable	The REST service is temporarily unable to process the request. Used to inform the client it should retry at a later time.

Popular Posts

Search This Blog