Thursday, July 22, 2021

API Security Testing - How we can do/stop hacking to the APIs : Part-1

 So, you’ve created an exhaustive regression test suite for your APIs that runs as part of your continuous build and deploy process. You’ve run and even automated (cool!) load-tests that simulate magnitudes more users than your API will probably ever (but maybe) have. You’ve set up monitors that will catch any bug that sneaks past all these lines of defense. Hey - you’ve even automated the validation of the metadata that gets generated for your API every time you commit some changes to your code (high five)! Your API is ready for Primetime! (…or not.)

You probably know where this is going – but it’s somebody else’s problem, right? Isn’t there a CSO (Chief Security Officer) at your company that has this covered, with a long list of API security tools? Aren’t you using the latest updates to your frameworks? How could there be any security flaws in them? They are surely written by super-smart developers that avoid SQL Injection attacks, just as they would avoid crossing the street on a green light. And your API management vendor uses the latest OAuth implementation with tokens and nonces flying through the ether like bats in the night. All this talk about API Security is just a scare by vendors that want to sell you more tools. Right?

But deep down you know that API Security is something you need to take seriously – just like Facebook, SnapChat, Twitter,  Bitly, Sony, Microsoft, Tinder, Apply, NBC, Evernote and many others decidedly did not. Nobody is going to bail you out if your customers’ credit card numbers are stolen, or your customers’ users’ personal dating data is published on a torrent website. And deep down you’re right.

So what to do? Just like you do when validating functionality and performance, try to break things – put your hacker cloak on and make the developers of your API (you?) shiver as you approach for the attack.  And since even hackers need a little structure to their dwellings – let’s attempt to break this down somewhat – you wouldn’t want to fail at hacking your API, would you?

1) Know Thy Target

If you’re going to attack an API, then you must understand its perimeters… because the gate is where you often sneak in the Trojan horse.

  • HTTP: Most APIs today are using the HTTP protocol, which goes for both REST and SOAP. HTTP is a text-based protocol which therefore is fortunately very easy to read. Take, for example, the following HTTP Request:

HackYourAPI1and the corresponding response:

HackYourAPI2

As you can see – the Request and Status lines, Request and Response Headers, and Request/Response messages are all plain text – easily readable, and easily customizable for performing a security attack.

  • Message Formats: Messages sent over the web are sent using some message format. JSON is predominant in the REST world, while XML is mandatory in the SOAP world. Understand these formats (they’re easy too!) and how their peculiarities can be used to form an attack (we’ll get back to that later). And of course most formats can open for vulnerabilities if used incorrectly – PDF, Image formats like JPG and PNG, etc.

2) There is api security, and there is API Security

Security is a vague term; claiming an API is secure because it uses SSL or OAuth is false – there is more to an API than its transport-layer (although admittedly SSL goes a far way);

  • Different Authorization/Authentication standards are at play for REST and SOAP; OAuth 1.X and 2.X, SAML, WS-Security, OpenID Connect, etc.
  • SSL is great for transport-level security – but what if ones message data needs to be encrypted (so no one can read it) or signed (so you can be sure it hasn’t been tampered with) after it has been sent over HTTP? Perhaps you should be encrypting credit card numbers or sensitive customer data in your NoSQL database so that it’s useless if it should come into the wrong hands? SOAP APIs have the possibility to shine in this regard; WS-Security is a mature and complex standard handling most of these requirements. REST APIs are referred to as “startup initiatives” like JWT (JSON Web Tokens) or homegrown solutions.

As a hacker, you will be looking for these standards to be used improperly – or not at all where they should be. Perhaps getting access to someone’s credit card numbers is as simple as reusing a session token to get an authenticated user’s account information that isn’t encrypted in the message itself (more on incorrect session logic in a later post).

3) API Attack Surface Detection

Now that you’ve mastered the basics of web APIs and you’ve decided on an API to attack (your own API - don’t lose focus), you need to know where launch the attack; what is the “Attack Surface” of your API?

This can be tricky. Finding an Attack Surface for a UI-based solution (for example a web or mobile app) is straightforward: you can actually see the different input fields, buttons, file-uploads, etc. all waiting to be targeted during an attack. For an API, things are different - there is no UI to look at, just an API endpoint. But to launch a “successful” attack on an API, we need to know as much as possible about the API’s endpoints, messages, parameters and behavior. The more we know, the merrier our attack will be.

Fortunately, there are a number of “helpful” API technologies out there to facilitate our malignancies:

  • API Metadata and documentation has a lot of momentum currently; API providers are putting strong efforts into providing API consumers with detailed technical descriptions of an API, including all we need for our attack - paths, parameters, message formats, etc. Several standards are at play:

    • Swagger, RAML, API-Blueprint, I/O Docs, etc for REST APIs
    • WSDL/XML-Schema for SOAP APIs
    • JSON-LD, Siren, Hydra, etc for Hypermedia APIs

Have a look at the following Swagger definition for example:

HackYourAPI3

As you can see, a helpful Swagger specification also tells us a lot about an API’s possible vulnerabilities, helping us target the attack.

  • API Discovery: what if you have no metadata for the API you want to compromise? An alternative to getting an initial attack surface is to record interactions with the API using an existing client. For example, you might have an API consumed by a mobile app; set up a local recording proxy (there are several free options available) and direct your mobile phone to use this proxy when accessing the API – all calls will be recorded and give you an understanding of the APIs usage (paths, parameters, etc). There are even tools out there that can take recorded traffic and generate a metadata specification for you. As a hacker, it’s just as useful to you as it is to developers or honest testers.
  • Brute Force: full disclosure: most developers aren’t famed for their creativity when deciding on API paths, arguments, etc. More often than not, you can guess at an API’s paths like /api, /api/v1, /apis.json, etc. – which might at least give you something to start with. And if the target API is a Hypermedia API, then you’re in luck; Hypermedia APIs strive to return possible links and parameters related to an API response with the response itself, which for a hacker means that it will nicely tell you about all its attack surfaces as you consume it.

So now you’re all set with core API technologies, security standards and your API’s Attack Surface. You know what API to strike and where to hit, but how do you make your attack?

Tuesday, July 20, 2021

NFR(Non Functional Requirement) testing in Distributed Systems

 Before jumping into discussing resiliency in distributed systems, let’s

quickly refresh some basic terminology:

Basic Terminology

Resiliency
The capacity of any system to recover from difficulties.

Distributed Systems
These are networked components which communicate with each other by
passing messages most often to achieve a common goal.

Availability
Probability that any system is operating at time `t`.

Faults vs Failures

Fault is an incorrect internal state in your system.
Some common examples of fault in systems include:

1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most often because there’s not enough
validations on input data)

Whereas, Failure is an inability of the system to perform its intended job.

Failure means loss of up-time and availability on systems. Faults if not contained from propagating, can lead to failures.

Resiliency is all about preventing faults turning into failures

Why do we care about resiliency in our systems ?

Resiliency of a system is directly proportional to its up-time and availability. The more resilient the systems, the more available it is to serve users.

Failing to be resilient can affect companies in many ways.

not being resilient means:

1. It can lead to financial losses for the company
2. Losing customers to competitors
3. Affecting services for customers

Resiliency in distributed systems is hard

We all understand that ‘being available’ is critical. And to be available, we need to build in resiliency from ground up so that faults in our systems auto-heal.

But building resiliency in a complex micro-services architecture with multiple distributed systems communicating with each other is difficult.
Some of the things which make this hard are:

1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable

Though building resiliency is hard, it’s not impossible. Following some
of the patterns while building distributed systems can help us achieve
high up-time across our services. We will discuss some of these patterns
going forward:

Pattern[0] = nocode

The best way to write reliable and secure applications is write no code
at all — Write Nothing and Deploy nowhere — Kelsey Hightower
The most resilient piece of code you ever write will be the code you
never wrote.
The lesser the code you write, lower are the reasons for it to break.

Pattern[1] = Timeouts

Stop waiting for an answer.

Let’s consider this scenario:

You have a healthy service ‘A’ dependent on service ‘B’ for serving its requests. Service ‘B’ is affected and is slow.

The default Go HTTP client has no HTTP timeout. This causes application to leak go-routines (to handle every request Go spawns a go-routine). When you have a slow/failed downstream service, the go-routine waits forever for the reply from downstream service. To avoid this problem, it’s important to add timeouts for every integration point in our application.

This will help you fail fast if any of your downstream services does not reply back within, say 1ms.
Timeouts in application can help in following ways:

Preventing cascading failures
Cascading failures are failures which propagate very quickly to other
parts of the system.

Timeouts help us prevent these failures by failing fast. When downstream
services fail or are slower (violating their SLA), instead of waiting for the answer forever, you fail early to save your system as well as the systems which are dependent on yours.

Providing failure isolation
Failure isolation is the concept of isolating failures to only some part
of a system or a sub system.

Timeouts allow you to have failure isolation by not making some other
systems problem your problem.

How should timeouts be set ?
Timeouts must be based on the SLA’s provided by your dependencies. For
example, this could be around the dependency’s 99.9th percentile.

Pattern[2] = Retries

If you fail once, try again

Retries can help reduce recovery time. They are very effective when
dealing with intermittent failures.

Retries works well in conjunction with timeouts, when you timeout you

retry the request.

Retrying immediately might not always be useful
Dependency failures take time to recover in which case retrying could lead
to longer wait times for your users. To avoid these long wait times, we could potentially queue and retry these requests wherever possible. For example, system sends out an OTP sms message when you try to login. Instead of trying to send SMS’s synchronously with our telecom providers, we queue these requests and retry them. This helps us decouple our systems from failures of our telecom providers.

Idempotency is important
Wikipedia says:

Idempotence is the property of certain operations that they can be
applied multiple times without changing the result beyond the initial
application.

Consider a scenario in which the request to some server was processed but failed to reply back with a result. In this case, the client tries to retry the same operation. If the operation is not idempotent, it will lead to inconsistent states across your systems.

For example: non-idempotent operations in the booking creation
flow can lead to multiple bookings being created for the same user as well
as the same driver being allocated to multiple bookings.

Pattern[3] = Fallbacks

Degrade gracefully

When there are faults in your systems, they can choose to use alternative
mechanisms to respond with a degraded response instead of failing
completely.

The Curious case of Maps Service
we use Google Maps service for variety of reasons. We use it to
calculate the route path of our customers from their pickup location to destination, estimating fares etc. We have a Maps service which is an interface for all of our calls to Google. Initially, we used to have booking creation failures because of slowdown on Google maps api service. Our systems were not fault tolerant against these increases in latencies. This is how the route path looks like when systems are operating as expected.

The solution we went with was to fallback to a route approximation for
routing. When this fallback kicks in, systems depending on maps services work in a degraded mode and the route on the map looks something like this:

Fallback in the above scenario helped us prevent catastrophic failures across our systems which were potentially affecting our critical booking flows.

It is important to think of fallback at all the integration points.

Pattern [4] = Circuit Breakers

Trip the circuit to protect your dependencies

Circuit breakers are used in households to prevent sudden surge in current
preventing house from burning down. These trip the circuit and stop flow of current.

This same concept could be applied to our distributed systems wherein you stop making calls to downstream services when you know that the system is unhealthy and failing and allow it to recover.

The state transitions on a typical circuit breaker(CB) looks like this:

Initially when systems are healthy, the CB is in closed state. In this state, it makes calls to downstream services. When certain number of requests fail, the CB trips the circuit and goes into open state. In this state, CB stops making requests to failing downstream service. After a certain sleep threshold, CB attempts reset by going into half open state. If the next request in this state is successful, it goes to closed state. If this call fails, it stays in open state.

Hystrix by Netflix is a popular implementation of this pattern.


Circuit breakers are required at integration points, help preventing cascading
failures allowing the failing service to recover. You can also add a fallback for the circuit breaker to use it when it goes in open state.

You also need good metrics/monitoring around this to detect various state
transitions across various integration points. Hystrix has dashboards
which helps you visualize state transitions.


Pattern[5] = Resiliency Testing

Test to Break

It is important to simulate various failure conditions within your system. For example: Simulating various kinds of network failures, latencies in network, dependencies being slow or dead etc. After determining various failure modes, you codify it by creating some kind of test harness around it. These tests help you exercise some failure modes on every change to
code.

Injecting failures
Injecting failures into your system is a technique to induce faults purposefully to test resiliency. These kind of failures help us exercise a lot of unknown unknowns in our architectures.

Netflix has championed this approach with tools like Chaos Monkey, Latency monkey etc which are part of the Simian Army suite of applications.

In Conclusion:

Though following some of these patterns will help us achieve resiliency,
these is no silver bullet. Systems do fail, and the sad truth is we have
to deal with these failures. These patterns if exercised can help us achieve significant up-time/availability on our services.

My Profile

My photo
can be reached at 09916017317