Search This Blog

Thursday, June 16, 2022

SPOF (Single Point of Failure)

A single point of failure(SPOF) in computing is a critical point in the system whose failure can take down the entire system. A lot of resources and time is spent on removing single points of failure in an architecture/design. 



Single points of failure often pop up when setting up coordinators and proxies. These services help distribute load and discover services as they come and leave the system. Because of the critical centralized tasks of these services, they are more prone to being SPOFs.


One way to mitigate the problem is to use multiple instances of every component in the service. The graph of dependencies then becomes more flexible, allowing the system to resiliently switch to another service instead of failing requests.


Another approach is to have backups which allow a quick switch over on failure. The backups are useful in components dealing with data, like databases.


Allocating more resources, distributing the system and replication are some ways of mitigating the problem of SPOF. Hence designs include horizontal scaling capabilities and partitioning.

Wednesday, June 15, 2022

CDN (Content Delivery Network) Explained

 Let's discuss about CDN in details and below are the points you would consider generally while serving static pages, images etc.

Use-case

  • An example of your server serving static and dyanamic html pages, images etc.

Caching

  • To make it more fast and efficient, first approach you would take is cache the details and serve it accordingly.

Device Customised Data

  • Different type of html pages and images that would serve to different devices (desktop, mobile etc). Let's say 5 diff devices and 100 diff countries so 1000 diff data points to be served by cache.

Performance Consideration

  • You want to serve your pages fast to the users else they would lose interest in the product if it takes time to load.
Global Cache
  • You would cache the information outside server to serve the content fast and to avoid single point of failure, you would go with distributed cache and data is spread across multiple servers of cache.
Shard Caches
  • To serve the requests even faster considering many combinations like 1000 data points discussed above, you might go with sharding and shard it based on locations, countries etc and diff type of request like probably request from US would go to diff set of cache box which would serve US related requests etc.
Localized Caches
  • One more problem would be let's say company is from US and cache servers are sitting in US and our user-base is across countries so to serve requests efficiently for diff countries etc we have to make localised cache say for India one data centre in India to serve requests and so on.
Why should you use a CDN?
  • Well If you have to design by your own, you have to take case of all above points and majorly concluded as below
    • Available in different countries
    • Follows regulations
    • Serves the latest content
What are the benefits of a CDN?
  • Specialised solution like CDN takes care of all of the above points and you can focus on your business logics to expand it further.
    • One of the good example if Akamai and specialiases in as below
      • Hosting boxes close to the users.
      • Follow regulations
      • Allow posting content in the boxes via UI.
      • Expiry time in CDNs
        • Something like sometime you need cache for 60 sec or 60 min only etc. Everything handled provided via UI.
    • Another good example for the same is Amazon S3.
      • Super cheap
      • Very reliable
      • Easy to use

In details explanation of CDN : https://learnwithnitin.blogspot.com/2014/02/content-delivery-network-cdn.html


Happy Learning :) 

Sunday, June 5, 2022

Why do Databases fail? AntiPatterns to avoid !

Databases are often used to store various types of information, but one case where it becomes an a problem is when being used as a message broker.

The database is rarely designed to deal with messaging features, and hence is a poor substitute of a specialized message queue. When designing a system, this pattern is considered an anti pattern. 




Here are possible drawbacks:

  • Polling intervals have to be set correctly. Too long makes the system is inefficient. Too short makes the database undergo heavy read load.
  • Read and write operation heavy DB. Usually, they are good at one of the two.
  • Manual delete procedures to be written to remove read messages.
  • Scaling is difficult conceptually and physically.


Disadvantages of a Message Queue:

  • Adds more moving parts to the system.
  • Cost of setting up the MQ along with training is large.
  • Maybe be overkill for a small service.


It is important to be able to reason why or why not a system needs a message queue. These reasons allow us to argue on the merits and demerits of the two approaches.


However, there are blogs on why Databases are perfectly fine as message queues too. A deep understanding of the pros and cons helps evaluate how effective they would be for a given scenario. 


In general, for a small application, databases are fine as they bring no additional moving part to the system. For complex message sending requirements, it is useful to have an abstraction such as a message queue handle message delivery for us.

Publisher Subscriber Model

Microservices benefit from loose data coupling, which is provided by a publish subscribe model. In this model, events are produced by a publishing service and consumed by downstream services.

Designing the micro service interactions involves event handling and consistency checks. We look into a pub-sub architecture to evaluate it's advantages and disadvantages compared to a request response architecture.



This type of architecture relies on message queues to ensure event passing. An example would be rabbitMQ or Kafka. The architecture is common in real life scenarios and interviews.


If there is no strong consistency guarantee to made for transactions, an event model is good to use in microservices. Here are the main advantages:

  • Decouples a system's services.
  • Easily add subscribers and publishers without informing the other.
  • Converts multiple points of failure to single point of failure.
  • Interaction logic can be moved to services/ message broker.


Disadvantages:

  • An extra layer of interaction slows services
  • Cannot be used in systems requiring strong consistency of data
  • Additional cost to team for redesigning, learning and maintaining the message queues.


This model provides the basis for event driven systems.

Saturday, June 4, 2022

Capacity Planning and Estimation

 


Eg: Estimate the hardware requirements to set up a system like YouTube.

Eg: Estimate the number of petrol pumps in the city of Mumbai.


Let's start with storage requirements:

About 1 billion active users.

I assume 1/1000 produces a video a day.

Which means 1 million new videos a day.


What's the size of each video?

Assume the average length of a video to be 10 minutes. 

Assume a 10 minute video to be of size 1 GB. Or...

A video is a bunch of images. 10 minutes is 600 seconds. Each second has 24 frames. So a video has 25*600 = 150,000 frames.

Each frame is of size 1 MB. Which means (1.5 * 10^5) * (10^6) bytes = 150 GB.

This estimate is very inaccurate, and hence we must either revise our estimate or hope the interviewer corrects us. Normal video of 10 minutes is about 700 MB.


As each video is of about 1GB, we assume the storage requirement per day is 1GB * 1 million = 1 PB. 



This is the bare minimum storage requirement to store the original videos. If we want to have redundancy for fault tolerance and performance, we have to store copies. I'll choose 3 copies. 

That's 3 petabytes of raw data storage.

What about video formats and encoding? Let's assume a single type of encoding, mp4, and the formats will take a 720p video and store it in 480, 360, 240 and 144p respectively. That means approximately half the video size per codec.


If X is the original storage requirement = 1 PB,

We have X + X/2 + X/4 + X/8 == 2*X.

With redundancy, that's 2X * 3 = 6*X.


That's 6 PB(processed) + 3PB (raw)  == 10 PB of data. About 100 hard drives. The cost of this system is about 1 million per day.


For a 3 year plan, we can expect a 1 billion dollar storage price.


Now let's look at the real numbers:

Video upload speed = 3 * 10^4 minutes per minute.

That's 3 * 10^4 *1440 video footage per day =  4.5 * 10^7 minutes.

Video encoding can reduce a 1 hour film to 1 GB. So 1 million GB is the requirement. That's 1 PB.


So the original cost is similar to what the real numbers say.




Friday, May 20, 2022

Whatsapp System Design - High Level Architecture

 Let's design whatsapp :) 

Prioritized requirements

  • We must implement one-to-one chat messaging.
  • We must also show the users, what stage the message is currently on. (Sent, Delivered and Read Receipts)
  • Groups messaging is also allowed.
  • Users can share image, audio and video files.
  • We will also show the Online/Last seen status of users.
  • Chat will be temporary. (i.e. They will be stored on the client side)

One to One messaging and Read Receipts


Whenever users want to send a message they send a request to our server. This request is received by the gateway service. Then the client applications maintain a TCP Connection with the gateway service to send messages.


Once the server sends the message to the recipient, our system must also notify the sender that the message has been delivered. So we also send a parallel response to the sender that the message has been delivered. (Note: To ensure that message will be delivered we store the message in database and keep retrying till the recipient gets the message.) This takes care of Sent receipts.


When the recipient receives the message it sends a response (or acknowledgement) to our system. This response is then routed to session service. It finds the sender from the mapping and sends the Delivery receipts.


The process to send the Read receipts is also the same. As soon as user reads the message we perform the above process.


Note: The response from the client consists of sender and receiver fields.


Components required

  • Gateway Service
    • This service consists of multiple servers.
    • It will receive all the requests from the users.
    • It maintains the TCP connections with the users.
    • Furthermore, it also interacts with all the internal services.
  • Session Service Gateway service is also distributed. So if we want to send messages from one user to another we must know which user is connected to which gateway server. This is handled by session service. It maps each user (userID) to the particular gateway server.
  • Database All the mappings must be persisted in a non volatile storage. For that we need a database.

Trade-offs

  • Storing the mapping in gateway service v/s Storing it in session service
    • If we store the mapping in gateway service then we can access it faster. To get the mappings from session service we have to make a network call.
    • Gateway services have limited memory. If we store the mapping the gateway we have to reduce the number of TCP connections.
    • Gateway service is distributed. So there are multiple servers. In that case there will be a lot of duplication. Also every time there is an update we have to update the data in each and every server.
  • So we can conclude that storing the mapping in the session service is a better idea.


  • Using HTTP for messaging v/s Websockets (WSS)
    • HTTP can only send messages from client to server. The only way we can allow messaging is by constantly sending request to server to check if there is any new message (Long Polling).
    • WSS is a peer to peer protocol that allows client and server to send messages to each other.
  • As we do not need to constantly send requests to server, using XMPP will be more efficient.

Diagram




Last Seen Timestamps of users


We want to show other users whether any user is online or when was he/she last seen. To implement this we can store a table that contains the userID and the LastSeenTimestamps. Whenever any user makes an activity (like sending or reading message) that request is sent to the server. The time at which the request is sent we update the key value pair. We must also consider the requests sent by the application and not by the user (like polling for messages etc.) These requests do not count as user activity so we won't be logging them. We can have an additional flag (something like application_activity) to differentiate the two.


We also need to define a threshold. If the last seen is below the threshold then instead of showing the exact time difference we will just show online.


For e.g. if the last seen of user X is 3sec and the threshold is 5sec then other users will see X as online.


Components Required

  • Last Seen service Every time there is an user activity it is routed to this service. It persists they key value pair in a non volatile database.
  • Database


Group Messaging


Each group will have many users. Whenever a participant in a group sends a message we first find the list of users present in the group. Once the session service has the list of users it finds the gateway services that the users are connected to and then sends the message.


Note: We should also limit the number of users in a group. If there are a lot of users then it can cause fanout. We can ask the client applications to pull new messages from our system but our messages won't be realtime in such case.

  • We do not want the gateway service to parse messages because we want to minimize the memory usage and maximize the TCP connections. So we will use a message parser to convert the unparsed to sensible message.
  • We have a mapping of groupID to userID. This is one to many relationship. Group messaging service has a multiple servers so there can be data redundancy. In order to reduce redundancy we use consistent hashing. We hash the groupID and send the request to the server according to the result.
  • We also need to use a message queue incase there is any failures while sending requests. Once we give a request to message queue it ensures that message will be sent. If we reach maximum number of retries it tells the sender that it failed and we can notify the user.
  • While sending messages in a group we must take care of three things
    • Retries - Message queue takes care of that.
    • Idempotency - It means that each message should be sent only once. We can achieve this by sending messages to queue at least once but each message will have an unique ID. If the service has already seen the ID then it means that message is already sent so the new message is ignored.
    • Ordering of messages - Messages must be ordered by the their timestamps in a group. To ensure this we always assign the messages of a particular group to a particular thread from the thread pool.

Components Required

  • Group Messaging service It stores the mapping of groupID to userID and provides this data to the session service.
  • Message parser service It receives the unparsed message from the gateway service and converts it to sensible format before sending it to other services.
  • Message queue

Diagram



Sending Image, Audio and Video files


We can use a distributed file service to store the files as they are much more efficient and cost effective compared to storing images as BLOBs in database. Every time an user sends an image we can store it in file service and when we can get the image when we need to send it.


Components required

  • Distributed File System

Diagram



Some more optimizations

  • Graceful degradations On some occasions our system might get so many messages that our systems get overloaded. In such cases we can temporarily shut down services that are not critical to our service (like sending read receipts or last seen status etc).
  • Rate Limiting In some situations it might happen that we cannot handle any more requests. In such cases we can rate limit the number of requests and drop extra requests. However this results in bad user experience.

Happy Learning :) 

Tuesday, May 17, 2022

Distributed Caching - Key Features

Caching in distributed systems is an important aspect for designing scalable systems. We first discuss what is a cache and why we use it. We then talk about what are the key features of a cache in a distributed system.

The cache management policies of LRU and Sliding Window are mentioned here. For high performance, the cache eviction policy must be chosen carefully. To keep data consistent and memory footprint low, we must choose a write through or write back consistency policy.


Cache management is important because of its relation to cache hit ratios and performance. We talk about various scenarios in a distributed environment.



Use-cases of Cache

  • Save network calls
  • Avoid recomputations
  • Reduce db load
Store everything in cache?
  • As we know response times are much faster for response time to fetch details from cache insetad of db so does that mean we can store lot of data in cache?
    • Well you can't do for mutiple reasons
      • Firstly hardware on which cache runs is usually much more expensive than that of a normal database.
      • Secondly if you store ton of data in cache then search time will increase and as seacrh time keeps on increasing, it makes lesser sense to use the cache.
When to load and evict data from cache?
  • It's entirely depends on our cache policy we use.
    • First popular policy called as LRU (Least Recent Used).
      • kick out bottom most entries.
        • As an example if celebrity made a post/comment, ppl would want to load that and slowly it would be least used one
    • There is one more LFU (Least Frequently Used) but it's not frequently used in real world mostly :)
What problems poor eviction policy can cause?
  • Imaging you are asking for something from cache and it says I don't have it most of the time and you again going to ask DB so you are making more network calls.
    • So the first problem is Extra Calls.
  • Second problem when you have very small cache and imaging making entry for X and then making entry for Y and deleting for X.
    • This concept is called Thrashing.
  • Data Consistency
    • As an example server2 makes an update call and update the DB and now if server1 asks for X profile but it will fetch outdated profile. (would be even severe in terms of passwords updattion etc)


Where cache can be placed?
  • It can be placed closed to the database or can be placed close to the server.
    • There are benefits for both and drawbacks for both.
  • If you want to place close to the server, how close you can place, well you can place it in memory itself.
    • If you do this, amount of memory in your server is going to be used up by your cache.
      • If number of results is really small and you need to save on the network calls then you can just keep it in memory.
      • If let's say server2 fails, it's in-memory cache also fails
      • What if data on S1 and data on S2 are not consistent that means they are not in sync.
  • Putting cache near to db is like global cache.
    • In this case even if S2 crashes, S1 will keep serving requests and there won't be any data inconsistency.
    • Although it will be slightly slower but it's more accurate.
    • You can also scale this independently and servers would be resilient too.

How to make sure data is consistent in cache?
  • There are two approaches to achieve it
    • Write-through
      • You will make update entry in the cache and update further to the database.
      • Possible problems when servers having in-memory cache and let's say S1 making an update call and updated cache bur data would be inconsistent in S2 cache.
    • Write-back
      • Once you hit the database, make sure you make an entry in the cache.
      • Possible problem in write-back is performance
  • Both approaches having advantages and disadvantages.
    • Hybrid sort of solution would be best based on the use-cases.

Happy Learning :) 

Thursday, May 12, 2022

Appium Architecture - Core Concepts

What is Appium?

It’s a NodeJS based open-source tool for automating mobile applications. It supports native, mobile web, and hybrid applications on iOS mobile, Android mobile, and Windows desktop platforms.

Using Appium, you can run automated tests on physical devices or emulators, or both.



Let’s understand the above Appium architecture diagram.

  • Appium is a client-server architecture. The Appium server communicates with the client through the HTTP JSONWire Protocol using JSON objects.
  • Once it receives the request, it creates a session and returns the session ID, which will be used for communication so that all automation actions will be performed in the context of the created session.
  • Appium uses the UIAutomator test framework to execute commands on Android devices and emulators.
  • Appium uses the XCUITest test framework to execute commands on Apple mobile devices and simulators.
  • Appium uses WinAppDriver to execute commands for Windows Desktop apps. It is bundled with Appium and does not need to be installed separately.

Appium - Android visual interaction flow

Let’s understand the interaction flow between the code and the Android device via the Appium server.

  • The client sends the request to the Appium server through the HTTP JSONWire Protocol using JSON objects.
  • Appium sends the request to UIAutomator2.
  • UIAutomator2 communicates to a real device/simulator using bootstrap.jar which acts as a TCP server.
  • bootstrap.jar executes the command on the device and sends the response back.
  • Appium server sends back the command execution response to the client.

Appium - iOS visual interaction flow

Let’s understand the interaction flow between the code and the iOS device via the Appium server.

  • The client sends the request to the Appium server through the HTTP JSONWire Protocol using JSON objects.
  • Appium sends the request to XCUITest.
  • XCUITest communicates to a real device/simulator using bootstrap.js which acts as a TCP server.
  • bootstrap.js executes the command on the device and sends the response back.
  • The Appium server sends the command execution response to the client.

Whiteboard Sessions
  • IOS flow architecture

  • Android flow architecture

  • Drivers which appium supports
    • UI Automator2 (Android)
    • Espresso (Android)
    • WinApp (Windows)
    • MAC Driver (Mac OS)
    • XCUITest (IOS above 9.3 version)
    • UI Automation (IOS below 9.3 version)
    • Tizen (for samsung)


Happy Learning :) 

Wednesday, May 11, 2022

System Design : Tinder as a microservice architecture

Requirements

Prioritized requirements
  • System should store all the relevant profile data like name, age, location and profile images.
  • Users should be recommended matches based on their previous choices.
  • System should store the details when a match occurs.
  • System should allow for direct messaging between two users if they have matches.
Requirements not a part of our design
  • Allowing moderators to remove profile.
  • Payment and subscriptions for extra features.
  • Allowing only limited number of swipes to users without subscription

Estimation

  • We will store 5 profile images per users.
  • We are assuming the number of active user is 10 million.
  • We are assuming the number of matches is 0.1% of total active users i.e., 1000 matches daily.

Requirement 1: Profile creation, authentication and storage

Description

First the system should allow a new user to create an account and once the account has been created then it needs to provide the user with an authentication token. This token will be used by the API gateway to validate the user.

System needs to store profile name, age, location and description in a relational database. However, there are 2 ways to store images.

  • We can store images as file in File systems.
  • We can store images as BLOB in databases.

Components required

  • API Gateway Service
    • It will balance load across all the instances.
    • It will interact with all the services
    • It will also validate authentication token of the users
    • It will redirect users to the required service as per their request.
  • Distributed File System to store images
    • We will use CDN to serve the static images faster.
  • Relational database to store user information
  • We will store the user ID, name, age, location, gender, description, (user preferences) etc.

Trade-offs

  • Storing images as File v/s Storing images as BLOB

          Features provided by database when store images as BLOB

      • Mutability : When we store an image as a BLOB in database we can change its value. However, this is useless as because an update on image is not going to be a few bits. We can simply create a new image file.
      • Transaction guarantee : We are not going to do an atomic operation on the image. So this feature is useless.
      • Indexes : We are not going to search image by its content (which is just 0's and 1's) so, this is useless.
      • Access control : Storing images as BLOB's in database provides us access control, but we can achieve the same mechanisms using the file system.
            Features provided by file system when store images as files
    • They are comparatively cheap.
    • They are comparatively faster because they store large files separately.
    • Files are static, so we can use CDN for faster access.
  • So for storing user images we will use Distributed File System

      Diagram

      Requirement 2: One to one Chat messaging

      Description

      System should allow one to one chat messaging between two users if and only if they have matched. So we have to connect one client to another.

      To achieve this we use XMPP protocol that allows peer to peer communication.

      Components required

      • Relational database

        • We will use this database to store the user ID and connection ID
      • Cache

        • We do not want to access database every time a client sends a message, so we can use a cache to store the user ID and connection ID.

      Trade-offs

      • Use of HTTP for chat v/s Use of XMPP for one to one messaging
        • When we use HTTP XMPP we can only message from client to server. The only way we can allow messaging is by constantly sending request to server to check if there is any new message (polling).
        • XMPP is a peer to peer protocol that allows client and server to send messages to each other.
        • As we do not need to constantly send requests to sever, using XMPP will be more efficient.

      Diagram

      Requirement 3: Matching right swiped user

      Description

      Server should store the following information

      • Users who have matched (both have right swiped each other)
      • One or both the users have left swiped each other.

      This service would also allow the chat service to check if the users are matched and then allow one to one chat messaging.

      Components required

      • Relational database

      • We will use this database to store the user IDs of both the user
      • We will use indexes on the user ID to make queries faster.

      Trade-offs

      • Storing match details on the client v/s Storing match details on the server

        • One benefit of storing the match details on the client we save storage on the server side. However, as we are storing only the user IDs it is not significant.
        • If we store match data on client side then all data is lost when user uninstalls the applications but if we store it on the server then the data is not lost.
        • Benefit of storing the details on the server side is that it becomes a source of truth. And as the details on the server cannot be tampered so, it is more secure.
        • So we store the relevant details on the server side

      Diagram

      Requirement 4: Server recommendations to users

      Description

      Server should be able to recommend profiles to users. These recommendations should take into consideration the age and gender preferences. Server should also be able to recommend profiles that are geographically close to the user.

      Components required

      • Relational database
      • We can do horizontal partitioning (sharding) on the database based on the location. Also, we can put indexes on the name and age, so we can do efficient query processing.
      • For every database partition we will have a master slave architecture. This will allow the application to work even if the primary database fails.

      Diagram

      Database design

      API Design

      Profile service
      • POST /user/signup - Creates new account
      • GET /user/login - Sends the user authentication token
      • GET /user/:userID - Gets the profile of the user ID
      • PUT /user/:userID - Update user details
      • DELETE /user/:userID - Removes the user account
      Session service
      • GET /session/users/:connectionID - Returns both the users that have the connection ID.
      • DELETE /session/connection/:connectionID - Deletes all the data that have the connection ID.
      • POST /session/connection/:userID1/:userID2 - Adds user ID1 and user ID2 with the same connection ID.
      Matcher service
      • GET /match - Return all the matches of the logged-in user.
      • DELETE /match/:userID - Deletes the user ID from the match list.
      Recommendation service
      • GET /recommendation - Returns a collection of most appropriate profiles for logged-in user.

      Happy Learning :) 

      My Profile

      My photo
      can be reached at 09916017317