Search This Blog

Friday, May 20, 2022

Whatsapp System Design - High Level Architecture

 Let's design whatsapp :) 

Prioritized requirements

  • We must implement one-to-one chat messaging.
  • We must also show the users, what stage the message is currently on. (Sent, Delivered and Read Receipts)
  • Groups messaging is also allowed.
  • Users can share image, audio and video files.
  • We will also show the Online/Last seen status of users.
  • Chat will be temporary. (i.e. They will be stored on the client side)

One to One messaging and Read Receipts


Whenever users want to send a message they send a request to our server. This request is received by the gateway service. Then the client applications maintain a TCP Connection with the gateway service to send messages.


Once the server sends the message to the recipient, our system must also notify the sender that the message has been delivered. So we also send a parallel response to the sender that the message has been delivered. (Note: To ensure that message will be delivered we store the message in database and keep retrying till the recipient gets the message.) This takes care of Sent receipts.


When the recipient receives the message it sends a response (or acknowledgement) to our system. This response is then routed to session service. It finds the sender from the mapping and sends the Delivery receipts.


The process to send the Read receipts is also the same. As soon as user reads the message we perform the above process.


Note: The response from the client consists of sender and receiver fields.


Components required

  • Gateway Service
    • This service consists of multiple servers.
    • It will receive all the requests from the users.
    • It maintains the TCP connections with the users.
    • Furthermore, it also interacts with all the internal services.
  • Session Service Gateway service is also distributed. So if we want to send messages from one user to another we must know which user is connected to which gateway server. This is handled by session service. It maps each user (userID) to the particular gateway server.
  • Database All the mappings must be persisted in a non volatile storage. For that we need a database.

Trade-offs

  • Storing the mapping in gateway service v/s Storing it in session service
    • If we store the mapping in gateway service then we can access it faster. To get the mappings from session service we have to make a network call.
    • Gateway services have limited memory. If we store the mapping the gateway we have to reduce the number of TCP connections.
    • Gateway service is distributed. So there are multiple servers. In that case there will be a lot of duplication. Also every time there is an update we have to update the data in each and every server.
  • So we can conclude that storing the mapping in the session service is a better idea.


  • Using HTTP for messaging v/s Websockets (WSS)
    • HTTP can only send messages from client to server. The only way we can allow messaging is by constantly sending request to server to check if there is any new message (Long Polling).
    • WSS is a peer to peer protocol that allows client and server to send messages to each other.
  • As we do not need to constantly send requests to server, using XMPP will be more efficient.

Diagram




Last Seen Timestamps of users


We want to show other users whether any user is online or when was he/she last seen. To implement this we can store a table that contains the userID and the LastSeenTimestamps. Whenever any user makes an activity (like sending or reading message) that request is sent to the server. The time at which the request is sent we update the key value pair. We must also consider the requests sent by the application and not by the user (like polling for messages etc.) These requests do not count as user activity so we won't be logging them. We can have an additional flag (something like application_activity) to differentiate the two.


We also need to define a threshold. If the last seen is below the threshold then instead of showing the exact time difference we will just show online.


For e.g. if the last seen of user X is 3sec and the threshold is 5sec then other users will see X as online.


Components Required

  • Last Seen service Every time there is an user activity it is routed to this service. It persists they key value pair in a non volatile database.
  • Database


Group Messaging


Each group will have many users. Whenever a participant in a group sends a message we first find the list of users present in the group. Once the session service has the list of users it finds the gateway services that the users are connected to and then sends the message.


Note: We should also limit the number of users in a group. If there are a lot of users then it can cause fanout. We can ask the client applications to pull new messages from our system but our messages won't be realtime in such case.

  • We do not want the gateway service to parse messages because we want to minimize the memory usage and maximize the TCP connections. So we will use a message parser to convert the unparsed to sensible message.
  • We have a mapping of groupID to userID. This is one to many relationship. Group messaging service has a multiple servers so there can be data redundancy. In order to reduce redundancy we use consistent hashing. We hash the groupID and send the request to the server according to the result.
  • We also need to use a message queue incase there is any failures while sending requests. Once we give a request to message queue it ensures that message will be sent. If we reach maximum number of retries it tells the sender that it failed and we can notify the user.
  • While sending messages in a group we must take care of three things
    • Retries - Message queue takes care of that.
    • Idempotency - It means that each message should be sent only once. We can achieve this by sending messages to queue at least once but each message will have an unique ID. If the service has already seen the ID then it means that message is already sent so the new message is ignored.
    • Ordering of messages - Messages must be ordered by the their timestamps in a group. To ensure this we always assign the messages of a particular group to a particular thread from the thread pool.

Components Required

  • Group Messaging service It stores the mapping of groupID to userID and provides this data to the session service.
  • Message parser service It receives the unparsed message from the gateway service and converts it to sensible format before sending it to other services.
  • Message queue

Diagram



Sending Image, Audio and Video files


We can use a distributed file service to store the files as they are much more efficient and cost effective compared to storing images as BLOBs in database. Every time an user sends an image we can store it in file service and when we can get the image when we need to send it.


Components required

  • Distributed File System

Diagram



Some more optimizations

  • Graceful degradations On some occasions our system might get so many messages that our systems get overloaded. In such cases we can temporarily shut down services that are not critical to our service (like sending read receipts or last seen status etc).
  • Rate Limiting In some situations it might happen that we cannot handle any more requests. In such cases we can rate limit the number of requests and drop extra requests. However this results in bad user experience.

Happy Learning :) 

Tuesday, May 17, 2022

Distributed Caching - Key Features

Caching in distributed systems is an important aspect for designing scalable systems. We first discuss what is a cache and why we use it. We then talk about what are the key features of a cache in a distributed system.

The cache management policies of LRU and Sliding Window are mentioned here. For high performance, the cache eviction policy must be chosen carefully. To keep data consistent and memory footprint low, we must choose a write through or write back consistency policy.


Cache management is important because of its relation to cache hit ratios and performance. We talk about various scenarios in a distributed environment.



Use-cases of Cache

  • Save network calls
  • Avoid recomputations
  • Reduce db load
Store everything in cache?
  • As we know response times are much faster for response time to fetch details from cache insetad of db so does that mean we can store lot of data in cache?
    • Well you can't do for mutiple reasons
      • Firstly hardware on which cache runs is usually much more expensive than that of a normal database.
      • Secondly if you store ton of data in cache then search time will increase and as seacrh time keeps on increasing, it makes lesser sense to use the cache.
When to load and evict data from cache?
  • It's entirely depends on our cache policy we use.
    • First popular policy called as LRU (Least Recent Used).
      • kick out bottom most entries.
        • As an example if celebrity made a post/comment, ppl would want to load that and slowly it would be least used one
    • There is one more LFU (Least Frequently Used) but it's not frequently used in real world mostly :)
What problems poor eviction policy can cause?
  • Imaging you are asking for something from cache and it says I don't have it most of the time and you again going to ask DB so you are making more network calls.
    • So the first problem is Extra Calls.
  • Second problem when you have very small cache and imaging making entry for X and then making entry for Y and deleting for X.
    • This concept is called Thrashing.
  • Data Consistency
    • As an example server2 makes an update call and update the DB and now if server1 asks for X profile but it will fetch outdated profile. (would be even severe in terms of passwords updattion etc)


Where cache can be placed?
  • It can be placed closed to the database or can be placed close to the server.
    • There are benefits for both and drawbacks for both.
  • If you want to place close to the server, how close you can place, well you can place it in memory itself.
    • If you do this, amount of memory in your server is going to be used up by your cache.
      • If number of results is really small and you need to save on the network calls then you can just keep it in memory.
      • If let's say server2 fails, it's in-memory cache also fails
      • What if data on S1 and data on S2 are not consistent that means they are not in sync.
  • Putting cache near to db is like global cache.
    • In this case even if S2 crashes, S1 will keep serving requests and there won't be any data inconsistency.
    • Although it will be slightly slower but it's more accurate.
    • You can also scale this independently and servers would be resilient too.

How to make sure data is consistent in cache?
  • There are two approaches to achieve it
    • Write-through
      • You will make update entry in the cache and update further to the database.
      • Possible problems when servers having in-memory cache and let's say S1 making an update call and updated cache bur data would be inconsistent in S2 cache.
    • Write-back
      • Once you hit the database, make sure you make an entry in the cache.
      • Possible problem in write-back is performance
  • Both approaches having advantages and disadvantages.
    • Hybrid sort of solution would be best based on the use-cases.

Happy Learning :) 

My Profile

My photo
can be reached at 09916017317