This article presents an economically feasible approach to the building of efficient, cost-friendly high-level architecture for geographically distributed systems and the benefits of in-memory distributed caching based on the examples of Ehcache, Apache Cassandra, and Apache Kafka.
According to the McKinsey report, big data is the “next frontier for innovation, competition, and productivity.” More and more companies are investing their resources into technologies for big data processing. McKinsey forecasts the exponential growth of big data technologies in the short term as well as the inevitable spread of such technologies in all industries worldwide. For tech companies, this means great growth and development opportunities, but at the same time, it will challenge their expertise and experience.
First, when processing big data, there must be uninterrupted data flow. That is, the system receiving clients’ requests should be able to get, store, and process data 24х7, 365 days a year without interruptions or failures and be as tolerant as possible to hardware and power issues.
Second, the system should be able to process a large number of requests per time unit, as the processing speed directly affects the volume of data received.
Third, the system should be ready to handle current computing resources and any additional computational powers on the fly without any delays or failures.
Fourth, the architecture of the system should allow the building of a geographically distributed data-processing network.
The main purpose of any tech company that designs such systems is to ensure that all these conditions are met. However, when building big data systems, you should ignore technological issues and consider the business interests of your client. In particular, you should focus on the efficiency enhancement of complex mission critical systems with relatively small changes in infrastructure maintenance and support costs. For example, it is common knowledge that SDD is a “faster” and more effective (though far more expensive) data storage system compared to HDD. Therefore, a fast and budget-friendly data storage system is a cornerstone of many designed systems, and in-memory distributed caching technologies also help solve these problems.
At present, there are a number of proprietary solutions on the market—these are preconfigured solutions for big data processing, and they have their own advantages and limitations. But as we are trying to create a cost-effective solution, I’ll try to examine the architectural model of such a system based on open-source solutions. My choice is based on the widespread acceptance of these solutions, their availability (no license required or cost of license is negligible), and the prompt support of the professional community providing bug fixes, consultations, and updates.
First, let’s take a look at the high-level components of the big data processing system.
The main cluster consists of several geographically distributed data processing centers. Its main purpose is to receive users’ requests via a load balancer. Based on the data centers’ load availability, the geographical origin of the request, or another set of customizable parameters, the balancer redirects the request to the appropriate data center. At this stage, balancing manages the utilization of the available computational resources with the highest possible efficiency.
The data centers’ high-level architecture usually has the following subsystems:
When speaking of open-source solutions, I suggest using the following time-proven and effective technologies:
Distributed message queue
For a distributed message queue, I suggest using Apache Kafka, a distributed messaging system developed by LinkedIn and later submitted to the Apache Incubator.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
The main advantages of Apache Kafka:
- Was designed with an eye to distributed environments
- Solutions based on this technology are highly scalable
- Provides a high-throughput message capacity
- Automatically balances subscribers in case of incidents
- Messages are stored on the HDD and can be used for packaged processing
- Messages are synchronized between data centers, so their processing continues even in the case of one node’s disconnection
Using Apache Kafka as a message broker allows us to ensure message delivery until there is at least one working server in the entire cluster. It is important for 24×7 operations and guaranteed data integrity.
A distributed caching system is very important to ensure a high processing rate of users’ requests. Appropriate data caching in distributed systems improves the efficiency of caching operations and thus reduces request processing time.
Ehcache is a widely used open source solution for Java distributed cache. It features memory and disk storage. Ehcache is available under an Apache open source license and is actively supported by the developers’ community. Ehcache was originally developed by Greg Luck in 2003. In 2009, the project was purchased by Terracotta.
For example, data center Alpha receives a request from city M: User-1 requests the report on the tangible assets (category A) flows for the period of January 2014. Data center Alpha prepares the data, places it into the distributed cache, and sends it to the user. The execution time of this request is 30 seconds. After a while, user-2 from city N sends the same request to data center Beta. If the cache wasn’t distributed, user-2 would have had to wait for the same 30 seconds until the system prepared the necessary data and sent it. But our caching system already has the data prepared according to the criteria “category A, January 2014,” so it simply sends the data to the user without calling the storage to prepare the report. In this case, user 2 has to wait for just 0.5 seconds to receive the requested data.
I highly recommend using Ehcache from Terracotta for this layer. This product allows organizing a distributed storage of cached data in the cluster with the possibility of adding new nodes in a few clicks. My choice here is based on the following:
- The good scalability and stability of the solution
- Sufficient flexibility and setup simplicity
- Comprehensive tech support provided by Terracotta
Distributed NoSQL storage
Distributed NoSQL storage caches data from the relational storage and performs data selection based on custom criteria. Besides, using the NoSQL solution provides faster data addition to the storage.
As a solution for this node of the data center, I suggest using Apache Cassandra. This product provides powerful capabilities for building highly available and scalable geographically distributed systems. Suffice to say that such world-renowned services as Netflix (according to 2013 data, around 3.14 petabytes of data) and eBay (daily load on 15,000 servers, around 2 petabytes) already use this solution in their systems.
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Apache Cassandra was initially developed at Facebook. It was released as an open source project on Google code in July 2008. In March 2009, it became an Apache Incubator project. Cisco, IBM, Cloudkick, Reddit, Digg, Rackspace and Twitter use Cassandra-based industrial solutions.
If we take a look at performance benchmarks, we will see that the performance of Cassandra grows almost exponentially when the number of nodes is increased, which is also considered to be an economic justification for the use of this solution. Besides, Cassandra has the following features:
- Elastic scalability
- Flexible data storage system
- Support of peer-to-peer architecture
- Column-oriented DBMS
- High horizontal performance
- Simple and reliable data distribution
- High data compression ratio (up to 80% in some cases)
Thus, the main components of our distributed system meet the system requirements, in particular:
- Geographically distributed cluster support
- Failover guarantee
- High performance criteria
To illustrate the efficiency of the above-mentioned solutions and products, I’d like to share a real-life project developed in accordance with the described architecture. This system was designed by Auriga within the frame of development of a distributed intercontinental payment system for a fast-growing financial organization.
Our engineers were tasked with implementing an electronic payment processing system and payment kiosk solution in accordance with the following requirements:
- Processing of up to 10,000 requests per second
- Tolerance to the shutting down of separate servers or even an entire data center
- Possibility to analyze data on kiosk status and payment transactions in a timely manner
Based on the technology research used for building big data systems, Auriga’s team concluded that the best decision would be to build a system based on in-memory distributed caching using NoSQL storage to support the requested rate of operations. The parameters of the implemented systems are presented in the table below.
|Number of data centers||3|
|Number of nodes in the cluster||18|
|Number of requests per second||15,000|
|Volume of data processed in a day||40 Tb|
The system was implemented gradually, systematically adding new nodes to the cluster and transferring certain categories of customers to the new processing technology.
Thus, the use of the in-memory distributed caching technology and the distributed message queue based on Ehcache, Apache Cassandra, and Apache Kafka solutions allowed us to build a stable, 24х7 available geographically distributed system with high potential for scaling out.