The growing need to process large amounts of data, stored in humungous datasets, and gain valuable insights into the data by means of processing, and detailed analysis is pronounced in an era where companies resort to reading into seemingly insignificant numbers to learn about their customers and shape themselves to meet the needs of the current market. Companies process the data they obtain from web-services, extracted using web crawlers, for business intelligence, market studies, or internal data processing requirements. Handling such large datasets will need cost-effective, high-performance computing capabilities that parallel RDBMSs have failed to deliver, and hence several major organizations have developed technologies to utilize the computational power of large clusters of commodity servers to satisfy the high performance and computing requirements for the processing and analysis of large amounts of data (big data).
So what are clusters?
Clusters consist of hundreds or even thousands of commodity machines that are connected using high-bandwidth networks. Examples of this include Hadoop, Lexis Nexis HPCC(our focus in this series), MapReduce(Google), Sector/Sphere, etc.
High-Performance Computing or HPC describes computing environments that make use of supercomputers and computer clusters to address complex computational requirements and support applications that require a large amount of processing time, those that are data-intensive (require large amounts of data), compute-intensive (have large computational requirements, but usually work on small volumes of data), etc.
Supercomputers have generally been associated with scientific research and compute-intensive types of problems, but more and more supercomputer technology is appropriate for both compute-intensive and data-intensive applications often referred to as “Big Data” applications.
Supercomputers practice a very high degree of internal parallelism. They have specialized multiprocessors with memory architectures that are customized to optimize numerical calculations and require the implementation of sophisticated parallel programming techniques for the fullest utilization of the supercomputer’s performance potential.
Today, a high-end desktop has more computational power than some of the best supercomputers of the 1990s, which has lead to a new trend in supercomputing — the utilization of the computational capabilities of clusters of commodity computers with independent processors.
Independent processing nodes can work on separate parts of the problems after the data is divided in a logical manner, and the final results from each process can be combined — this is called Data Parallelism/Horizontal Partitioning/Dataparallel applications, and can be applied when the data processing requirements are of petabyte-scale. Data Parallelism can be defined as computation applied independently to each data item in a set of data. This means that the degree of parallelization can be scaled with the volume of data.
The main reason for the use of data parallelism in the development of applications is the scalability (based on the amount and division of data) in high-performance computation, resulting in a large improvement in performance (several orders of magnitude)
Some of the key issues in data-parallel applications are —
- Strategy for data decomposition and load balancing on each processing node i.e. the division of data and distribution to the separate nodes for independent processing
- Communication between the processing nodes
- The overall accuracy of the results
Another issue that we face while building applications that practice data parallelism, is the possible requirement of high program complexity, as we will be required to define the problem based on the available programming tools and the limitations of the target architecture.
Commodity Computing Clusters
A company is said to achieve an economy of scale when it has a better chance of decreasing costs as the company grows and production units increase i.e. more goods/services can be produced on a larger scale with fewer input costs.
The use of multiple independent processing units for nodes in the design of modern-day supercomputers has helped achieve an economy of scale with respect to the computational power provided and the resulting costs, which is why commodity computer clustering is used to satisfy high-performance computational requirements.
A computer cluster is basically a group of shared individual computers, that are linked by high-speed communication lines on a local area network topology (gigabit network switches, InfiniBand used here). Computer clusters incorporate system software that incorporates a parallel processing environment and is able to divide processing amongst the different nodes of the computer cluster.
- Improve the performance of applications reliant on a single computer
- Provide higher availability, reliability
- Much more cost-effective than equivalent supercomputer systems
Though the hardware architecture is an important determinant of the capability, performance, and throughput of a computing cluster, the key component, used to effectively take full advantage of the available hardware is the system software and the accessible software tools that are used to provide a parallel job execution environment, which is why a programming language with implicit parallel processing features that enable a high degree of optimization of is vital to ensure high-performance, and programmer productivity.
Cluster allow the data in use, to be partitioned among the available computing resources i.e. distributed to each individual processing node, which is a system with its own processor and local memory, after which the data is processed independently in each processing node, to help achieve high performance and data-volume based scalability.
Since this particular parallel processing approach involves no communication between the processing nodes, it is called the shared nothing approach, since each node consists of its own processor, local memory and disk, and shares nothing with any other nodes in the cluster.
Clustering is highly effective when the “shared nothing” approach is implemented. Why? Because clusters are most effective when the task is easily partitionable into a number of parallel tasks that can be handled individually by each processing node, with no requirement of communication between the nodes i.e. no interdependence of the tasks assigned to each node, other than the overall management of the tasks.
HPCC’s are revolutionizing data-dependent industries by speeding processes that would otherwise have taken weeks or even months to finish, improving reliability, and enabling access to computational resources previously unavailable to a majority of researchers, from the confines of one’s home irrespective of the sophistication of the personal computer used, thereby opening up endless possibilities by facilitating innovation.