Big data software has tremendous potential, but the infrastructure clusters that support these new platforms bring a new paradigm to most service oriented data center architecture standards. As data analytics requirements explode, technology infrastructure and its associated costs can grow out of control quickly.
Besides the most important question of "How do I quantify the value of this big data project", these infrastructures have highlighted another huge chasm today inside most organizations that are considering investing in Hadoop. There is a wide gap between the existing core IT infrastructure and the next-generation (cluster based) architecture best suited to support the Hadoop-based ecosystem and the big data platforms and applications that reside there.
We call this the Big Data Gap!
In this article, we will share a proven approach for bridging the Big Data Gap, that not only addresses concerns like cost containment, risk and operational efficiency, but also provides a next generation reference architecture, including TCO modeling, cost benefit and comparative analysis.
Big Data Clusters Are The Foundation and The Future
Many organizations are exploring how big data clusters can analyze their unstructured data to uncover new insights into their business and finding new revenue streams. Some have come to the conclusion, "I don't see the value." Many never will, but we here at Headwaters take the bullish view and expect investment and adoption of Hadoop-based architectures to grow in the next 2-3 years.
Additionally, big data cluster technologies and delivery mechanisms require a unique approach to infrastructure delivery and operations. Organizations moving towards a big data approach to analytics must adapt their service catalog standards to enable a cost effective, low risk approach to providing the right infrastructure at the right time so as not to overwhelm their budgets or the capabilities of their staff.
Leveraging the original Hadoop hardware ideals, big data cluster software is typically designed to operate with commodity servers that localize compute, network, and storage to provide a low latency highly available cluster with compute scalability built in. Originally, the standard use of the MapReduce functionality of Hadoop leveraged this approach wonderfully. MapReduce is a batch process engine, so using multiple data nodes with large, inexpensive disks and limited amounts of compute and memory resources worked very well. However, as big data has evolved, more secondary, and tertiary software pieces have begun moving a bulk of the cluster workload into more compute and memory intensive jobs.
Tools like Pig, Spark, and Hive are gaining popularity with analytics teams, and have caused infrastructure teams to respond with larger costlier systems to satisfy cluster requirements. With these advanced software tools gaining momentum, the cost for each node has exponentially increased.
It is common now to have cluster node requirements for massive amounts of memory, additional compute cores, and SSD storage. As cluster requirements grow, infrastructure costs can soar.
Big data clusters require a transformational architecture approach within most traditional data centers. As big data gains traction in an organization, it is incumbent upon the IT infrastructure organization to create a new reference architecture to support the unique requirements of big data clusters. It is important for IT organizations to be able to communicate clearly with big data and analytics stakeholders to create the best reference architecture for their organizational requirements.
Understanding key foundational pillars of how and why big data will be used within the organization are critical. Some fundamental points to know when starting a big data reference architecture discussion can be:
- What is the base big data software construct?
- Will software and the associated jobs be more batch oriented or more transactional?
- Where will most of the analytics occur, within memory or on disk?
- Where will data be ingested from and how?
- How quickly will the analytics resources grow?
- Will post processed data need to be replicated?
- Will other lines of business or other software platforms need access to post processed data?
These questions can begin to set a foundation for an infrastructure team to create an extensible and scalable reference architecture to enable big data analytics within their organization. As reference architectures start to take shape, organizations should also consider where to host their big data clusters. The elastic nature of analytics and the clusters that support the platform makes it an interesting platform to host in cloud or off premise hosted facilities.
As we’ve discussed, big data clusters can and will scale and perhaps contract very quickly over their lifespan. This elastic scalability, along with support of next generation software analytics platforms, requires a new approach to operational efficiency. Infrastructure operations processes and policies that have been built around traditionally built client server applications hosted on virtual environment must change significantly to support a big data environment.
Server deployment standards and methodologies must be adapted and sometimes adjusted to match the reference architecture and progressive nature of big data deployments. Additional operational efficiency can sometimes be realized in off premise hosted deployments, but operations staff must still ensure quality control as an intermediary for the organization’s analytics staff. There is a new category of operational infrastructure based devops tools that are quickly maturing to assist IT infrastructure teams achieve an appropriate level of operational efficiency.
New platforms always bring risk. Big data cluster complexity and scalability can dramatically increase that risk. By carefully considering platform choices and operational models, IT organizations can get a head start on infrastructure alignment and support to dramatically reduce risk. Additionally, there are easy ways to introduce enterprise IT functionality to big data clusters. Data replication, high availability, and server analytics can all be implemented to compliment big data software’s built in functionality to deliver a consistent service that significantly reduces organizational risk.
The Readiness Assessment
In order to help our clients bridge this gap, Headwaters Group has developed a Big Data Readiness assessment that can address these issues and provide a next generation reference architecture within less that two months and makes the engagement self funding.