Open-source source platforms for big data have exploded in popularity. And in the past few months, it seems like nearly everyone is feeling the fallout.
Cost, flexibility and the availability of trained personnel are major reasons for the open-source boom. Hadoop, R and NoSQL are now the supporting pillars of many enterprises’ big data strategies, whether they involve managing unstructured data or performing complex statistical analyses on it.
It’s almost hard to keep up: SAP AG recently released a new product, SAP BusinessObjects Predictive Analysis, software that integrates algorithms from the open-source R language, which is used extensively in the academic community for advanced statistical modelling.
A few weeks before that, Teradata Corp. announced that its new integrated analytics portfolio would include R functionality as well as a connection to GeoServer, a Java-based open-source geolocation platform. Countless other companies are rushing to build links to Hadoop.
Widespread adoption, feverish innovation
James Kobielus, then an analyst at Forrester Research Inc. (he’s now senior program director for product marketing of big data analytics solutions at IBM Corp.), wrote in an e-mail message that “open-source approaches have the momentum of the most widespread adoption and the most feverish innovation.”
But what’s the rush?
First of all, Kobielus explains, just as open-source products ranging from Mozilla to Android have earned widespread acceptance in the IT community after some birth pains, open-source data storage and analysis software have now matured (“no longer the risky bet they were just a year or two ago,” as he puts it).
Secondly, Kobielus wrote, platforms like Hadoop, R and NoSQL have enjoyed an advantage over proprietary software because they were able to evolve faster. They’re also being continuously developed and refined by many different parties. Pretty soon, he predicts, open-source will begin to dominate the big data world.
“As the footprint of closed-source software shrinks in many data/analytics environments, many incumbent vendors will evolve their business models toward open-source approaches,” he wrote, “and also ramp up professional services and systems integration to assist customers in their moves towards open-source, cloud-oriented analytics, much of it focused on Hadoop and R.
“Forrester regards Hadoop, for example, as the nucleus of the next-generation enterprise data warehouse (EDW) in the cloud, and R as a key codebase in the coming wave of integrated big data development tools. We also expect various open-source NoSQL databases and tools to coalesce into rich alternatives to closed-source content analytics offerings.”
The Red Hat model
Different enterprises are approaching open-source integration in different ways. Some, like SAP, have opted to use their own in-house expertise to develop products with Hadoop or R functionality, while others, like Teradata, hand over much of the work to firms like Revolution Analytics Inc., a company that is somewhat like the Red Hat Inc. of big data. The company offers a commercialized version of R geared towards enterprises, much as Red Hat does with Linux.
A small company standing among big data giants, the firm specializes in modifying R for distinct business processes, says David Smith, vice-president of marketing and community at Revolution Analytics. “In particular,” he says, “we make it run with really big data sets.”
Using open-source in their products is a way for companies to differentiate themselves in the market, says Smith. “By definition,” he says, “it means that you’re not doing what your competitors are doing.”
Smith says that for organizations that take a progressive, scientific approach to big data analysis, open-source technologies are a natural choice. “Those companies that have a bit of a culture around data science, around exploration and curiousity with data, have really gravitated towards open-source technologies because they’re so flexible and they lend themselves to these different ways of just thinking about working with data and exploring different things you can do with that.”
Scott Gnau, president of Teradata Labs, which has partnered with Revolution Analytics, says large enterprises will benefit most from commercial packages of open-source technology so they can keep their focus on their particular line of business.
“There is a lot of value to be created in adopting some of the newer technologies that are developed in a Hadoop and MapReduce environment, but to deploy them as an enterprise-class kind of software, where there’s dependable version control, and there’s dependable scalability and there’s support available.
“It’s got to be packaged and dependable to get into the mainstream because the mainstream doesn’t want to be a software development house,” he says.
Will Davis, product marketing manager at EMC Greenplum, agrees. Larger companies, he says, need more stable, reliable incarnations of open-source big data platforms, whether they add the polish themselves or rely on others to do it for them.
“A lot of the enterprises… traditional customers of EMC, these sort of large Fortune 500 companies, really need their deployment of this technology to be enterprise-ready, to meet strict SLAs, to be always available,” he says.
Some early adopters of open-technology developed the expertise to go it alone, but “the second wave” of companies, he says, is anxious to get up and running quickly and might not have the internal talent for a do-it-yourself approach.
Big data talent is indeed in great demand these days, and companies are realizing that by running open-source platforms, they’ll be the best position to attract the trained people. Open-source technologies, particularly R, are widely used in academia.
These data scientists, moreover, work better with open-source platforms. Imran Ahmad is a data scientist who has developed his own grid-computing algorithm, a Hadoop competitor called Bileg, which is based on the open-source Globus toolkit (GT4). The president of Cloudanum Inc., a Toronto-based company that develops data analysis technologies for cloud environments, he says the fundamental advantage in an open-source platform is that people like him can see its underlying mathematical basis.