Calling Apache Spark “the most important new open source project in a decade that is being defined by data,” IBM today announced that it will embed the compute engine into its analytics and commerce platforms and offer Spark as a service on IBM Bluemix.

As part of its new commitment to Spark, Big Blue also says it will assign more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide and will donate its IBM SystemML machine learning technology to the Spark open source ecosystem. It also pledged to educate more than one million data scientists and data engineers on Spark.

Adding Spark to Hadoop Distributed File System

Spark is a cluster computing framework designed to sit on top of Hadoop Distributed File System (HDFS) in place of Hadoop MapReduce. With support for in-memory cluster computing, Spark can achieve performance up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk. Spark is a compute engine geared for data processing workflows, advanced analytics, stream processing and business intelligence/visual analytics.

To read this article in full or to leave a comment, please click here