What is Apache Spark

In short – Apache Spark is a fast, universal, in-memory data processing engine.

Apache company polled a survey on “Why companies should use in-memory computing framework like Apache Spark?” and the results of the survey are overwhelming:

apache-spark-big-data

  • 91% use Apache Spark because of its performance gains.
  • 77% use Apache Spark as it is easy to use.
  • 71% use Apache Spark due to the ease of deployment.
  • 64% use Apache Spark to leverage advanced analytics
  • 52% use Apache Spark for real-time streaming.

 

Apache Spark key features

  • Spark can be run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in your cluster.
  • Currently it provides API for Scala, Java and Python, also support for other languages (for example, R, SQL), data streaming, machine learning and graph processing.
  • Spark has good integration with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
  • Spark can run on clusters under Hadoop YARN or Apache Mesos, and also in stand-alone mode.

Spark best use cases

  • Data integration and ETL
  • Interactive analytics or business intelligence
  • High performance batch computation
  • Machine learning and advanced analytics
  • Real-time stream processing

Check our article on How to Install and Configure Apache Spark Cluster

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.