In short – Apache Spark is a fast, universal, in-memory data processing engine.
Apache company polled a survey on “Why companies should use in-memory computing framework like Apache Spark?” and the results of the survey are overwhelming:
- 91% use Apache Spark because of its performance gains.
- 77% use Apache Spark as it is easy to use.
- 71% use Apache Spark due to the ease of deployment.
- 64% use Apache Spark to leverage advanced analytics
- 52% use Apache Spark for real-time streaming.
Apache Spark key features
- Spark can be run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in your cluster.
- Currently it provides API for Scala, Java and Python, also support for other languages (for example, R, SQL), data streaming, machine learning and graph processing.
- Spark has good integration with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
- Spark can run on clusters under Hadoop YARN or Apache Mesos, and also in stand-alone mode.
Spark best use cases
- Data integration and ETL
- Interactive analytics or business intelligence
- High performance batch computation
- Machine learning and advanced analytics
- Real-time stream processing
Check our article on How to Install and Configure Apache Spark Cluster