Spark Intro
- reference: Apache Spark: core concepts, architecture and internals
Keywords
- RDD
- Operation
- Create RDD
- Action: Count, Collect, Take
- Transformation: Map, Filter
- Partition, Dependency
1. RDD (Resilient Distributed Datasets)
- Important aspect: Distributed/ Immutable/ Resilient
- Distributed: Each RDD is split into multiple pieces(partitions) -> partitions are divided across the clusters
- Immutable(read-only): cannot be changed after created -> remove all the potential problems due to updates from multiple threads
- Resilient: in case of any node in the cluster goes down, can recover the parts of the RDDs(cuz can be recreated at any time & ‘Lineage’) -> make sure fault-tolerance
- Basic concept
- can contain any types of obj, such as Python, Java, or Scala objects, including user-defined classes
- is a capsulation around a very large dataset
- will automatically distribute the data contained in RDDs, across your cluster and parallelize the operaton your perfom on them
- paralleize applications(compute-intensive app, app requiring input from data streams, and etc) across clusters
- RDD workflow
- Create initial RDDs from external data
- Apply Transformation
- Launch Actions
How to create a RDD
- 1) (Not practical to large datasets)Take an existing collection and pass it to SparkContext’s
parallelize
method. - 2) Load RDDs from external strage by calling
textFile
method on SparkContext.- External storage: AWS S3, HDFS and etc. (JDBC, Cassandra, Elasticsearch)
2. Spark Architecture (link)
Driver Program
(Spark Application) eg. shell- Drive application
- Consists of
SparkContext
and user code
SparkContext
- Take the job, break the job in tasks and distribute them to the
worker nodes
- Represent connection to spark
Cluster Managers
which allocate resourecs across app Driver Program
access Spark app throughSparckContext
obj
- Take the job, break the job in tasks and distribute them to the
Cluster Manager
- not need to be on the same machines with
Driver Program
- Manage job within the cluster
- Allocate resource
- not need to be on the same machines with
Executor
- Excute the tasks on the partitioned RDDs
- Launch at the beginning of the Spark app
- Return back result to
SparkContext
- Interact with storage system
DAG
(Direct Acyclic Graph): User code containing RDD transformations forms DAG- split into stages of task by DAGScheduler
3. Spark Components (link)
Main Components
Spark Driver
Executors
Cluster Manger
Other Components
: Responsible for translation of user code into actual jobs executed on cluster
SparkContext
DAGScheduler
TaskScheduler
SchedulerBackend
BlockManager
4. Operation
- 1)
Action
: compute(count, collect) a result based on an RDD and return result toDriver Program
or write it to storage - 2)
Transformation
: Apply some functions to the data in RDD to create a new RDDfilter
: return a new RDD with a subset of the data in the original RDD(selecting elems from those passed to func())- used to remove invalid row to clean up
map
: pass each elem through the function and yield the new value of each elem in the resulting RDD
- 3) Transformation vs Actions: due to Lazy Execution, all the RDDs won’t be computed until they are used in an ACTION()(eg. collect(), first())
5. Others
- Dependencies
- Wide(shuffle): multiple child partitions may depend on one parent partition; require data from all parent partitions
- Narrow: each partition of the parent RDD is used by at most one partition of the child RDD eg. Map(), Filter()
Operation Code from Spark Doc
Transformations
The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.
Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)
and pair RDD functions doc (Scala, Java) for details.
Action | Meaning |
---|---|
reduce(func) | Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. |
collect() | Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. |
count() | Return the number of elements in the dataset. |
first() | Return the first element of the dataset (similar to take(1)). |
take(n) | Return an array with the first n elements of the dataset. |
takeSample(withReplacement, num, [seed]) | Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. |
takeOrdered(n, [ordering]) | Return the first n elements of the RDD using either their natural order or a custom comparator. |
saveAsTextFile(path) | Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. |
saveAsSequenceFile(path) (Java and Scala) |
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). |
saveAsObjectFile(path) (Java and Scala) |
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using
SparkContext.objectFile() . |
countByKey() | Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. |
foreach(func) | Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. |