[Spark_3] Spark SQL basic

Spark

Spark_SQL basic

RDD vs DataSet/DataFrame: 두 개 다 사용하지만 이제 Dataset사용이 많아짐
- DataSet: a distributed collection of data
- DataFrame: a DataSet organized into named columns
- RDD를 DS, DF 변환가능
- Python은 DataSet 개념없음

SparkContext: Old entry point of Spark. RDD 조작
SparkSession: New entry point of Spark. 내부 SparkContext 있음. (it is essentially combination of SQLContext, HiveContext and future StreamingContext)
SparkContext: Spark Application이 Cluster Manager에 접근
SparkSession: Spark SQL 접속 포인트 (the entry point to Spark SQL)

Spark SQL / Streaming / MLlib / GraphX
library?
- library: 필요 기능을 Class, Function로 만들어 둔 것. collection of various packages. eg. reqeusts
- Framework: 특정 기능 구현 위해 라이브러리 모아둔 것. collection of libraries. eg. Django
- API: 다른 프로그램과 직접적 연결없이 interaction 위한 인터페이스 eg. pyspark
  - An API may be referred to as an Interface. API exist at many levels including system, library, framework, program, and application. APIs should be defined before the code implementing them is implemented.