[Spark_3] Spark SQL basic

Spark_SQL basic

  • RDD vs DataSet/DataFrame: 두 개 다 사용하지만 이제 Dataset사용이 많아짐
    • DataSet: a distributed collection of data
    • DataFrame: a DataSet organized into named columns
    • RDD를 DS, DF 변환가능
    • Python은 DataSet 개념없음

SparkContext vs SparkSession

  • SparkContext: Old entry point of Spark. RDD 조작
  • SparkSession: New entry point of Spark. 내부 SparkContext 있음. (it is essentially combination of SQLContext, HiveContext and future StreamingContext)
  • SparkContext: Spark Application이 Cluster Manager에 접근
  • SparkSession: Spark SQL 접속 포인트 (the entry point to Spark SQL)
  • Spark SQL / Streaming / MLlib / GraphX
  • library?
    • library: 필요 기능을 Class, Function로 만들어 둔 것. collection of various packages. eg. reqeusts
    • Framework: 특정 기능 구현 위해 라이브러리 모아둔 것. collection of libraries. eg. Django
    • API: 다른 프로그램과 직접적 연결없이 interaction 위한 인터페이스 eg. pyspark
      • An API may be referred to as an Interface. API exist at many levels including system, library, framework, program, and application. APIs should be defined before the code implementing them is implemented.

참고

Spark Cluster Manager

  • Standalone Deplot Mode
  • Apache Mesos
  • Hadoop YARN
  • Kubernetes
< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->