[Hadoop_Cluster] Tutorial with GCP(Google Cloud Platform)

Tutorial: GCP instance로 Hadoop Cluster 구성

  • reference 참고하여 정리한 글입니다.

Process

  • GCP VM instance 생성 및 Hadoop 환경/ssh key 설정
  • 상기 instance로 snapshot 생성
  • slave node용 instance 추가 생성 by snapshot

Step 1: install and set Hadoop config.

  • Prerequisite: GCP VM instance
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update && sudo apt-get install -y build-essential oracle-java8-set-default

    $ wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
    $ tar -xzvf hadoop-3.1.1.tar.gz
    $ sudo mv hadoop-3.1.1 /usr/local/hadoop

    $ sudo vi /etc/environment

    PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
    JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"

    $ source /etc/environment #or export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre

    # Run test application
    $ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output
    cat ~/output/part-r-*

cf. Official tutorial

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Complie and Create jar
$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

# Input dir
$ bin/hadoop fs -ls /user/joe/wordcount/input/
>> /user/joe/wordcount/input/file01
>> /user/joe/wordcount/input/file02

$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
>> Hello World Bye World

$ bin/hadoop fs -cat /user/joe/wordcount/input/file02
>> Hello Hadoop Goodbye Hadoop

1) hdfs-site.xml

  • Configurations for Name/DataNode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>

<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

2) core-site.xml

  • NameNode URI
  • Size of R/W buffer, etc
1
2
3
4
5
6
7
8
$ sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

3) yarn-site.xml

  • Configurations for ResourceManager and NodeManager
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

4) mapred-site.xml

  • Configurations for MapReduce App
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>

5) Define Master/Workers

1
2
3
4
5
6
7
8
$ sudo vi /usr/local/hadoop/etc/hadoop/workers

slave1
slave2

$ sudo vi /usr/local/hadoop/etc/hadoop/masters

master

Step 3. Set SSH

  • make the master node connect to the slave nodes without password
    1
    2
    3
    4
    5
    6
    7
    8
    # generate key
    $ ssh-keygen -t rsa

    # append it to 'authorized_keys'
    $ cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub

    # check ssh
    $ ssh localhost

Step 4. Create Snapshot(GCP)

  • Create snapshot and launch two more instances
  • Create a instance Group
    • AWS Security Group설정 대신 그룹으로 묶음
    • Bind all the 3 instances

Step 5. ssh Connection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ sudo vi /etc/hosts

# Private IP
127.0.0.1 localhost
10.0.0.1.2 master
10.0.0.1.3 slave1
10.0.0.1.4 slave2

# test connection
$ ssh slave1
$ exit
$ ssh slave2
$ exit

# update each instance's hosts files
$ cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'"
$ cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"

Step 6. Run Hadoop Cluster

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# test cluster
$ hdfs namenode -format
$ start-dfs.sh
$ jps
$ hadoop fs -mkdir /test
$ hadoop fs -ls /
$ hdfs dfsadmin -report

# run cluster manager
$ start-yarn.sh
$ yarn node -list

# upload data and run wordcount
$ hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/
$ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output

# read data
$ hadoop fs -text /test/output/*
< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->