HDFS安装(三种模式)

HDFS 分布式文件系统

  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop HDFS 2.x 安装

Hadoop HDFS 2.x 包含了3种安装模式:

  1. Standalone. 独立模式
  2. Pseudo-Distributed Operation. 伪分布式
  3. Cluster. 集群

Standalone

默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程。

1
2
3
4
5
6
export JAVA_HOME=/usr/local/java/openjdk1.8/
cd hadoop_home/
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
cat output/*

Pseudo-Distributed Operation

Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。

配置

不配置Yarn
etc/hadoop/core-site.xml:

1
2
3
4
5
6
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

etc/hadoop/hdfs-site.xml:

1
2
3
4
5
6
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>

Setup passphraseless ssh login:

1
2
3
4
ssh localhost
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys


伪分布式HDFS初始化及使用命令

The following instructions are to run a MapReduce job locally.

  1. Format the filesystem: 初始化!!!

    1
    bin/hdfs namenode -format
  2. Start NameNode daemon and DataNode daemon:

    1
    sbin/start-dfs.sh

The hadoop daemon log output is written to the $(HADOOP_LOG_DIR) directory (defaults to $(HADOOP_HOME)/logs).

  1. Browse the web interface for the NameNode; by default it is available at:
    NameNode - http://localhost:50070/
  2. Make the HDFS directories required to execute MapReduce jobs:

    1
    2
    bin/hdfs dfs -mkdir /user
    bin/hdfs dfs -mkdir /user/<username>
  3. Copy the input files into the distributed filesystem:

    1
    bin/hdfs dfs -put etc/hadoop input
  4. Run some of the examples provided:

    1
    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
  5. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

    1
    2
    bin/hdfs dfs -get output output
    cat output/*

or
View the output files on the distributed filesystem:

1
bin/hdfs dfs -cat output/*

  1. When you’re done, stop the daemons with:
    1
    sbin/stop-dfs.sh

配置Yarn

配置Yarn on a Single Node
etc/hadoop/mapred-site.xml

1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

etc/hadoop/yarn-site.xml:

1
2
3
4
5
6
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

  1. Start ResourceManager daemon and NodeManager daemon:

    1
    sbin/start-yarn.sh
  2. Browse the web interface for the ResourceManager; by default it is available at:
    ResourceManager - http://localhost:8088/

  3. Run a MapReduce job.
    与不配置Yarn的HDFS使用命令类似,可参考上面的例子。
  4. When you’re done, stop the daemons with:
    1
    sbin/stop-yarn.sh

Cluster Setup

参考:http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
下载 hadoop2.7.3 版本的压缩包,解压缩到master节点上, 解压路径为 ${Hadoop_Install} .
配置 hadoop cluster 中各个节点之间的passwordless 无密码访问。

Configure Hadoop Cluster

到 ${Hadoop_Install}/etc/hadoop/ 目录下 编辑配置文件: core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml .
core-site.xml : configure important parameters

1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-nn:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

hdfs-site.xml : configure for NameNode + DataNode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.blocksize</name>
<value>10240</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

mapred-site.xml : Configure for MapReduce Applications + MapReduce JobHistory Server

1
2
3
4
5
6
<configuration>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml : Configure for ResourceManager + NodeManager + History Server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-nn:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-nn:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-nn:8050</value>
</property>
</configuration>

Hadoop Cluster Startup

Format a new distributed filesystem:

1
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>

会生成一个name文件夹,里面存储fsimage和editlog文件,记录整个cluster中的文件系统。
Start HDFS NameNode :

1
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

Start HDFS DataNode :

1
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Start all Hadoop slaves * :

1
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh

Start Yarn ResourceManager :

1
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Start Yarn NodeManager :

1
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager

Start Yarn WebAppProxy server if necessary:

1
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start proxyserver

Start all Yarn slaves *:

1
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh

Start MapReduce JobHistory server :

1
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Hadoop Cluster Web Interfaces

1
2
3
NameNode http://hadoop_namenode:50070/
ResourceManager http://hadoop_resourcemanager:8088/
MapReduce JobHistory Server http://jobhistory_serevr:19888/

Hadoop Cluster exclude/decommision Datanodes

configure hdfs-site.xml :

1
2
3
4
5
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/hdfs_exclude.txt</value>
<description>DFS exclude</description>
</property>

Then write the decommission data node(slave2) to hdfs_exclude.txt file.
Last, force configure reload:

1
2
hadoop dfsadmin -refreshNodes
hadoop dfsadmin -report