HDFS安装(三种模式)

HDFS 分布式文件系统

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop HDFS 2.x 安装

Hadoop HDFS 2.x 包含了3种安装模式：

Standalone. 独立模式
Pseudo-Distributed Operation. 伪分布式
Cluster. 集群

Standalone

默认情况下，Hadoop被配置成以非分布式模式运行的一个独立Java进程。

export JAVA_HOME=/usr/local/java/openjdk1.8/
cd  hadoop_home/
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
cat output/*

Pseudo-Distributed Operation

Hadoop可以在单节点上以所谓的伪分布式模式运行，此时每一个Hadoop守护进程都作为一个独立的Java进程运行。

配置

不配置Yarn
etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

Setup passphraseless ssh login:

ssh localhost
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

伪分布式HDFS初始化及使用命令

The following instructions are to run a MapReduce job locally.

Format the filesystem: 初始化!!!
1
bin/hdfs namenode -format
Start NameNode daemon and DataNode daemon:
1
sbin/start-dfs.sh

The hadoop daemon log output is written to the $(HADOOP_LOG_DIR) directory (defaults to $(HADOOP_HOME)/logs).

Browse the web interface for the NameNode; by default it is available at:
NameNode - http://localhost:50070/

Make the HDFS directories required to execute MapReduce jobs:

1 2	bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:
1
bin/hdfs dfs -put etc/hadoop input

Run some of the examples provided:

1	bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'

Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
1
2
bin/hdfs dfs -get output output
cat output/*

or
View the output files on the distributed filesystem:

1	bin/hdfs dfs -cat output/*

When you’re done, stop the daemons with:
1
sbin/stop-dfs.sh

配置Yarn

配置Yarn on a Single Node
etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Start ResourceManager daemon and NodeManager daemon:
1
sbin/start-yarn.sh
Browse the web interface for the ResourceManager; by default it is available at:
ResourceManager - http://localhost:8088/
Run a MapReduce job.
与不配置Yarn的HDFS使用命令类似，可参考上面的例子。
When you’re done, stop the daemons with:
1
sbin/stop-yarn.sh

Cluster Setup

参考：http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
下载 hadoop2.7.3 版本的压缩包，解压缩到master节点上，解压路径为 ${Hadoop_Install} .
配置 hadoop cluster 中各个节点之间的passwordless 无密码访问。

Configure Hadoop Cluster

到 ${Hadoop_Install}/etc/hadoop/ 目录下编辑配置文件： core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml .
core-site.xml : configure important parameters

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop-nn:9000</value>
        </property>
        <property>
                <name>dfs.permissions</name>
                <value>false</value>
        </property>
</configuration>

hdfs-site.xml : configure for NameNode + DataNode

<configuration>
        <property>
                <name>dfs.data.dir</name>
                <value>/opt/hadoop/dfs/name/data</value>
                <final>true</final>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hadoop/dfs/name</value>
                <final>true</final>
        </property>
        <property>
                <name>dfs.blocksize</name>
                <value>10240</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
</configuration>

mapred-site.xml : Configure for MapReduce Applications + MapReduce JobHistory Server

<configuration>
        <property>
                <name>mapred.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

yarn-site.xml : Configure for ResourceManager + NodeManager + History Server

<configuration>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>hadoop-nn:8025</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>hadoop-nn:8035</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>hadoop-nn:8050</value>
        </property>
</configuration>

Hadoop Cluster Startup

Format a new distributed filesystem:

1	[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>

会生成一个name文件夹，里面存储fsimage和editlog文件，记录整个cluster中的文件系统。
Start HDFS NameNode :

1	[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

Start HDFS DataNode :

1	[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Start all Hadoop slaves * :

1	[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh

Start Yarn ResourceManager :

1	[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Start Yarn NodeManager :

1	[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager

Start Yarn WebAppProxy server if necessary:

1	[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start proxyserver

Start all Yarn slaves *:

1	[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh

Start MapReduce JobHistory server :

1	[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Hadoop Cluster Web Interfaces

1
2
3

NameNode     http://hadoop_namenode:50070/
ResourceManager     http://hadoop_resourcemanager:8088/
MapReduce JobHistory Server     http://jobhistory_serevr:19888/

Hadoop Cluster exclude/decommision Datanodes

configure hdfs-site.xml :

<property>
  <name>dfs.hosts.exclude</name>
  <value>/home/hadoop/hdfs_exclude.txt</value>
  <description>DFS exclude</description>
</property>

Then write the decommission data node(slave2) to hdfs_exclude.txt file.
Last, force configure reload:

1 2	hadoop dfsadmin -refreshNodes hadoop dfsadmin -report

2017-07-19