HDFS 分布式文件系统
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop HDFS 2.x 安装
Hadoop HDFS 2.x 包含了3种安装模式:
- Standalone. 独立模式
- Pseudo-Distributed Operation. 伪分布式
- Cluster. 集群
Standalone
默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程。
Pseudo-Distributed Operation
Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。
配置
不配置Yarn
etc/hadoop/core-site.xml:
etc/hadoop/hdfs-site.xml:
Setup passphraseless ssh login:
伪分布式HDFS初始化及使用命令
The following instructions are to run a MapReduce job locally.
Format the filesystem: 初始化!!!
1bin/hdfs namenode -formatStart NameNode daemon and DataNode daemon:
1sbin/start-dfs.sh
The hadoop daemon log output is written to the $(HADOOP_LOG_DIR) directory (defaults to $(HADOOP_HOME)/logs).
- Browse the web interface for the NameNode; by default it is available at:
NameNode - http://localhost:50070/ Make the HDFS directories required to execute MapReduce jobs:
12bin/hdfs dfs -mkdir /userbin/hdfs dfs -mkdir /user/<username>Copy the input files into the distributed filesystem:
1bin/hdfs dfs -put etc/hadoop inputRun some of the examples provided:
1bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
12bin/hdfs dfs -get output outputcat output/*
or
View the output files on the distributed filesystem:
- When you’re done, stop the daemons with:1sbin/stop-dfs.sh
配置Yarn
配置Yarn on a Single Node
etc/hadoop/mapred-site.xml
etc/hadoop/yarn-site.xml:
Start ResourceManager daemon and NodeManager daemon:
1sbin/start-yarn.shBrowse the web interface for the ResourceManager; by default it is available at:
ResourceManager - http://localhost:8088/- Run a MapReduce job.
与不配置Yarn的HDFS使用命令类似,可参考上面的例子。 - When you’re done, stop the daemons with:1sbin/stop-yarn.sh
Cluster Setup
参考:http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
下载 hadoop2.7.3 版本的压缩包,解压缩到master节点上, 解压路径为 ${Hadoop_Install} .
配置 hadoop cluster 中各个节点之间的passwordless 无密码访问。
Configure Hadoop Cluster
到 ${Hadoop_Install}/etc/hadoop/ 目录下 编辑配置文件: core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml .
core-site.xml : configure important parameters
hdfs-site.xml : configure for NameNode + DataNode
mapred-site.xml : Configure for MapReduce Applications + MapReduce JobHistory Server
yarn-site.xml : Configure for ResourceManager + NodeManager + History Server
Hadoop Cluster Startup
Format a new distributed filesystem:
会生成一个name文件夹,里面存储fsimage和editlog文件,记录整个cluster中的文件系统。
Start HDFS NameNode :
Start HDFS DataNode :
Start all Hadoop slaves * :
Start Yarn ResourceManager :
Start Yarn NodeManager :
Start Yarn WebAppProxy server if necessary:
Start all Yarn slaves *:
Start MapReduce JobHistory server :
Hadoop Cluster Web Interfaces
|
|
Hadoop Cluster exclude/decommision Datanodes
configure hdfs-site.xml :
Then write the decommission data node(slave2) to hdfs_exclude.txt file.
Last, force configure reload: