You are here

Hadoop Commands Cheat Sheet - hdfs ,Administration and MapReduce

Hadoop cheatsheet

Like many buzzwords, what people mean when they say “big data” is not always clear. At its core, big data is a way of describing data problems that are unsolvable using traditional tools —because of the volume of data involved, the variety of that data, or the time constraints faced by those trying to use that data. Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. To get that software framework to work for you, you’ll need to master a bunch of commands.

Hadoop Distributed File System Shell Commands

The Hadoop shell is a family of commands that you can run from your operating system’s command line. The shell has two sets of commands: one for file manipulation (similar in purpose and syntax to Linux commands that many of us know and love) and one for Hadoop administration. The following table summarizes the first set of commands for you.

Command

What It Does

Usage

Examples

dcat

Copies source paths tostdout.

hdfs dfs -cat URI [URI …]

hdfs dfs -cat hdfs://
<path>/file1; hdfs 
dfs -cat file:///file2 /user/hadoop/file3

chgrp

Changes the group association of files. With -R, makes the change recursively by way of the directory structure. The user must be the file owner or the superuser.

hdfs dfs -chgrp [-R] GROUP URI [URI …]

 

chmod

Changes the permissions of files. With -R, makes the change recursively by way of the directory structure. The user must be the file owner or the superuser.

hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]

hdfs dfs -chmod 777 
test/data1.txt

chown

Changes the owner of files. With -R, makes the change recursively by way of the directory structure. The user must be the superuser.

hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

hdfs dfs -chown -R 
hduser2 /opt/hadoop/logs

copyFromLocal

Works similarly to the putcommand, except that the source is restricted to a local file reference.

hdfs dfs -copyFromLocal <localsrc> URI

hdfs dfs -copyFromLocal input/docs/data2.txt hdfs://localhost/user/
rosemary/data2.txt

copyToLocal

Works similarly to the getcommand, except that the destination is restricted to a local file reference.

hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

hdfs dfs -copyToLocal
data2.txt data2.copy.txt

count

Counts the number of directories, files, and bytes under the paths that match the specified file pattern.

hdfs dfs -count [-q] <paths>

hdfs dfs -count hdfs://nn1.example.com/
file1 hdfs://nn2.example.com/
file2

cp

Copies one or more files from a specified source to a specified destination. If you specify multiple sources, the specified destination must be a directory.

hdfs dfs -cp URI [URI …] <dest>

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

du

Displays the size of the specified file, or the sizes of files and directories that are contained in the specified directory. If you specify the -s option, displays an aggregate summary of file sizes rather than individual file sizes. If you specify the -h option, formats the file sizes in a "human-readable" way.

hdfs dfs -du [-s] [-h] URI [URI …]

hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1

dus

Displays a summary of file sizes; equivalent to hdfs dfs -du –s.

hdfs dfs -dus <args>

 

expunge

Empties the trash. When you delete a file, it isn’t removed immediately from HDFS, but is renamed to a file in the/trash directory. As long as the file remains there, you can undelete it if you change your mind, though only the latest copy of the deleted file can be restored.

hdfs dfs –expunge

 

get

Copies files to the local file system. Files that fail a cyclic redundancy check (CRC) can still be copied if you specify the -ignorecrcoption. The CRC is a common technique for detecting data transmission errors. CRC checksum files have the .crc extension and are used to verify the data integrity of another file. These files are copied if you specify the -crc option.

hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>

hdfs dfs -get /user/hadoop/file3
localfile

getmerge

Concatenates the files insrc and writes the result to the specified local destination file. To add a newline character at the end of each file, specify theaddnl option.

hdfs dfs -getmerge <src> <localdst> [addnl]

hdfs dfs -getmerge /user/hadoop/mydir/ ~/result_file addnl

ls

Returns statistics for the specified files or directories.

hdfs dfs -ls <args>

hdfs dfs -ls /user/hadoop/file1

lsr

Serves as the recursive version of ls; similar to the Unix command ls -R.

hdfs dfs -lsr <args>

hdfs dfs -lsr /user/
hadoop

mkdir

Creates directories on one or more specified paths. Its behavior is similar to the Unix mkdir -p command, which creates all directories that lead up to the specified directory if they don’t exist already.

hdfs dfs -mkdir <paths>

hdfs dfs -mkdir /user/hadoop/dir5/temp

moveFromLocal

Works similarly to the putcommand, except that the source is deleted after it is copied.

hdfs dfs -moveFromLocal <localsrc> <dest>

hdfs dfs -moveFromLocal localfile1 localfile2 /user/hadoop/hadoopdir

    

mv

Moves one or more files from a specified source to a specified destination. If you specify multiple sources, the specified destination must be a directory. Moving files across file systems isn’t permitted.

hdfs dfs -mv URI [URI …] <dest>

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

put

Copies files from the local file system to the destination file system. This command can also read input fromstdin and write to the destination file system.

hdfs dfs -put <localsrc> ... <dest>

hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir;
hdfs dfs -put - /user/hadoop/hadoopdir (reads input from stdin)

rm

Deletes one or more specified files. This command doesn’t delete empty directories or files. To bypass the trash (if it’s enabled) and delete the specified files immediately, specify the -skipTrashoption.

hdfs dfs -rm [-skipTrash] URI [URI …]

hdfs dfs -rm hdfs://nn.example.com/
file9

rmr

Serves as the recursive version of –rm.

hdfs dfs -rmr [-skipTrash] URI [URI …]

hdfs dfs -rmr /user/hadoop/dir

setrep

Changes the replication factor for a specified file or directory. With -R, makes the change recursively by way of the directory structure.

hdfs dfs -setrep <rep> [-R] <path>

hdfs dfs -setrep 3 -R /user/hadoop/dir1

stat

Displays information about the specified path.

hdfs dfs -stat URI [URI …]

hdfs dfs -stat /user/hadoop/dir1

tail

Displays the last kilobyte of a specified file to stdout. The syntax supports the Unix -foption, which enables the specified file to be monitored. As new lines are added to the file by another process, tail updates the display.

hdfs dfs -tail [-f] URI

hdfs dfs -tail /user/hadoop/dir1

test

Returns attributes of the specified file or directory. Specifies -e to determine whether the file or directory exists; -z to determine whether the file or directory is empty; and -d to determine whether the URI is a directory.

hdfs dfs -test -[ezd] URI

hdfs dfs -test /user/hadoop/dir1

text

Outputs a specified source file in text format. Valid input file formats are zip andTextRecordInputStream.

hdfs dfs -text <src>

hdfs dfs -text /user/hadoop/file8.zip

touchz

Creates a new, empty file of size 0 in the specified path.

hdfs dfs -touchz <path>

hdfs dfs -touchz /user/hadoop/file12

Hadoop Administration Commands

Any Hadoop administrator worth his salt must master a comprehensive set of commands for cluster administration. The following table summarizes the most important commands. Know them, and you will advance a long way along the path to Hadoop wisdom.

Command

What It Does

Syntax

Example

balancer

Runs the cluster-balancing utility. The specified threshold value, which represents a percentage of disk capacity, is used to overwrite the default threshold value (10 percent). To stop the rebalancing process, press Ctrl+C.

hadoop balancer
[-threshold <threshold>]

hadoop balancer
 -threshold 20

daemonlog

Gets or sets the log level for each daemon (also known as a service). Connects tohttp://host:port/
logLevel?log=name and prints or sets the log level of the daemon that’s running athost:port. Hadoop daemons generate log files that help you determine what’s happening on the system, and you can use thedaemonlog command to temporarily change the log level of a Hadoop component when you’re debugging the system. The change becomes effective when the daemon restarts.

hadoop daemonlog 
-getlevel
 <host:port>
 <name>; hadoop
 daemonlog
 -setlevel
 <host:port>
 <name>
 <level>

hadoop daemonlog
 -getlevel
 10.250.1.15:50030
 org.apache.hadoop.
mapred.JobTracker;
 hadoop daemonlog
 -setlevel 10.250.
1.15:50030 org.apache.hadoop.
mapred.JobTracker DEBUG

 

 

datanode

Runs the HDFS DataNode service, which coordinates storage on each slave node. If you specify -rollback, the DataNode is rolled back to the previous version. Stop the DataNode and distribute the previous Hadoop version before using this option.

hadoop datanode
 [-rollback]

hadoop datanode –
rollback

dfsadmin

Runs a number of Hadoop Distributed File System (HDFS) administrative operations. Use the -helpoption to see a list of all supported options. The generic options are a common set of options supported by several commands.

hadoop dfsadmin
 [GENERIC_
OPTIONS]
 [-report]
 [-safemode
 enter
 | leave | 
get | wait]
 [-refreshNodes]
 [-finalize
Upgrade]
 [-upgrade
Progress
 status | 
details | force]
 [-metasave filename]
 [-setQuota
 <quota>
 <dirname>...<dirname>]
 [-clrQuota <dirname>
...<dirname>]
 [-restoreFailed
Storagetrue|false
|check] [-help
[cmd]]

 

mradmin

Runs a number of MapReduce administrative operations. Use the -helpoption to see a list of all supported options. Again, the generic options are a common set of options that are supported by several commands. If you specify -refreshServiceAcl, reloads the service-level authorization policy file (JobTracker reloads the authorization policy file); -refreshQueues reloads the queue access control lists (ACLs) and state (JobTracker reloads the mapred-queues.xml file); -refreshNodes refreshes the hosts information at the JobTracker; -refreshUserToGroups
Mappings refreshes user-to-groups mappings; -refreshSuperUserGroups
Configuration refreshes superuser proxy groups mappings; and -help [cmd] displays help for the given command or for all commands if none is specified.

hadoop mradmin
 [ GENERIC_OPTIONS ]
 [-refreshServiceAcl]
 [-refreshQueues]
 [-refreshNodes]
 [-refreshUserTo
GroupsMappings]
 [- refreshSuper
UserGroups
Configuration] [-help [cmd]]

hadoop mradmin -help
 –refreshNodes

jobtracker

Runs the MapReduce JobTracker node, which coordinates the data processing system for Hadoop. If you specify -dumpConfiguration, the configuration that’s used by the JobTracker and the queue configuration in JSON format are written to standard output.

hadoop
jobtracker [-dump
Configuration]

hadoop jobtracker –
dumpConfiguration

namenode

Runs the NameNode, which coordinates the storage for the whole Hadoop cluster. If you specify -format, the NameNode is started, formatted, and then stopped; with -upgrade, the NameNode starts with the upgrade option after a new Hadoop version is distributed; with -rollback, the NameNode is rolled back to the previous version (remember to stop the cluster and distribute the previous Hadoop version before using this option); with -finalize, the previous state of the file system is removed, the most recent upgrade becomes permanent, rollback is no longer available, and the NameNode is stopped; finally, with -importCheckpoint, an image is loaded from the checkpoint directory (as specified by thefs.checkpoint.dirproperty) and saved into the current directory.

hadoop
 namenode
 [-format] |
 [-upgrade] |
 [-rollback] |
 [-finalize] |
 [-import
Checkpoint]

hadoop namenode –
finalize

Secondary
namenode

Runs the secondary NameNode. If you specify -checkpoint, a checkpoint on the secondary NameNode is performed if the size of the EditLog (a transaction log that records every change that occurs to the file system metadata) is greater than or equal tofs.checkpoint.size; specify -force and a checkpoint is performed regardless of the EditLog size; specify –geteditsizeand the EditLog size is printed.

hadoop secondary
namenode
 [-checkpoint [force]] | [-geteditsize]

hadoop
 secondarynamenode
 –geteditsize

tasktracker

Runs a MapReduce TaskTracker node.

hadoop
 tasktracker

hadoop tasktracker

 

 

The Hadoop dfsadmin Command Options

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As an added bonus, you can use them to perform some administration operations on HDFS as well.

Option

What It Does

-report

Reports basic file system information and statistics.

-safemode enter | leave | get | wait

Manages safe mode, a NameNode state in which changes to the name space are not accepted and blocks can be neither replicated nor deleted. The NameNode is in safe mode during start-up so that it doesn’t prematurely start replicating blocks even though there are already enough replicas in the cluster.

-refreshNodes

Forces the NameNode to reread its configuration, including thedfs.hosts.exclude file. The NameNode decommissions nodes after their blocks have been replicated onto machines that will remain active.

-finalizeUpgrade

Completes the HDFS upgrade process. DataNodes and the NameNode delete working directories from the previous version.

-upgradeProgress status | details | force

Requests the standard or detailed current status of the distributed upgrade, or forces the upgrade to proceed.

-metasave filename

Saves the NameNode’s primary data structures to filename in a directory that’s specified by the hadoop.log.dir property. Filefilename, which is overwritten if it already exists, contains one line for each of these items: a) DataNodes that are exchanging heartbeats with the NameNode; b) blocks that are waiting to be replicated; c) blocks that are being replicated; and d) blocks that are waiting to be deleted.

-setQuota <quota> <dirname>...<dirname>

Sets an upper limit on the number of names in the directory tree. You can set this limit (a long integer) for one or more directories simultaneously.

-clrQuota <dirname>...<dirname>

Clears the upper limit on the number of names in the directory tree. You can clear this limit for one or more directories simultaneously.

-restoreFailedStorage true | false | check

Turns on or off the automatic attempts to restore failed storage replicas. If a failed storage location becomes available again, the system attempts to restore edits and the fsimage during a checkpoint. The check option returns the current setting.

-help [cmd]

Displays help information for the given command or for all commands if none is specified.