Hadoop

How to run Hadoop without using SSH

The start-all.sh and stop-all.sh scripts in the hadoop/bin directory will use SSH to launch some of the hadoop daemons. If for some reason SSH is not available on the server, please follow the steps below to run hadoop without using SSH.

The goal is to modify all "hadoop-daemons.sh" with "hadoop-daemon.sh". The "hadoop-daemons.sh" simply runs "hadoop-daemon.sh" through SSH.

Modify start-dfs.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
Modify stop-dfs.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode
Modify start-mapred.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start tasktracker
Modify stop-mapred.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop tasktracker

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop tasktracker

Note that after this change, start-all.sh and stop-all.sh will not start/stop any other hadoop nodes outside of this server remotely. All other remote slaves must be start/stopped manually, directly on those servers.

How To Backup/Restore Cloudera Manager Configuration Settings?

You can backup and restore the Cloudera configuration settings by using the APIs referenced in the following Cloudera documentation:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Introduction/cm5i_api.html

Steps to backup and restore CM configuration settings:

Where $ADMIN_UNAME is your CM admin UserName, for example "admin", $ADMIN_PASS is the password for CM admin UserName and $CM_HOST is the Cloudera Manager HostName.

From Cloudera Manager server host, logon as root:

1. To backup/export the configuration through CM API to a json file:

# curl -u $ADMIN_UNAME:$ADMIN_PASS "http://$CM_HOST:7180/api/v6/cm/deployment" > <path-to-file>/cm-deployment.json

2. To restore/import the configuration through CM API and the previously exported json file.

Please be *AWARE* this command will stop all the cluster services. So don't run this while you have jobs running.

# curl --upload-file <path-to-file>/cm-deployment.json -u $ADMIN_UNAME:$ADMIN_PASS http://$CM_HOST:7180/api/v6/cm/deployment?deleteCurrentDeployment=true

Disabling Anonymous Usage Data Collection in Cloudera Manager

When loading the Cloudera Manager webpage there is a request made to google-analytics.com which attempts to send anonymous data usage statistics back to Cloudera via Google Analytics. This is not desirable because the BDA is normally run with no access to the internet and the request shows up in the browser status bar and may cause concern even though it is harmless. BDA customers may also not desire anonymous statistics be sent back to Cloudera even if their BDA is connected to the internet.

This issue will be fixed in the next release of BDA Mammoth Software. Internal Bug16434583 has been filed to fix this issue.

To disable anonymous usage data collection in V2.0.1:

Open CM using http://<node3-name>:7180
Click the gear icon (Top right hand corner) to display the Administration page.
On the Properties tab, under the Other category, unset the Allow Usage Data Collection option to disable anonymous usage data collection.
Click Save changes

Oracle NoSQL Database for Hadoop

The Oracle NoSQL Database is a network-accessible multi-terabyte distributed key-value pair database offering scalable throughput and performance. It is designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes.

Data is stored in a flexible key-value format where the key consists of a major component(s) and optional minor component(s) and the associated value is represented as an opaque set of bytes. The key-value pairs are written to particular storage node(s), based on the hashed value of the primary key. Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries.

Oracle NoSQL Database offers full Create, Read, Update and Delete (CRUD) operations with adjustable durability guarantees. It is designed to be highly available, with excellent throughput and latency, while requiring minimal administrative interaction. Customer applications are written using an easy-to-use Java API to read and write data. The NoSQL Database links with the customer application, providing access to the data via the appropriate storage node for the requested key-value.

A typical applications is a web application which is servicing requests across the traditional three-tier architecture: web server, application server, and back-end database. In this configuration, Oracle NoSQL Database is meant to be installed behind the application server, causing it to either take the place of the back-end database, or work alongside it.

Oracle NoSQL Database Components Include:

Simple Data Model
Key-value pair data structure, keys are composed of Major & Minor keys
Easy-to-use Java API with simple Put, Delete and Get operations
Scalability
Automatic, hash-function based data partitioning and distribution
Intelligent NoSQL Database driver is topology and latency aware, providing optimal data access and transparent load balancing
Designed to scale out to thousands of nodes
Predictable behavior
ACID transactions, configurable globally and per operation
Bounded latency via B-tree caching and efficient query dispatching
Highly tuned memory management
High Availability
No single point of failure
Node-level backup and restore
Built-in, configurable replication for fault-tolerance, fail-over, and read-scalability
Resilient to single and multi-storage node failure
Disaster recovery via data center replication
Easy Administration
Web console or command line interface
System and node management
Shows system topology, status, current load, trailing and average latency, events and alerts

Oracle NoSQL Database Major Benefits Include:

High throughput
Bounded latency (sub-millisecond)
Near-linear scalability
High Availability
Short time to deployment
No conflict resolution requirement
Commercial Grade Software and Support
Optimized Hardware (Oracle Big Data Appliance)

Hadoop Oracle Big Data Connectors

Oracle Big Data Connectors facilitate data access between data stored in a Hadoop cluster and Oracle Database. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.

These are the 5 connectors currently available:

Oracle Loader for Hadoop:
Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into a database-ready format. It optionally sorts records by primary key or user-defined columns before loading the data or creating output files. Oracle Loader for Hadoop is a MapReduce application that is invoked as a command-line utility. It accepts the generic command-line options that are supported by the org.apache.hadoop.util.Tool interface.

Oracle SQL Connector for Hadoop Distributed File System (formerly known as Oracle Direct Connector for HDFS):
Oracle SQL Connector for Hadoop Distributed File System enables Oracle Database to access data stored in Hadoop Distributed File System (HDFS) files or a Hive table. The data can remain in HDFS or the Hive table, or it can be loaded into an Oracle database. Oracle SQL Connector for HDFS is a command-line utility that accepts generic command line arguments supported by the org.apache.hadoop.util.Tool interface. It also provides a preprocessor for Oracle external tables.

Oracle R Connector for Hadoop:
Oracle R Connector for Hadoop is an R package that provides an interface between a local R environment, Oracle Database, and Hadoop, allowing speed-of-thought, interactive analysis on all three platforms. Oracle R Connector for Hadoop is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise.

Oracle Data Integrator Application Adapter for Hadoop:
Oracle Data Integrator (ODI) extracts, transforms, and loads data into Oracle Database from a wide range of sources. In Oracle Data Integrator, a knowledge module (KM) is a code template dedicated to a specific task in the data integration process. You use ODI Studio to load, select, and configure the KMs for your particular application. More than 150 KMs are available to help you acquire data from a wide range of third-party databases and other data repositories. You only need to load a few KMs for any particular job. Oracle Data Integrator Application Adapter for Hadoop contains the KMs specifically for use with big data. They stage the data in Hive, a data warehouse built on Hadoop, for the best performance.

Oracle XQuery for Hadoop:
Runs transformations expressed in the XQuery language by translating them into a series of MapReduce jobs, which are executed in parallel on the Hadoop cluster. The input data can be located in a file system accessible through the Hadoop File System API, such as the Hadoop Distributed File System (HDFS), or stored in Oracle NoSQL Database. Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL Database, or Oracle Database.

Individual connectors may require that software components be installed in Oracle Database and either the Hadoop cluster or an external system set up as a Hadoop client for the cluster. Users may also need additional access privileges in Oracle Database.

What Is MapReduce?

MapReduce refers to a framework that runs on a computational cluster to mine large datasets. The name derives from the application of map() and reduce() functions repurposed from functional programming languages.

“Map” applies to all the members of the dataset and returns a list of results
“Reduce” collates and resolves the results from one or more mapping operations executed in parallel
Very large datasets are split into large subsets called splits
A parallelized operation performed on all splits yields the same results as if it were executed against the larger dataset before turning it into splits
Implementations separate business logic from multiprocessing logic
MapReduce framework developers focus on process dispatching, locking, and logic flow
App developers focus on implementing the business logic without worrying about infrastructure or scalability issues

Implementation patterns

The Map(k1, v1) -> list(k2, v2) function is applied to every item in the split. It produces a list of (k2, v2) pairs for each call. The framework groups all the results with the same key

together in a new split.

The Reduce(k2, list(v2)) -> list(v3) function is applied to each intermediate results split to produce a collection of values v3 in the same domain. This collection may have zero or more values. The desired result consists of all the v3 collections, often aggregated into one result file.

Hadoop Commands Cheat Sheet - hdfs ,Administration and MapReduce

Hadoop cheatsheet

Like many buzzwords, what people mean when they say “big data” is not always clear. At its core, big data is a way of describing data problems that are unsolvable using traditional tools —because of the volume of data involved, the variety of that data, or the time constraints faced by those trying to use that data. Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. To get that software framework to work for you, you’ll need to master a bunch of commands.

Hadoop Distributed File System Shell Commands

The Hadoop shell is a family of commands that you can run from your operating system’s command line. The shell has two sets of commands: one for file manipulation (similar in purpose and syntax to Linux commands that many of us know and love) and one for Hadoop administration. The following table summarizes the first set of commands for you.

Command	What It Does	Usage	Examples
dcat	Copies source paths tostdout.	hdfs dfs -cat URI [URI …]	hdfs dfs -cat hdfs:// <path>/file1; hdfs dfs -cat file:///file2 /user/hadoop/file3
chgrp	Changes the group association of files. With -R, makes the change recursively by way of the directory structure. The user must be the file owner or the superuser.	hdfs dfs -chgrp [-R] GROUP URI [URI …]
chmod	Changes the permissions of files. With -R, makes the change recursively by way of the directory structure. The user must be the file owner or the superuser.	hdfs dfs -chmod [-R] <MODE[,MODE]... \| OCTALMODE> URI [URI …]	hdfs dfs -chmod 777 test/data1.txt
chown	Changes the owner of files. With -R, makes the change recursively by way of the directory structure. The user must be the superuser.	hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]	hdfs dfs -chown -R hduser2 /opt/hadoop/logs
copyFromLocal	Works similarly to the putcommand, except that the source is restricted to a local file reference.	hdfs dfs -copyFromLocal <localsrc> URI	hdfs dfs -copyFromLocal input/docs/data2.txt hdfs://localhost/user/ rosemary/data2.txt
copyToLocal	Works similarly to the getcommand, except that the destination is restricted to a local file reference.	hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>	hdfs dfs -copyToLocal data2.txt data2.copy.txt
count	Counts the number of directories, files, and bytes under the paths that match the specified file pattern.	hdfs dfs -count [-q] <paths>	hdfs dfs -count hdfs://nn1.example.com/ file1 hdfs://nn2.example.com/ file2
cp	Copies one or more files from a specified source to a specified destination. If you specify multiple sources, the specified destination must be a directory.	hdfs dfs -cp URI [URI …] <dest>	hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
du	Displays the size of the specified file, or the sizes of files and directories that are contained in the specified directory. If you specify the -s option, displays an aggregate summary of file sizes rather than individual file sizes. If you specify the -h option, formats the file sizes in a "human-readable" way.	hdfs dfs -du [-s] [-h] URI [URI …]	hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1
dus	Displays a summary of file sizes; equivalent to hdfs dfs -du –s.	hdfs dfs -dus <args>
expunge	Empties the trash. When you delete a file, it isn’t removed immediately from HDFS, but is renamed to a file in the/trash directory. As long as the file remains there, you can undelete it if you change your mind, though only the latest copy of the deleted file can be restored.	hdfs dfs –expunge
get	Copies files to the local file system. Files that fail a cyclic redundancy check (CRC) can still be copied if you specify the -ignorecrcoption. The CRC is a common technique for detecting data transmission errors. CRC checksum files have the .crc extension and are used to verify the data integrity of another file. These files are copied if you specify the -crc option.	hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>	hdfs dfs -get /user/hadoop/file3 localfile
getmerge	Concatenates the files insrc and writes the result to the specified local destination file. To add a newline character at the end of each file, specify theaddnl option.	hdfs dfs -getmerge <src> <localdst> [addnl]	hdfs dfs -getmerge /user/hadoop/mydir/ ~/result_file addnl
ls	Returns statistics for the specified files or directories.	hdfs dfs -ls <args>	hdfs dfs -ls /user/hadoop/file1
lsr	Serves as the recursive version of ls; similar to the Unix command ls -R.	hdfs dfs -lsr <args>	hdfs dfs -lsr /user/ hadoop
mkdir	Creates directories on one or more specified paths. Its behavior is similar to the Unix mkdir -p command, which creates all directories that lead up to the specified directory if they don’t exist already.	hdfs dfs -mkdir <paths>	hdfs dfs -mkdir /user/hadoop/dir5/temp
moveFromLocal	Works similarly to the putcommand, except that the source is deleted after it is copied.	hdfs dfs -moveFromLocal <localsrc> <dest>	hdfs dfs -moveFromLocal localfile1 localfile2 /user/hadoop/hadoopdir

mv	Moves one or more files from a specified source to a specified destination. If you specify multiple sources, the specified destination must be a directory. Moving files across file systems isn’t permitted.	hdfs dfs -mv URI [URI …] <dest>	hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
put	Copies files from the local file system to the destination file system. This command can also read input fromstdin and write to the destination file system.	hdfs dfs -put <localsrc> ... <dest>	hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir; hdfs dfs -put - /user/hadoop/hadoopdir (reads input from stdin)
rm	Deletes one or more specified files. This command doesn’t delete empty directories or files. To bypass the trash (if it’s enabled) and delete the specified files immediately, specify the -skipTrashoption.	hdfs dfs -rm [-skipTrash] URI [URI …]	hdfs dfs -rm hdfs://nn.example.com/ file9
rmr	Serves as the recursive version of –rm.	hdfs dfs -rmr [-skipTrash] URI [URI …]	hdfs dfs -rmr /user/hadoop/dir
setrep	Changes the replication factor for a specified file or directory. With -R, makes the change recursively by way of the directory structure.	hdfs dfs -setrep <rep> [-R] <path>	hdfs dfs -setrep 3 -R /user/hadoop/dir1
stat	Displays information about the specified path.	hdfs dfs -stat URI [URI …]	hdfs dfs -stat /user/hadoop/dir1
tail	Displays the last kilobyte of a specified file to stdout. The syntax supports the Unix -foption, which enables the specified file to be monitored. As new lines are added to the file by another process, tail updates the display.	hdfs dfs -tail [-f] URI	hdfs dfs -tail /user/hadoop/dir1
test	Returns attributes of the specified file or directory. Specifies -e to determine whether the file or directory exists; -z to determine whether the file or directory is empty; and -d to determine whether the URI is a directory.	hdfs dfs -test -[ezd] URI	hdfs dfs -test /user/hadoop/dir1
text	Outputs a specified source file in text format. Valid input file formats are zip andTextRecordInputStream.	hdfs dfs -text <src>	hdfs dfs -text /user/hadoop/file8.zip
touchz	Creates a new, empty file of size 0 in the specified path.	hdfs dfs -touchz <path>	hdfs dfs -touchz /user/hadoop/file12

Hadoop Administration Commands

Any Hadoop administrator worth his salt must master a comprehensive set of commands for cluster administration. The following table summarizes the most important commands. Know them, and you will advance a long way along the path to Hadoop wisdom.

Command	What It Does	Syntax	Example
balancer	Runs the cluster-balancing utility. The specified threshold value, which represents a percentage of disk capacity, is used to overwrite the default threshold value (10 percent). To stop the rebalancing process, press Ctrl+C.	hadoop balancer [-threshold <threshold>]	hadoop balancer -threshold 20
daemonlog	Gets or sets the log level for each daemon (also known as a service). Connects tohttp://host:port/ logLevel?log=name and prints or sets the log level of the daemon that’s running athost:port. Hadoop daemons generate log files that help you determine what’s happening on the system, and you can use thedaemonlog command to temporarily change the log level of a Hadoop component when you’re debugging the system. The change becomes effective when the daemon restarts.	hadoop daemonlog -getlevel <host:port> <name>; hadoop daemonlog -setlevel <host:port> <name> <level>	hadoop daemonlog -getlevel 10.250.1.15:50030 org.apache.hadoop. mapred.JobTracker; hadoop daemonlog -setlevel 10.250. 1.15:50030 org.apache.hadoop. mapred.JobTracker DEBUG
datanode	Runs the HDFS DataNode service, which coordinates storage on each slave node. If you specify -rollback, the DataNode is rolled back to the previous version. Stop the DataNode and distribute the previous Hadoop version before using this option.	hadoop datanode [-rollback]	hadoop datanode – rollback
dfsadmin	Runs a number of Hadoop Distributed File System (HDFS) administrative operations. Use the -helpoption to see a list of all supported options. The generic options are a common set of options supported by several commands.	hadoop dfsadmin [GENERIC_ OPTIONS] [-report] [-safemode enter \| leave \| get \| wait] [-refreshNodes] [-finalize Upgrade] [-upgrade Progress status \| details \| force] [-metasave filename] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname> ...<dirname>] [-restoreFailed Storagetrue\|false \|check] [-help [cmd]]
mradmin	Runs a number of MapReduce administrative operations. Use the -helpoption to see a list of all supported options. Again, the generic options are a common set of options that are supported by several commands. If you specify -refreshServiceAcl, reloads the service-level authorization policy file (JobTracker reloads the authorization policy file); -refreshQueues reloads the queue access control lists (ACLs) and state (JobTracker reloads the mapred-queues.xml file); -refreshNodes refreshes the hosts information at the JobTracker; -refreshUserToGroups Mappings refreshes user-to-groups mappings; -refreshSuperUserGroups Configuration refreshes superuser proxy groups mappings; and -help [cmd] displays help for the given command or for all commands if none is specified.	hadoop mradmin [ GENERIC_OPTIONS ] [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserTo GroupsMappings] [- refreshSuper UserGroups Configuration] [-help [cmd]]	hadoop mradmin -help –refreshNodes
jobtracker	Runs the MapReduce JobTracker node, which coordinates the data processing system for Hadoop. If you specify -dumpConfiguration, the configuration that’s used by the JobTracker and the queue configuration in JSON format are written to standard output.	hadoop jobtracker [-dump Configuration]	hadoop jobtracker – dumpConfiguration
namenode	Runs the NameNode, which coordinates the storage for the whole Hadoop cluster. If you specify -format, the NameNode is started, formatted, and then stopped; with -upgrade, the NameNode starts with the upgrade option after a new Hadoop version is distributed; with -rollback, the NameNode is rolled back to the previous version (remember to stop the cluster and distribute the previous Hadoop version before using this option); with -finalize, the previous state of the file system is removed, the most recent upgrade becomes permanent, rollback is no longer available, and the NameNode is stopped; finally, with -importCheckpoint, an image is loaded from the checkpoint directory (as specified by thefs.checkpoint.dirproperty) and saved into the current directory.	hadoop namenode [-format] \| [-upgrade] \| [-rollback] \| [-finalize] \| [-import Checkpoint]	hadoop namenode – finalize
Secondary namenode	Runs the secondary NameNode. If you specify -checkpoint, a checkpoint on the secondary NameNode is performed if the size of the EditLog (a transaction log that records every change that occurs to the file system metadata) is greater than or equal tofs.checkpoint.size; specify -force and a checkpoint is performed regardless of the EditLog size; specify –geteditsizeand the EditLog size is printed.	hadoop secondary namenode [-checkpoint [force]] \| [-geteditsize]	hadoop secondarynamenode –geteditsize
tasktracker	Runs a MapReduce TaskTracker node.	hadoop tasktracker	hadoop tasktracker

The Hadoop dfsadmin Command Options

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As an added bonus, you can use them to perform some administration operations on HDFS as well.

Option	What It Does
-report	Reports basic file system information and statistics.
-safemode enter \| leave \| get \| wait	Manages safe mode, a NameNode state in which changes to the name space are not accepted and blocks can be neither replicated nor deleted. The NameNode is in safe mode during start-up so that it doesn’t prematurely start replicating blocks even though there are already enough replicas in the cluster.
-refreshNodes	Forces the NameNode to reread its configuration, including thedfs.hosts.exclude file. The NameNode decommissions nodes after their blocks have been replicated onto machines that will remain active.
-finalizeUpgrade	Completes the HDFS upgrade process. DataNodes and the NameNode delete working directories from the previous version.
-upgradeProgress status \| details \| force	Requests the standard or detailed current status of the distributed upgrade, or forces the upgrade to proceed.
-metasave filename	Saves the NameNode’s primary data structures to filename in a directory that’s specified by the hadoop.log.dir property. Filefilename, which is overwritten if it already exists, contains one line for each of these items: a) DataNodes that are exchanging heartbeats with the NameNode; b) blocks that are waiting to be replicated; c) blocks that are being replicated; and d) blocks that are waiting to be deleted.
-setQuota <quota> <dirname>...<dirname>	Sets an upper limit on the number of names in the directory tree. You can set this limit (a long integer) for one or more directories simultaneously.
-clrQuota <dirname>...<dirname>	Clears the upper limit on the number of names in the directory tree. You can clear this limit for one or more directories simultaneously.
-restoreFailedStorage true \| false \| check	Turns on or off the automatic attempts to restore failed storage replicas. If a failed storage location becomes available again, the system attempts to restore edits and the fsimage during a checkpoint. The check option returns the current setting.
-help [cmd]	Displays help information for the given command or for all commands if none is specified.

Options to Upload data to Hadoop Distributed File System(HDFS) using Oracle R Connector for Hadoop

APPLIES TO:

Oracle R Connector for Hadoop - Version 1.0 to 1.0 [Release 1.0]
Linux x86-64

PURPOSE

This document provides sample code on how to upload data to Hadoop Distributed File System (HDFS) from OS files, database tables, and ORE/Data frames using Oracle R Connector for Hadoop(ORCH).

Oracle R Connector for Hadoop provides an interface between a local R environment, Oracle Database, and Hadoop Distributed File System(HDFS), allowing speed-of-thought, interactive analysis on all three platforms.

Oracle R Connector for Hadoop(ORCH) is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise (ORE).

REQUIREMENTS

Generic R console with ORE and ORCH packages installed

CONFIGURING

For more information on installing R, ORE, and ORCH on the client server refer to Doc ID 1477347.1

INSTRUCTIONS

Open R command line console. You can paste the content of the *.R files into R console or execute using the source command.

CAUTION

This sample code is provided for educational purposes only and not supported by Oracle Support Services. It has been tested internally, however, and works as documented. We do not guarantee that it will work for you, so be sure to test it in your environment before relying on it.

Proofread this sample code before using it! Due to the differences in the way text editors, e-mail packages and operating systems handle text formatting (spaces, tabs and carriage returns), this sample code may not be in an executable state when you first receive it. Check over the sample code to ensure that errors of this type are corrected.

SAMPLE CODE

SAMPLE TO UPLOAD OS FILE TO HDFS

Here is sample code to upload a .dat file from OS file to HDFS.

Note:- This sample code is executed as Oracle OS user on BDA. If you intend to execute the sample code as a different OS user then set hdfs.setroot("/user/<OSUserName>") to point to that OS user home directory.

CUpload.R

cat("Using generic R and ORCH functions.\n")
cat("Check the current OS directory and list the contents ..\n")
print(getwd())
print(list.files())
cat("Create an OS directory ..\n")
dir.create("orchtest")
print(list.files())

cat("cd to the newly created directory ..\n")
setwd("orchtest")
print(getwd())

cat("cars is a sample data frame \n")
class(cars)
print(names(cars))

cat("write cars data frame to an OS File \n")
write.csv(cars, "cars_test.dat", row.names = FALSE)
print(list.files())

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample1 directory on HDFS ...\n")
hdfs.rmdir('csample1')

cat("Create a new csample1 directory on HDFS ...\n")
hdfs.mkdir('csample1', cd=T)
print(hdfs.pwd())

cat("Upload the dat file to HDFS ...\n")
irs.dfs_File <- hdfs.upload('cars_test.dat', dfs.name='cars_F', header=T)

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_F"))
print(hdfs.parts("cars_F"))
print(hdfs.sample("cars_F",lines=3))

SAMPLE TO UPLOAD OS FILE TO DATA FRAME AND THEN TO HDFS

Here is sample code to upload a .dat file from OS file to Data Frame and then to HDFS.

CUpload2.R

cat("Using generic R and ORCH functions.\n")
cat("Commands to cd to directory where the .dat/csv file resides ..\n")
getwd()
setwd("orchtest")
print(getwd())
print(list.files())

cat("Create data frame from OS File  \n")
dcars <- read.csv(file="cars_test.dat",head=TRUE,sep=",")
print(names(dcars))

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample2 directory on HDFS ...\n")
hdfs.rmdir('csample2')

cat("Create a new csample2 directory on HDFS ...\n")
hdfs.mkdir('csample2', cd=T)
print(hdfs.pwd())

cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(dcars, dfs.name='cars_F2')

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("cars_F2"))
print(hdfs.size("cars_F2"))
print(hdfs.parts("cars_F2"))
print(hdfs.sample("cars_F2",lines=3))

The data frame (diris) created from OS file can be used with ore.create to create a table in the database. Refer to Oracle R Enterprise User's Guide for sample code.

SAMPLE TO UPLOAD DATA FRAME TO HDFS

Here is the code to upload a Data Frame to HDFS.

DUpload.R

cat("Using generic R and ORCH functions.\n")
cat("cars is a sample data frame \n")
class(cars)

cat("Load ORCH library ...\n")
library(ORCH)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample3 directory on HDFS ...\n")
hdfs.rmdir('csample3')

cat("Create a new csample3 directory on HDFS ...\n")
hdfs.mkdir('csample3', cd=T)
print(hdfs.pwd())

cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(cars, dfs.name='cars_D')

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_D"))
print(hdfs.parts("cars_D"))
print(hdfs.sample("cars_D",lines=3))

SAMPLE TO UPLOAD DATABASE TABLE TO HDFS

Here is the sample code to create a ORE/Data Frame from database table. Then upload ORE/Data to HDFS.

In this sample along with generic R and ORCH functions used Oracle R Enterprise functions.

For sample code on how to create database tables using ore.create from R Data Frames refer to Oracle R Enterprise User's Guide

Refer to Doc ID 1490291.1 for sample code on how to create the table(DF_TABLE) used in this sample.

Modify dbsid, dbhost, port and RQPASS to match your environment. RQUSER is the user created using demo_user.sh, which is created as part of ORE Server install. Username and Password may differ in your environment.

Also when executing this script in ORE server environment uncomment .libpaths and change <ORACLE_HOME> to absolute path of Oracle Home. ORE server installs needed R libraries/packages in $ORACLE_HOME/R/library , where as ORE Client installs R libraries/packages in $R_HOME/library.

TUpload.R

cat("Using generic R, ORE and ORCH functions.\n")
cat("Load ORE and connect.\n")
# .libPaths("<ORACLE_HOME>/R/library")
library(ORE)
ore.connect("RQUSER","<dbsid>","<dbhost>","RQPASS", <port>)
ore.sync()
ore.attach()
cat("List the tables in RQUSER schema.\n")
print(ore.ls())

cat("Load ORCH and connect.\n")
library(ORCH)
orch.connect("<dbhost>","RQUSER","<dbsid>","RQPASS", <port> , secure=F)

cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())

cat("Command to remove csample4 directory on HDFS ...\n")
hdfs.rmdir('csample4')

cat("Create a new csample4 directory on HDFS ...\n")
hdfs.mkdir('csample4')
hdfs.cd('csample4')
print(hdfs.pwd())

cat("Create ORE Frame for DF_TABLE \n")
df_t <- DF_TABLE
print(class(df_t))
print(names(df_t))

cat("Upload ORE Frame to HDFS .. \n")
df.dfs <-  hdfs.push(df_t,  dfs.name='df_T', split.by="A")

cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("df_T"))
print(hdfs.size("df_T"))
print(hdfs.parts("df_T"))
print(hdfs.sample("df_T",lines=3))

SAMPLE OUTPUT

Sample Output of Uploading OS file to HDFS

Open R command line console. You can paste the content of the CUpload.R into R console or execute using the source command.

> dir()
[1] "CUpload.R"
> source("CUpload.R")
Using generic R and ORCH functions.
Check the current OS directory and list the contents ..
[1] "/refresh/home/RTest"
[1] "CUpload.R"
Create an OS directory ..
[1] "CUpload.R" "orchtest"
cd to the newly created directory ..
[1] "/refresh/home/RTest/orchtest"
cars is a sample data frame
[1] "speed" "dist"
write cars data frame to an OS File
[1] "cars_test.dat"
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1"
Command to remove csample1 directory on HDFS ...
Create a new csample1 directory on HDFS ...
[1] "/user/oracle/RTest/csample1"
Upload the dat file to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_F"
[1] 293
[1] 1
val1 val2
1   24   93
2   24 120
3   25   85

Sample Output of Uploading OS file to Data Frame and then to HDFS

Open R command line console . You can paste the content of the CUpload2.R into R console or execute using the source command.

> dir()
[1] "CUpload2.R" "CUpload.R" "orchtest"
> source("CUpload2.R")
Using generic R and ORCH functions.
Commands to cd to directory where the .dat/csv file resides ..
[1] "/refresh/home/RTest/orchtest"
[1] "cars_test.dat"
Create data frame from OS File
[1] "speed" "dist"
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2"
Command to remove csample2 directory on HDFS ...
Create a new csample2 directory on HDFS ...
[1] "/user/oracle/RTest/csample2"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 343
[1] 1
speed dist
1    24   93
2    24 120
3    25   85

Sample Output of Uploading Data Frame to HDFS

Open R command line console. You can paste the content of the DUpload.R into R console or execute using the source command.

> dir()
[1] "CUpload2.R" "CUpload.R" "DUpload.R" "orchtest"
> source("DUpload.R")
Using generic R and ORCH functions.
cars is a sample data frame
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3"
Command to remove csample3 directory on HDFS ...
DBG: 21:54:29 [ER] failed to remove "/user/oracle/RTest/csample3"
Create a new csample3 directory on HDFS ...
[1] "/user/oracle/RTest/csample3"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_D"
[1] 343
[1] 1
speed dist
1    24   93
2    24 120
3    25   85

Sample Output of Uploading Database Table to HDFS

Open R command line console. You can paste the content of the TUpload.R into R console or execute using the source command.

> dir()
[1] "CTab.R"        "CUpload2.R"    "CUpload.R"     "DUpload.R"
[5] "orchtest"      "TUpload1.R"    "TUpload.R"     "TUpload.R.old"
> source("TUpload.R")
Using generic R, ORE and ORCH functions.
Load ORE and connect.
Loading required package: OREbase
Loading required package: ROracle
Loading required package: DBI

Attaching package: 'OREbase'

The following object(s) are masked from 'package:base':

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table

Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: ORExml
List the tables in RQUSER schema.
[1] "CARS_TABLE"   "CARS_VTAB"    "CARS_VTAB1"   "DF_TABLE"     "IRIS_TABLE"
[6] "ONTIME_S"     "ONTIME_S2000" "WADERS_TABLE"
Load ORCH and connect.
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Connecting ORCH to RDBMS via [sqoop]
    Host: celvpint0603
    Port: 1521
    SID: orcl
    User: RQUSER
Connected.
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3" "csample4"
Command to remove csample4 directory on HDFS ...
Create a new csample4 directory on HDFS ...
[1] "/user/oracle/RTest/csample4"
Create ORE Frame for DF_TABLE
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
[1] "A" "B"
Upload ORE Frame to HDFS ..
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 121
[1] 4
A B
1 13 m
2 26 z
3 7 g
>

How to Install R packages as Non Root User for Oracle R Connector for Hadoop

Document how to install R packages as non-root user for use with Oracle R Connector for Hadoop (ORCH).

SOLUTION

Since Oracle R Distribution (ORD) is installed as root or sudo, this raises the question of how to install other R packages as non-root user?

R is meant to be a shared application, so that when packages are installed they will be placed in a global library and will be available for all users - the default is the global directory $R_HOME i.e. /usr/lib64/R. If you install as root or sudo, packages will be installed into $R_HOME/library. If you do not install packages as root you will not have permission to write packages into the global library directory i.e. $R_HOME/library and you will be prompted to create a personal library that is writable by your user id (and thus accessible to you only). Another option is to create another global directory writable by all R users. If the common directory is "/a/b/c", the syntax for setting .libPaths() is:

.libPaths("/a/b/c")

Setting .libPaths won't persist between R sessions, so it's common practice to place this in a .Rprofile or .Rprofile.site file thus it is executed each time R is started. (from within R see ?.libPaths for more information).

Note:- Installing R packages into a global directory writable by all R users is considered a best practice.

Steps to Install R Packages as Non-root:

1. Use the default non-root location provided by R to install your package. For example in Oracle Distribution of R version 2.15.1 that location is

~/R/x86_64-unknown-linux-gnu-library/2.15

Note:- If ~/R/x86_64-unknown-linux-gnu-library/2.15 doesn't exist then one will be created after the package is installed

Output when using default non-root location

$ R
Oracle Distribution of R version 2.15.1 (--) -- "Roasted Marshmallows"
...
> install.packages("png")
Installing package(s) into /usr/lib64/R/library
...
Warning in install.packages("png") :
'lib = "/usr/lib64/R/library"' is not writable
Would you like to use a personal library instead? (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/2.15
to install packages into? (y/n) y
...

This installs the package into ~/R/x86_64-unknown-linux-gnu-library/2.15/<packagename>

To uninstall a package installed in this directory i.e. from R issue below command and package will be removed from the default location

remove.packages("<package_name>")

2. Within R use .libPaths() to define your global path for example: .libPaths("/a/b/c")
$ R
...
> .libPaths("/a/b/c")
> install.packages("<package_name>")

Output using this case looks like:

$ R
Oracle Distribution of R version 2.15.1 (--) -- "Roasted Marshmallows"
...
> .libPaths("/a/b/c")
> install.packages("<package_name>")
Installing package(s) into "/a/b/c"
...

This installs the package into ("/a/b/c")

If you uninstall a package installed in the directory defined by .libPaths(), make sure to set .libPaths() before removing the package from R with:

remove.packages("<package_name>")

3. Add the .libPaths() function to an existing .Rprofile/Rprofile.site file or create a new .Rprofile/Rprofile.site file if one does not already exist, by editing the file and adding for example:

.libPaths(c("/a/b/c",.libPaths()))

Note that by default Rprofile.site is located at $R_HOME/etc/Rprofile.site. It is a site-wide R initialization file. .Rprofile is a local initialization file. R searches for a file called .Rprofile in the current directory or in the user home directory (in that order) and sources it into the user workspace.

Using a .Rprofile as an example:

$ more .Rprofile
...
.libPaths(c("/a/b/c",.libPaths()))
...

Output using this case looks like:

$ R
Oracle Distribution of R version 2.15.1 (--) -- "Roasted Marshmallows"
...
> install.packages("<package_name>")
Installing package(s) into "/a/b/c"
...

This installs the package into "/a/b/c" as set by .Rprofile.

If you uninstall a package installed in the directory defined by the .Rprofile, make sure the file exists before removing the package from R by issuing:

remove.packages("<package_name>")

The same holds true if Rprofile.site is used.

Oracle Loader for Hadoop

Oracle Loader for Hadoop is an efficient and high performance loader for fast movement of data from a Hadoop Cluster into a table in an Oracle database. Oracle Loader for Hadoop prepares data for loading into a database table, pre-partitioning the data if necessary and transforming it into an Oracle-ready format. It optionally sorts records before loading the data or creating output files. Oracle Loader for Hadoop is a Map Reduce application that is invoked as a command line utility and accepts the generic command-line options which are supported by the org.apache.hadoop.util.Tool interface.

After the pre-partitioning and transforming steps, there are two modes for loading the data into an Oracle database from a Hadoop cluster:

Online database mode: The data is loaded into the database using either a JDBC output format or an OCI Direct Path output format. The OCI Direct Path output format performs a high performance direct path load of the target table. The JDBC output format performs a conventional path load. In both cases, the reducer tasks connect to the database in parallel.

Offline database mode:The reducer tasks create binary or text format output files. The Data Pump output format creates binary format files that are ready to be loaded into an Oracle database using Oracle Direct Connector for HDFS. The Delimited Text output format creates text files in delimited record format. (This is usually called comma separated value (CSV) format when the delimiter is a comma.) These text files are ready to be loaded into an Oracle Database using Oracle Direct Connector for HDFS. Alternatively, these files can be copied to the database system and loaded manually. For Data Pump files, Oracle Loader for Hadoop produces a SQL script that contains the commands to create an external table that may be used to load the Data Pump files. Delimited text files may be manually loaded using either SQL*Loader or external tables. For each delimited text file, Oracle Loader for Hadoop produces a SQL*Loader control file that may be used to load the delimited text file. It also produces a single SQL script to load the delimited text file(s) into the target external table.

Oracle Loader for Hadoop is installed and runs on the Hadoop cluster. It resides on a node from which you submit MapReduce jobs.

Oracle Big Data Connectors must be licensed separately from Oracle Big Data Appliance. If Oracle Big Data Connectors are licensed and you have choosen the option to install connectors in the configuration script, then Mammoth utility installs Oracle Loader for Hadoop on all nodes of the non-primary racks on the Oracle Big Data Appliance.

You are here

Hadoop

Implementation patterns

APPLIES TO:

PURPOSE

REQUIREMENTS

CONFIGURING

INSTRUCTIONS

CAUTION

SAMPLE CODE

SAMPLE TO UPLOAD OS FILE TO HDFS

SAMPLE TO UPLOAD OS FILE TO DATA FRAME AND THEN TO HDFS

SAMPLE TO UPLOAD DATA FRAME TO HDFS

SAMPLE TO UPLOAD DATABASE TABLE TO HDFS

SAMPLE OUTPUT

Sample Output of Uploading OS file to HDFS

Sample Output of Uploading OS file to Data Frame and then to HDFS

Sample Output of Uploading Data Frame to HDFS

Sample Output of Uploading Database Table to HDFS

SOLUTION

Pages

Smart TV Universal Remote Codes

TV Support Guides

User Manuals for TV and Consumer Electronics

Smart TV Universal Remote Codes By TV Brand

sub woofer - multi channel home theatre - audio video companies near me

Home Theater codes

Remote Codes For Audio Receivers

Universal Remote Codes By Brand

Remote Codes For BLU RAY Players

Remote Codes For DVD Players

Streaming Media Player Remote Codes

Cable Box

Digital Converter Remote Codes

Remote Codes For Video Projectors

Universal remote codes

Universal Remote

Connecting Devices To TV

Replacement Remotes For TVs

Replacement Remotes For DVR

DVD Blu-Ray Player Region Codes

Universal remote code pages