APPLIES TO:
Oracle R Connector for Hadoop - Version 1.0 to 1.0 [Release 1.0]
Linux x86-64
PURPOSE
This document provides sample code on how to upload data to Hadoop Distributed File System (HDFS) from OS files, database tables, and ORE/Data frames using Oracle R Connector for Hadoop(ORCH).
Oracle R Connector for Hadoop provides an interface between a local R environment, Oracle Database, and Hadoop Distributed File System(HDFS), allowing speed-of-thought, interactive analysis on all three platforms.
Oracle R Connector for Hadoop(ORCH) is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise (ORE).
REQUIREMENTS
Generic R console with ORE and ORCH packages installed
CONFIGURING
For more information on installing R, ORE, and ORCH on the client server refer to Doc ID 1477347.1
INSTRUCTIONS
Open R command line console. You can paste the content of the *.R files into R console or execute using the source command.
CAUTION
Proofread this sample code before using it! Due to the differences in the way text editors, e-mail packages and operating systems handle text formatting (spaces, tabs and carriage returns), this sample code may not be in an executable state when you first receive it. Check over the sample code to ensure that errors of this type are corrected.
SAMPLE CODE
SAMPLE TO UPLOAD OS FILE TO HDFS
Here is sample code to upload a .dat file from OS file to HDFS.
Note:- This sample code is executed as Oracle OS user on BDA. If you intend to execute the sample code as a different OS user then set hdfs.setroot("/user/<OSUserName>") to point to that OS user home directory.
CUpload.R
cat("Check the current OS directory and list the contents ..\n")
print(getwd())
print(list.files())
cat("Create an OS directory ..\n")
dir.create("orchtest")
print(list.files())
cat("cd to the newly created directory ..\n")
setwd("orchtest")
print(getwd())
cat("cars is a sample data frame \n")
class(cars)
print(names(cars))
cat("write cars data frame to an OS File \n")
write.csv(cars, "cars_test.dat", row.names = FALSE)
print(list.files())
cat("Load ORCH library ...\n")
library(ORCH)
cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())
cat("Command to remove csample1 directory on HDFS ...\n")
hdfs.rmdir('csample1')
cat("Create a new csample1 directory on HDFS ...\n")
hdfs.mkdir('csample1', cd=T)
print(hdfs.pwd())
cat("Upload the dat file to HDFS ...\n")
irs.dfs_File <- hdfs.upload('cars_test.dat', dfs.name='cars_F', header=T)
cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_F"))
print(hdfs.parts("cars_F"))
print(hdfs.sample("cars_F",lines=3))
SAMPLE TO UPLOAD OS FILE TO DATA FRAME AND THEN TO HDFS
Here is sample code to upload a .dat file from OS file to Data Frame and then to HDFS.
CUpload2.R
cat("Commands to cd to directory where the .dat/csv file resides ..\n")
getwd()
setwd("orchtest")
print(getwd())
print(list.files())
cat("Create data frame from OS File \n")
dcars <- read.csv(file="cars_test.dat",head=TRUE,sep=",")
print(names(dcars))
cat("Load ORCH library ...\n")
library(ORCH)
cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())
cat("Command to remove csample2 directory on HDFS ...\n")
hdfs.rmdir('csample2')
cat("Create a new csample2 directory on HDFS ...\n")
hdfs.mkdir('csample2', cd=T)
print(hdfs.pwd())
cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(dcars, dfs.name='cars_F2')
cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("cars_F2"))
print(hdfs.size("cars_F2"))
print(hdfs.parts("cars_F2"))
print(hdfs.sample("cars_F2",lines=3))
SAMPLE TO UPLOAD DATA FRAME TO HDFS
Here is the code to upload a Data Frame to HDFS.
DUpload.R
cat("cars is a sample data frame \n")
class(cars)
cat("Load ORCH library ...\n")
library(ORCH)
cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())
cat("Command to remove csample3 directory on HDFS ...\n")
hdfs.rmdir('csample3')
cat("Create a new csample3 directory on HDFS ...\n")
hdfs.mkdir('csample3', cd=T)
print(hdfs.pwd())
cat("Upload Data Frame to HDFS ...\n")
myfile <- hdfs.put(cars, dfs.name='cars_D')
cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.ls())
print(hdfs.size("cars_D"))
print(hdfs.parts("cars_D"))
print(hdfs.sample("cars_D",lines=3))
SAMPLE TO UPLOAD DATABASE TABLE TO HDFS
Here is the sample code to create a ORE/Data Frame from database table. Then upload ORE/Data to HDFS.
In this sample along with generic R and ORCH functions used Oracle R Enterprise functions.
For sample code on how to create database tables using ore.create from R Data Frames refer to Oracle R Enterprise User's Guide
Refer to Doc ID 1490291.1 for sample code on how to create the table(DF_TABLE) used in this sample.
Modify dbsid, dbhost, port and RQPASS to match your environment. RQUSER is the user created using demo_user.sh
, which is created as part of ORE Server install. Username and Password may differ in your environment.
Also when executing this script in ORE server environment uncomment .libpaths and change <ORACLE_HOME> to absolute path of Oracle Home. ORE server installs needed R libraries/packages in $ORACLE_HOME/R/library , where as ORE Client installs R libraries/packages in $R_HOME/library.
TUpload.R
cat("Load ORE and connect.\n")
# .libPaths("<ORACLE_HOME>/R/library")
library(ORE)
ore.connect("RQUSER","<dbsid>","<dbhost>","RQPASS", <port>)
ore.sync()
ore.attach()
cat("List the tables in RQUSER schema.\n")
print(ore.ls())
cat("Load ORCH and connect.\n")
library(ORCH)
orch.connect("<dbhost>","RQUSER","<dbsid>","RQPASS", <port> , secure=F)
cat("Set root directory and list contents ...\n")
hdfs.setroot("/user/oracle")
print(hdfs.pwd())
print(hdfs.ls())
cat("Command to remove csample4 directory on HDFS ...\n")
hdfs.rmdir('csample4')
cat("Create a new csample4 directory on HDFS ...\n")
hdfs.mkdir('csample4')
hdfs.cd('csample4')
print(hdfs.pwd())
cat("Create ORE Frame for DF_TABLE \n")
df_t <- DF_TABLE
print(class(df_t))
print(names(df_t))
cat("Upload ORE Frame to HDFS .. \n")
df.dfs <- hdfs.push(df_t, dfs.name='df_T', split.by="A")
cat("ORCH commands to check the file size and sample data ...\n")
print(hdfs.exists("df_T"))
print(hdfs.size("df_T"))
print(hdfs.parts("df_T"))
print(hdfs.sample("df_T",lines=3))
SAMPLE OUTPUT
Sample Output of Uploading OS file to HDFS
Open R command line console. You can paste the content of the CUpload.R into R console or execute using the source command.
[1] "CUpload.R"
> source("CUpload.R")
Using generic R and ORCH functions.
Check the current OS directory and list the contents ..
[1] "/refresh/home/RTest"
[1] "CUpload.R"
Create an OS directory ..
[1] "CUpload.R" "orchtest"
cd to the newly created directory ..
[1] "/refresh/home/RTest/orchtest"
cars is a sample data frame
[1] "speed" "dist"
write cars data frame to an OS File
[1] "cars_test.dat"
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI
Attaching package: 'OREbase'
The following object(s) are masked from 'package:base':
cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
rbind, table
Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1"
Command to remove csample1 directory on HDFS ...
Create a new csample1 directory on HDFS ...
[1] "/user/oracle/RTest/csample1"
Upload the dat file to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_F"
[1] 293
[1] 1
val1 val2
1 24 93
2 24 120
3 25 85
Sample Output of Uploading OS file to Data Frame and then to HDFS
Open R command line console . You can paste the content of the CUpload2.R into R console or execute using the source command.
[1] "CUpload2.R" "CUpload.R" "orchtest"
> source("CUpload2.R")
Using generic R and ORCH functions.
Commands to cd to directory where the .dat/csv file resides ..
[1] "/refresh/home/RTest/orchtest"
[1] "cars_test.dat"
Create data frame from OS File
[1] "speed" "dist"
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI
Attaching package: 'OREbase'
The following object(s) are masked from 'package:base':
cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
rbind, table
Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2"
Command to remove csample2 directory on HDFS ...
Create a new csample2 directory on HDFS ...
[1] "/user/oracle/RTest/csample2"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 343
[1] 1
speed dist
1 24 93
2 24 120
3 25 85
Sample Output of Uploading Data Frame to HDFS
Open R command line console. You can paste the content of the DUpload.R into R console or execute using the source command.
[1] "CUpload2.R" "CUpload.R" "DUpload.R" "orchtest"
> source("DUpload.R")
Using generic R and ORCH functions.
cars is a sample data frame
Load ORCH library ...
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Loading required package: ROracle
Loading required package: DBI
Attaching package: 'OREbase'
The following object(s) are masked from 'package:base':
cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
rbind, table
Loading required package: MASS
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3"
Command to remove csample3 directory on HDFS ...
DBG: 21:54:29 [ER] failed to remove "/user/oracle/RTest/csample3"
Create a new csample3 directory on HDFS ...
[1] "/user/oracle/RTest/csample3"
Upload Data Frame to HDFS ...
ORCH commands to check the file size and sample data ...
[1] "cars_D"
[1] 343
[1] 1
speed dist
1 24 93
2 24 120
3 25 85
Sample Output of Uploading Database Table to HDFS
Open R command line console. You can paste the content of the TUpload.R into R console or execute using the source command.
[1] "CTab.R" "CUpload2.R" "CUpload.R" "DUpload.R"
[5] "orchtest" "TUpload1.R" "TUpload.R" "TUpload.R.old"
> source("TUpload.R")
Using generic R, ORE and ORCH functions.
Load ORE and connect.
Loading required package: OREbase
Loading required package: ROracle
Loading required package: DBI
Attaching package: 'OREbase'
The following object(s) are masked from 'package:base':
cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
rbind, table
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: ORExml
List the tables in RQUSER schema.
[1] "CARS_TABLE" "CARS_VTAB" "CARS_VTAB1" "DF_TABLE" "IRIS_TABLE"
[6] "ONTIME_S" "ONTIME_S2000" "WADERS_TABLE"
Load ORCH and connect.
Oracle R Connector for Hadoop 0.1.8 (rev. 104)
Hadoop 0.20.2-cdh3u4 is up
Sqoop 1.3.0-cdh3u4 is up
Connecting ORCH to RDBMS via [sqoop]
Host: celvpint0603
Port: 1521
SID: orcl
User: RQUSER
Connected.
Set root directory and list contents ...
[1] "/user/oracle/RTest"
[1] "csample1" "csample2" "csample3" "csample4"
Command to remove csample4 directory on HDFS ...
Create a new csample4 directory on HDFS ...
[1] "/user/oracle/RTest/csample4"
Create ORE Frame for DF_TABLE
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
[1] "A" "B"
Upload ORE Frame to HDFS ..
ORCH commands to check the file size and sample data ...
[1] TRUE
[1] 121
[1] 4
A B
1 13 m
2 26 z
3 7 g
>