Hadoop

Hadoop Frequently Asked Questions

This document provides answers to frequently asked questions about Hadoop distributed by Cloudera for use on the Oracle Big Data Appliance(BDA).

QUESTIONS AND ANSWERS

Is the environment variable $HADOOP_HOME used in CDH 4.1.2 ?

On BDA V2.0.1 with CDH 4.1.2, $HADOOP_HOME has been deprecated. It is good practice to unset it if it was previously set.

In lieu of the environment variable $HADOOP_HOME what should be used in CDH 4.1.2 ?

On BDA V2.0.1 with CDH 4.1.2, use $HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce.

Should OS disks (/dev/sda, /dev/sdb) be used to store local data? HDFS data?

No, this is not recommended.

How can data on the OS disks be cleaned up, since storing it there is not recommended?

Simply delete the data and it will be automatically cleaned upon the mirrored disk as well.

Does the Cloudera CDH Client have to be installed on all Exadata DB nodes?

If using Oracle SQL Connector for HDFS then yes, the CDH Client needs to be installed on all Exadata DB nodes. If using Oracle Loader for Hadoop then the CDH client does not have to be installed on all Exadata DB nodes.

If a disk goes bad and is replaced can you verify the disk is functional with regards to HDFS?

If a disk goes bad and is replaced you can verify the disk is functional with regards to the local file system by doing something like:

## copy a file to filesystem on that disk
# cp /<path>/<file> /u03/
# sync

## check for differences between the 2 files
# echo Checking copied file.
# diff /<path>/<file> /u03/.

## copy the file back and check for differences
# cp /u03/<file> /tmp/<file>
# sync
# echo Checking file after copying back.
# diff /<path>/<file> /tmp/<file>

There is no similar sequence which can be done to verify the disk is functional with regards to HDFS. This type of functionality is built into HDFS with in-built per block md5 checksums which are read back over time by the blockScanner on each DataNode and repaired via recopy of a healthy block as needed. Nothing needs to be done on to verify the disk is functional with regards to HDFS other than check that the disk is writable from the OS level as described.

If one of the services managed by Cloudera Manager(CM) goes into "BAD" health, is there a recommended order for checking the status of services?

Yes. If any of the services goes into "BAD" Health in CM check the status of services in the order below:

1) First check the Zookeeper service Status. If Zookeeper is in "BAD" health then the cluster will not be stable. The Zookeeper service will need to be fixed prior to fixing any other service. Check the Zookeeper logs for additional details on Zookeeper status.

If the Zookeeper service is in "Good" health then continue to (2)

2. Check the status of the HDFS service.
a) First check the Failover Controller Status. If the Failover Controller service is in "BAD" health then then check the log files.

b) Check the NameNode service status. If the NameNode service is in "BAD" health then check the NameNode logs.

If the Zookeeper and HDFS services are in "Good" health then continue to (3)

3) Check the logs of the service that is bad.
You can upload the output from "bdadiag" to an Oracle Support SR for review.

If the nodes of the BDA cluster have been up for close to 200 days is a reboot recommended?

In BDA versions V2.0.1 - V2.2.0 yes all nodes approaching 200 days of uptime need reboot.

Generally first reboot the node where standby NameNode resides and make sure it's Healthy. Once the standby NameNode is Healthy then manually failover Active NameNode to Standby After the Active NameNode is switched to standby then reboot that node.

Can you decommision non-critical nodes from BDA HDFS cluster , inorder to install NoSQL ?

Currently it's not supported on BDA to remove/decommission nodes from a deployed HDFS cluster.

For HA testing is it possible to relocate Hive services to a different node after a Hive node failure?

Migration of Hadoop roles (JT, NN, Hive, ZK, etc.) is not currently supported on the BDA. For now you need to stick to the layout of services provided. The software checks will start reporting errors when you move Hadoop roles controlled by Mammoth to different locations.

What options are available for migrating service roles on the BDA?

Since the documentation on Oracle Big Data Appliance Restrictions on Use states that migration of Hadoop roles (NN, Hive, ZK, etc.) is not currently supported on the BDA, what options are available for migrating service roles on the BDA?

The BDA is a fully top-to-bottom supported Hadoop Appliance. We do not support arbitrary movement of specific Hadoop roles since this may result in configurations that are not supportable ( e.g. both NameNodes on the same host) or are less than optimal in terms of performance.

However, our goal is to support enough flexibility to meet most requirements. Clearly there are situations where it is necessary to be able to move master roles off of a particular node.

- when adding a new rack to an existing cluster
- in case of catastrophic failure of a server
- for scheduled maintenance of a rack or of particular servers

We are working on supporting in Mammoth the ability to move all master roles (NameNode, JournalNode, MySQL, Cloudera Manager etc) off a particular server and onto another server (which was previously a regular slave node). The previous master node would become a regular slave node (DataNode + NodeManager) if it was still up. We believe that this ability will support the three cases listed above (and in the case of adding a new rack to an existing cluster we will automatically distribute the 4 master nodes between the 2 racks). This functionality is planned for a future release.

What are the options for destroying i.e. performing a non-recoverable delete all the data stored on the DataNodes in HDFS?

The fastest way to delete all HDFS data in a Hadoop cluster is to run:

$ hadoop fs -rm -R -skipTrash "/*"

This will remove all HDFS data and skip the trash option so the deletes are finalized. The NameNode will still have work to do in that it will have to purge the blocks on all DataNodes after a short period of time.

When destroying HDFS data is there an option for replacing the data blocks on all DataNodes with some random pattern of bytes (0s/1s or something else)? In other words is there a way to securely delete sensitive data from HDFS by overwriting the physical disk locations with new data i.e. with randomly generated output?

No, this would be considered a "secure wipe" and this functionality is not present in HDFS.

Running a very long reducer seems to be filling one DataNode. Why would that be?

Reducers will write their data to the DataNode on which they are running, which will cause growth on that particular DataNode. The NameNode will make replicas of the blocks written but the original blocks will remain there.

How to run Hadoop without using SSH

The start-all.sh and stop-all.sh scripts in the hadoop/bin directory will use SSH to launch some of the hadoop daemons. If for some reason SSH is not available on the server, please follow the steps below to run hadoop without using SSH.

SOLUTION

The goal is to modify all "hadoop-daemons.sh" with "hadoop-daemon.sh". The "hadoop-daemons.sh" simply runs "hadoop-daemon.sh" through SSH.

Modify start-dfs.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
Modify stop-dfs.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode
Modify start-mapred.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start tasktracker
Modify stop-mapred.sh script:

from:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop tasktracker

to:
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop tasktracker

Note that after this change, start-all.sh and stop-all.sh will not start/stop any other hadoop nodes outside of this server remotely. All other remote slaves must be start/stopped manually, directly on those servers.

How To Modify Hadoop Log Level

APPLIES TO:

Oracle WebCenter Sites - Version 2.5 and later
Information in this document applies to any platform.

GOAL

By default, Hadoop's log level is set to INFO. This can be too much for most instances, as it will generate huge log files, even in an environment with low to moderate traffic.

Changing the root logger in log4j.properties file in Hadoop will not change the log level.

Follow the steps below for changing the log level of Hadoop.

FIX

Shut down hadoop if it is still running (Note that all Analytics daemons must be restarted if hadoop is restarted. Please refer to KM 1509643.1 for the procedure on how to restart Analytics)
Open the [hadoop_home]/bin/hadoop-daemon.sh file
Look for the following line:

export HADOOP_ROOT_LOGGER="INFO,DRFA"
Modify that line to a lower or higher log level (WARN should be sufficient if the goal is to limit the size of the log file rather than debugging an issue)

export HADOOP_ROOT_LOGGER="WARN,DRFA"
Save the file and start hadoop

Error during installation of Hadoop on Oracle WebCenter

APPLIES TO:

Oracle WebCenter Sites - Version 7.6.0 and later
Information in this document applies to any platform.

GOAL

When trying to run the command "$ bin/hadoop namenode -format" during hadoop installation the user sees the below error message

bin/hadoop: line 234: C:\Program: command not found

SOLUTION

There is an issue with the environment configuration used. The problem is caused by the whitespace between “Program Files” in the Windows folder structure. In other words if your JAVA_HOME is “C:\Program Files\java”, there is whitespace between “Program" and "Files”. So one way to solve the problem is put your jdk in a different folder, something like c:\java\jdk should fix the issue.

You are here

Hadoop

QUESTIONS AND ANSWERS

Is the environment variable $HADOOP_HOME used in CDH 4.1.2 ?

In lieu of the environment variable $HADOOP_HOME what should be used in CDH 4.1.2 ?

Should OS disks (/dev/sda, /dev/sdb) be used to store local data? HDFS data?

How can data on the OS disks be cleaned up, since storing it there is not recommended?

Does the Cloudera CDH Client have to be installed on all Exadata DB nodes?

If a disk goes bad and is replaced can you verify the disk is functional with regards to HDFS?

If one of the services managed by Cloudera Manager(CM) goes into "BAD" health, is there a recommended order for checking the status of services?

Can you decommision non-critical nodes from BDA HDFS cluster , inorder to install NoSQL ?

For HA testing is it possible to relocate Hive services to a different node after a Hive node failure?

What options are available for migrating service roles on the BDA?

Running a very long reducer seems to be filling one DataNode. Why would that be?

SOLUTION

APPLIES TO:

GOAL

FIX

APPLIES TO:

GOAL

SOLUTION

Pages

Smart TV Universal Remote Codes

TV Support Guides

User Manuals for TV and Consumer Electronics

Smart TV Universal Remote Codes By TV Brand

sub woofer - multi channel home theatre - audio video companies near me

Home Theater codes

Remote Codes For Audio Receivers

Universal Remote Codes By Brand

Remote Codes For BLU RAY Players

Remote Codes For DVD Players

Streaming Media Player Remote Codes

Cable Box

Digital Converter Remote Codes

Remote Codes For Video Projectors

Universal remote codes

Universal Remote

Connecting Devices To TV

Replacement Remotes For TVs

Replacement Remotes For DVR

DVD Blu-Ray Player Region Codes

Universal remote code pages