Search This Blog

Wednesday, December 26, 2012

Install Sqoop and Hbase on macbook pro OS X 10.8.2

Apache Sqoop helps transfer data between hadoop and datastores(such as relational databases like oracle, db2 and a bunch of others). Read more about sqoop here
http://sqoop.apache.org/

If you are just getting started with hadoop you may want to refer to my earlier posts regarding installing hadoop:
http://springandgrailsmusings.blogspot.com/2012/12/install-hadoop-111-on-macbook-pro-os-x.html
and installing hive:
http://springandgrailsmusings.blogspot.com/2012/12/installing-hive-on-on-macbook-pro-os-x.html

As I mentioned in my previous posts, homebrew provides a simple way to install anything, in this case, sqoop.

Open a terminal and install sqoop with this command:
brew install sqoop

Homebrew takes care of installing all related dependencies for you, which for sqoop are, hbase and zookeeper.



Your terminal output should be similar to this:

$ brew install sqoop
==> Installing sqoop dependency: hbase
==> Downloading http://www.apache.org/dyn/closer.cgi?path=hbase/hbase-0.94.2/hbase-0.94.2.tar.gz
==> Best Mirror http://www.poolsaboveground.com/apache/hbase/hbase-0.94.2/hbase-0.94.2.tar.gz
######################################################################## 100.0%
==> Caveats
Requires Java 1.6.0 or greater.

You must also edit the configs in:
  /usr/local/Cellar/hbase/0.94.2/libexec/conf
to reflect your environment.

For more details:
  http://wiki.apache.org/hadoop/Hbase
==> Summary
/usr/local/Cellar/hbase/0.94.2: 3086 files, 115M, built in 3.9 minutes
==> Installing sqoop dependency: zookeeper
==> Downloading http://www.apache.org/dyn/closer.cgi?path=zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
==> Best Mirror http://www.fightrice.com/mirrors/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
######################################################################## 100.0%
/usr/local/Cellar/zookeeper/3.4.5: 193 files, 12M, built in 18 seconds
==> Installing sqoop
==> Downloading http://apache.mirror.iphh.net/sqoop/1.4.2/sqoop-1.4.2.bin__hadoop-1.0.0.tar.gz
######################################################################## 100.0%
==> Caveats
Hadoop, Hive, HBase and ZooKeeper must be installed and configured
for Sqoop to work.
==> Summary
/usr/local/Cellar/sqoop/1.4.2: 60 files, 4.4M, built in 24 seconds



Now you are all set to use sqoop to work with any supported data store.
Have fun.


Installing hive 0.9 on macbook pro OS X 10.8.2



If you are reading this post I assume you are interested in getting started with hive on your macbook and already have hadoop installed. For details on installing hadoop please refer to my post here
http://springandgrailsmusings.blogspot.com/2012/12/install-hadoop-111-on-macbook-pro-os-x.html

Again howbrew provides an easy way to get hive on your mac.
Run this from your mac terminal:
> brew install hive

You will see brew installs hive on your mac and you will see output similar to the one below:
==> Downloading http://www.apache.org/dyn/closer.cgi?path=hive/hive-0.9.0/hive-0.9.0-bin.tar.gz
==> Best Mirror http://apache.claz.org/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz
######################################################################## 100.0%
==> Caveats
Hadoop must be in your path for hive executable to work.
After installation, set $HIVE_HOME in your profile:
  export HIVE_HOME=/usr/local/Cellar/hive/0.9.0/libexec

You may need to set JAVA_HOME:
  export JAVA_HOME="$(/usr/libexec/java_home)"
==> Summary
/usr/local/Cellar/hive/0.9.0: 276 files, 25M, built in 13 seconds

Export HIVE_HOME and JAVA_HOME as prompted from your terminal

export HIVE_HOME=/usr/local/Cellar/hive/0.9.0/libexec
export JAVA_HOME="$(/usr/libexec/java_home)"

Now you can start hive as follows:

/usr/local/Cellar/hive/0.9.0/bin/hive


You should be all set at this point to work with hive.
Hope this helps.


Sunday, December 23, 2012

Install Hadoop 1.1.1 on macbook pro OS X 10.8.2

I recently installed hadoop on my new Macbook and here is the steps I followed. to get it working.
I write this with the hope that someone might find this useful.

First up there a couple of very nice posts regarding this which helped me get this done.
http://ragrawal.wordpress.com/2012/04/28/installing-hadoop-on-mac-osx-lion
http://dennyglee.com/2012/05/08/installing-hadoop-on-osx-lion-10-7/
http://geekiriki.blogspot.com/2011/10/flume-and-hadoop-on-os-x.html
I mainly followed these three(i mixed steps provided by couple of them) to get my installation working.

First up I used homebrew to install hadoop
brew install hadoop

I enabled Remote Login on my mac and created a rsa key using ssh-keygen
Finally I tested I was able to ssh, by doing ssh localhost.
I used rsa but dsa can be used as well for ssh.

This is how my conf files look(located in  /usr/local/Cellar/hadoop/1.1.1/libexec/conf folder)
The links provided above detail these-I have not added any changes of my own except for the hadoop install dir change.



core-site.xml



<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Note:Had to create two folders as the original poster indicates like this


mkdir /usr/local/Cellar/hadoop/hdfs
mkdir /usr/local/Cellar/hadoop/hdfs/tmp


hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
 NOTE:change dfs.replication according to your needs.



mapred-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>


Find the line # export HADOOP_OPTS=-server
Now add this line
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"





Format the Hadoop Namenode using:
hadoop namenode -format





Start Hadoop by running the script:
/usr/local/Cellar/hadoop/1.1.1/libexec/bin/start-all.sh

Run
ps ax | grep hadoop | wc -l
If you see 6 as output you are all set.
If not check the logs at
ls /usr/local/Cellar/hadoop/1.1.1/libexec/logs/

Health can be checked at http://localhost:50070/dfshealth.jsp

You can run an example that is provided like this
cd /usr/local/Cellar/hadoop/1.1.1/libexec
Run this command
hadoop jar /usr/local/Cellar/hadoop/1.1.1/libexec/hadoop-examples-1.1.1.jar pi 10 100

You should see output similar to the following


Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/12/23 16:31:00 INFO mapred.FileInputFormat: Total input paths to process : 10
12/12/23 16:31:00 INFO mapred.JobClient: Running job: job_201212231524_0003
12/12/23 16:31:01 INFO mapred.JobClient:  map 0% reduce 0%
12/12/23 16:31:04 INFO mapred.JobClient:  map 20% reduce 0%
12/12/23 16:31:06 INFO mapred.JobClient:  map 40% reduce 0%
12/12/23 16:31:08 INFO mapred.JobClient:  map 60% reduce 0%
12/12/23 16:31:09 INFO mapred.JobClient:  map 80% reduce 0%
12/12/23 16:31:11 INFO mapred.JobClient:  map 100% reduce 0%
12/12/23 16:31:12 INFO mapred.JobClient:  map 100% reduce 26%
12/12/23 16:31:18 INFO mapred.JobClient:  map 100% reduce 100%
12/12/23 16:31:19 INFO mapred.JobClient: Job complete: job_201212231524_0003
12/12/23 16:31:19 INFO mapred.JobClient: Counters: 27
12/12/23 16:31:19 INFO mapred.JobClient:   Job Counters
12/12/23 16:31:19 INFO mapred.JobClient:     Launched reduce tasks=1
12/12/23 16:31:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16432
12/12/23 16:31:19 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/23 16:31:19 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/12/23 16:31:19 INFO mapred.JobClient:     Launched map tasks=10
12/12/23 16:31:19 INFO mapred.JobClient:     Data-local map tasks=10
12/12/23 16:31:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13728
12/12/23 16:31:19 INFO mapred.JobClient:   File Input Format Counters
12/12/23 16:31:19 INFO mapred.JobClient:     Bytes Read=1180
12/12/23 16:31:19 INFO mapred.JobClient:   File Output Format Counters
12/12/23 16:31:19 INFO mapred.JobClient:     Bytes Written=97
12/12/23 16:31:19 INFO mapred.JobClient:   FileSystemCounters
12/12/23 16:31:19 INFO mapred.JobClient:     FILE_BYTES_READ=226
12/12/23 16:31:19 INFO mapred.JobClient:     HDFS_BYTES_READ=2560
12/12/23 16:31:19 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=267335
12/12/23 16:31:19 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
12/12/23 16:31:19 INFO mapred.JobClient:   Map-Reduce Framework
12/12/23 16:31:19 INFO mapred.JobClient:     Map output materialized bytes=280
12/12/23 16:31:19 INFO mapred.JobClient:     Map input records=10
12/12/23 16:31:19 INFO mapred.JobClient:     Reduce shuffle bytes=280
12/12/23 16:31:19 INFO mapred.JobClient:     Spilled Records=40
12/12/23 16:31:19 INFO mapred.JobClient:     Map output bytes=180
12/12/23 16:31:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=1931190272
12/12/23 16:31:19 INFO mapred.JobClient:     Map input bytes=240
12/12/23 16:31:19 INFO mapred.JobClient:     Combine input records=0
12/12/23 16:31:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
12/12/23 16:31:19 INFO mapred.JobClient:     Reduce input records=20
12/12/23 16:31:19 INFO mapred.JobClient:     Reduce input groups=20
12/12/23 16:31:19 INFO mapred.JobClient:     Combine output records=0
12/12/23 16:31:19 INFO mapred.JobClient:     Reduce output records=0
12/12/23 16:31:19 INFO mapred.JobClient:     Map output records=20
Job Finished in 19.303 seconds
Estimated value of Pi is 3.14800000000000000000

Hope this helps.