Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Friday, November 20, 2009

HBase on Cloudera Training Virtual Machine (0.3.2)

Note: This is a follow up to my earlier post. Since then Cloudera released a new VM that includes the current 0.20 branch of Hadoop. Below I have the same post adjusted to work with that new release. Please note that there are subtle changes, for example the NameNode port has changed. So if you in any way still have the older post please make sure you forget about it and follow this one here instead for the new VM version.

You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. The main issue though is that the current Hadoop Training VM does not include HBase at all (yet?). Apart from that the install of a local HBase instance is a straight forward process.

Here are the steps to get HBase running on Cloudera's VM:
  1. Download VM

    Get it from Cloudera's website.

  2. Start VM

    As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."

    Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.2.vmx" to a new "cloudera-training-0.2-cl4-000001.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is the native screen resolution on my MacBook Pro I am using.

    Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
    $ cd ~/git
    $ ./update-exercises --workspace
    

  3. Pull HBase branch

    We are using the brand new HBase 0.20.2 release. Open a new Terminal (or issue a $ cd .. in the open one), then:
    $ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/tags/0.20.2"
    Note: moving to "origin/tags/0.20.2" which isn't a local branch
    If you want to create a new branch from this checkout, you may do so
    (now or later) by using -b with the checkout command again. Example:
      git checkout -b <new_branch_name>
    HEAD is now at 777fb63... HBase release 0.20.2
    

    First we clone the repository, then switch to the actual branch. You will notice that I am using sudo -u hadoop because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. When sudo is asking for a password use the default, which is set to "training".

    You can ignore the messages git prints out while performing the checkout.

  4. Build Branch

    Continue in Terminal:
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package"
    ...
    BUILD SUCCESSFUL
    

  5. Configure HBase

    There are a few edits to be made to get HBase running.
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml
    
    <configuration>
    
      <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:8022/hbase</value>
      </property>
    
    </configuration>
    
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh 
    
    # The java implementation to use.  Java 1.6 required.
    # export JAVA_HOME=/usr/java/jdk1.6.0/
    export JAVA_HOME=/usr/lib/jvm/java-6-sun
    ...
    

  6. Rev up the Engine!

    The final thing is to start HBase:
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh
    
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Version: 0.20.2, r777fb63ff0c73369abc4d799388a45b8bda9e5fd, Thu Nov 19 15:32:17 PST 2009
    hbase(main):001:0>
    

    Done!

    Let's create a table and check if it was created OK.
    hbase(main):001:0> list
    0 row(s) in 0.0910 seconds
    
    hbase(main):002:0> create 't1', 'f1', 'f2', 'f3'
    0 row(s) in 6.1260 seconds
    
    hbase(main):003:0> list                         
    t1                                                                                                            
    1 row(s) in 0.0470 seconds
    
    hbase(main):004:0> describe 't1'                
    DESCRIPTION                                                             ENABLED                               
     {NAME => 't1', FAMILIES => [{NAME => 'f1', COMPRESSION => 'NONE', VERS true                                  
     IONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => '                                       
     false', BLOCKCACHE => 'true'}, {NAME => 'f2', COMPRESSION => 'NONE', V                                       
     ERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =                                       
     > 'false', BLOCKCACHE => 'true'}, {NAME => 'f3', COMPRESSION => 'NONE'                                       
     , VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR                                       
     Y => 'false', BLOCKCACHE => 'true'}]}                                                                        
    1 row(s) in 0.0750 seconds
    hbase(main):005:0> 
    
This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.

Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.

Finally a screenshot of the running HBase UI:

Tuesday, October 20, 2009

HBase on Cloudera Training Virtual Machine (0.3.1)

You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. There are a few issues though, the worst being that the current Hadoop Training VM does not include HBase at all. Also, Cloudera is using a specific version of Hadoop that it deems stable and maintains it own release cycle. So Cloudera's version of Hadoop is 0.18.3. HBase though needs Hadoop 0.20 - but we are in luck as Andrew Purtell of TrendMicro maintains a special branch of HBase 0.20 that works with Cloudera's release.

Here are the steps to get HBase running on Cloudera's VM:
  1. Download VM

    Get it from Cloudera's website.
  2. Start VM

    As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."

    Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.1.vmx" to a new "cloudera-training-0.2-cl3-000002.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is native to my MacBook Pro I am using.

    Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
    $ cd ~/git
    $ ./update-exercises --workspace
    
  3. Pull HBase branch

    Open a new Terminal (or issue a $ cd .. in the open one), then:
    $ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/0.20_on_hadoop-0.18.3"
    ...
    HEAD is now at c050f68... pull up to release
    

    First we clone the repository, then switch to the actual branch. You will notice that I am using sudo -u hadoop because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. When sudo is asking for a password use the default set to "training".
  4. Build Branch

    Continue in Terminal:
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package"
    ...
    BUILD SUCCESSFUL
    
  5. Configure HBase

    There are a few edits to be made to get HBase running.
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml
    
    <configuration>
    
      <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:8020/hbase</value>
      </property>
    
    </configuration>
    
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh 
    
    # The java implementation to use.  Java 1.6 required.
    # export JAVA_HOME=/usr/java/jdk1.6.0/
    export JAVA_HOME=/usr/lib/jvm/java-6-sun
    ...
    

    Note: There is a small glitch in the revision 826669 of that Cloudera specific HBase branch. The master UI (on port 60010 on localhost) will not start because a path is different and Jetty packages are missing because of it. You can fix it by editing the start up script and changing the path scanned:
    $ sudo -u hadoop vim /home/hadoop/hbase/build/bin/hbase
    

    Replace
    for f in $HBASE_HOME/lib/jsp-2.1/*.jar; do
    with
    for f in $HBASE_HOME/lib/jetty-ext/*.jar; do

    This is only until the developers have fixed this in the branch (compare the revision I used r813052 with what you get). Or if you do not want the UI you can ignore this and the error in the logs too. HBase will still run, just not its web based interface.
  6. Rev up the Engine!

    The final thing is to start HBase:
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell
    
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Version: 0.20.0-0.18.3, r813052, Mon Oct 19 06:51:57 PDT 2009
    hbase(main):001:0> list
    0 row(s) in 0.2320 seconds
    hbase(main):002:0>
    

    Done!

This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.

Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.

Update: Updated title to include version number, fixed XML

Thursday, February 5, 2009

Apache fails on Semaphores

In the last few years I had twice an issue with our Apache web servers where all of a sudden they would crash and not start again. While there are obvious reasons in case the configuration is screwed up there are also cases where you simply do not know why it would not restart. There is enough drive space, RAM, no other processes running locking the port (even checked with lsof).

All you get is an error message in the log saying:

[Fri May 21 15:34:22 2008] [crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock
Configuration Failed


After some digging the issue was that all semaphores were used up and had to be deleted first. Here is a script I use to do that:
echo "Semaphores found: "
ipcs -s | awk '{ print $2 }' | wc -l
ipcs -s | awk '{ print $2 }' | xargs -n 1 ipcrm sem
echo "Semaphores found after removal: "
ipcs -s | awk '{ print $2 }' | wc -l

Sometimes you really wonder what else could go wrong.

Monday, January 19, 2009

VServer is not Xen, Part 2

Another oddity about VServer is that it does not have a true init process. Or rather, the whole start up is not as you are used to from other Linux systems.

While you can read about the different Init Styles there is one crucial issue: the startup scripts, usually located in /etc/rc.<n> are either executed outside of the VM or inside, so that you can either "see" the VM starting up from the master or not respectively. While this is OK and usable for most applications, it has a major problem. You cannot run DJB's daemontools.

This is because while the above startup styles execute the init scripts, it does not execute anything else from the inittab configuration files. Most importantly the last line in the following excerpt from /etc/inittab:
...
# Example how to put a getty on a serial line (for a terminal)
#
#T0:23:respawn:/sbin/getty -L ttyS0 9600 vt100
#T1:23:respawn:/sbin/getty -L ttyS1 9600 vt100

# Example how to put a getty on a modem line.
#
#T3:23:respawn:/sbin/mgetty -x0 -s 57600 ttyS3

SV:123456:respawn:/command/svscanboot

The last line is what starts the root daemontools process that starts all services it maintains. In VServer is simply will not start.

The issue for me started a lot earlier, I should have seen this coming really. When I tried the initial setup I went down the usual (at least for me) get the daemontools-installer Debian package and build the binaries. I did this in the VM obviously, because that is where I wanted to install the daemontools. Here is what happened:
$ build-daemontools       

This script unpacks the daemontools source into a directory, and
compiles it to produce a binary daemontools*.deb file.
...
Press ENTER to continue...
Attempting to apply patches located in
/usr/src/daemontools-installer/patches...
/usr/src/daemontools-installer/patches/errno.patch
patching file src/error.h
/usr/src/daemontools-installer/patches/fileutils.patch
patching file src/rts.tests
dh_testdir
package/compile
Linking ./src/* into ./compile...
Compiling everything in ./compile...
make[1]: Entering directory `/tmp/daemontools/admin/daemontools-0.76/compile'
sh find-systype.sh > systype
rm -f compile
sh print-cc.sh > compile
chmod 555 compile
./compile byte_chr.c
./compile byte_copy.c
./compile byte_cr.c
./compile byte_diff.c
...
make[1]: Leaving directory `/tmp/daemontools/admin/daemontools-0.76/compile'
Copying commands into ./command...
touch build-stamp
dh_testdir
dh_testroot
dh_clean -k
dh_clean: Compatibility levels before 4 are deprecated.
dh_installdirs
dh_installdirs: Compatibility levels before 4 are deprecated.
mkdir -p debian/daemontools/package/admin/daemontools-0.76
mkdir -p debian/daemontools/command
mkdir -p debian/daemontools/usr/share/daemontools
mkdir -p debian/daemontools/service
cp -a command debian/daemontools/package/admin/daemontools-0.76
cp -a compile debian/daemontools/package/admin/daemontools-0.76
cp -a package debian/daemontools/package/admin/daemontools-0.76
cp -a src debian/daemontools/package/admin/daemontools-0.76
dh_link package/admin/daemontools-0.76/package usr/share/daemontools/package
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76 package/admin/daemontools
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/envdir command/envdir
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/envuidgid command/envuidgid
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/fghack command/fghack
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/multilog command/multilog
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/pgrphack command/pgrphack
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/readproctitle
command/readproctitle
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/setlock command/setlock
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/setuidgid command/setuidgid
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/softlimit command/softlimit
...
dh_gencontrol
dh_gencontrol: Compatibility levels before 4 are deprecated.
dh_md5sums
dh_md5sums: Compatibility levels before 4 are deprecated.
dpkg-deb -b debian/daemontools ..
dpkg-deb: building package `daemontools' in `../daemontools_0.76-9_i386.deb'.

It seems that all went ok

Do you want to remove all files in /tmp/daemontools,
except daemontools_0.76-9_i386.deb now? [Yn]
Removing files... done

Do you want to install daemontools_0.76-9_i386.deb now? [Yn] n

Do you want to purge daemontools-installer now? [yN]

Good luck!

So the compile succeeded but the subsequent package compilation failed with "dh_link: Compatibility levels before 4 are deprecated." errors. The makefile was not built to handle these kinds of errors by the looks because at the end I got told all seems OK - which is of course not the case, the package is empty.

Well, I managed to build it somewhere else and install the binaries that way into the Virtual Machine. But then I noticed the issue above, in other words the services would not run because the root process was not started.

After searching around on the web I found - of course - a post outlining the same issue. As usual you go through the same steps and pain just to find out that someone else found the same problem and already fixed it.

The solution is to start the root daemontools process just like any other process. The post has a script that I include below (in case in gets lost in the Intertubes):
$ cat /etc/init.d/svscanboot 

#! /bin/sh
#
# daemontools for launching /etc/svscanboot from sysvinit instead of /sbin/init.
#
# author: dkg

set -e

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
DESC="daemontools"
NAME=svscanboot
DAEMON=/command/svscanboot

PIDFILE=/var/run/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME

# Gracefully exit if the package has been removed.
test -x $DAEMON || exit 0

#
# Function that starts the daemon/service.
#
d_start() {
start-stop-daemon --start --background --make-pidfile --quiet --pidfile $PIDFILE \
--exec $DAEMON
}

#
# Function that stops the daemon/service.
#
d_stop() {
start-stop-daemon --stop --quiet --pidfile $PIDFILE \
--name $NAME
echo "not cleaning up svscan and readproctitle subprocesses
appropriately. dkg is lazy."
}

#
# Function that sends a SIGHUP to the daemon/service.
#
d_reload() {
start-stop-daemon --stop --quiet --pidfile $PIDFILE \
--name $NAME --signal 1
}

case "$1" in
start)
echo -n "Starting $DESC: $NAME"
d_start
echo "."
;;
stop)
echo -n "Stopping $DESC: $NAME"
d_stop
echo "."
;;
restart|force-reload)
#
# If the "reload" option is implemented, move the "force-reload"
# option to the "reload" entry above. If not, "force-reload" is
# just the same as "restart".
#
echo -n "Restarting $DESC: $NAME"
d_stop
sleep 1
d_start
echo "."
;;
*)
# echo "Usage: $SCRIPTNAME {start|stop|restart|reload|force-reload}" >&2
echo "Usage: $SCRIPTNAME {start|stop|restart|force-reload}" >&2
exit 1
;;
esac

exit 0

Now, other post mention that there is also a "fakeinit" style - but it did not work for me and I rather believe that this is the old name for "plain" mentioned in the Init Styles document I linked above.

Goes to show that a lot is unclear about VServer. But that is often the case with open-source tools and systems. It is up to us, the IT people, to help out and close those gaps while contributing to the community.

Saturday, January 10, 2009

VServer is not Xen

Recently I had to work on a Virtual Machine (VM) running on VServer. In the past I used Xen to create virtual machines but due to the nature of the task VServer seemed more appropriate. I only have to run two Debian Etch VM's on a Debian Etch host. Because of the much narrower interface to the Operation System (OS) it makes sense for VServer hosts to run without much of the overhead - and therefore faster as well.

There are a few things that are quite nice about the lesser abstraction of VServer compared to Xen. For example copying a Virtual Machine is much simpler I found and files can be copied into place from the master because the file system of the VM's are simply directories of the master file system.

One thing I noticed is though that it is much more difficult to run certain daemons in the VM's and/or the master at the same time. The separation in Xen completely splits master and VM on the kernel level, running the same daemon on the same port is a natural fit. Nothing to be done. Not so with VServer.

I tried to run SSH, NTP and SNMP on the master and the two VM's I tried to set up. First issue I ran into was SSH. SSH on the master is listening on all network addresses, configured as such:
ListenAddress 0.0.0.0

When you now try to start the SSH daemon on the VM's you get an error that the address is already in use - by the master of course! Master and Virtual Machines share the network layer and this is now causing a problem.

The issue in itself is solved by setting the listening address to a specific one, namely the address of the master:
ListenAddress 192.168.1.100

Then it binds to the default socket only on that interface and the VM's are free to bind their daemons to their IP.

The second issue I ran into is NTP. I tried to run it the same way as the SSH daemon, but since the listening address is not something you can specify in the /etc/ntp.conf the NTP daemon is binding to all interfaces and we have the same error on the VM's as mentioned above.

I found it best to remove NTP completely from the VM's and only run it on the master. It seems after a few weeks of observation that the time is "passed" on to the VM's, in other words their time stays in sync. This somewhat makes sense considering the thin layer VServer has to run the Virtual Machines. They simply use the same internal clock and if the master is in sync then so are the VM's.

Friday, January 9, 2009

Odd "ps" output

While trying to figure out when I started a particular process I noticed that the normal "ps aux" or "ps -eF" is not showing the actual start date, but - depending on how long the task is already running - the year. For example:
[02:09:36 root@lv1-cpq-bl-17 bin]# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1944 656 ? Ss 2008 0:01 init [2]
root 2 0.0 0.0 0 0 ? S 2008 0:00 [migration/0]
root 3 0.0 0.0 0 0 ? SN 2008 0:00 [ksoftirqd/0]
root 4 0.0 0.0 0 0 ? S 2008 0:00 [events/0]
root 5 0.0 0.0 0 0 ? S 2008 0:00 [khelper]
...
root 2851 0.0 3.9 1243532 40740 ? Sl Jan07 1:14 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3521 0.1 4.4 1250100 45828 ? Sl Jan07 2:22 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3629 0.0 4.1 1237900 42880 ? Sl Jan07 0:28 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3799 0.1 5.9 1268268 61260 ? Sl Jan07 3:17 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 12274 0.0 0.0 3432 880 pts/4 R+ 03:25 0:00 ps aux

So this varies from the time today when it was started, to a month/day combination all the way to just a year, because the process was started last year.

But when exactly?

Digging into the "man ps" details and using the "trial and error" approach I found out that a custom layout allows to get what I needed:
[root]# ps -e -o user,pid,pcpu,start,stime,time,vsz,rssize,ni,args
USER PID %CPU STARTED STIME TIME VSZ RSS NI COMMAND
root 1 0.0 Jul 01 2008 00:00:01 1944 656 0 init [2]
root 2 0.0 Jul 01 2008 00:00:00 0 0 - [migration/0]
root 3 0.0 Jul 01 2008 00:00:00 0 0 19 [ksoftirqd/0]
root 4 0.0 Jul 01 2008 00:00:00 0 0 -5 [events/0]
root 5 0.0 Jul 01 2008 00:00:00 0 0 -5 [khelper]
...
root 2851 0.0 Jan 07 Jan07 00:01:14 1243532 40740 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3521 0.1 Jan 07 Jan07 00:02:22 1250100 45828 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3629 0.0 Jan 07 Jan07 00:00:28 1237900 42880 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3799 0.1 Jan 07 Jan07 00:03:17 1268268 61260 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 12275 0.0 03:25:38 03:25 00:00:00 3432 880 0 ps -e -o user,pid,pcpu,start,
stime,time,vsz,rssize,ni,args

The "start" format option resulting in the "STARTED" column above and is showing what I needed. Last thing would be I guess to set the "PS_FORMAT" environment variable if I needed this permanently.