Showing posts with label work. Show all posts
Showing posts with label work. Show all posts

Tuesday, March 31, 2009

10 Years in one Project

For about ten years now I am the CTO at WorldLingo. During those years I have seen quite a few people join and leaving us eventually. Below is a small snapshot of how time has passed. Obviously I am quite proud to be somewhat the rock in the sea.


If you like to know how the video was created, then read on.

I download the source of the code_swarm project following the description, i.e. I used


svn checkout http://codeswarm.googlecode.com/svn/trunk/ codeswarm-read-only
cd codeswarm-read-only
ant all

to get the code and then ran ant all in its root directory:

C:\CODESW~1>ant all
Buildfile: build.xml

init:
[echo] Running INIT

build:
[echo] Running BUILD
[mkdir] Created dir: C:\CODESW~1\build
[javac] Compiling 18 source files to C:\CODESW~1\build
[copy] Copying 1 file to C:\CODESW~1\build

jar:
[echo] Running JAR
[mkdir] Created dir: C:\CODESW~1\dist
[jar] Building jar: C:\CODESW~1\dist\code_swarm.jar

all:
[echo] Building ALL

BUILD SUCCESSFUL
Total time: 6 seconds

Note that this is on my Windows machine. After the build you will have to edit the config file to have your settings and regular expressions match your project. I really took the supplied sample config file, copied it and modified these lines:

# This is a sample configuration file for code_swarm

...

# Input file
InputFile=data/wl-repevents.xml

...

# Project time per frame
#MillisecondsPerFrame=51600000

...

# Optional Method instead of MillisecondsPerFrame
FramesPerDay=2

...

ColorAssign1="wlsystem",".*wlsystem.*", 0,0,255, 0,0,255
ColorAssign2="www",".*www.*", 0,255,0, 0,255,0
ColorAssign3="docs",".*docs.*", 102,0,255, 102,0,255
ColorAssign4="serverconfig",".*serverconf.*", 255,0,0, 255,0,0

# Save each frame to an image?
TakeSnapshots=true

...

DrawNamesHalos=true

...

This is just adjusting the labels and turning on the snap shots to be able to create a video at the end. I found a tutorial that explained how to set this up.

What did not work for me is getting mencoder to work. I downloaded the MPlayer Windows installer from its official site and although it is meant to have mencoder included it does not. Or I am blind.

So, I simply ran

mkdir frames
runrepositoryfetch.bat data\wl.config

to fetch the history of our repository spanning about 10 years - going from Visual SourceSafe, to CVS and currently running on Subversion. One further problem was that the output file of the above script was not named as I had previously specified in the config file, so I had to rename it like so:
cd data
ren realtime_sample1157501935.xml wl-repevents.xml

After that I was able to use run.bat data\wl.config to see the full movie in real time.

With the snap shots created but me not willing to further dig into the absence of mencoder I fired up my trusted MacBookPro and used Quicktime to create the movie from an image sequence.

When Quicktime did its magic I saved the .mov file and used VisualHub to convert it to a proper video format to upload to Vimeo. And that was it really.

Saturday, March 7, 2009

HBase vs. CouchDB in Berlin

I had the pleasure of presenting our involvement with HBase at the 4th Berlin Hadoop Get Together. It was organized by Isabel Drost. Thanks again to Isabel for having me there, I thoroughly enjoyed it. First off, here are the slides:



HBase @ WorldLingo - Get more Information Technology

The second talk given was by Jan Lehnardt, a CouchDB team member. I am looking into Erlang for the last few months to see how we could use it for our own efforts. CouchDB is one of the projects you come across when reading articles about Erlang. So it was really great to have Jan present too.

At the end of both our talks it was great to see how the questions from the audience at times tried to compare the two. So is HBase better or worse than CouchDB. Of course, you cannot compare them directly. While they share common features (well, they store data, right?) they are made to solve different problems. CouchDB is offering a schema free storage with build in replication, which can even be used to create offline clients that sync their changes with another site when they have connectivity again. One of the features puzzling me most is the ability to use it to serve your own applications to the world. You create the pages and scripts you need and push it into the database using CouchApp. Since the database already has a built-in web server it can handle your applications requirements implicitly. Nice!

I asked Jan if he had concerns about scaling this, or if it wouldn't be better to use Apache or Nginx to serve the static content. His argument was that Erlang can handle many many more concurrent request than Apache can for example. I read up on Yaws and saw his point. So I guess it is a question then of memory and CPU requirements. The former is apparently another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?! I am not sure about CPU then - but take a gander that it is equally sane.

Another Erlang project I am interested in is Dynomite, a Erlang based Amazon Dynamo "clone" (or rather implementation). Talking to Cliff it seems it is as awesome leveraging the Erlang OTP abilities to create something that a normal Java developer and their JRE is just not used to.

And that brings me to HBase. I told the crowd in Berlin that as of version 0.18.0 HBase is ready for anyone to get started with - given they read the Wiki to set the file handles right and a few other bits in pieces.

Note: I was actually thinking about suggesting an improvement to the HBase team to have a command line check that can be invoked separately or is called when "start-hbase.sh" is called that checks a few of these common parameters and prints out warnings to the user. I know that the file handle count is printed out in the log files, but for a newbie this is a bit too deep down. What could be checked? First of the file handles being say 32K. The next thing is newer resource limits that were introduced with Hadoop for example that now need tweaking. An example is the "xciever" (sic) value. This again is documented in the Wiki, but who reads it, right? Another common issue is RAM. If the master knows the number of regions (or while it is scanning the META to determine it) it could warn if the JRE is not given enough memory. Sure, there are no hard boundaries, but better to see a Warning: Found x regions. Your configured memory for the JRE seems too low for the system to run stable.

Back to HBase. I also told the audience that as of HBase 0.19.0 the scanning was much improved speed wise and that I am happy where we are nowadays in terms of stability and speed. Sure, it could be faster for random reads so I may be able to drop my MemCached layer. And the team is working on that. So, here's hoping that we will see the best HBase ever in the upcoming version. I for myself am 100% sure that the HBase guys can deliver - they have done so in the past and will now as well. All I can say - give it a shot!

So, CouchDB is lean and mean while HBase is a resource hog from my experience. But it is also built to scale to Petabyte size data. With CouchDB, you would have to add sharding on top of it including all the issues that come with it, for example rebalancing, fail-over, recovery, adding more servers and so on. For me HBase is the system of choice - for this particular problem. That does not mean I could use CouchDB, or even Erlang for that matter, in a separate area. Until then I will keep my eyes very close in this exciting (though in case of Erlang not new!) technology. May the open-source projects are rule and live long and prosper!

Thursday, February 5, 2009

Apache fails on Semaphores

In the last few years I had twice an issue with our Apache web servers where all of a sudden they would crash and not start again. While there are obvious reasons in case the configuration is screwed up there are also cases where you simply do not know why it would not restart. There is enough drive space, RAM, no other processes running locking the port (even checked with lsof).

All you get is an error message in the log saying:

[Fri May 21 15:34:22 2008] [crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock
Configuration Failed


After some digging the issue was that all semaphores were used up and had to be deleted first. Here is a script I use to do that:
echo "Semaphores found: "
ipcs -s | awk '{ print $2 }' | wc -l
ipcs -s | awk '{ print $2 }' | xargs -n 1 ipcrm sem
echo "Semaphores found after removal: "
ipcs -s | awk '{ print $2 }' | wc -l

Sometimes you really wonder what else could go wrong.

Wednesday, February 4, 2009

String starts with a number in XSL

I needed a way to test if a particular value in an XML file started with a letter based prefix. If not, then the value would start with a number and needed to be prefixed first before being output. While I found a great post how to remove leading zeros I could not find how to check if the first letter is of a particular type, for example a letter or a number. In Java you can do that easily like so (using BeanShell here):

bsh % String s1 = "1234";
bsh % String s2 = "A1234";
bsh % print(Character.isLetter(s1.charAt(0)));
false
bsh % print(Character.isLetter(s2.charAt(0)));
true

This is of course Unicode safe. With XSL though I could not find a similar feature but for my purposes it was sufficient to reverse the check and see if I had a Latin number first. Here is how:
<xsl:template match="/person/personnumber">
<reference>
<xsl:variable name="num">
<xsl:value-of select="."/>
</xsl:variable>
<xsl:choose>
<xsl:when test="contains('0123456789', substring($num, 1, 1))">
<xsl:variable name="snum">
<xsl:call-template name="removeLeadingZeros">
<xsl:with-param name="originalString" select="$num"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat('PE', $snum)"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</reference>
</xsl:template>

Monday, February 2, 2009

Hadoop Scripts

If you work with HBase and Hadoop in particular, you start off doing most things on the command line. After a while this is getting tedious and - in the end - becomes a nuisance. And error prone! While I wish there would be a an existing and established solution out there that helps managing a Hadoop cluster I find that there are few that you can use right now. Of the few that come to mind is the "Hadoop on Demand" (HOD) package residing in the contribution folder of the Hadoop releases. The other is ZooKeeper.

Interesting things are in the pipeline though, for example Simon from Yahoo!.

What I often need are small helpers that allow me to clean up behind me or which helps me deploy new servers. There are different solutions that usually involve some sort of combination of SSH and rsync. Tools I found and some of them even tried are SmartFrog, Puppet, and Tentakel.

Especially in the beginning you often find that you do not know these tools well enough or they do one thing - but not another. Of course, you can combine them and make that work somehow. I usually resort to a set of well known and proven scripts that I created over time to simplify working with a particular system. With Hadoop most of these scripts are run on the master and since it already is set up to use SSH to talk to all slaves it makes it easy to use the same mechanism.

The first one is to show all Java processes across the machines to see that they are up - or all shut down before attempting a new start:
#!/bin/bash
# $Revision: 1.0 $
#
# Shows all Java processes on the Hadoop cluster.
#
# Created 2008/01/07 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"

for srv in $servers; do
echo "Sending command to $srv...";
ssh $srv "ps aux | grep -v grep | grep java"
done

echo "done."

The next one is a poor-man's deployment thingamajig. It helps copying a new release across the machines and setting up the symbolic link I use for the current version in production. Of course this all varies with your setup.
#!/bin/bash
# $Revision: 1.0 $
#
# Rsync's Hadoop files across all slaves. Must run on namenode.
#
# Created 2008/01/03 by Lars George
#

if [ "$#" != "2" ]; then
echo "usage: $(basename $0) <dir-name> <ln-name>"
echo " example: $(basename $0) hbase-0.1 hbase"
exit 1
fi

for srv in $(cat /usr/local/hadoop/conf/slaves); do
echo "Sending command to $srv...";
rsync -vaz --exclude='logs/*' /usr/local/$1 $srv:/usr/local/
ssh $srv "rm -fR /usr/local/$2 ; ln -s /usr/local/$1 /usr/local/$2"
done

echo "done."

I basically download a new version on the master (or build one) and issue a

$ rsyncnewhadoop /usr/local/hadoop-0.19.0 hadoop

It copies the directory across and changes the "/usr/local/hadoop" symbolic link to point to the this new release.

Another helper I use quite often is to diff an existing and a new version before I actually copy them across the cluster. It can be used like so:

$ diffnewversion /usr/local/hbase-0.19.0 /usr/local/hadoop-0.19.0

Again I assume that the current version is symlinked as explained above. Otherwise you would have to make adjustements obviously.
#!/bin/bash
#
# Diff's the configuration files between the current symlinked versions and the given one.
#
# Created 2009/01/23 by Lars George
#

if [[ $# == 0 ]]; then
echo "usage: $(basename $0) <new_dir> [<new_dir>]"
exit 1;
fi

DIRS="conf bin"

for path in $*; do
if [[ "$1" == *hadoop* ]]; then
kind="hadoop"
else
kind="hbase"
fi
for dir in $DIRS; do
echo
echo "Comparing $kind $dir directory..."
echo
for f in /usr/local/$kind/$dir/*; do
echo
echo
echo "Checking $(basename $f)"
diff -w $f $1/$dir/$(basename $f)
if [[ $? == 0 ]]; then
echo "Files are the same..."
fi
echo
echo "================================================================"
done
done
shift 1
done

echo "done."

The last one I am posting here helps removing the Distributed File System (DFS) after for example a complete corruption (I didn't say they happen) or when you want to have a clean start.

Note: It assumes that the data is stored under "/data1/hadoop" and "/data2/hadoop" - that is where I have my data. If yours is different then adjust the path or - if you like - grep/awk the hadoop-site.xml and parse the paths out of the "dfs.name.dir" and "dfs.data.dir" respectively.
#!/bin/bash
# $Revision: 1.0 $
#
# Deletes all files and directories pertaining to the Hadoop DFS.
#
# Created 2008/12/12 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"
# optionally allow single server use
if [[ $# > 0 ]]; then
servers="$*"
fi
first="$(echo $servers | head -n 1 | awk -F. '{ print $1 }')"
dirs="/tmp/hbase* /tmp/hsperfdata* /tmp/task* /tmp/Jetty* /data1/hadoop/* /data2/hadoop/*"

echo "IMPORTANT: Are you sure you want to delete the DFS starting with $first?"
echo "Type \"yes\" to continue:"
read yes
if [ "$yes" == "yes" ]; then
for srv in $servers; do
echo "Sending command to $srv...";
for dir in $dirs; do
pa=$(dirname $dir)
fn=$(basename $dir)
echo "removing $pa/$fn...";
ssh $srv "find $pa -name \"$fn\" -type f -delete ; rm -fR $pa/$fn"
done
done
else
echo "aborted."
fi

echo "done."

I have a few others that for example let me kill runaway Java processes, sync only config changes across the cluster machines, starts and stops safely, and so on. I won't post them here as they are pretty trivial like the ones above or do not differ much. Let me know if you have similar scripts or better ones!

Wednesday, January 28, 2009

Changing HBase Tables in Code

One thing I did early on is storing the HBase table descriptions in an external XML file, like a database schema. For example:
<table>
<name>documents</name>
<table_name>docs</table_name>
<description>Stores the actual documents.</description>
<column_family>
<name>contents</name>
<description>Holds the actual raw data.</description>
<!-- Default: 3 -->
<max_versions></max_versions>
<!-- Default: DEFAULT_COMPRESSION_TYPE -->
<compression_type></compression_type>
<!-- Default: false -->
<in_memory></in_memory>
<!-- Default: false -->
<block_cache_enabled/>
<!-- Default: -1 (forever) -->
<time_to_live/>
<!-- Default: 2147483647 -->
<max_value_length></max_value_length>
<!-- Default: DEFAULT_BLOOM_FILTER_DESCRIPTOR -->
<bloom_filter></bloom_filter>
</column_family>
<column_family>
<name>mimetype</name>
<description>Holds the MIME type of the data.</description>
<!-- Default: 3 -->
<max_versions></max_versions>
<!-- Default: DEFAULT_COMPRESSION_TYPE -->
<compression_type></compression_type>
<!-- Default: false -->
<in_memory></in_memory>
<!-- Default: false -->
<block_cache_enabled/>
<!-- Default: -1 (forever) -->
<time_to_live/>
<!-- Default: 2147483647 -->
<max_value_length></max_value_length>
<!-- Default: DEFAULT_BLOOM_FILTER_DESCRIPTOR -->
<bloom_filter></bloom_filter>
</column_family>
</table>

While this adds extra work to maintain the schemas, it does give a central place where all the metadata about the HBase tables is stored.

In the code I added the functionality to read these XML files into internal classes that represent each table. For example:
/**
* Describes a table HBase independent to be used by calling class.
*/
public class TableSchema {

private String name = null;
private String description = null;
private String tableName = null;
private HashMap<String, ColumnDefinition> columns = new HashMap<String, ColumnDefinition>();

public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public String getDescription() {
return description;
}

public void setDescription(String description) {
this.description = description;
}

public String getTableName() {
return tableName;
}

public void setTableName(String tableName) {
this.tableName = tableName;
}

public void addColumn(ColumnDefinition column) {
columns.put(column.getName(), column);
}

public Collection<ColumnDefinition> getColumns() {
return columns.values();
}

public ColumnDefinition getColumnDefinition(String name) {
return columns.get(name);
}

@Override
public String toString() {
return "name -> " + name + "\n description -> " + description +
"\n tableName -> " + tableName + "\n columns -> " + columns;
}

} // TableSchema

In addition I added a function to convert these instances into those that HBase understands. I also added a generic helper to get a table reference:
/**
* Converts the XML based schema to a version HBase can take natively.
*
* @param schema The schema with the all tables.
* @return The converted schema as a HBase object.
*/
private HTableDescriptor convertSchemaToDescriptor(TableSchema schema) {
HTableDescriptor desc;
desc = new HTableDescriptor(schema.getTableName());
Collection<ColumnDefinition> cols = schema.getColumns();
for (ColumnDefinition col : cols) {
HColumnDescriptor cd = new HColumnDescriptor(Bytes.toBytes(col.getColumnName()), col.getMaxVersions(),
col.getCompressionType(), col.isInMemory(), col.isBlockCacheEnabled(), col.getMaxValueLength(),
col.getTimeToLive(), col.isBloomFilter());
desc.addFamily(cd);
}
return desc;
} // convertSchemaToDescriptor

/**
* Returns a table descriptor or <code>null</code> if it does not exist.
*
* @param name The name of the table.
* @return The table descriptor or <code>null</code>.
* @throws IOException When the communication to HBase fails.
*/
private HTableDescriptor getHBaseTable(String name) throws IOException {
HTableDescriptor[] tables = _hbaseAdmin.listTables();
for (int i = 0; i < tables.length; i++)
if (tables[i].getNameAsString().equals(name)) return tables[i];
return null;
} // getHBaseTable

Now I can ask for a table and if it does not exist I can create it. But what if it does exist already? I am facing the problem of checking if a table schema is different from the table that is deployed. If it is the same, fine, simply load it, but if it is different you have to compare the column definitions and change those columns that have changed. Here is an approach:
/**
* Returns a HBase table. The table is either opened, created or updated.
*
* @param schema The external schema describing the table.
* @param create True means create table if non existent.
* @return The internal table container.
* @throws IOException When the table creation fails.
*/
private TableContainer createTable(TableSchema schema, boolean create)
throws IOException {
TableContainer res = new TableContainer();
res.setSchema(schema);
HTableDescriptor desc = null;
if (_hbaseAdmin.tableExists(schema.getTableName())) {
desc = getHBaseTable(schema.getTableName());
// only check for changes if we are allowed to
if (create) {
HTableDescriptor d = convertSchemaToDescriptor(schema);
// compute differences
List<HColumnDescriptor> modCols = new ArrayList<HColumnDescriptor>();
for (HColumnDescriptor cd : desc.getFamilies()) {
HColumnDescriptor cd2 = d.getFamily(cd.getName());
if (cd2 != null && !cd.equals(cd2)) modCols.add(cd2);
}
List<HColumnDescriptor> delCols = new ArrayList<HColumnDescriptor>(desc.getFamilies());
delCols.removeAll(d.getFamilies());
List<HColumnDescriptor> addCols = new ArrayList<HColumnDescriptor>(d.getFamilies());
addCols.removeAll(desc.getFamilies());
// check if we had a column that was changed, added or deleted
if (modCols.size() > 0 || addCols.size() > 0 || delCols.size() > 0) {
// yes, then disable table and iterate over changes
_hbaseAdmin.disableTable(schema.getTableName());
for (HColumnDescriptor col : modCols)
_hbaseAdmin.modifyColumn(schema.getTableName(), col.getNameAsString(), col);
for (HColumnDescriptor col : addCols)
_hbaseAdmin.addColumn(schema.getTableName(), col);
for (HColumnDescriptor col : delCols)
_hbaseAdmin.deleteColumn(schema.getTableName(), col.getNameAsString() + ":");
// enable again and reload details
_hbaseAdmin.enableTable(schema.getTableName());
desc = getTable(schema.getTableName(), false);
}
}
} else if (create) {
desc = convertSchemaToDescriptor(schema);
_hbaseAdmin.createTable(desc);
}
res.setDescription(desc);
HTable table = null;
if (desc != null) table = new HTable(_hbaseConfig, desc.getName());
res.setTable(table);
return res;
} // createTable

That's it I guess. Please note that this is my attempt of solving it, not sure yet if it works. I will test it as soon as I can and update here accordingly. But I thought I throw it out anyways, who knows maybe it helps someone or someone can help me. :)

Oh, for completeness sake, here the returned class I created to hold my table details:
/**
* Container to hold a table's details.
*/
class TableContainer {

private HTable table;
private HTableDescriptor description;
private TableSchema schema;

public HTable getTable() {
return table;
}

public void setTable(HTable table) {
this.table = table;
}

public HTableDescriptor getDescription() {
return description;
}

public void setDescription(HTableDescriptor description) {
this.description = description;
}

public TableSchema getSchema() {
return schema;
}

public void setSchema(TableSchema schema) {
this.schema = schema;
}

@Override
public String toString() {
return "table -> " + table + ", description -> " + description +
", schema -> " + schema;
}

} // TableContainer

/**
* Describes a column and its features.
*/
public class ColumnDefinition {

/** The divider between the column family name and a label. */
public static final String DIV_COLUMN_LABEL = ":";
/** Default values for HBase. */
private static final int DEF_MAX_VERSIONS = HColumnDescriptor.DEFAULT_VERSIONS;
/** Default values for HBase. */
private static final CompressionType DEF_COMPRESSION_TYPE = HColumnDescriptor.DEFAULT_COMPRESSION;
/** Default values for HBase. */
private static final boolean DEF_IN_MEMORY = HColumnDescriptor.DEFAULT_IN_MEMORY;
/** Default values for HBase. */
private static final boolean DEF_BLOCKCACHE_ENABLED = HColumnDescriptor.DEFAULT_BLOCKCACHE;
/** Default values for HBase. */
private static final int DEF_MAX_VALUE_LENGTH = HColumnDescriptor.DEFAULT_LENGTH;
/** Default values for HBase. */
private static final int DEF_TIME_TO_LIVE = HColumnDescriptor.DEFAULT_TTL;
/** Default values for HBase. */
private static final boolean DEF_BLOOM_FILTER = HColumnDescriptor.DEFAULT_BLOOMFILTER;

private String name;
private String tableName;
private String description;
private int maxVersions = DEF_MAX_VERSIONS;
private CompressionType compressionType = DEF_COMPRESSION_TYPE;
private boolean inMemory = DEF_IN_MEMORY;
private boolean blockCacheEnabled = DEF_BLOCKCACHE_ENABLED;
private int maxValueLength = DEF_MAX_VALUE_LENGTH;
private int timeToLive = DEF_TIME_TO_LIVE;
private boolean bloomFilter = DEF_BLOOM_FILTER;

public String getColumnName() {
return name.endsWith(":") ? name : name + ":";
}

public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public String getTableName() {
return tableName;
}

public void setTableName(String tableName) {
this.tableName = tableName;
}

public String getDescription() {
return description;
}

public void setDescription(String description) {
this.description = description;
}

public int getMaxVersions() {
return maxVersions;
}

public void setMaxVersions(int maxVersions) {
this.maxVersions = maxVersions;
}

public CompressionType getCompressionType() {
return compressionType;
}

public void setCompressionType(CompressionType compressionType) {
this.compressionType = compressionType;
}

public boolean isInMemory() {
return inMemory;
}

public void setInMemory(boolean inMemory) {
this.inMemory = inMemory;
}

/**
* @return Returns the blockCacheEnabled.
*/
public boolean isBlockCacheEnabled() {
return blockCacheEnabled;
}

/**
* @param blockCacheEnabled The blockCacheEnabled to set.
*/
public void setBlockCacheEnabled(boolean blockCacheEnabled) {
this.blockCacheEnabled = blockCacheEnabled;
}

/**
* @return Returns the timeToLive.
*/
public int getTimeToLive() {
return timeToLive;
}

/**
* @param timeToLive The timeToLive to set.
*/
public void setTimeToLive(int timeToLive) {
this.timeToLive = timeToLive;
}

/**
* @return Returns the bloomFilter.
*/
public boolean isBloomFilter() {
return bloomFilter;
}

/**
* @param bloomFilter The bloomFilter to set.
*/
public void setBloomFilter(boolean bloomFilter) {
this.bloomFilter = bloomFilter;
}

public int getMaxValueLength() {
return maxValueLength;
}

public void setMaxValueLength(int maxValueLength) {
this.maxValueLength = maxValueLength;
}

@Override
public String toString() {
return "name -> " + name +
"\n tableName -> " + tableName +
"\n description -> " + description +
"\n maxVersions -> " + maxVersions +
"\n compressionType -> " + compressionType +
"\n inMemory -> " + inMemory +
"\n blockCacheEnabled -> " + blockCacheEnabled +
"\n maxValueLength -> " + maxValueLength +
"\n timeToLive -> " + timeToLive +
"\n bloomFilter -> " + bloomFilter;
} // toString

} // ColumnDefinition

Not much to it obviously, but hey.

Update: I fixed the code to handle added and removed columns properly. The previous version would only handle changed columns.

Tuesday, January 27, 2009

How to use HBase with Hadoop


I have been using Hadoop and HBase for a while now. The "raw" directory dump on the right may give you a rough idea. ;)

With each new release I went through the iterations of the supplied helper classes to scan a HBase table from within a Map/Reduce job. What I did not find was a description of how to use these classes. That has improved thanks to the relentless work of the volunteers and Michael Stack who put it all together. Without him and the rest of the team HBase would be nowhere. Anyhow, here is my spin on this issue and where I am still not sure about how to handle things:

The Hadoop Tool class is the launcher application and its purposes is to read the command line parameters and then set up a JobConf instance that holds the job details such as what classes to use to read the input, what is the mapper and reducer class, what are the classes of the key and value for each of these and so on. The command line parameters usually specify how many Map and Reduce should be run on the cluster, what table to process, what columns and so on.

Once the Tool has set up the JobConf, it runs the job on the cluster. With HBase there are special Map/Reduce classes that serve as helpers or starting point to process HBase tables. They are in the HBase path under the "mapred" package. Their names are TableInputFormat, TableReduce, TableMap, IdentityTableMap and so on.

After the job has started, the first thing that happens is the preparation for the Map phase, which is done in the InputFormat class. It serves as the filter to read the raw data and pass it into the Map phase as key/value pairs. For the HBase job this is done in the supplied TableInputFormat class. What it does is splitting the table you are scanning into chunks that can be handed to the Map instances. By the way, you are allowed to only scan one single table at a time, but any columns you want to process out of that table.

In HBase a table is physically divided into many regions, which are in turn served by different RegionServers. The way splitting is done is it maps each split to exactly one region of the table you are scanning. Because of that you may end up with a couple of thousand splits for a single table. For example, I have about >6000 regions for one table.

It is recommended by the HBase team to match the number of Map instances to the number of splits (aka table regions). Instead of always having to check the number first using for example the HBase UI I have opted to automate the computation of the number of splits to use. I simply ask the table how many "start keys" it knows of. This should equal the number of regions as each region has a new starting key:
/**
* Computes the number of regions per table.
*
* @return The total number of regions per table.
* @throws IOException When the table is not created.
*/
private int getNumberOfRegions(String tableName) throws IOException {
// sanity check
if (tableName == null) return -1;
HTable table = new HTable(hbaseConfig, tableName);
byte[][] startKeys = table.getStartKeys();
return startKeys.length;
}

Each split is one region and one region holds a start and end key to be processed. As each split is read by the Map instance a HBase Scanner is created to scan the rows of the split's keys.

Each of these rows are handed to the TableMap class, or rather a class that implements this interface. You can use the supplied IndentityTableMap since often you are simply passing on the rows to the Map step. As per the Map/Reduce process the rows then get sorted during the next phase and eventually passed on to the TableReduce class.

So what you have now is the row and all columns that were listed usually as a parameter to Tool class implementation. For example if you named "contents:,mimetype:" then those two column families are handed to you - and in this special case with all labels! If you had specified "contents:en,mimetype:en" then you would have gotten only exactly those two columns with that particular label. So leaving out the label defaults to wildcard all labels (which is because the HBase Scanner class implements it that way).

In the Reduce phase you perform what work you have to and then pass on the result to the TableOutputFormat class. Here you write out what you need and you are done.

During the process you can call upon counters to count anything you like. That is what I do to count how many documents I have in total and how many for each respective target language etc. At the end of the job run I read the counters and store the values back into HBase or MemCacheDB as meta data.

Now the question you may have is, where do I do what work? In the Map or Reduce phase? And for the latter, why do I need a formatter?

I had the same questions. My understanding is that HBase is a special Hadoop Map/Reduce case. This is because keys in HBase are unique, so doing a map first and then sorting them so that they can be reduced is not necessary. I in fact have one job that I only use a Map phase for, doing it so:
job.setNumReduceTasks(0);
job.setInputFormat(MyTextInputFormat.class);
job.setMapperClass(MyMapper.class);
job.setOutputFormat(NullOutputFormat.class);

So it is a decision you have to make on your own, do I need a particular phase or not? In the above I am not scanning HBase tables but rather read a text file stored in the DFS of Hadoop and each line is an update instruction for an existing document stored in HBase. There is no need to do the sorting or reducing.

As far as HBase scans are concerned, you may want to keep the Input/Split/Map/Sort/Reduce/Output phases, that is also why there are those base classes the excellent HBase team supplied matching that concept. Usually the IdentityTableMap class is used to pass on the rows and columns and all the work is done in the Reduce phase.

Leaves one thing left, why having a TableReduce and a TableOutputFormat class? The reason is that in the Reduce you output what needs to be "saved" - but now how. You can therefore run two very similar jobs which only differ in how they save the data by replacing the output format class.

Again, I have cases where I do not output but save back to HBase. I could easily write the records back into HBase in the reduce step, so why pass them on first? I think this is in some cases just common sense or being a "good citizen". I still have code where I am torn as to where to process the final output. Sometimes I lean this way, sometimes the other.

Other notes:
1) With HBase 0.19.0 there is now a central helper class called TableMapReduceUtil which helps setting up jobs like so:
job.setMapperClass(MyMap.class);
TableMapReduceUtil.initTableReduceJob(TABLE_NAME, MyReduce.class, job);
...

It helps you setting up all the required details for Map and/or Reduce jobs based on the supplied helper classes.

2) Writing back to the same HBase table is OK when doing it in the Reduce phase as all scanning has concluded in the Map phase beforehand, so all rows and columns are saved to an intermediate Hadoop SequenceFile internally and when you process these and write back to HBase you have no problems that there is still a scanner for the same job open reading the table.

Otherwise it is OK to write to a different HBase table even during the Map phase.

Hope that helps a little.

Monday, January 19, 2009

VServer is not Xen, Part 2

Another oddity about VServer is that it does not have a true init process. Or rather, the whole start up is not as you are used to from other Linux systems.

While you can read about the different Init Styles there is one crucial issue: the startup scripts, usually located in /etc/rc.<n> are either executed outside of the VM or inside, so that you can either "see" the VM starting up from the master or not respectively. While this is OK and usable for most applications, it has a major problem. You cannot run DJB's daemontools.

This is because while the above startup styles execute the init scripts, it does not execute anything else from the inittab configuration files. Most importantly the last line in the following excerpt from /etc/inittab:
...
# Example how to put a getty on a serial line (for a terminal)
#
#T0:23:respawn:/sbin/getty -L ttyS0 9600 vt100
#T1:23:respawn:/sbin/getty -L ttyS1 9600 vt100

# Example how to put a getty on a modem line.
#
#T3:23:respawn:/sbin/mgetty -x0 -s 57600 ttyS3

SV:123456:respawn:/command/svscanboot

The last line is what starts the root daemontools process that starts all services it maintains. In VServer is simply will not start.

The issue for me started a lot earlier, I should have seen this coming really. When I tried the initial setup I went down the usual (at least for me) get the daemontools-installer Debian package and build the binaries. I did this in the VM obviously, because that is where I wanted to install the daemontools. Here is what happened:
$ build-daemontools       

This script unpacks the daemontools source into a directory, and
compiles it to produce a binary daemontools*.deb file.
...
Press ENTER to continue...
Attempting to apply patches located in
/usr/src/daemontools-installer/patches...
/usr/src/daemontools-installer/patches/errno.patch
patching file src/error.h
/usr/src/daemontools-installer/patches/fileutils.patch
patching file src/rts.tests
dh_testdir
package/compile
Linking ./src/* into ./compile...
Compiling everything in ./compile...
make[1]: Entering directory `/tmp/daemontools/admin/daemontools-0.76/compile'
sh find-systype.sh > systype
rm -f compile
sh print-cc.sh > compile
chmod 555 compile
./compile byte_chr.c
./compile byte_copy.c
./compile byte_cr.c
./compile byte_diff.c
...
make[1]: Leaving directory `/tmp/daemontools/admin/daemontools-0.76/compile'
Copying commands into ./command...
touch build-stamp
dh_testdir
dh_testroot
dh_clean -k
dh_clean: Compatibility levels before 4 are deprecated.
dh_installdirs
dh_installdirs: Compatibility levels before 4 are deprecated.
mkdir -p debian/daemontools/package/admin/daemontools-0.76
mkdir -p debian/daemontools/command
mkdir -p debian/daemontools/usr/share/daemontools
mkdir -p debian/daemontools/service
cp -a command debian/daemontools/package/admin/daemontools-0.76
cp -a compile debian/daemontools/package/admin/daemontools-0.76
cp -a package debian/daemontools/package/admin/daemontools-0.76
cp -a src debian/daemontools/package/admin/daemontools-0.76
dh_link package/admin/daemontools-0.76/package usr/share/daemontools/package
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76 package/admin/daemontools
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/envdir command/envdir
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/envuidgid command/envuidgid
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/fghack command/fghack
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/multilog command/multilog
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/pgrphack command/pgrphack
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/readproctitle
command/readproctitle
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/setlock command/setlock
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/setuidgid command/setuidgid
dh_link: Compatibility levels before 4 are deprecated.
dh_link package/admin/daemontools-0.76/command/softlimit command/softlimit
...
dh_gencontrol
dh_gencontrol: Compatibility levels before 4 are deprecated.
dh_md5sums
dh_md5sums: Compatibility levels before 4 are deprecated.
dpkg-deb -b debian/daemontools ..
dpkg-deb: building package `daemontools' in `../daemontools_0.76-9_i386.deb'.

It seems that all went ok

Do you want to remove all files in /tmp/daemontools,
except daemontools_0.76-9_i386.deb now? [Yn]
Removing files... done

Do you want to install daemontools_0.76-9_i386.deb now? [Yn] n

Do you want to purge daemontools-installer now? [yN]

Good luck!

So the compile succeeded but the subsequent package compilation failed with "dh_link: Compatibility levels before 4 are deprecated." errors. The makefile was not built to handle these kinds of errors by the looks because at the end I got told all seems OK - which is of course not the case, the package is empty.

Well, I managed to build it somewhere else and install the binaries that way into the Virtual Machine. But then I noticed the issue above, in other words the services would not run because the root process was not started.

After searching around on the web I found - of course - a post outlining the same issue. As usual you go through the same steps and pain just to find out that someone else found the same problem and already fixed it.

The solution is to start the root daemontools process just like any other process. The post has a script that I include below (in case in gets lost in the Intertubes):
$ cat /etc/init.d/svscanboot 

#! /bin/sh
#
# daemontools for launching /etc/svscanboot from sysvinit instead of /sbin/init.
#
# author: dkg

set -e

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
DESC="daemontools"
NAME=svscanboot
DAEMON=/command/svscanboot

PIDFILE=/var/run/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME

# Gracefully exit if the package has been removed.
test -x $DAEMON || exit 0

#
# Function that starts the daemon/service.
#
d_start() {
start-stop-daemon --start --background --make-pidfile --quiet --pidfile $PIDFILE \
--exec $DAEMON
}

#
# Function that stops the daemon/service.
#
d_stop() {
start-stop-daemon --stop --quiet --pidfile $PIDFILE \
--name $NAME
echo "not cleaning up svscan and readproctitle subprocesses
appropriately. dkg is lazy."
}

#
# Function that sends a SIGHUP to the daemon/service.
#
d_reload() {
start-stop-daemon --stop --quiet --pidfile $PIDFILE \
--name $NAME --signal 1
}

case "$1" in
start)
echo -n "Starting $DESC: $NAME"
d_start
echo "."
;;
stop)
echo -n "Stopping $DESC: $NAME"
d_stop
echo "."
;;
restart|force-reload)
#
# If the "reload" option is implemented, move the "force-reload"
# option to the "reload" entry above. If not, "force-reload" is
# just the same as "restart".
#
echo -n "Restarting $DESC: $NAME"
d_stop
sleep 1
d_start
echo "."
;;
*)
# echo "Usage: $SCRIPTNAME {start|stop|restart|reload|force-reload}" >&2
echo "Usage: $SCRIPTNAME {start|stop|restart|force-reload}" >&2
exit 1
;;
esac

exit 0

Now, other post mention that there is also a "fakeinit" style - but it did not work for me and I rather believe that this is the old name for "plain" mentioned in the Init Styles document I linked above.

Goes to show that a lot is unclear about VServer. But that is often the case with open-source tools and systems. It is up to us, the IT people, to help out and close those gaps while contributing to the community.

Saturday, January 10, 2009

VServer is not Xen

Recently I had to work on a Virtual Machine (VM) running on VServer. In the past I used Xen to create virtual machines but due to the nature of the task VServer seemed more appropriate. I only have to run two Debian Etch VM's on a Debian Etch host. Because of the much narrower interface to the Operation System (OS) it makes sense for VServer hosts to run without much of the overhead - and therefore faster as well.

There are a few things that are quite nice about the lesser abstraction of VServer compared to Xen. For example copying a Virtual Machine is much simpler I found and files can be copied into place from the master because the file system of the VM's are simply directories of the master file system.

One thing I noticed is though that it is much more difficult to run certain daemons in the VM's and/or the master at the same time. The separation in Xen completely splits master and VM on the kernel level, running the same daemon on the same port is a natural fit. Nothing to be done. Not so with VServer.

I tried to run SSH, NTP and SNMP on the master and the two VM's I tried to set up. First issue I ran into was SSH. SSH on the master is listening on all network addresses, configured as such:
ListenAddress 0.0.0.0

When you now try to start the SSH daemon on the VM's you get an error that the address is already in use - by the master of course! Master and Virtual Machines share the network layer and this is now causing a problem.

The issue in itself is solved by setting the listening address to a specific one, namely the address of the master:
ListenAddress 192.168.1.100

Then it binds to the default socket only on that interface and the VM's are free to bind their daemons to their IP.

The second issue I ran into is NTP. I tried to run it the same way as the SSH daemon, but since the listening address is not something you can specify in the /etc/ntp.conf the NTP daemon is binding to all interfaces and we have the same error on the VM's as mentioned above.

I found it best to remove NTP completely from the VM's and only run it on the master. It seems after a few weeks of observation that the time is "passed" on to the VM's, in other words their time stays in sync. This somewhat makes sense considering the thin layer VServer has to run the Virtual Machines. They simply use the same internal clock and if the master is in sync then so are the VM's.

Friday, January 9, 2009

Odd "ps" output

While trying to figure out when I started a particular process I noticed that the normal "ps aux" or "ps -eF" is not showing the actual start date, but - depending on how long the task is already running - the year. For example:
[02:09:36 root@lv1-cpq-bl-17 bin]# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1944 656 ? Ss 2008 0:01 init [2]
root 2 0.0 0.0 0 0 ? S 2008 0:00 [migration/0]
root 3 0.0 0.0 0 0 ? SN 2008 0:00 [ksoftirqd/0]
root 4 0.0 0.0 0 0 ? S 2008 0:00 [events/0]
root 5 0.0 0.0 0 0 ? S 2008 0:00 [khelper]
...
root 2851 0.0 3.9 1243532 40740 ? Sl Jan07 1:14 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3521 0.1 4.4 1250100 45828 ? Sl Jan07 2:22 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3629 0.0 4.1 1237900 42880 ? Sl Jan07 0:28 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3799 0.1 5.9 1268268 61260 ? Sl Jan07 3:17 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 12274 0.0 0.0 3432 880 pts/4 R+ 03:25 0:00 ps aux

So this varies from the time today when it was started, to a month/day combination all the way to just a year, because the process was started last year.

But when exactly?

Digging into the "man ps" details and using the "trial and error" approach I found out that a custom layout allows to get what I needed:
[root]# ps -e -o user,pid,pcpu,start,stime,time,vsz,rssize,ni,args
USER PID %CPU STARTED STIME TIME VSZ RSS NI COMMAND
root 1 0.0 Jul 01 2008 00:00:01 1944 656 0 init [2]
root 2 0.0 Jul 01 2008 00:00:00 0 0 - [migration/0]
root 3 0.0 Jul 01 2008 00:00:00 0 0 19 [ksoftirqd/0]
root 4 0.0 Jul 01 2008 00:00:00 0 0 -5 [events/0]
root 5 0.0 Jul 01 2008 00:00:00 0 0 -5 [khelper]
...
root 2851 0.0 Jan 07 Jan07 00:01:14 1243532 40740 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3521 0.1 Jan 07 Jan07 00:02:22 1250100 45828 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3629 0.0 Jan 07 Jan07 00:00:28 1237900 42880 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 3799 0.1 Jan 07 Jan07 00:03:17 1268268 61260 0 /usr/lib/jvm/java-1.5.0-sun/jre/bin/java
root 12275 0.0 03:25:38 03:25 00:00:00 3432 880 0 ps -e -o user,pid,pcpu,start,
stime,time,vsz,rssize,ni,args

The "start" format option resulting in the "STARTED" column above and is showing what I needed. Last thing would be I guess to set the "PS_FORMAT" environment variable if I needed this permanently.