Showing posts with label cloudstack. Show all posts

Thursday, April 09, 2015

Running the CloudStack Simulator in Docker

CloudStack comes with a simulator. It is very handy for testing purposes, we use it to run our smoke tests on TravisCI for each commit to the code base. However if you want to run the simulator, you need to compile from source using some special maven profiles. That requires you to check out the code and setup your working environment with the dependencies for a successfull CloudStack build.

With Docker you can skip all of that and simply download the cloudstack/simulator image from the Docker Hub. Start a container from that image and expose port 8080 where the dashboard is being served. Once the container is running, you can use docker exec to configure a simulated data center. This will allow you to start fake virtual machines, create security groups and so on. You can do all of this through the dashboard or using the CloudStack API.

So you want to give CloudStack a try ? Use Docker :)

$ docker pull cloudstack/simulator

The image is a bit big and we need to work on slimming it down but once the image is pulled, starting the container will be almost instant. If you feel like sending a little PR just the Dockerfile, there might be a few obvious things to slim down the image.

$ docker run -d -p 8080:8080 --name cloudstak cloudstack/simulator

The application needs a few minutes to start however, something that I have not had time to check. Probably we need to give more memory to the container. Once you can access the dashboard at http://localhost:8080/client you can configure the simulated data-center. You can choose between a basic network which gives you L3 network isolation or advanced zone which gives you a VLAN base isolation:

$ docker exec -ti cloudstack python /root/tools/marvin/marvin/deployDataCenter.py -i /root/setup/dev/basic.cfg

Once the configuration completes, head over to the dashboard http://localhost:8080/client and check your simulated infrastructure

Enjoy the CloudStack simulator brought to you by Docker.

Thursday, October 02, 2014

CloudStack simulator on Docker

Docker is a lot of fun, one of its strength is in the portability of applications. This gave me the idea to package the CloudStack management server as a docker image.

CloudStack has a simulator that can fake a data center infrastructure. It can be used to test some of the basic functionalities. We use it to run our integration tests, like the smoke tests on TravisCI. The simulator allows us to configure an advanced or basic networking zone with fake hypervisors.

So I bootstrapped the CloudStack management server, configured the Mysql database with an advanced zone and created a docker image with Packer. The resulting image is on DockerHub, and I realized after the fact that four other great minds already did something similar :)

On a machine running docker:

docker pull runseb/cloudstack
docker run -t -i -p 8080:8080 runseb/cloudstack:0.1.4 /bin/bash
# service mysql restart
# cd /opt/cloudstack
# mvn -pl client jetty:run -Dsimulator

Then open your browser on http://<IP_of_docker_host>:8080/client and enjoy !

Tuesday, September 30, 2014

On Docker and Kubernetes on CloudStack

Docker has pushed containers to a new level, making it extremely easy to package and deploy applications within containers. Containers are not new, with Solaris containers and OpenVZ among several containers technologies going back 2005. But Docker has caught on quickly as mentioned by @adrianco. The startup speed is not surprising for containers, the portability is reminiscent of the Java goal to "write once run anywhere". What is truly interesting with Docker is that availability of Docker registries (e.g Docker Hub) to share containers and the potential to change the application deployment workflows.

Rightly so, we should soon see IT move to a docker based application deployment, where developers package their applications and make them available to Ops. Very much like we have been using war files. Embracing a DevOps mindset/culture should be easier with Docker. Where it becomes truly interesting is when we start thinking about an infrastructure whose sole purpose is to run containers. We can envision a bare operating system with a single goal to manage docker based services. This should make sys admin life easier.

The role of the Cloud with Docker

While the buzz around Docker has been truly amazing and a community has grown over night, some may think that this signals the end of the cloud. I think it is far from the truth as Docker may indeed become the killer app of the cloud.

A IaaS layer is what is: an infrastructure orchestration layer, while Docker and its ecosystem will become the application orchestration layer.

The question then becomes: How do I run Docker in the cloud ? And there is a straightforward answer: Just install Docker in your cloud templates. Whether on AWS or GCE or Azure or your private cloud, you can prepare linux based templates that provide Docker support. If you are aiming for the bare operating system whose sole purpose is to run Docker then the new CoreOS linux distribution might be your best pick. CoreOS provides rolling upgrades of the kernel, systemd based services, a distributed key value store (i.e etcd) and a distributed service scheduling system (i.e fleet)

exoscale an Apache CloudStack based public clouds, recently announced the availability of CoreOS templates.

Like exoscale, any cloud provider be it public or private can make CoreOS templates available. Providing Docker within the cloud instantly.

Docker application orchestration, here comes Kubernetes

Running one container is easy, but running multiple coordinated containers across a cluster of machines is not yet a solved problem. If you think of an application as a set of containers, starting these on multiple hosts, replicating some of them, accessing distributed databases, providing proxy services and fault tolerance is the true challenge.

However, Google came to the resuce and announced Kubernetes a system that solves these issues and makes managing scalable, fault-tolerant container based apps doable :)

Kubernetes is of course available on Google public cloud GCE, but also in Rackspace, Digital Ocean and Azure. It can also be deployed on CoreOS easily.

Kubernetes on CloudStack

Kubernetes is under heavy development, you can test it with Vagrant. Under the hood, aside from the go code :), most of the deployment solutions use SaltStack recipes but if you are a Chef, Puppet or Ansible user I am sure we will see recipes for those configuration management solution soon.

But you surely got the same idea that I had :) Since Kubernetes can be deployed on CoreOS and that CoreOS is available on exoscale. We are just a breath away from running Kubernetes on CloudStack.

It took a little more than a breath, but you can clone kubernetes-exoscale and you will get running in 10 minutes. With a 3 node etcd cluster and a 5 node kubernetes cluster, running a Flannel overlay.

CloudStack supports EC2 like userdata, and the CoreOS templates on exoscale support cloud-init, hence passing some cloud-config files to the instance deployment was straightforward. I used libcloud to provision all the nodes, created proper security groups to let the Kubernetes nodes access the etcd cluster and let the Kubernetes nodes talk to each other, especially to open a UDP port to build a networking overlay with Flannel. Fleet is used to launch all the Kubernetes services. Try it out.

Conclusions.

Docker is extremely easy to use and taking advantage of a cloud you can get started quickly. CoreOS will put your docker work on steroid with availability to start apps as systemd services over a distributed cluster. Kubernetes will up that by giving you replication of your containers and proxy services for free (time).

We might see pure docker based public clouds (e.g think Mesos cluster with a Kubernetes framework). These will look much more like PaaS, especially if they integrate a Docker registry and a way to automatically build docker images (e.g think continuous deployment pipeline).

But a "true" IaaS is actually very complimentary, providing multi-tenancy, higher security as well as multiple OS templates. So treating docker as a standard cloud workload is not a bad idea. With three dominant public clouds in AWS, GCE and Azure and a multitude of "regional" ones like exoscale it appears that building a virtualization based cloud is a solved problem (at least with Apache CloudStack :)).

So instead of talking about cloudifying your application, maybe you should start thinking about Dockerizing your applications and letting them loose on CloudStack.

Friday, July 11, 2014

GCE Interface to CloudStack

Gstack, a GCE compatible interface to CloudStack

Google Compute Engine (GCE) is the Google public cloud. In december 2013, Google announced the General Availability (GA) of GCE. With AWS and Microsoft Azure, it is one of the three leading public clouds in the market. Apache CloudStack now has a brand new GCE compatible interface (Gstack) that lets users use the GCE clients (i.e gcloud and gcutil) to access their CloudStack cloud. This has been made possible through the Google Summer of Code program.

Last summer Ian Duffy, a student from Dublin City University participated in GSoC through the Apache Software Foundation (ASF) and worked on a LDAP plugin to CloudStack. He did such a great job that he finished early and was made an Apache CloudStack committer. Since he was done with his original GSoC project I encouraged him to take on a new one :), he brought in a friend for the ride: Darren Brogan. Both of them worked for fun on the GCE interface to CloudStack and learned Python doing so.

They remained engaged with the CloudStack community and has a third year project worked on an Amazon EC2 interface to CloudStack using what they learned from the GCE interface. They got an A :). Since they loved it so much, Darren applied to the GSoC program and proposed to go back to Gstack, improve it, extend the unittests and make it compatible with the GCE v1 API.

Technically, Gstack is a Python Flask application that provides a REST API compatible with the GCE API and forwards the requests to the corresponding CloudStack API. The source is available on GitHub and the binary is downloadable via PyPi. Let's show you how to use it.

Installation and Configuration of Gstack

You can grab the Gstack binary package from Pypi using pip in one single command.

pip install gstack

Or if you plan to explore the source and work on it, you can Clone the repository and install it by hand. Pull requests are of course welcome.

git clone https://github.com/NOPping/gstack.git
sudo python ./setup.py install

Both of these installation methods will install a gstack and a gstack-configure binary in your path. Before running Gstack you must configure it. To do so run:

gstack-configure

And enter your configuration information when prompted. You will need to specify the host and port where you want gstack to run on, as well as the CloudStack endpoint that you want gstack to forward the requests to. In the example below we use the exoscale cloud:

$ gstack-configure
gstack bind address [0.0.0.0]: localhost
gstack bind port [5000]: 
Cloudstack host [localhost]: api.exoscale.ch
Cloudstack port [8080]: 443
Cloudstack protocol [http]: https
Cloudstack path [/client/api]: /compute

The information will be stored in a configuration file available at ~/.gstack/gstack.conf:

$ cat ~/.gstack/gstack.conf 
PATH = 'compute/v1/projects/'
GSTACK_BIND_ADDRESS = 'localhost'
GSTACK_PORT = '5000'
CLOUDSTACK_HOST = 'api.exoscale.ch'
CLOUDSTACK_PORT = '443'
CLOUDSTACK_PROTOCOL = 'https'
 CLOUDSTACK_PATH = '/compute'

You are now ready to start Gstack in the foreground with:

gstack

That's all there is to running Gstack. To be able to use it as if you were talking to GCE however, you need to use gcutil and configure it a bit.

Using `gcutil` with Gstack

The current version of Gstack only works with the stand-alone version of gcutil. Do not use the version of gcutil bundled in the Google Cloud SDK. Instead install the 0.14.2 version of gcutil. Gstack comes with a self-signed certificate for the local endpoint gstack/data/server.crt, copy the certificate to the gcutil certificates file gcutil/lib/httplib2/httplib2/cacerts.txt. A bit dirty I know, but that's a work in progress.

At this stage your CloudStack apikey and secretkey need to be entered in the gcutil auth_helper.py file at gcutil/lib/google_compute_engine/gcutil/auth_helper.py.

Again not ideal but hopefully gcutil or the Cloud SDK will soon be able to configure those without touching the source. Darren and Ian opened a feature request with google to pass the client_id and client_secret as options to gcutil, hopefully future release of gcutil will allow us to do so.

Now, create a cached parameters file for gcutil. Assuming you are running gstack on your local machine, using the defaults that were suggested during the configuration phase. Modify ~/.gcutil_params with the following:

--auth_local_webserver
--auth_host_port=9999
--dump_request_response
--authorization_uri_base=https://localhost:5000/oauth2
--ssh_user=root
--fetch_discovery
--auth_host_name=localhost
--api_host=https://localhost:5000/

Warning: Make sure to set the --auth_host_name variable to the same value as GSTACK_BIND_ADDRESS in your ~/.gstack/gstack.conf file. Otherwise you will see certificates errors.

With this setup complete, gcutil will issues requests to the local Flask application, get an OAuth token, issue requests to your CloudStack endpoint and return the response in a GCE compatible format.

Example with exoscale.

You can now start issuing standard gcutil commands. For illustration purposes we use Exoscale. Since there are several semantic differences, you will notice that as a project we use the account information from CloudStack. Hence we pass our email address as the project value.

Let's start by listing the availability zones:

$ gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com listzones
+----------+--------+------------------+
| name     | status | next-maintenance |
+----------+--------+------------------+
| ch-gva-2 | UP     | None scheduled   |
+----------+--------+------------------+

Let's list the machine types or in CloudStack terminology: the compute service offerings and to list the available images.

$ ./gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com listimages
$ gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com listmachinetypes
+-------------+----------+------+-----------+-------------+
| name        | zone     | cpus | memory-mb | deprecation |
+-------------+----------+------+-----------+-------------+
| Micro       | ch-gva-2 |    1 |       512 |             |
+-------------+----------+------+-----------+-------------+
| Tiny        | ch-gva-2 |    1 |      1024 |             |
+-------------+----------+------+-----------+-------------+
| Small       | ch-gva-2 |    2 |      2048 |             |
+-------------+----------+------+-----------+-------------+
| Medium      | ch-gva-2 |    2 |      4096 |             |
+-------------+----------+------+-----------+-------------+
| Large       | ch-gva-2 |    4 |      8182 |             |
+-------------+----------+------+-----------+-------------+
| Extra-large | ch-gva-2 |    4 |     16384 |             |
+-------------+----------+------+-----------+-------------+
| Huge        | ch-gva-2 |    8 |     32184 |             |
+-------------+----------+------+-----------+-------------+

You can also list firewalls which gstack maps to CloudStack security groups. To create a securitygroup, use the firewall commands:

$ ./gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com addfirewall ssh --allowed=tcp:22
$ ./gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com getfirewall ssh

To start an instance you can follow the interactive prompt given by gcutil. You will need to pass the --permit_root_ssh flag, another one of those semantic and access configuration that needs to be ironed out. The interactive prompt will let you choose the machine type and the image that you want, it will then start the instance

$ ./gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com addinstance foobar
Selecting the only available zone: CH-GV2
1: Extra-large  Extra-large 16384mb 4cpu
2: Huge Huge 32184mb 8cpu
3: Large    Large 8192mb 4cpu
4: Medium   Medium 4096mb 2cpu
5: Micro    Micro 512mb 1cpu
6: Small    Small 2048mb 2cpu
7: Tiny Tiny 1024mb 1cpu
7
1: CentOS 5.5(64-bit) no GUI (KVM)
2: Linux CentOS 6.4 64-bit
3: Linux CentOS 6.4 64-bit
4: Linux CentOS 6.4 64-bit
5: Linux CentOS 6.4 64-bit
6: Linux CentOS 6.4 64-bit
<...snip>
INFO: Waiting for insert of instance . Sleeping for 3s.
INFO: Waiting for insert of instance . Sleeping for 3s.

Table of resources:

+--------+--------------+--------------+----------+---------+
| name   | network-ip   | external-ip  | zone     | status  |
+--------+--------------+--------------+----------+---------+
| foobar | 185.1.2.3    | 185.1.2.3    | ch-gva-2 | RUNNING |
+--------+--------------+--------------+----------+---------+

Table of operations:

+--------------+--------+--------------------------+----------------+
| name         | status | insert-time              | operation-type |
+--------------+--------+--------------------------+----------------+
| e4180d83-31d0| DONE   | 2014-06-09T10:31:35+0200 | insert         |
+--------------+--------+--------------------------+----------------+

You can of course list (with listinstances) and delete instances

$ ./gcutil --cached_flags_file=~/.gcutil_params --project=runseb@gmail.com deleteinstance foobar
Delete instance foobar? [y/n]
y 
WARNING: Consider passing '--zone=CH-GV2' to avoid the unnecessary zone lookup which requires extra API calls.
INFO: Waiting for delete of instance . Sleeping for 3s.
+--------------+--------+--------------------------+----------------+
| name         | status | insert-time              | operation-type |
+--------------+--------+--------------------------+----------------+
| d421168c-4acd| DONE   | 2014-06-09T10:34:53+0200 | delete         |
+--------------+--------+--------------------------+----------------+

Gstack is still a work in progress, it is now compatible with GCE GA v1.0 API. The few differences in API semantics need to be investigated further and additional API calls need to be supported. However it provides a solid base to start working on hybrid solutions between GCE public cloud and a CloudStack based private cloud.

GSoC has been terrific to Ian and Darren, they both learned how to work with an open source community and ultimately became part of it through their work. They learned tools like JIRA, git, Review Board and became less shy with working publicly on a mailing lists. Their work on Gstack and EC2stack is certainly of high value to CloudStack and should become the base for interesting products that will use hybrid clouds.

Tuesday, June 03, 2014

Eutester with CloudStack

Eutester

An interesting tool based on Boto is Eutester it was created by the folks at Eucalyptus to provide a framework to create functional tests for AWS zones and Eucalyptus based clouds. What is interesting with eutester is that it could be used to compare the AWS compatibility of multiple clouds. Therefore the interesting question that you are going to ask is: Can we use Eutester with CloudStack ? And the answer is Yes. Certainly it could use more work but the basic functionality is there.

Install eutester with:

pip install eutester

Then start Python/iPython interactive shell or write a script that will import ec2ops and create a connection object to your AWS EC2 compatible endpoint. For example, using ec2stack:

    #!/usr/bin/env python

    from eucaops import ec2ops
    from IPython.terminal.embed import InteractiveShellEmbed

    accesskey="my api key"
    secretkey="my secret key"

    conn.ec2ops.EC2ops(endpoint="localhost",
                   aws_access_key_id=apikey,
                   aws_secret_access_key=secretkey,
                   is_secure=False,
                   port=5000,
                   path="/",
                   APIVersion="2014-02-01")

    ipshell = InteractiveShellEmbed(banner1="Hello Cloud Shell !!")
    ipshell()

Eutester at the time of this writing has 145 methods. Only the methods available through the CloudStack AWS EC2 interface will be availble. For example, get_zones and get_instances would return:

In [3]: conn.get_zones()
Out[3]: [u'ch-gva-2']

In [4]: conn.get_instances()
[2014-05-21 05:39:45,094] [EUTESTER] [DEBUG]: 
--->(ec2ops.py:3164)Starting method: get_instances(self, state=None, 
     idstring=None, reservation=None, rootdevtype=None, zone=None,
     key=None, pubip=None, privip=None, ramdisk=None, kernel=None,
     image_id=None, filters=None)
Out[4]: 
[Instance:5a426582-3aa3-49e0-be3f-d2f9f1591f1f,
 Instance:95ee8534-b171-4f79-9e23-be48bf1a5af6,
 Instance:f18275f1-222b-455d-b352-3e7b2d3ffe9d,
 Instance:0ea66049-9399-4763-8d2f-b96e9228e413,
 Instance:7b2f63d6-66ce-4e1b-a481-e5f347f7e559,
 Instance:46d01dfd-dc81-4459-a4a8-885f05a87d07,
 Instance:7158726e-e76c-4cd4-8207-1ed50cc4d77a,
 Instance:14a0ce40-0ec7-4cf0-b908-0434271369f6]

This example shows that I am running eight instances at the moment in a zone called ch-gva-2, our familiar exoscale. Selecting one of these instance objects will give you access to all the methods available for instances. You could also list, delete and create keypairs. List, delete and create security groups etc.

Eutester is meant for building integration tests and easily creating test scenarios. If you are looking for a client to build an application with, use Boto.

The master branch of eutester may still cause problems to list images from a CloudStack cloud. I recently patched a fork of the testing branch and opened an issue on their github page. You might want to check its status if you want to use eutester heavily.

Tuesday, March 04, 2014

Why CloudStack is not a Citrix project

I was at CloudExpo Europe in London last week for the Open Cloud Forum to give a tutorial on CloudStack tools. A decent crowd showed up, all carrying phones. Kind of problematic for a tutorial where I wanted the audience to install python packages and actually work :) Luckily I made it self-paced so you can follow at home. Giles from Shapeblue was there too and he was part of a panel on Open Cloud. He was told once again "But Apache CloudStack is a Citrix project !" This in itself is a paradox and as @jzb told me on twitter yesterday "Citrix donated CloudStack to Apache, the end". Apache projects do not have any company affiliation.

I don't blame folks, with all the vendors seemingly supporting OpenStack, it does seem that CloudStack is a one supporter project. The commit stats are also pretty clear with 39% of commits coming from Citrix. This number is also probably higher since those stats are reporting gmail and apache as domain contributing 20 and 15% respectively, let's say 60% is from Citrix. But nonetheless, this is ignoring and mis-understanding what Apache is and looking at the glass half empty.

When Citrix donated CloudStack to the Apache Software Foundation (ASF) it relinquished control of the software and the brand. This actually put Citrix in a bind, not being able to easily promote the CloudStack project. Indeed, CloudStack is now a trademark of the ASF and Citrix had to rename their own product CloudPlatform (powered by Apache CloudStack). Citrix cannot promote CloudStack directly, it needs to get approval to donate sponsoring and follow the ASF trademark guidelines. Every committer and especially PMC members of Apache CloudStack are now supposed to work and protect the CloudStack brand as part of the ASF and make sure that any confusion is cleared. This is what I am doing here.

Of course when the software was donated, an initial set of committers was defined, all from Citrix and mostly from the former cloud.com startup. Part of the incubating process at the ASF is to make sure that we can add committers from other organization and attract a community. "Community over Code" is the bread and butter of ASF and so this is what we have all been working on, expanding the community outside Citrix, welcoming anyone who thinks CloudStack is interesting enough to contribute a little bit of time and effort. Looking at the glass half empty is saying that CloudStack is a Citrix project "Hey look 60% of their commits is from Citrix", looking at it half full like I do is saying "Oh wow, in a year since graduation, they have diversified the committer based, 40% are not from Citrix". Is 40% enough ? of course not, I wish it were the other way around, I wish Citrix were only a minority in the development of CloudStack.

Couple other numbers: Out of the 26 members of the project management committee (PMC) only seven are from Citrix and looking at mailing lists participation since the beginning of the year, 20% of the folks on the users mailing list and 25% on the developer list are from Citrix. We have diversified the community a great deal but the "hand-over", that moment when new community members are actually writing more code than the folks who started it, has not happened yet. A community is not just about writing code, but I will give it to you that it is not good for a single company to "control" 60% of the development, this is not where we/I want to be.

This whole discussion is actually against Apache's modus operandi. Since one of the biggest tenant of the foundation is non-affiliation. When I participate on the list I am Sebastien, I am not a Citrix employee. Certainly this can put some folks in conflicting situations at times, but the bottom line is that we do not and should not take into account company affiliation when working and making decisions for the project. But if you really want some company name dropping, let's do an ASF faux-pas and let's look at a few features:

The Nicira/NSX and OpenDaylight SDN integration was done by Schuberg Phillis, the OpenContrail plugin was done by Juniper, Midokura created it's own plugin for Midonet and Stratosphere as well, giving us a great SDN coverage. The LXC integration was done by Gilt, Klarna is contributing in the ecosystem with the vagrant and packer plugins, CloudOps has been doing terrific job with Chef recipes, Palo-Alto networks integration and Netscaler support, a google summer of code intern did a brand new LDAP plugin and another GSoC did the GRE support for KVM. RedHat contributed the Gluster plugin and PCExtreme contributed the Ceph interface while Basho of course contributed the S3 plugin for secondary storage as well as major design decisions on the storage refactor. The Solidfire plugin was done by, well Solidfire and Netapp has developed a plugin as well for their virtual storage console. NTT contributed the CloudFoundry interface via BOSH. On the user side, Shapeblue is leading the user support company. So no it's not just Citrix.

Are all these companies members of the CloudStack project ? No. There is no such thing as a company being a member of an ASF project. There is no company affiliation, there is no lock in, just a bunch of guys trying to make good software and build a community. And yes, I work for Citrix and my job here will be done when Citrix only contributes 49% of the commits. Citrix is paying me to make sure they loose control of the software, that a healthy ecosystem develops and that CloudStack keeps on becoming a strong and vibrant Apache project. I hope one day folks will understand what CloudStack has become, an ASF project, like HTTP, Hadoop, Mesos, Ant, Maven, Lucene, Solr and 150 other projects. Come to Denver for #apachecon you will see ! The end.

Tuesday, January 21, 2014

PaaS with CloudStack

A few talks from CCC in Amsterdam

In November at the CloudStack Collaboration Conference I was pleased to see several talks on PaaS. We had Uri Cohen (@uri1803) from Gigaspaces, Marc-Elian Begin (@lemeb) from Sixsq and Alex Heneveld (@ahtweetin) from CloudSoft. We also had some related talks -depending on your definition of PaaS- with talks about Docker and Vagrant.

PaaS variations

The differences between PaaS solutions is best explained by this picture from AWS FAQ about application management.

There is clearly a spectrum that goes from operational control to pure application deployment. We could argue that true PaaS abstracts the operational details and that management of the underlying infrastructure should be totally hidden, that said, automation of virtual infrastructure deployment has reached such a sophisticated state that it blurs the definition between IaaS and PaaS. Not suprisingly AWS offers services that covers the entire spectrum.
Since I am more on the operation side, I tend to see a PaaS as an infrastructure automation framework. For instance I look for tools to deploy a MongoDB cluster or a RiakCS cluster. I am not looking for an abstract plaform that has Monogdb pre-installed and where I can turn a knob to increase the size of the cluster or manage my shards. An application person will prefer to look at something like Google App Engine and it's open source version Appscale. I will get back to all these differences in a next post on PaaS but this article by @DavidLinthicum that just came out is a good read.

Support for CloudStack

What is interesting for the CloudStack community is to look at the support for CloudStack in all these different solutions wherever they are in the application management spectrum.

Cloudify from Gigaspaces was all over twitter about their support for OpenStack, and I was getting slightly bothered with the lack of CloudStack support. That's why it was great to see Uri Cohen in Amstredam. He delivered a great talk and he gave me a demo of Cloudify. I was very impressed of course by the slick UI but overall by the ability to provision complete application/infrastructure definitions on clouds. Underlying it uses Apache jclouds, so there was no reason that it could not talk to CloudStack. Over christmas Uri did a terrific job and the CloudStack support is now tested and documented. It works not only on the commercial version from Citrix CloudPlatform but also with Apache CloudStack. And of course it works with my neighbors Cloud exoscale
Slipstream is not widely known but worth a look. At CCC @lemeb demoed a CloudStack driver. Since then, they now offer an hosted version of their slipstream cloud orchestration framework which turns out to be hosted on exoscale CloudStack cloud. Slipstream is more of a Cloud broker than a PaaS but it automates application deployment on multiple clouds abstracting the various cloud APIs and offering application templates for deployments of virtual infrastructure. Check it out.
Cloudsoft main application deployment engine is brooklyn, it originated from Alex Heneveld contribution to Apache Whirr that I wrote about couple times. But it can use OpenShift for additional level of PaaS. I will need to check with Alex how they are doing this, as I believe Openshift uses LXC. Since CloudStack has LXC support, one ought to be able to use Brooklyn to deploy a LXC cluster on CloudStack and then use OpenShift to manage deployed applications.
A quick note on OpenShift. As far as I understand, it actually uses a static cluster. The scalability comes from the use of containes in the nodes. So technically you could create an OpenShift cluster in CloudStack, but I don't think we will see OpenShift talking directly to the CloudStack API to add nodes. Openshift bypasses the IaaS APIs. Of course I have not looked at it in a while and I may be wrong :)
Talking about PaaS for Vagrant is probably a bit far fetched, but it fits the infrastructure deployment criteria and could be compared with AWS OpsWorks. Vagrant helps to define reproducible machines so that devs and ops can actually work on the same base servers. But Vagrant with its plugins can also help deployment on public clouds, and can handle multiple server definitions. So one can look at a Vagrantfile as a template defintion for a virtual infrastructure deployment. As a matter of fact, there are many Vagrant boxes out there to deploy things like Apache Mesos clusters, MongoDB, RiakCS clusters etc. It's not meant to manage that stack in production but at a minimum can help develop it. Vagrant has a CloudStack plugin demoed by Hugo Correia from Klarna at CCC. Exoscale took the bait and created a set of -exoboxes- that's real gold for developers deploying in exoscale and any CloudStack provider should follow suit.
Which brings me on to Docker, there is currently no support for Docker in CloudStack. We do have LXC support therefore it would not be to hard to have a 'docker' cluster in CloudStack. You could even install docker within an image and deploy that on KVM or Xen. Of course some would argue that using containers within VMs defeats the purpose. In any case, with the Docker remote API you could then manage your containers. OpenStack already has a Docker integration, we will dig deeper into Docker functionality to see how best to integrate it in CloudStack.
AWS as I mentioned has several PaaS like layers with OpsWorks, CloudFormation, Beanstalk. CloudStack has an EC2 interface but also has a third party solution to enabled CloudFormation. This is still under development but pretty close to full functionality, check out stackmate and its web interface stacktician. With a CF interface to CloudStack we could see a OpSWork and a Beanstalk interface coming in the future.
Finally, not present at CCC but the leader of PaaS for enterprise is CloudFoundry. I am going to see Andy Piper (@andypiper) in London next week and will make sure to talk to him about the recent CloudStack support that was merged in the cloudfoundry community repo. It came from folks in Japan and I have not had time to test it. Certainly we as a community should look at this very closely to make sure there is outstanding support for CloudFoundry in ACS.

It is not clear what the frontier between PaaS and IaaS is, it is highly dependent on the context, who you are and what you are trying to achieve. But CloudStack offers several interfaces to PaaS or shall I say PaaS offer several connectors to CloudStack :)

Thursday, December 19, 2013

Clojure with CloudStack

CloStack

CloStack is a Clojure client for Apache CloudStack. Clojure is a dynamic programming language for the Java Virtual Machine (JVM). It is compiled directly in JVM bytecode but offers a dynamic and interactive nature of an interpreted language like Python. Clojure is a dialect of LISP and as such is mostly a functional programming language.

You can try Clojure in your browser and get familiar with its read eval loop (REPL). To get started, you can follow the tutorial for non-LISP programmers through this web based REPL.

To give you a taste for it, here is how you would 2 and 2:

user=> (+ 2 2)
4

And how you would define a function:

user=> (defn f [x y]
  #_=> (+ x y))
#'user/f
user=> (f 2 3)
5

This should give you a taste of functional programming :)

Install Leinigen

leiningen is a tool for managing Clojure projects easily. With lein you can create the skeleton of clojure project as well as start a read eval loop (REPL) to test your code.

Installing the latest version of leiningen is easy, get the script and set it in your path. Make it executable and your are done.

The first time your run lein repl it will boostrap itself:

$ lein repl
Downloading Leiningen to /Users/sebgoa/.lein/self-installs/leiningen-2.3.4-standalone.jar now...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.0M  100 13.0M    0     0  1574k      0  0:00:08  0:00:08 --:--:-- 2266k
nREPL server started on port 58633 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> exit
Bye for now!

Download CloStack

To install CloStack, clone the github repository and start lein repl:

git clone https://github.com/pyr/clostack.git
$ lein repl
Retrieving codox/codox/0.6.4/codox-0.6.4.pom from clojars
Retrieving codox/codox.leiningen/0.6.4/codox.leiningen-0.6.4.pom from clojars
Retrieving leinjacker/leinjacker/0.4.1/leinjacker-0.4.1.pom from clojars
Retrieving org/clojure/core.contracts/0.0.1/core.contracts-0.0.1.pom from central
Retrieving org/clojure/core.unify/0.5.3/core.unify-0.5.3.pom from central
Retrieving org/clojure/clojure/1.4.0/clojure-1.4.0.pom from central
Retrieving org/clojure/core.contracts/0.0.1/core.contracts-0.0.1.jar from central
Retrieving org/clojure/core.unify/0.5.3/core.unify-0.5.3.jar from central
Retrieving org/clojure/clojure/1.4.0/clojure-1.4.0.jar from central
Retrieving codox/codox/0.6.4/codox-0.6.4.jar from clojars
Retrieving codox/codox.leiningen/0.6.4/codox.leiningen-0.6.4.jar from clojars
Retrieving leinjacker/leinjacker/0.4.1/leinjacker-0.4.1.jar from clojars
Retrieving org/clojure/clojure/1.3.0/clojure-1.3.0.pom from central
Retrieving org/clojure/data.json/0.2.2/data.json-0.2.2.pom from central
Retrieving http/async/client/http.async.client/0.5.2/http.async.client-0.5.2.pom from clojars
Retrieving com/ning/async-http-client/1.7.10/async-http-client-1.7.10.pom from central
Retrieving io/netty/netty/3.4.4.Final/netty-3.4.4.Final.pom from central
Retrieving org/clojure/data.json/0.2.2/data.json-0.2.2.jar from central
Retrieving com/ning/async-http-client/1.7.10/async-http-client-1.7.10.jar from central
Retrieving io/netty/netty/3.4.4.Final/netty-3.4.4.Final.jar from central
Retrieving http/async/client/http.async.client/0.5.2/http.async.client-0.5.2.jar from clojars
nREPL server started on port 58655 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> exit

The first time that you start the REPL lein will download all the dependencies of clostack.

Prepare environment variables and make your first `clostack` call

Export a few environmen variables to define the cloud you will be using, namely:

export CLOUDSTACK_ENDPOINT=http://localhost:8080/client/api
export CLOUDSTACK_API_KEY=HGWEFHWERH8978yg98ysdfghsdfgsagf
export CLOUDSTACK_API_SECRET=fhdsfhdf869guh3guwghseruig

Then relaunch the REPL

$lein repl
nREPL server started on port 59890 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> (use 'clostack.client)
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
nil

You can safely discard the warning message which only indicates that 'clostack' is meant to be used as a library in a clojure project.
Define a client to your CloudStack endpoint

user=> (def cs (http-client))
#'user/cs

And call an API like so:

user=> (list-zones cs)
{:listzonesresponse {:count 1, :zone [{:id "1128bd56-b4d9-4ac6-a7b9-c715b187ce11", :name "CH-GV2", :networktype "Basic", :securitygroupsenabled true, :allocationstate "Enabled", :zonetoken "ccb0a60c-79c8-3230-ab8b-8bdbe8c45bb7", :dhcpprovider "VirtualRouter", :localstorageenabled true}]}}

To explore the API calls that you can make, the REPL features tab completion. Enter list or de and press the tab key you should see:

user=> (list
list                                list*                               list-accounts                       list-async-jobs                     
list-capabilities                   list-disk-offerings                 list-event-types                    list-events                         
list-firewall-rules                 list-hypervisors                    list-instance-groups                list-ip-forwarding-rules            
list-iso-permissions                list-isos                           list-lb-stickiness-policies         list-load-balancer-rule-instances   
list-load-balancer-rules            list-network-ac-ls                  list-network-offerings              list-networks                       
list-os-categories                  list-os-types                       list-port-forwarding-rules          list-private-gateways               
list-project-accounts               list-project-invitations            list-projects                       list-public-ip-addresses            
list-remote-access-vpns             list-resource-limits                list-security-groups                list-service-offerings              
list-snapshot-policies              list-snapshots                      list-ssh-key-pairs                  list-static-routes                  
list-tags                           list-template-permissions           list-templates                      list-virtual-machines               
list-volumes                        list-vp-cs                          list-vpc-offerings                  list-vpn-connections                
list-vpn-customer-gateways          list-vpn-gateways                   list-vpn-users                      list-zones                          
list?

user=> (de
dec                           dec'                          decimal?                      declare                       def                           
default-data-readers          definline                     definterface                  defmacro                      defmethod                     
defmulti                      defn                          defn-                         defonce                       defprotocol                   
defrecord                     defreq                        defstruct                     deftype                       delay                         
delay?                        delete-account-from-project   delete-firewall-rule          delete-instance-group         delete-ip-forwarding-rule     
delete-iso                    delete-lb-stickiness-policy   delete-load-balancer-rule     delete-network                delete-network-acl            
delete-port-forwarding-rule   delete-project                delete-project-invitation     delete-remote-access-vpn      delete-security-group         
delete-snapshot               delete-snapshot-policies      delete-ssh-key-pair           delete-static-route           delete-tags                   
delete-template               delete-volume                 delete-vpc                    delete-vpn-connection         delete-vpn-customer-gateway   
delete-vpn-gateway            deliver                       denominator                   deploy-virtual-machine        deref                         
derive                        descendants                   destroy-virtual-machine       destructure                   detach-iso                    
detach-volume

To pass arguments to a call follow the syntax:

user=> (list-templates cs :templatefilter "executable")

Start a virtual machine

To deploy a virtual machine you need to get the serviceofferingid or instance type, the templateid also known as the image id and the zoneid, the call is then very similar to CloudMonkey and returns a jobid

user=> (deploy-virtual-machine cs :serviceofferingid "71004023-bb72-4a97-b1e9-bc66dfce9470" :templateid "1d961c82-7c8c-4b84-b61b-601876dab8d0" :zoneid "1128bd56-b4d9-4ac6-a7b9-c715b187ce11")
{:deployvirtualmachineresponse {:id "d0a887d2-e20b-4b25-98b3-c2995e4e428a", :jobid "21d20b5c-ea6e-4881-b0b2-0c2f9f1fb6be"}}

You can pass additional parameters to the deploy-virtual-machine call, such as the keypair and the securitygroupname:

user=> (deploy-virtual-machine cs :serviceofferingid "71004023-bb72-4a97-b1e9-bc66dfce9470" :templateid "1d961c82-7c8c-4b84-b61b-601876dab8d0" :zoneid "1128bd56-b4d9-4ac6-a7b9-c715b187ce11" :keypair "exoscale")
{:deployvirtualmachineresponse {:id "b5fdc41f-e151-43e7-a036-4d87b8536408", :jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4"}}

To query the asynchronous job, you can use the query-async-job api call:

user=> (query-async-job-result cs :jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4")
{:queryasyncjobresultresponse {:jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4", :jobprocstatus 0, :jobinstancetype "VirtualMachine", :accountid "b8c0baab-18a1-44c0-ab67-e24049212925", :jobinstanceid "b5fdc41f-e151-43e7-a036-4d87b8536408", :created "2013-12-16T12:25:21+0100", :jobstatus 0, :jobresultcode 0, :cmd "com.cloud.api.commands.DeployVMCmd", :userid "968f6b4e-b382-4802-afea-dd731d4cf9b9"}}

And finally to destroy the virtual machine you would pass the id of the VM to the destroy-virtual-machine call like so:

user=> (destroy-virtual-machine cs :id "d0a887d2-e20b-4b25-98b3-c2995e4e428a")
{:destroyvirtualmachineresponse {:jobid "8fc8a8cf-9b54-435c-945d-e3ea2f183935"}}

With these simple basics you can keep on exploring clostack and the CloudStack API.

Use `CloStack` within your own clojure project

Hello World in clojure

To write your own clojure project that makes user of clostack, use leiningen to create a project skeleton

lein new toto

Lein will automatically create a src/toto/core.clj file, edit it to replace the function foo with -main. This dummy function returns Hellow World !. Let's try to execute it. First we will need to define the main namespace in the project.clj file. Edit it like so:

defproject toto "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :main toto.core :dependencies [[org.clojure/clojure "1.5.1"]])

Note the :main toto.core

You can now execute the code with lein run john . Indeed if you check the -main function in src/toto/core.clj you will see that it takes an argument. Surprisingly you should see the following output:

$ lein run john
john Hello, World!

Let's now add the CloStack dependency and modify the main function to return the zone of the CloudStack cloud.

Adding the Clostack dependency

Edit the project.clj to add a dependency on clostack and a few logging packages:

:dependencies [[org.clojure/clojure "1.5.1"]
               [clostack "0.1.3"]
               [org.clojure/tools.logging "0.2.6"]
               [org.slf4j/slf4j-log4j12   "1.6.4"]
               [log4j/apache-log4j-extras "1.0"]
               [log4j/log4j               "1.2.16"
                :exclusions [javax.mail/mail
                             javax.jms/jms
                             com.sun.jdkmk/jmxtools
                             com.sun.jmx/jmxri]]])

lein should have created a resources directory. In it, create a log4j.properties file like so:

$ more log4j.properties 
# Root logger option
log4j.rootLogger=INFO, stdout

# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

A discussion on logging is beyond the scope of this recipes, we merely add it in the configuration for a complete example.

Now you can edit the code in src/toto/core.clj with some basic calls.

(ns testclostack.core
  (:require [clostack.client :refer [http-client list-zones]]))

(defn foo
  "I don't do a whole lot."
  [x]
  (println x "Hello, World!"))

(def cs (http-client))

(defn -main [args]
  (println (list-zones cs))
  (println args "Hey Wassup")
  (foo args)
)

Simply run this clojure code with lein run joe in the source of your project. And that's it, you have sucessfully discovered the very basics of Clojure and used the CloudStack client clostack to write your first Clojure code. Now for something more significant, look at Pallet

Wednesday, December 18, 2013

2014 Cloud Predictions

Warning: this is written with a glass of wine on one hand, two days before vacation ...:)

1. CloudStack will abandon semantic versioning and adopt super hero names for its releases, this will make upgrade paths more understandable.

2. Someone will take Euca API server and stick CloudStack backend beneath it, adding Opennebula packaging will make this the best cloud distro of all.

3. I will finally make sense of NetflixOSS plethora of software and reach nirvana by integrating CloudStack in Asgard.

4. AWS will opensource its software killing OpenStack, and we will realize that in fact they use CloudStack with Euca in front.

5. I will understand what NFV, VNF and SDN really mean and come up with a new acronym that will set twitter on fire.

6. We will actually see some code in Solum.

7. bitcoin will crash and come back up at least five times.

8. Citrix stock will jump 100% on acquisition by IBM.

9. My boss will stop asking me for statistics.

10. Facebook will die on a Snowden revelation.

I will stop at 10 otherwise this could go on all night :)

Happy Holidays everyone

Friday, December 06, 2013

Veewee, Vagrant and CloudStack

Coming back from CloudStack conference the feeling that this is not about building clouds got stronger. This is really about what to do with them and how they bring you agility, faster-time to market and allow you to focus on innovation in your core business. A large component of this is Culture and a change of how we do IT. The DevOps movement is the embodiment of this change. Over in Amsterdam I was stoked to meet with folks that I had seen at other locations throughout Europe in the last 18 months. Folks from PaddyPower, SchubergPhilis, Inuits who all embrace DevOps. I also met new folks, including Hugo Correia from Klarna (CloudStack users) who came by to talk about Vagrant-cloudstack plugin. His talk and a demo by Roland Kuipers from Schuberg was enough to kick my butt and get me to finally check out Vagrant. I sprinkled a bit of Veewee and of course some CloudStack on top of it all. Have fun reading.

Automation is key to a reproducible, failure-tolerant infrastructure. Cloud administrators should aim to automate all steps of building their infrastructure and be able to re-provision everything with a single click. This is possible through a combination of configuration management, monitoring and provisioning tools. To get started in created appliances that will be automatically configured and provisioned, two tools stand out in the arsenal: Veewee and Vagrant.

Veewee: Veewee is a tool to easily create appliances for different hypervisors. It fetches the .iso of the distribution you want and build the machine with a kickstart file. It integrates with providers like VirtualBox so that you can build these appliances on your local machine. It supports most commonly used OS templates. Coupled with virtual box it allows admins and devs to create reproducible base appliances. Getting started with veewee is a 10 minutes exericse. The README is great and there is also a very nice post that guides you through your first box building.

Most folks will have no issues cloning Veewee from github and building it, you will need ruby 1.9.2 or above. You can get it via `rvm` or your favorite ruby version manager.

git clone https://github.com/jedi4ever/veewee
gem install bundler
bundle install

Setting up an alias is handy at this point `alias veewee="bundle exec veewee"`. You will need a virtual machine provider (e.g VirtualBox, VMware Fusion, Parallels, KVM). I personnaly use VirtualBox but pick one and install it if you don't have it already. You will then be able to start using `veewee` on your local machine. Check the sub-commands available (for virtualbox):

$ veewee vbox
Commands:
  veewee vbox build [BOX_NAME]                     # Build box
  veewee vbox copy [BOX_NAME] [SRC] [DST]          # Copy a file to the VM
  veewee vbox define [BOX_NAME] [TEMPLATE]         # Define a new basebox starting from a template
  veewee vbox destroy [BOX_NAME]                   # Destroys the virtualmachine that was built
  veewee vbox export [BOX_NAME]                    # Exports the basebox to the vagrant format
  veewee vbox halt [BOX_NAME]                      # Activates a shutdown the virtualmachine
  veewee vbox help [COMMAND]                       # Describe subcommands or one specific subcommand
  veewee vbox list                                 # Lists all defined boxes
  veewee vbox ostypes                              # List the available Operating System types
  veewee vbox screenshot [BOX_NAME] [PNGFILENAME]  # Takes a screenshot of the box
  veewee vbox sendkeys [BOX_NAME] [SEQUENCE]       # Sends the key sequence (comma separated) to the box. E.g for testing the :boot_cmd_sequence
  veewee vbox ssh [BOX_NAME] [COMMAND]             # SSH to box
  veewee vbox templates                            # List the currently available templates
  veewee vbox undefine [BOX_NAME]                  # Removes the definition of a basebox 
  veewee vbox up [BOX_NAME]                        # Starts a Box
  veewee vbox validate [BOX_NAME]                  # Validates a box against vagrant compliancy rules
  veewee vbox winrm [BOX_NAME] [COMMAND]           # Execute command via winrm

Options:
          [--debug]           # enable debugging
  -w, --workdir, [--cwd=CWD]  # Change the working directory. (The folder containing the definitions folder).
                              # Default: /Users/sebgoa/Documents/gitforks/veewee

Pick a template from the `templates` directory and `define` your first box:

veewee vbox define myfirstbox CentOS-6.5-x86_64-minimal

You should see that a `defintions/` directory has been created, browse to it and inspect the `definition.rb` file. You might want to comment out some lines, like removing `chef` or `puppet`. If you don't change anything and build the box, you will then be able to `validate` the box with `veewee vbox validate myfirstbox`. To build the box simply do:

veewee vbox build myfristbox

Everything should be successfull, and you should see a running VM in your virtual box UI. To export it for use with `Vagrant`, `veewee` provides an export mechanism (really a VBoxManage command): `veewee vbox export myfirstbox`. At the end of the export, a .box file should be present in your directory.

Vagrant: Picking up from where we left with `veewee`, we can now add the box to Vagrant and customize it with shell scripts or much better, with Puppet recipes or Chef cookbooks. First let's add the box file to Vagrant:

vagrant box add 'myfirstbox' '/path/to/box/myfirstbox.box'

Then in a directory of your choice, create the Vagrant "project":

 
vagrant init 'myfirstbox'

This will create a `Vagrantfile` that we will later edit to customize the box. You can boot the machine with `vagrant up` and once it's up , you can ssh to it with `vagrant ssh`.

While `veewee` is used to create a base box with almost no customization (except potentially a chef and/or puppet client), `vagrant` is used to customize the box using the Vagrantfile. For example, to customize the `myfirstbox` that we just built, set the memory to 2 GB, add a host-only interface with IP 192.168.56.10, use the apache2 Chef cookbook and finally run a `boostrap.sh` script, we will have the following `Vagrantfile`:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "myfirstbox"
  config.vm.provider "virtualbox" do |vb|
    vb.customize ["modifyvm", :id, "--memory", 2048]
  end

  #host-only network setup
  config.vm.network "private_network", ip: "192.168.56.10"

  # Chef solo provisioning
  config.vm.provision "chef_solo" do |chef|
     chef.add_recipe "apache2"
  end

  #Test script to install CloudStack
  #config.vm.provision :shell, :path => "bootstrap.sh"
  
end

The cookbook will be in a `cookbooks` directory and the boostrap script will be in the root directory of this vagrant definition. For more information, check the Vagrant website and experiment.

Vagrant and CloudStack: What is very interesting with Vagrant is that you can use various plugins to deploy machines on public clouds. There is a `vagrant-aws` plugin and of course a `vagrant-cloudstack` plugin. You can get the latest CloudStack plugin from github. You can install it directly with the `vagrant` command line:

vagrant plugin install vagrant-cloudstack

Or if you are building it from source, clone the git repository, build the gem and install it in `vagrant`

git clone https://github.com/klarna/vagrant-cloudstack.git
gem build vagrant-cloudstack.gemspec
gem install vagrant-cloudstack-0.1.0.gem
vagrant plugin install /Users/sebgoa/Documents/gitforks/vagrant-cloudstack/vagrant-cloudstack-0.1.0.gem

The only drawback that I see is that one would want to upload his local box (created from the previous section) and use it. Instead one has to create `dummy boxes` that use existing templates available on the public cloud. This is easy to do, but creates a gap between local testing and production deployments. To build a dummy box simply create a `Vagrantfile` file and a `metadata.json` file like so:

$ cat metadata.json 
{
    "provider": "cloudstack"
}
$ cat Vagrantfile 
# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.provider :cloudstack do |cs|
    cs.template_id = "a17b40d6-83e4-4f2a-9ef0-dce6af575789"
  end
end

Where the `cs.template_id` is a uuid of a CloudStack template in your cloud. CloudStack users will know how to easily get those uuids with `CloudMonkey`. Then create a `box` file with `tar cvzf cloudstack.box ./metadata.json ./Vagrantfile`. Note that you can add additional CloudStack parameters in this box definition like the host,path etc (something to think about :) ). Then simply add the box in `Vagrant` with:

vagrant box add ./cloudstack.box

You can now create a new `Vagrant` project:

mkdir cloudtest
cd cloudtest
vagrant init

And edit the newly created `Vagrantfile` to use the `cloudstack` box. Add additional parameters like `ssh` configuration, if the box does not use the default from `Vagrant`, plus `service_offering_id` etc. Remember to use your own api and secret keys and change the name of the box to what you created. For example on exoscale:

# -*- mode: ruby -*-
# vi: set ft=ruby :

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "cloudstack"

  config.vm.provider :cloudstack do |cs, override|
    cs.host = "api.exoscale.ch"
    cs.path = "/compute"
    cs.scheme = "https"
    cs.api_key = "PQogHs2sk_3..."
    cs.secret_key = "...NNRC5NR5cUjEg"
    cs.network_type = "Basic"

    cs.keypair = "exoscale"
    cs.service_offering_id = "71004023-bb72-4a97-b1e9-bc66dfce9470"
    cs.zone_id = "1128bd56-b4d9-4ac6-a7b9-c715b187ce11"

    override.ssh.username = "root" 
    override.ssh.private_key_path = "/path/to/private/key/id_rsa_example"
  end

  # Test bootstrap script
  config.vm.provision :shell, :path => "bootstrap.sh"

end

The machine is brought up with:

vagrant up --provider=cloudstack

The following example output will follow:

$ vagrant up --provider=cloudstack
Bringing machine 'default' up with 'cloudstack' provider...
[default] Warning! The Cloudstack provider doesn't support any of the Vagrant
high-level network configurations (`config.vm.network`). They
will be silently ignored.
[default] Launching an instance with the following settings...
[default]  -- Service offering UUID: 71004023-bb72-4a97-b1e9-bc66dfce9470
[default]  -- Template UUID: a17b40d6-83e4-4f2a-9ef0-dce6af575789
[default]  -- Zone UUID: 1128bd56-b4d9-4ac6-a7b9-c715b187ce11
[default]  -- Keypair: exoscale
[default] Waiting for instance to become "ready"...
[default] Waiting for SSH to become available...
[default] Machine is booted and ready for use!
[default] Rsyncing folder: /Users/sebgoa/Documents/exovagrant/ => /vagrant
[default] Running provisioner: shell...
[default] Running: /var/folders/76/sx82k6cd6cxbp7_djngd17f80000gn/T/vagrant-shell20131203-21441-1ipxq9e
Tue Dec  3 14:25:49 CET 2013
This works

Which is a perfect execution of my amazing bootstrap script:

#!/usr/bin/env bash

/bin/date
echo "This works"

You can now start playing with Chef cookbooks, Puppet recipes or SaltStack formulas and automate the configuration of your cloud instances, thanks to Veewee Vagrant and CloudStack.

Thursday, October 17, 2013

Why I will go to CCC13 in Amsterdam ?

Aside from the fact that I work full-time on Apache CloudStack, that I am on the organizing committee and that my boss would kill me if I did not go to the CloudStack Collaboration conference, there are many great reasons why I want to go as an open source enthusiast, here is why:

It's Amsterdam and we are going to have a blast (the city of Amsterdam is even sponsoring the event). The venue -Beurs Van Berlage- is terrific, this is the same venue where the Hadoop summit is held and where the AWS Benelux Summit was couple weeks ago. We are going to have a 24/7 Developer room (thanks to CloudSoft) where we can meet to hack on CloudStack and its ecosystem, three parallel tracks in other rooms and great evening events. The event is made possible by the amazing local support from the team at Schuberg Philis, a company that has devops in its vein and organized DevOps days Amsterdam. I am not being very subtle in acknowledging our sponsors here, but hey, without them this would not be possible.

On the first day (November 20th) is the Hackathon sponsored by exoscale. In parallel to the hackathon, new users of CloudStack will be able to attend a full day bootcamp run by the super competent guys from Shapeblue, they also play guitar and drink beers so make sure to hang out with them :). Even as cool is that the CloudStack community recognizes that building a Cloud takes many components, so we will have a jenkins workshop and an elasticsearch workshop. I am big fan of elasticsearch, not only for keeping your infrastructure logs but also for other types of data. I actually store all CloudStack emails in an elasticsearch cluster. Jenkins of course is at the heart of everyone's continuous integration systems these days. Seeing those two workshops, it will be no surprise to see a DevOps track the next two days.

Kicking off the second day -first day of talks- we will have a keynote by Patrick Debois the jedi master of DevOps. We will then break up into a user track, a developer track, a commercial track and for this day only a devops track with a 'culture' flavor. The hard work will begin: choosing which talk to attend. I am not going to go through every talk, we received a lot of great submissions and choosing was hard. New CloudStack users or people looking into using CloudStack will gain a lot from the case studies being presented in the user track while the developers will get a deep dive into the advanced networking features of CloudStack including SDN support -right off the bat-. In the afternoon, the case studies will continue in the user track including a talk from NTT about how they built an AWS compatible cloud. I will have to head to the developer track for a session on 'interfaces' with a talk on jclouds, a new GCE interface that I worked on and my own talk on Apache libcloud for which I worked a lot on the CloudStack driver. The DevOps track will have an entertaining talk by Michael Ducy from Opscode, some real world experiences by John Turner and Noel King from Paddy Power and the VP of engineering for Citrix CloudPlatform will lead an interactive session on how to best work with the open source community of Apache CloudStack.

After recovering from the nights events, we will head into the second day with another entertaining keynote by John Willis. Here the choice will be hard between the storage session in the commercial track and the 'Future of CloudStack' session in the developer track. With talks from NetApp and SolidFire who have each developed a plugin in CloudStack plus our own Wido Den Hollander (PMC member) who wrote the Ceph integration the storage session will rock, but the 'Future of CloudStack' session will be key for developers, talking about frameworks, integration testing, system VMs...After lunch the user track will feature several intro to networking talks. Networking is the most difficult concept to grasp in clouds (IMHO). The storage session will continue with a talk by Basho on RiakCS (also integrated in CloudStack) and a panel. The dev track will be dedicated to discussions on PaaS, not to be missed if you ask me, as PaaS is the next step in Clouds. To wrap things up, I will have to decide between a session on metering/billing, a discussion on hypervisor choice and support, and a presentation on the CloudStack community in Japan after Ruv Cohen talking about trading cloud commodities.

The agenda is loaded and ready to fire, it will be tough to decide which sessions to attend but you will come out refreshed, energized with lots of new ideas to evolve your IT infrastructure, so one word: Register

And of course many thanks to our sponsors: Citrix, Schuberg Philis, Juniper, Sungard, Shapeblue, NetApp, cloudSoft, Nexenta, iKoula, leaseweb, solidfire, greenqloud, atom86, apalia, elasticsearch, 2source4, iamsterdam, cloudbees and 42on

Tuesday, October 01, 2013

A look at RIAK-CS from BASHO

Playing with Basho Riak CS Object Store

CloudStack deals with the compute side of a IaaS, the storage side which for most of us these days consists of a scalable, fault tolerant object store is left to other software. Ceph led by inktank and RiakCS from Basho are the two most talked about object store these days. In this post we look at RiakCS and take it for a quick whirl. CloudStack integrates with RiakCS for secondary storage and together they can offer an EC2 and a true S3 interface, backed by a scalable object store. So here it is.

While RiakCS (Cloud Storage) can be seen as an S3 backend implementation, it is based on Riak. Riak is a highly available distributed nosql database. The use of a consistent hashing algorithm allows riak to re-balance the data when node disappear (e.g fail) and when node appear (e.g increased capacity), it also allows to manage replication of data with an eventual consistency principle typical of large scale distributed storage system which favor availability over consistency.

To get a functioning RiakCS storage we need Riak, RiakCS and Stanchion. Stanchion is an interface that serializes http requests made to RiakCS.

A taste of Riak

To get started, let's play with Riak and build a cluster on our local machine. Basho has some great documentation, the toughest thing will be to install Erlang (and by tough I mean a 2 minutes deal), but again the docs are very helpful and give step by step instructions for almost all OS.

There is no need for me to re-create step by step instructions since the docs are so great, but the gist is that with the quickstart guide we can create a Riak cluster on `localhost`. We are going to start five Riak node (e.g we could start more) and join them into a cluster. This is as simple as:

    bin/riak start
    bin/riak-admin cluster join dev1@127.0.0.1

Where `dev1` was the first riak node started. Creating this cluster will re-balance the ring:

    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid     100.0%     20.3%    'dev1@127.0.0.1'
    valid       0.0%     20.3%    'dev2@127.0.0.1'
    valid       0.0%     20.3%    'dev3@127.0.0.1'
    valid       0.0%     20.3%    'dev4@127.0.0.1'
    valid       0.0%     18.8%    'dev5@127.0.0.1'

The `riak-admin` command is a nice cli to manage the cluster. We can check the membership of the cluster we just created, after some time the ring will have re-balanced to the expected state.

    dev1/bin/riak-admin member-status
    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid      62.5%     20.3%    'dev1@127.0.0.1'
    valid       9.4%     20.3%    'dev2@127.0.0.1'
    valid       9.4%     20.3%    'dev3@127.0.0.1'
    valid       9.4%     20.3%    'dev4@127.0.0.1'
    valid       9.4%     18.8%    'dev5@127.0.0.1'
    -------------------------------------------------------------------------------
    Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
   
    dev1/bin/riak-admin member-status
    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid      20.3%      --      'dev1@127.0.0.1'
    valid      20.3%      --      'dev2@127.0.0.1'
    valid      20.3%      --      'dev3@127.0.0.1'
    valid      20.3%      --      'dev4@127.0.0.1'
    valid      18.8%      --      'dev5@127.0.0.1'
    -------------------------------------------------------------------------------

You can then test your cluster by putting an image as explained in the docs and retrieving it in a browser (e.g an HTTP GET)

    curl -XPUT http://127.0.0.1:10018/riak/images/1.jpg 
         -H "Content-type: image/jpeg" 
         --data-binary @image_name_.jpg

Open the browser to `http://127.0.0.1:10018/riak/images/1.jpg` As easy as 1..2..3

Installing everything on Ubuntu 12.04

To move forward and build a complete S3 compatible object store, let's setup everything on a Ubuntu 12.04 machine. Back to installing `riak`, get the repo keys and setup a `basho.list` repository:

    curl http://apt.basho.com/gpg/basho.apt.key | sudo apt-key add -
    bash -c "echo deb http://apt.basho.com $(lsb_release -sc) main > /etc/apt/sources.list.d/basho.list"
    apt-get update

And grab `riak`, `riak-cs` and `stanchion`. I am not sure why but their great docs make you download the .debs separately and use `dpkg`.

    apt-get install riak riak-cs stanchion

Check that the binaries are in your path with `which riak`, `which riak-cs` and `which stanchion` , you should find everything in `/usr/sbin`. All configuration will be in `/etc/riak`, `/etc/riak-cs` and `/etc/stanchion` inspect especially the `app.config` which we are going to modify before starting everything. Note that all binaries have a nice usage description, it includes a console, a ping method and a restart among others:

    Usage: riak {start | stop| restart | reboot | ping | console | attach | 
                        attach-direct | ertspath | chkconfig | escript | version | 
                        getpid | top [-interval N] [-sort reductions|memory|msg_q] [-lines N] }

Configuration

Before starting anything we are going to configure every component, which means editing the `app.config` files in each respective directory. For `riak-cs` I only made sure to set `{anonymous_user_creation, true}`, I did nothing for configuring `stanchion` as I used the default ports and ran everything on `localhost` without `ssl`. Just make sure that you are not running any other application on port `8080` as `riak-cs` will use this port by default. For configuring `riak` see the documentation, it sets a different backend that what we used in the `tasting` phase :) With all these configuration done you should be able to start all three components:

    riak start
    riak-cs start
    stanchion start

You can `ping` every component and check the console with `riak ping`, `riak-cs ping` and `stanchion ping`, I let you figure out the console access. Create an admin user for `riak-cs`

    curl -H 'Content-Type: application/json' -X POST http://localhost:8080/riak-cs/user \
         --data '{"email":"foobar@example.com", "name":"admin user"}'

If this returns successfully this should be a good indication that your setup is working properly. In the response we recognized API and secret keys

    {"email":"foobar@example.com",
     "display_name":"foobar",
     "name":"admin user",
     "key_id":"KVTTBDQSQ1-DY83YQYID",
     "key_secret":"2mNGCBRoqjab1guiI3rtQmV3j2NNVFyXdUAR3A==",
     "id":"1f8c3a88c1b58d4b4369c1bd155c9cb895589d24a5674be789f02d3b94b22e7c",
     "status":"enabled"}

Let's take those and put them in our `riak-cs` configuration file, there are `admin_key` and `admin_secret` variables to set. Then restart with `riak-cs restart`. Don't forget to also add those in the `stanchion` configuration file `/etc/stanchion/app.config` and restart it `stanchion restart`.

Using our new Cloud Storage with Boto

Since Riak-CS is S3 Compatible clouds storage solution, we should be able to use an S3 client like Python boto to create buckets and store data. Let's try. You will need boto of course, `apt-get install python-boto` and then open an interactive shell `python`. Import the modules and create a connection to `riak-cs`

    >>> from boto.s3.key import Key
    >>> from boto.s3.connection import S3Connection
    >>> from boto.s3.connection import OrdinaryCallingFormat
    >>> apikey='KVTTBDQSQ1-DY83YQYID'
    >>> secretkey='2mNGCBRoqjab1guiI3rtQmV3j2NNVFyXdUAR3A=='
    >>> cf=OrdinaryCallingFormat()
    >>> conn=S3Connection(aws_access_key_id=apikey,aws_secret_access_key=secretkey,
                          is_secure=False,host='localhost',port=8080,calling_format=cf)

Now you can list the bucket, which will be empty at first. Then create a bucket and store content in it with various keys

    >>> conn.get_all_buckets()
    []
    >>> bucket=conn.create_bucket('riakbucket')
    >>> k=Key(bucket)
    >>> k.key='firstkey'
    >>> k.set_contents_from_string('Object from first key')
    >>> k.key='secondkey'
    >>> k.set_contents_from_string('Object from second key')
    >>> b=conn.get_all_buckets()[0]
    >>> k=Key(b)
    >>> k.key='secondkey'
    >>> k.get_contents_as_string()
    'Object from second key'
    >>> k.key='firstkey'
    >>> k.get_contents_as_string()
    'Object from first key'

And that's it, an S3 compatible object store backed by a NOSQL distributed database that uses consistent hashing, all of it in erlang. Automate all of it with Chef recipe. Hook that up to your CloudStack EC2 compatible cloud, use it as secondary storage to hold templates or make it a public facing offering and you have the second leg of the Cloud: storage. Sweet...Next post I will show you how to use it with CloudStack.

Tuesday, September 03, 2013

CloudStack Google Summer of Code projects

Google Summer of Code is entering the final stretch with pencil down on Sept 16th and final evaluation on Sept 27th. Of the five projects CloudStack had this summer, one failed at mid-term and one led to committer status couple weeks ago. That's 20% failure and 20% outstanding results, on par with GSoC wide statistics I believe.

The LDAP integration has been the most productive project. Ian Duffy a 20 year old from Dublin did an outstanding job, developing his new feature in a feature branch, building a jenkins pipeline to test everything and submitting a merge request to master couple weeks ago. With 90% unittest coverage, static code analysis with Sonar in his jenkins pipeline and automatic publishing of rpms in a local yum repo, Ian exceed expectation. His code has even been already backported to the 4.1.1 release with the CloudSand distro of CloudStack.

The SDN extension project was about taking the native GRE controller in CloudStack and extend it to support XCP and KVM. Nguyen from Vietnam has done an excellent job quickly adding support for XCP thanks to his expertise with Xen. He is now putting the final touches on KVM support and building L3 services with OpenDaylight. The entire GRE controller was re-factored to be a plugin similar to the Nicira NVP, Midonet and BigSwitch BVS plugin. While native to CloudStack this controller brings another SDN solution to CloudStack. I expect to see his merge request before pencil down for what will be an extremely valuable project.

While the CloudStack UI is great, it was actually written has a demonstration of how the CloudStack API could be used to build a user facing portal. With the "new UI" project, Shiva Teja from India used boostrap and Angular to create a new UI. Originally the project suggested to use backbone but after feedback from the community Shiva switch to using Angular. Shiva's effort are to be commended as he truly worked on his own with in-consistent network connectivity and no local mentoring. Shiva is a bachelor student and had to learn bootstrap, angular and also Flask on his own. It must have been paying off since he is interviewing with Amazon and Goole for internships next summer. His code being independent of the CloudStack code has been committed to master in our tools directory. This creates a solid framework for other to build on and create their own CloudStack UI.

Perhaps the most research oriented project has been the one from Meng Han from Florida. This was no standard coding projects as it required not only to learn new technologies (aside from CloudStack) but also required investigation of the Amazon EMR API. Meng had to implement EMR in CloudStack using Apache Whirr. Whirr is a java library for provisioning of virtual machines on cloud providers. Whirr uses Apache jclouds and can interact with most cloud providers out there. Meng developed a new set of CloudStack APIs to launch hadoop clusters on-demand. At the start she had to learn CloudStack and install it, then learn the Whirr library and subsequently create a new API in CloudStack which would use Whirr to coordinate multiple node deployments. Meng's code is working but still a bit short from our goal of having a AWS EMR interface. This is partly my fault has this project could have required more mentoring. In any case, the work will go on and I expect to see an EMR implementation in CloudStack in the coming months.

All students faced the same challenge, not a code writing challenge but the OSS challenge and specifically learning the Apache Way. Apache is about consensus and public discussions on the mailing list. With several hundreds participants every month and very active discussion, the shear amount of email traffic can be intimidating. Sharing issues and asking for help on public mailing list is still a bit frightening. IRC, intense emailing, JIRA and git are basic tools used in all Apache project, but seldom used in academic settings. Learning these development tools and participating in a project with over a million line of code was the toughest challenge for students and the goal of GSoC. I am glad we got five students to join CloudStack this summer and tackle these challenges, if anything it is a terrific experience that will benefit their own academic endeavor and later their entire career. Great job Ian, Meng, Nguyen, Shiva and Dharmesh, we are not done yet but I wish you all the best.

Wednesday, July 03, 2013

Apache Whirr and CloudStack for Big Data in the Clouds

This post is a little more formal than usual as I wrote this for a tutorial on how to run hadoop in the clouds, but I thought this was very useful so I am posting it here for everyone's benefit (hopefully).

When CloudStack graduated from the Apache Incubator in March 2013 it joined Hadoop as a Top-Level Project (TLP) within the Apache Software Foundation (ASF). This made the ASF the only Open Source Foundation which contains a cloud platform and a big data solution. Moreover a closer look at the projects making the entire ASF shows that approximately 30% of the Apache Incubator and 10% of the TLPs is "Big Data" related. Projects such as Hbase, Hive, Pig and Mahout are sub-projects of the Hadoop TLP. Ambari, Kafka, Falcon and Mesos are part of the incubator and all based on Hadoop.

To Complement CloudStack, API wrappers such as Libcloud, deltacloud and jclouds are also part of the ASF. To connect CloudStack and Hadoop two interesting projects are also in the ASF: Apache Whirr a TLP, and Provisionr currently in incubation. Both Whirr and Provisionr aimed at providing an abstraction layer to define big data infrastructure based on Hadoop and instantiate those infrastructure on Clouds, including Apache CloudStack based clouds. This co-existence of CloudStack and the entire Hadoop ecosystem under the same Open Source Foundation means that the same governance, processes and development principles apply to both project bringing great synergy that promises an even better complementarity.

In this tutorial we introduce Apache Whirr, an application that can be used to define, provision and configure big data solutions on CloudStack based clouds. Whirr automatically starts instances in the cloud and boostrapps hadoop on them. It can also add packages such as Hive, Hbase and Yarn for map-reduce jobs.

Whirr [1] is a "set of libraries for running cloud services" and specifically big data services. Whirr is based on jclouds [2]. Jclouds is a java based abstraction layer that provides a common interface to a large set of Cloud Services and providers such as Amazon EC2, Rackspace servers and CloudStack. As such all Cloud providers supported in Jclouds are supported in Whirr. The core contributors of Whirr include four developers from Cloudera the well-known Hadoop distribution. Whirr can also be used as a command line tool, making it straightforward for users to define and provision Hadoop clusters in the Cloud.

As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating /etc/apt/sources.list.d/cloudera.list and putting the following contents in it:

deb [arch=amd64] http://archive.cloudera.com/cdh4/-cdh4 contrib 
deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib

With this repository in place, one can install whirr with:

$sudo apt-get install whirr

The whirr command will now be available. Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:

$git clone git://git.apache.org/whirr.git

They can then build their own version of whirr using maven:

$mvn install

The whirr binary will be located under the /bin directory. Adding it to one's path with:

$export PATH=$PATH:/path/to/whirr/bin

Will make the whirr command available in the user's environment. Successfull installation can be checked by simply entering:

$whirr --help

With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:

PROVIDER=cloudstack
IDENTITY=
CREDENTIAL=
ENDPOINT=

For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute

Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:

---------------------------------------
# Set the name of your hadoop cluster
whirr.cluster-name=hadoop

# Change the name of cluster admin user
whirr.cluster-user=${sys:user.name}

# Change the number of machines in the cluster here
# Below we define one hadoop namenode and 3 hadoop datanode
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker

# Specify which distribution of hadoop you want to use
# Here we choose to use the Cloudera distribution
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop

# Use a specific instance type.
# Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster
whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8

# If you use ssh key pairs to access instances in the cloud
# Specify them like so
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale
whirr.public-key-file=${whirr.private-key-file}.pub

# Specify the template to use for the instances
# This is the uuid of the CloudStack template
whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2
------------------------------------------------------

To launch this Hadoop cluster use the whirr command line:

$whirr launch-cluster --config hadoop.properties

The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.

-------------------
Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437)
<< success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Get:2 http://security.ubuntu.com precise-security Release [49.6 kB]
Hit http://ch.archive.ubuntu.com precise Release.gpg
Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B]
Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B]
Hit http://ch.archive.ubuntu.com precise Release
..../snip/.....
You can log into instances using the following ssh commands:
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.xx.yy.zz
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.zz.zz.rr
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.tt.yy.uu
[hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.ii.oo.pp
-----------

To destroy the cluster from your client do:

$whirr destroy-cluster --config hadoop.properties.

Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:

$ hadoop fs -ls /
Found 5 items
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /hadoop
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /hbase
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /mnt
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /tmp
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /user

Create a directory to put your input data.

$ hadoop fs -mkdir input
$ hadoop fs -ls /user/sebastiengoasguen
Found 1 items
drwxr-xr-x   - sebastiengoasguen supergroup          0 2013-06-21 20:15 /user/sebastiengoasguen/input

Create a test input file and put in the hadoop file system:

$ cat foobar 
this is a test to count the words
$ hadoop fs -put ./foobar input
$ hadoop fs -ls /user/sebastiengoasguen/input
Found 1 items
-rw-r--r--   3 sebastiengoasguen supergroup         34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar

Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Start the map-reduce job:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output
13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1
13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001
13/06/21 20:20:01 INFO mapred.JobClient:  map 0% reduce 0%
13/06/21 20:20:11 INFO mapred.JobClient:  map 100% reduce 0%
13/06/21 20:20:17 INFO mapred.JobClient:  map 100% reduce 33%
13/06/21 20:20:18 INFO mapred.JobClient:  map 100% reduce 100%
13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001
13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32
13/06/21 20:20:22 INFO mapred.JobClient:   File System Counters
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes read=133
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes written=766347
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of write operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes read=157
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes written=50
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of write operations=3
13/06/21 20:20:22 INFO mapred.JobClient:   Job Counters 
13/06/21 20:20:22 INFO mapred.JobClient:     Launched map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Launched reduce tasks=3
13/06/21 20:20:22 INFO mapred.JobClient:     Data-local map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=10956
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=15446
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:   Map-Reduce Framework
13/06/21 20:20:22 INFO mapred.JobClient:     Map input records=1
13/06/21 20:20:22 INFO mapred.JobClient:     Map output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Map output bytes=66
13/06/21 20:20:22 INFO mapred.JobClient:     Input split bytes=123
13/06/21 20:20:22 INFO mapred.JobClient:     Combine input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Combine output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input groups=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce shuffle bytes=109
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Spilled Records=16
13/06/21 20:20:22 INFO mapred.JobClient:     CPU time spent (ms)=1880
13/06/21 20:20:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=469413888
13/06/21 20:20:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5744541696
13/06/21 20:20:22 INFO mapred.JobClient:     Total committed heap usage (bytes)=207687680

And you can finally check the output:

$ hadoop fs -cat output/part-* | head
this 1
to  1
the  1
a  1
count 1
is  1
test 1
words 1

Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.

To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3

Sort it with TeraSort:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4

And then validate the results with TeraValidate:

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate

Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].

[1] http://whirr.apache.org
[2] http://jclouds.incubator.apache.org
[3] http://www.apache.org/dyn/closer.cgi/whirr/
[4] http://exoscale.ch
[5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html
[6] http://wiki.apache.org/hadoop/Virtual%20Hadoop
[7] https://github.com/chiradeep/stackmate