Please join us on 8th March at 5.00pm GMT for a live juju Charm School webinar:

To sign up click here: Juju charm school webinar. This charm school will cover how to create charms and use them with juju, and how to submit charms to the larger server community so that your software is easily deployable in the cloud. If you are deploying to the cloud or have software that you’d like to make easily available to Ubuntu Server users then this is the event for you!

Attendees are encouraged to watch the first webinar so that we can concentrate on more advanced topics for this Charm School.

Can’t make it? We’ve got in person Charm Schools throughout the year if you’re interested in attending, or you can just contact us.

The Ubuntu Server Survey is finally ready to be published it makes for a fascinating read. It is the third survey of its kind and again it has been an overwhelming response with over 6,000 completed surveys throughout 2011 and a heartfelt thanks to all who took the time to complete the comprehensive survey.

The overwhelming impression is the widespread use of Ubuntu both geographically as you might expect with respondents from across the globe. but also in the broad range of workloads in which Ubuntu Server finds itself used. Every category from web and data servers to cloud shows up strongly albeit with a strong bias towards traditional workloads.

As we approach an LTS, again we see evidence of the popularity of the extended support releases. Given we have run this survey three times now over the past three years now we begin to see strong evidence of the switching from one LTS to the next, particularly as the deployment platform, so our user base is certainly staying with us as as we introduce new features and support them in the long term.

Virtualization and cloud are now key elements of Ubuntu use, and for the first time we see KVM overtake Xen as the preferred virtualization technology for Ubuntu users, significant as the platform was the first to make the switch to supporting KVM as the native technology. With that though, VMWare remains the most cited virtualization technology showing a healthy mixture of open source and other technologies at use in the Ubuntu user base.

The respondents consideration of cloud makes for interesting reading too. There is significant interest but the use of Ubuntu Server on bare metal remains the primary use case for most users today. There is strong recognition though of the emergence of this powerful technology and with the plans for ease of installation and orchestration in 12.04 LTS it will be interesting to see how this moves the dial in regards to uptake in the Ubuntu base. A deeper analysis  shows a bias towards larger companies (i.e. respondents with more servers) using cloud technologies which is to be expected and overwhelmingly there is recognition of the suitability of Ubuntu Cloud as a basis for those efforts.

Enjoy the full report, it an be found here and it would be very interesting to hear your comments.

 

Amazon Web Services (AWS), as the trailblazing provider of Infrastructure as a Service (IaaS), has changed the dialog about computing infrastructure. Today, instead of simply assuming that you’ll be buying and operating your own servers, storage and networking, AWS is always an option to consider, and for many new businesses, it’s simply the default choice.

I’m a huge fan of cloud computing in general and AWS in particular. But I’ve long had an instinct that the economics of the choice between self-hosted and cloud provider had more texture to it than the patently attractive sounding “10 cents an hour,” particularly as a function of demand distribution. As a case in point, Zynga has made it known that for economic reasons, they now use their own infrastructure for baseline loads and use Amazon for peaks and variable loads surrounding new game introductions.

An analysis of the load profiles

To tease out a more nuanced view of the economics, I’ve built a detailed Excel model that analyzes the relative costs and sensitivities of AWS versus self-hosted in the context of different load profiles. By “load profiles,” I mean the distribution of demand over the day/month as well as relative needs for bandwidth versus compute resources. The load profile is the key factor influencing the economic choice because it determines what resources are required and how heavily these resources are utilized.

The model provides a simple way to analyze various load profiles and allows one to skew the load between bandwidth-heavy, compute-heavy or any combination. In addition, the model presents the cost of operating 100 percent on AWS, 100 percent self-hosted as well as all hybrid mixes in between.

In a subsequent post, I will share the model and describe how you can use it for scenarios of interest to you. But for this post, I will outline some of the conclusions that I’ve derived from looking at many different scenarios. In most cases, the analysis illustrates why intuition is right (for example, that a highly variable compute load is a slam dunk for AWS). In other cases, certain high-sensitivity factors become evident and drive the economic answer. There are also cases where a hybrid infrastructure is at least worthy of consideration.

To frame an example analysis, here is the daily distribution of a typical Internet application. In the model, traffic distribution is an input from which bandwidth requirements are computed. The distribution over the day reflects the behavior of the user base (in this case, one with a high U.S. business-hour activity peak). Computing load is assumed to follow traffic according to a linear relationship, i.e. higher traffic implies higher compute load.

Note that while labor costs are included in the model, I am leaving them out of this example for simplicity. Because labor is a mostly fixed cost for each alternative, it will tend not to impact the relative comparison of the two alternatives. Rather, it will impact where the actual break-even point lies. If you use the model to examine your own situation, then of course I would recommend including the labor costs on each side.

For this example, to compute costs for Amazon, I have assumed Standard Extra Large instances and ELB load balancer for the Northern California region. The model computes the number of instances required for each hour of the day. Whenever the economics dictate it, the model applies as many AWS Reserved Instances (capacity contracts with lower variable costs) as justified and fills in with on-demand instances as required. Charges for data are computed according to the progressive pricing schedule that Amazon publishes. To compute costs for self-hosting, I assume co-location with the peak number of Std-XL-equivalent servers required, each loaded to no more than 80 percent of capacity. The costs of hardware are amortized over 36 months. Power is assumed to be included with rackspace fees. Bandwidth is assumed to be obtained on a 95th percentile price basis.

Now let’s look at a sensitivity analysis. Notice in the above example, that a bit more than half of the total cost for each alternative is for bandwidth/data transfer charges ($35,144 for self-hosted at $8/Mbps and $36,900 for AWS). This is important because while Amazon pricing is fixed and published, 95th percentile pricing is highly variable and competitive

The chart above shows total costs as a function of co-location bandwidth pricing. AWS costs are independent of this and thus flat. What this chart shows is that self-hosting costs less for any bandwidth pricing under about $9.50 per Mbps/Month. And if you can negotiate a price as low as $4, you’d be saving more than 40 percent to self-host. I’ll leave discussion of the hybrid to another post.

This should provide a bit of a feel for how I’ve been conducting these analyses. Above is a visual summary of how different scenarios tend to shake out. The intuitive conclusion that the more spiky the load, the better the economics of the AWS on-demand solution is confirmed. And similarly, the flatter or less variable the load distribution, the more self-hosting appears to make sense. And if you’ve got a situation that uses a lot of bandwidth, you need to look more closely at potential self-hosted savings that could be feasible with negotiated bandwidth reductions.

Charlie Oppenheimer is a serial-CEO and currently an executive-in-residence at venture-capital firm Matrix Partners. His most recent company, Digital Fountain, was acquired by Qualcomm, and his previous company, Aptivia, was acquired by Yahoo. He blogs at stratamotion.com

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.





First let me make it clear that Open Cloud Initiative (OCI) is not dead and it is going to stay for a long time advocating openness. Also, I will fight all I can to keep it going. Having said that I am writing this post to re-emphasize something which I have been saying all along. I am also going to use this post as a reference whenever a myth is promoted in the social media circles on this topic. Unlike the cathedrals of proprietary vendors, all the debates about open source and other open topics occur in the open (pun intended). In the typical open source spirit, I will vent my thoughts (once again) on this topic here. For beginners, I have already written about this in my introductory post on OCI.

On the other hand, I am neutral because open source is included as an afterthought in the requirements. There are two schools of thoughts among those who advocate openness in the cloud world. One school, spearheaded by Tim O’Reilly, emphasizes on open protocols, open formats, open architecture, etc. as the necessary conditions for openness. They claim that licensing is irrelevant in the cloud services world. The other school, slightly old fashioned and in minority, claim that open source is equally important in ensuring the openness in the cloud based world. I belong to the second group and I have argued in favor of the importance of open source in the cloud world here and in other fora. For me, open source becomes a requirement because it is the only way we can have a more federated interoperable cloud ecosystem. In the absence of open source, the barriers for participation becomes very high and we may face the prospect of monopoly of cloud providers offering services.

I also highlighted the same thing in a talk at a Cloud Bootcamp at Santa Clara in the sidelines of Cloud Expo and my slides from the talk can be found here.

Argument: When you move from software to services, open source doesn’t matter and only open standards matter

My counterargument: I do agree that open standards (open protocols, open formats, etc.) are the key to eliminate cloud lock-in. The biggest concern against large scale cloud adoption is the risk of getting locked into proprietary clouds. Open standards are key to avoid such a lock-in. There is no doubt about it and it is extremely important that we raise the awareness about open standards so that cloud users are protected. However, dismissing open source as irrelevant is shortsighted at the best. Yes, open standards might help users from getting locked into a single vendor but, in the absence of open source, they will be locked into handful of vendors. We saw what happened when only a handful of players meet the needs of an entire country with US wireless industry. They stymied innovation for a long term because they were hell bent upon protecting their existing cash cow than really letting their services to be used for further innovation. Open Source doesn’t guarantee innovation in the technology field but it lowers the barriers so much that it opens up opportunity for others to get into the market, innovate and, more importantly, ensure that the end users are not taken for a ride. Imagine if we would have seen the cloud as AWS introduced to the world in the absence of open source licenses? Do you think Microsoft would have been flexible with their licenses to let Amazon develop a service that will eventually come back to bite them? Open source is critical for cloud computing and it is now helping, in the form of OpenStack, CloudFoundry and others, to ensure that there are not handful of cloud providers who could eventually grow their market power to stymie innovation like the US wireless companies. I strongly believe in demanding open standards but it is quite possible to work around its absence if there is open source, a fact once again highlighted by this brilliant post by Brad Hedlund. No, I am nowhere close to claiming that we don’t need to focus on open standards but I am only arguing that ignoring open source and focussing only on open standards is s h o r t s i g h t e d. Period.

Argument: Why would a consumer of a service need its source code?

My counterargument: The biggest problem with opponents and some proponents of open source is that they really don’t get it. Open source is not about consumption but about its power of enablement. Whether it is the case of software or service, it is the same. Even in the software world, every single user of open source software didn’t take the source code and look at it. Only a small percentage of users who wanted to modify the source code to scratch their itch really used the code. It is clearly the case of enablement than consumption in the software world and it is going to be the same in the services world. Consumers of services are going to give a damn about source code much like the consumers of software but the availability of code is going to enable many providers to scratch the itch and offer services to meet the more diverse needs not addressed by the original set of service providers. My point is: it doesn’t matter what we are talking is software or service, open source is an enabler of openness (and innovation) and, therefore, it is equally critical as open standards.

Open standards is about not getting your data locked in but open source is needed if you want to enable the users to run their workloads after that. What is the point in having my data out of a provider if I don’t have the resources available (at a cost affordable to me) to have applications that can act on that data? A truly open cloud should allow me to not just take my data out but also give me opportunities to use the data elsewhere without being held hostage by any group (of providers). If the definition of open cloud doesn’t give me this opportunity, then it is meaningless as far as I am concerned.

Argument: But, hey, we demand that at least one implementation should be open source

My counterargument: This afterthought addition of open source in the open cloud definition is what frustrates me the most. I really really couldn’t get this argument. Why would a proprietary cloud vendor spend critical resources (including tons of money) implementing an open source implementation just to get certified as open cloud by OCI? If market pressures forced the vendor to support open protocols, they will just enable that and satisfy the needs of their market. If the market pressure doesn’t exist, they would not give a damn to open source or open standards anyhow. Microsoft is a good example of market pressures forcing them to open up than some certification agency. Instead if OCI puts open source at the center, along with open standards, for the very definition of open clouds, it will at least motivate the large open source cloud ecosystem (it is growing by leaps and bounds every day) to get certified by OCI. Believe me, I have spoken to at least 5 service providers and platform vendors on this open source cloud ecosystem and they just don’t care about OCI for the very reasons I have highlighted above. They feel that they need not get OCI certified to be seen as a player embracing openness. I am pretty sure this is the thinking with many others in that ecosystem.

Argument: What OCI has is the middle ground that will help bring proprietary cloud vendors on board

My counterargument: As I told above, what is the incentive for them to come to OCI? If a company believes in the proprietary approach (believe me, it is not a wrong approach at all and what matters is that customers should have choices and proprietary software is one such choice), why would they even worry about openness unless there is market pressure? When there is market pressure they will anyhow adopt open standards and meet the needs. They really don’t give a damn about embracing openness mantra through OCI certification. However, this approach of OCI is a big put off for companies which have openness at the heart and have open source at the core of their clouds. In today’s world, it is a big part of the cloud ecosystem and they feel OCI is not needed to showcase their openness because they have open source in their DNA. OCI can create the market pressure needed to force proprietary cloud vendors to embrace open standards ONLY if they could convince these open source cloud vendors to come on board in large numbers. Why am I not hearing any excitement about OCI in the OpenStack community? The only group that will really benefit from this “middle ground” are those proprietary vendors who are lagging behind in the marketplace but want to use openness mantra to catch up. Yes, the biggest benefactors will be those who want to open wash.

If OCI’s intention is to put pressure on proprietary cloud providers to open up, they are doing it all wrong because whatever they are doing with this so called “middle approach” is not going to add the necessary market pressure. Rather, it has the danger of making OCI irrelevant as more and more open source providers jump in and create the market pressure on their own. I really want OCI to succeed but my efforts to make them see the larger picture is not making any dent. This blog post is my attempt to get the larger community put pressure on OCI to really open up.

Tear down that wall Mr. Johnston!!

During the Ubuntu precise development cycle the Canonical Platform Server Team have been working on automating testing of Openstack on Ubuntu.

The scope of this work was:

  1. Per-commit testing of Openstack trunk to evaluate the current state of the upstream codebase in-conjunction with the current packaging in Ubuntu precise and the current Juju charms to deploy Openstack.
  2. SRU testing for Openstack Diablo on Ubuntu 11.10.

Openstack do a lot of pre-commit testing through the use of gerrit with Jenkins; we wanted to supplement this with Ubuntu focused testing to provide another dimension to the testing already completed upstream.

So grab a coffee and make yourself comfortable; this is not a short read….

Lab Setup

The Ubuntu Openstack QA lab consists of 12 servers; the primary server in the solution is an Ubuntu 11.10 install providing the following functions:

  1. Juju – used to deploy Openstack charms in the Lab
  2. Cobbler to support server provisioning (using the Ubuntu Orchestra packages in Oneiric)
  3. Jenkins CI – provides triggering based on upstream commits to github repositories and general job control and reporting.
  4. Schroots for Oneiric and Precise for building packages locally
  5. A reprepro managed local archive for Oneiric and Precise
  6. Squid based archive caching to reduce installation times in the lab

This server also acts at the gateway into and out of the Lab (it’s setup as a NAT router).

The other 11 servers are registered in Cobbler; All servers are connected to a Sentry CDU (Cabinet Distribution Unit) which allows full power control from Cobbler – thanks goes to Andres Rodriguez for developing the required fence component for Cobbler to support this type of CDU.

Preseeded LVM Snapshot Installs

To initiate a new integration test run requires all machines to be powered down and re-provisioned from scratch.  It is essential that our deployment and test runs can cope the frequency of upstream commits, particularly as the frequency increases as Openstack approaches milestones and releases.   After getting the initial lab setup in place, we were able to tear down all machines, re-provision and deploy Openstack in ~30mins.

It was important that we are able to minimize the time taken to complete the testing cycle.   To do so, we’ve employed the use of LVM snapshotting and restoration of the root partition during the the netboot installation.   The process is as follows:

  1. Test run begins
  2. Juju deploys a service (i.e. nova-compute)
  3. A machine is netbooted and a preseeded LVM-based Ubuntu installation takes place onto /dev/qalab/root
  4. At the end of the installation, the root filesystem is moved to /dev/qalab/pristine-[release]-root and a snapshot created at /dev/qalab/root
  5. The machine reboots, runs Juju and deploys nova-compute as pat of the rest of the Openstack deployment. This deployment is smoke tested.
  6. The next test run begins.  All machines are terminated. Juju redeploys nova-compute, a machine is netbooted and Ubuntu installation kicks off.
  7. The installation checks for the existence of a logical volume at /dev/qalab/pristine-[release]-root.  If it exists, it creates a new snapshot at /dev/qalab/root and reboots. If it does not, continues with installation and goto step 4.
  8. System reboots, Juju installs and redeploys nova-compute to a fresh Ubuntu installation.

This process takes place on all nodes in parallel.  With it in place, we were able to cut down the time it took to tear-down and re-provision a node from ~30 minutes to 10 to 15 minutes depending on the service being deployed.

By taking this approach we are also minimize the chance of any nodes hitting an archive inconsistency during installation. This is a known issue when deploying the development release and halts installation on any node that hits it, failing the entire deployment.

All of this is embedded in debian-installer preseeds via Cobbler snippets.  The snippets and kick starts are available at lp:~openstack-ubuntu-testing/+junk/cobbler-lvm-snapshot.

In the future, we’ll be investigating the use of kexec as an alternative to reboot after snapshot restoration to reduce the time spent waiting on servers to boot.  This should minimize the test cycle even more. Credit to James Blair for the idea (see http://amo-probos.org/post/11).

Management of Jenkins

All of the projects in Jenkins are managed using Jinja2 XML templates in-conjunction with python-jenkins (python-jenkins); this makes it really easy to setup new jobs in the lab and reconfigure existing ones as required (as well as providing great backup!).

Templates and management scripts can be found in lp:~openstack-ubuntu-testing/+junk/jenkins-qa-lab

Testing Openstack Essex on Ubuntu Precise

This testing was the first to be setup in the lab.  Jenkins (using the git plugin) monitors the upstream github.com repositories for commits on the master branch.  When a change is detected the following process is triggered:

Build

Objective: Validate that upstream trunk still builds OK with current packaging for Ubuntu.

  1. A new snapshot upstream tarball is generated based on the latests commit to the upstream component.
  2. The latest archive packaging for the component is pulled in from lp:~ubuntu-server-dev/<COMPONENT>/essex
  3. Any changes in the testing packaging for the component are merged from lp:~openstack-ubuntu-testing/<COMPONENT>/essex
  4. New changelog entries are automatically created for the new upstream commits.
  5. The source package is generated and built in a clean schroot using sbuild locally.

On the assumption that the package built OK locally:

  1. The source package is uploaded to the Testing PPA (ppa:openstack-ubuntu-testing/testing)
  2. The testing packaging branch is push back to lp:~openstack-ubuntu-testing/<COMPONENT>/essex.
  3. The binary packages from the sbuild are installed into the local reprepro managed archive.

This process is managed by a single script (tarball.sh); Credit to Chuck Short for pulling together this part of the process based on work from Openstack upstream.

For changes to the nova project the deploy phase is then executed.

Deploy

Objective: Validate that packages install, can be configured and reach a know good state prior to execution of testing.

This phase of testing uses Juju with Cobbler to deploy Openstack into the QA lab infrastructure; It utilizes branches of the Openstack charms to support use of a local archive along with a deployer wrapper around Juju written by Adam Gandelman which executes the actual deployment using Juju and monitors for errors.

The deployer is configured to know where to get the right codebase for the Openstack charms, which services to deploy and which relations to setup between services. As you can see from the above diagram this is non-trivial but the charms and Juju do most of the hard work.

Once Openstack is deployed successfully the test phase is then executed.

Test

Objective: Validate that the Openstack deployment in the lab actually works!

At this point, we can run any integration tests we wish against the newly deployed cloud.  This testing is able to help us achieve multiple goals:

  • Early detection of upstream bugs that break Openstack functionality on Ubuntu
  • Verification that packaging branches in the development version of Ubuntu are compatible with upstream trunk.
  • Using these packages, verification that our Juju charms are deploying a functional Openstack cloud and are up-to-date with any deployment-related configuration changes upstream.

At the moment this phase looks like this:

  1. Configure the Openstack deployment (Adams deployer script provides some utility functions for locating specific services in the environment)
    • Creates network configuration in Nova for the private instance network as well as a pool of public floating IPs.
    • Upload an image into the Glance server for use during testing
    • Creates EC2 credentials in the Keystone server for use during testing.
  2. Run the devstack exercise test scripts which ensure basic functionality of the deployment. Currently, this includes:
    • Basic euca-tools EC2 API for starting and stopping instances
    • EC2 AMI bundle uploads
    • Floating IP allocation, association and connectivity to instance
    • Volume creation and attachment to instance

Note: These are the same sets of tests that are currently run against proposed commits to gerrit upstream.

Longer term we aim to use the Openstack Tempest test suite in the lab; Adam is currently working on getting this up and running.

Reporting

The Jenkins instance in the QA lab is not publicly accessible; however all jobs run in the lab are published out (using the Jenkins build-publisher plugin) to http://jenkins.qa.ubuntu.com so that people can see the current state of the testing packaging in Ubuntu precise.

We are also working on setting up email notifications.

Success so far

Juju charms deploy Openstack components in a configuration that is compatible with upstream trunk prior to updates to packaging in Ubuntu.  Previously packages were updated in the archive first while Juju charm updates lagged behind as incompatibilities were uncovered after the fact.

We enabled automated testing 2 days prior to the 3rd Essex milestone release.  We were able to uncover and help fix a handful of bugs upstream before the release, including critical bugs like 921784.  In the past, these bugs were typical uncovered after the release (both upstream and in Ubuntu).

Since E3, there have been even more critical bugs uncovered by this testing and fixed upstream, some of which are only applicable to Ubuntu-specific configurations (not tested upstream) and would have been uncovered by users after code hit the Ubuntu archive (See 922232).

Further Plans for the Lab

Pre-commit  testing of changes to stable branches;  The Ubuntu Server team are  working upstream on maintaining the stable branches of released versions  of OpenStack – this work will validate patches proposed to stable  branches in review.openstack.org against the current version of the  packaging in released versions of Ubuntu.  Initially this will target  Diablo on Ubuntu 11.10 but will also support Essex on Ubuntu 12.04 once  released.  Ideally the testing process will provide feedback on  review.openstack.org to help the stable release team review proposed  patches.

References

Jenkins job configurations: lp:~openstack-ubuntu-testing/+junk/jenkins-qa-lab

Scripts supporting the lab: lp:~openstack-ubuntu-testing/+junk/jenkins-scripts

LVM snapshot preseeds and Cobbler snippets: lp:~openstack-ubuntu-testing/+junk/cobbler-lvm-snapshot

All other relevant scripts, charm branches, etc: https://code.launchpad.net/~openstack-ubuntu-testing/

Credits

Overall management of delivery and general whip cracking: Dave Walker

Lab installation and base configuration: Pete Graner, Tim Gardner, Brad Figg, James Page

Fence agent for network power control of servers: Andres Rodriguez

Source package creation and build process: Chuck Short and James Page

Deployment testing using Juju: Adam Gandelman

Testing of Openstack: Adam Gandelman

Jenkins packaging, configuration and management: James Page

Gerrit Plugin for pre-commit testing and generally great ideas: Monty Taylor and James Blair

Writing and reviewing this post: Adam Gandelman, Chuck Short and Dave Walker.

Big data is among the hottest trends in IT right now, and Hadoop stands front and center in the discussion of how to implement a big data strategy. There’s just one problem that keeps cropping up: many people don’t seem to know exactly what it means when somebody says “Hadoop.”

The problem surfaced again Monday in the form of complaints over Forrester’s new report titled “Enterprise Hadoop Solution, Q1 2012.” InformationWeek spoke with a few vendors that didn’t like how their products were assessed, and database industry analyst Curt Monash says the report “compares apples, peaches, almonds, and peanuts.” I thought the same thing when I saw a copy of the report last week. They all focus on Hadoop, but Hortonworks is not Datameer is not HStreaming.

Allow me to explain. Hopefully, this provides a foundation for parsing what people talk about when they talk about Hadoop, and for differentiating one type of product from another.

What Hadoop is

I went into this in more detail in a GigaOM Pro report published last March (sub req’d), but the long and short is that Hadoop is, at its core, an Apache Software Foundation project consisting of two primary subprojects — Hadoop MapReduce and the Hadoop Distributed File System. MapReduce is the parallel-processing engine that allows Hadoop to churn through large data sets in relatively short order. HDFS is the distributed file system that lets Hadoop scale across commodity servers and, importantly, store data on the compute nodes in order to boost performance (and potentially save money). These are the two must-have components for any Hadoop distribution.

There are also a number of Apache projects related to Hadoop, often built atop either Hadoop MapReduce or HDFS. These include — but are not limited to — Hive and Pig, two SQL-like query languages to provide data-warehouse-like capabilities to a Hadoop cluster, and HBase, a NoSQL database that leverages HDFS as its distributed storage engine.

Hadoop distributions

These are packaged software products that aim to ease deployment and management of Hadoop clusters compared with simply downloading the various Apache code bases and trying to cobble together a system. Presently, Cloudera, Hortonworks, MapR and EMC  all offer their own Hadoop distributions. Although they’re all unique — sometimes very unique, as with MapR’s proprietary file system — they all package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a way that in theory makes them integrate more naturally, and to run both smoothly and securely.

Many Hadoop distributions integrate with various data warehouses, databases and other data-management products, with the goal of moving data between Hadoop clusters and other environments so each might process or query data stored in the other.

Hadoop management software

Just as the wording implies, Hadoop management software is designed to make it easier to manage and troubleshoot a Hadoop cluster. Such products are usually sold or offered by companies peddling Hadoop distributions, because even when commercially packaged, Hadoop is still a complex architecture and somewhat foreign to most IT personnel and products. However, third parties such as Platform Computing (now part of IBM) and Zettaset also sell software for managing Hadoop clusters, and their products are typically agnostic as to what distributions they support.

But distributions and management software are all about the infrastructure and the platform. Anyone actually wanting to use Hadoop still needs to know how to write applications that leverage the underlying architecture.

Hadoop application software (or, products that use Hadoop)

The Hadoop ecosystem gets really complex when we start looking at products that exist to help developers write Hadoop applications or otherwise analyze data stored within Hadoop in a manner other than writing traditional MapReduce jobs. These range from abstraction layers such as Karmasphere Analyst or IBM Infosphere BigInsights, to Hadapt, which offers a single-platform product fusing a SQL data warehouse with a Hadoop cluster, to HStreaming, which promises real-time processing and analytics.

The one common thing among all these products, however, is that they are not Hadoop distributions, but sit atop platform software from Hortonworks, EMC or whomever. Some products that get thrown into the Hadoop fray, such as Outerthought Lily or Drawn to Scale Spire, are essentially scale-out databases built atop HBase (which itself is a separate project built atop HDFS). The image below, from Karmasphere, gives a particularly clear map of how a Hadoop environment might look.

The applications and analytics space is probably where we’ll see the biggest influx of new companies, as writing Hadoop applications is still tough, but it’s also how companies will actually start experiencing direct business benefits. In fact, it’s these type of higher-level products that are the focal point of Accel Partners’ new big data fund.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



src=’http://ads.gigaom.com/show/rss/’
alt=”
border=’0′
/>

In this blog post the Launchpad team uses juju to deploy oops-tool, a Django-based tool that aggregates bug reports for the Launchpad project.

We typically talk about services that people commonly deploy, such as Mediawiki or WordPress. However there is another use case for juju that is just as powerful, as a tool to help iterate on whatever you’re working on faster. oops-tool is not a general tool that most people will want to use; it’s very specialized.

However the Launchpad team have encapsulated their service in a charm. Any person can now deploy oops-tool in 4 commands. Now have a think about a project you and your team might be working on and the complexities of that service and how wonderful it would be if any person on any team could deploy any service in your project’s code base with that kind of ease. You’re codifying the management of your service so that as you work on a feature branch you can deploy, test, and then iterate.

juju strives to deploy your service in the same way that people strive to have their software build in one set of processes, but it’s more than just that. Deploy-and-forget is nice, but being able to manage a service over its lifetime is what people need in the cloud and you can do that with a juju charm.

Launchpad has a myriad of services it provides, we’ll keep you in touch on how that team is using juju to simplify their processes. Got more questions about juju and how we can help you manage in the cloud? Feel free to Contact Us and ask questions!

The biggest topic at this year’s Southern California Linux Expo (SCALE) conference was the OpenStackTM project. Everyone came away from the show appreciating that OpenStack is only going to get more popular and bigger. OpenStack is building momentum. Jim Ash and Andrei Matei from the HP Cloud Services team stayed busy – talking with and signing up people for our private beta (HP Cloud Compute and HP Cloud Object Storage). To the SCALE attendees, who gave us their opinions, HP’s involvement with OpenStack means that OpenStack will be a serious, viable option for businesses of all sizes and for developers – who want a real choice in the market that competes with the existing proprietary cloud options.

People at the conference wanted to know more about the links between HP, OpenStack technology, Linux, and other open source projects. In a nutshell, OpenStack technology is the open source, open API, open development, and open orchestration layer powering HP Cloud Services. And OpenStack technology is built on Linux and open source technology. The OpenStack project and offerings like HP Cloud Services that integrate OpenStack technology bring open source technology and ideals to businesses of all sizes. We were excited about the warm reception HP Cloud Services got from people with a broad range of backgrounds in Linux and cloud – and from developers from all kinds of companies, from the smallest organizations to the largest enterprises.

Read more…

Lucas Carlson, CEO AppFog

AppFog Founder and CEO Lucas Carlson isn’t shy about touting platform-as-a-service as the ideal way for developers to access cloud computing resources, but he isn’t blind either. Although PaaS has been around for a couple years now and has already spurred hundreds of millions in M&A spending, Carlson knows it’s nowhere near the mainstream yet.

Carlson lays out his version of the evolution of cloud computing in the infographic below. Right now, API-based infrastructure-as-a-service offerings like that from Amazon Web Services and SysOps (or DevOps) tools are developers’ best friends in the cloud. Application-lifecycle platforms such as Cloud Foundry (the VMware-ran open source project  on which AppFog is built) and Red Hat’s OpenShift  are poised to reach critical mass in 2012, whereas so-called “NoOps” platforms such as AppFog and Heroku will reach that point in 2013.

During a recent phone call, Carlson told me PaaS is the model of the future, not the present, because only about 2 to 4 percent of developers — the ones on the cutting edge — are actually using it right now. “As interesting as PaaS is, the majority of developers … have some very real concerns that are holding them back from actually going forward,” Carlson said.

Aside from illustrating the evolution of cloud-development tools, Carlson said the infographic also aims to clearly delineate the different layers of the cloud stack, something he opined on in a December blog post. PaaS isn’t a feature of IaaS, he explained, but “a full reinvention from the ground up.” Every layer has to fully understand the layers below because they must manage them, but the user experience and the resulting increase in developer productivity are what make the service.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



src=’http://ads.gigaom.com/show/rss/’
alt=”
border=’0′
/>

For online auction powerhouse eBay, big data is serious business. The company has 100 million active users globally, 300 million live listings at any time (and it archives them all), receives 2 billion page views daily, and handles 250 million search queries and 75 billion database calls a day. How does eBay make sense of all this activity? With Hadoop, of course.

What a customer (or engineer) wants

Hugh Williams

Hugh Williams is VP of experience, search and platforms at eBay. His team is responsible for the entire eBay experience from the moment users hit the site until moment they make a purchase, from code to data center automation to building new picture-hosting platforms. If it has to do with driving traffic to eBay and improving the customer experience, Williams’ team builds it. But in order to know what to build and how to build it, the team needs insight into what customers want and what they’re doing.

In order to figure this out, eBay first has to give its analysts and engineers the tools they want. It does this by operating a two-pronged big data attack consisting of a massive Teradata data warehouse and a fast-growing Hadoop environment.  Financial analysts like SQL and more of a WYSIWYG experience, Williams said, which is why Teradata is so important. However, the majority of his engineers love Hadoop — which stores and processes unstructured data such as server logs, click-throughs and search queries – and make “enormous use” of it.

Huge data

Whichever one you’re talking about, Williams says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years worth of HD video) within a year. Its Hadoop environment is currently storing between 9 and 10 petabytes, according to Williams, but always growing. In fact, the Hadoop environment doubled in size in the past year, in part from more user data streaming in and in part from analysts running lots of Hadoop jobs and creating new, larger data sets that also remain in the system.

“What we really use Hadoop for is to understand our customers and their needs,” Williams said. This happens both at a broad scale — say, improving the accuracy of its search engine — and also more narrowly around building specific features the data suggests customers would want. For example, Williams explained, Hadoop has proven helpful in deciphering patterns of misspelled words, so now eBay’s search engine knows to look instead for an actual word or product when users type certain queries incorrectly. In the middle, between broad improvements and narrow data-driven features, Williams said Hadoop helps eBay find out a lot about how it’s different and how it can become more unique by letting Williams’s team churn through those petabytes of unstructured data to uncover trends.

More than MapReduce

Beyond Hadoop’s sweet spot as a batch-processing engine using its native MapReduce framework (i.e., processing large data sets) Williams said eBay is also expanding its own Hadoop usage rather heavily into HBase, the NoSQL database that’s also an Apache Software Foundation project and leverages the Hadoop Distributed File System. HDFS, which is the default storage layer for Hadoop, also serves as the storage layer for HBase, which doesn’t process data like MapReduce but lets users quickly read from and write to large unstructured data sets.

HBase is already a piece of eBay’s new search engine, and Williams said there are few sites using it in production at eBay’s scale. Facebook is another site already making major use of HBase. Williams said HBase is fantastic, but it’s also the area within the Hadoop ecosystem where he’d like to see the most improvement. It’s fundamentally real-time, he explained, which is great, but eBay had to do a lot of work to make HBase scale and to make it fault-tolerant. Build a self-healing system out of Hadoop subprojects was very challenging.

Actually, Williams is generally excited about NoSQL, which refers to non-relational database technologies, as a way to handle eBay’s high traffic in data not necessarily ideal for traditional databases. “Cassandra and MongoDB are other great examples of the latest, innovative technologies for managing large data sets that we’re excited about at eBay,” he said.

Open source all the way … probably

For all its benefits, Williams acknowledges Hadoop can be a tough technology to learn, but any blood, sweat and tears are worth it to ensure his team really understands the data platform that underpins so much of eBay. “[T]o put it to its full potential, we have to be experts in it,” William said — a level of expertise that can really only come via open-source software that lets engineers “roll up [their] sleeves and [get] into the source code.”

Still, any sort of decision is the result of collaboration between the business team and the technology team, so Williams says he keeps an open mind as to how eBay’s big data environment might evolve. Right now it’s Teradata and Hadoop, but “I can imagine that landscape changing,” Williams said.

In October, we covered comments from eBay Senior Director of E-commerce Darren Bruntz, who said he would like to move to a single data platform and that he’d like to see “more focus and energy” from the Hadoop community. Asked at the time about whether such a platform is possible, Teradata Labs President Scott Gnau told me it’s not possible now — at least if you want all the advanced SQL analysis features of a product like Teradata for structured data — but that it might be in the future.

And although Teradata now has a product in Aster Data Systems that is something of a replacement for Hadoop, Gnau said “Hadoop or son of Hadoop or something else” will always be a big piece of the big data space because it has so much momentum and such a sweet spot around search and batch processing of unstructured data.

EBay’s Williams, though, maintains the sentiment of his team members will remain a major factor in any decision regarding the company’s data platform. “For a new platform to succeed, our technologists would have to be passionate about the platform, and the platform would have to enable us to innovate faster to build products for eBay’s customers,” he said. “If a new technology helps us achieve that goal, we would certainly evaluate the benefits.”

We’ll be talking a lot more about Hadoop, NoSQL and where they’re headed at our Structure: Data conference, which takes place March 21-22 in New York City. Speakers include some of the biggest names and brightest stars in the space, all of whom are trying to push the limits of what organizations can do with all the data they collect.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.