Map Reduce is dead, long live Spark!

Map Reduce is dead, long live Spark!

That’s the impression I, and I think most people attending the conference, walked away with after Strata NY 2014.  Most of the interesting presentations were centered on Spark.  Only corporate IT presentations about “in progress hadoop implementations” were about Map Reduce.

So who’s working on Spark?  Cool startups and vendors (preparing for enterprise IT departments to move on to Spark in a year or two).

Who’s working on Map Reduce? Corporate IT departments migrating off legacy BI systems onto the promised land of Hadoop (dream come true, or nightmare around the corner, not sure which one it will be for people).

It makes sense. Map Reduce has been tested and is ‘safe’ now for enterprise IT teams to start deploying it into production systems.  Spark is still very new and untested.  Too risky for a Fortune 500 to dive into replacing legacy systems with a still-in-diapers open source software “solution.”  Nonetheless, I am sure every technical worker will be drooling to “prototype” or create proof of concepts with Spark after this conference.

Reflections on Strata NYC 2014

I had a chance to attend Strata in New York back in October.  I had been wanting to attend Strata for a few years, but had not had a chance until now.  A few impressions: (in the form of brief bullets)

  • It’s huge! (Over 3,000 attendees)
  • Very corporate!  (A bit too corporate, too stuffy, seemed like legal departments censored some presentations)
  • All the cool kids are using/learning Spark (and Scala)
  • Map Reduce is old news.
  • Enterprises move slow like dinosaurs, are just figuring out what Map Reduce is
  • Way too many vendors
  • Not enough interesting/inspiring presentations

Those were just my impressions, others may have other opinions.

Pentaho and Vertica as Business Intelligence / Data Warehousing solution

Introduction

I recently wrapped up a BI/Data Warehouse implementation project where I was responsible for helping a rapidly expanding international e-commerce company replace their aging BI reporting tool with a new, more flexible solution. The old BI reporting tool was based on a “in-memory” reporting engine, was more of a “departmental solution” than an enterprise-grade one, and was not optimally designed. For example, users found themselves downloading data from different canned reports to Excel where they ran VLOOKUPs and pivot tables to compute simple metrics such as average order value and average unit retail. Needless to say despite best of intentions, there had been a communication gap between business users and IT developers on reporting requirements during the implementation of the original BI tool.

In designing and implementing the new solutions, I set the following strategic tenants / guiding principles:

  • leverage commercial off-the-shelf (COTS) software; minimize customization and emphasize configuration instead (i.e., chose to buy instead of build, and made sure to not to build too much after buying)
  • involve all stakeholders and business users throughout process
  • enable business users to use self-service BI tools as much as possible
  • train as needed; up-skilling user base on self-service tools is better than hiring army of BI analysts
  • leverage data warehouse for both internal and external reporting
  • minimize amount of aggregation in Data Warehouse (we did almost no aggregation)
  • maximize the processing power of the ROLAP engine by pairing it with a high-performance analytical database (i.e., columnar MPP database)
  • stick to Kimball data warehouse design approach as much as possible, but be pragmatic where needed; Star Schema, Star Schema, Star Schema! (no snowflakes here)
  • take an iterative approach where possible – need to “ship” on time – understand that 1st release will not be “perfect” but does need to meet business requirements
  • for external reporting, provide canned reports only initially; test user adoption and work with clients to understand and address reporting needs over time

We looked at traditional players, open source, emerging technologies, and Cloud BI SaaS providers. I made sure business and IT stakeholders were part of the vendor selection process, ensuring they attended demos and vendor presentations. In the end, Pentaho best matched all our needs, providing us with both a solid ETL and BI reporting engines. Since we looking at providing both internal and external reporting with this solution, traditional BI vendors were prohibitively expensive, and “cloud offerings” were not compatible with our current IT capabilities and architecture (our data was not in the cloud).

Solution Description – Vertica + Pentaho BI/PDI

I proposed and received approval from our senior management and company board of directors to use Pentaho and Vertica as our Business Intelligence (BI) / Data Warehouse (DW) solution.

Vertica

HP Vertica is a columnar MPP database that is 20-100 times faster than Oracle. HP Vertica is available in a Community Edition; allowing organizations to use all the features of the database for free for data up to 1TB on three nodes. You can also install the database on a single node, though for a true proof of concept, you should get at least 3 nodes. We started using Vertica 6.1 Community Edition for proof of concept (POC) and then later upgraded to an enterprise license when we went live in production.

Pentaho

Pentaho is an open source BI platform and ETL tool. I liked the fact that it was open source; allowing us to highly customize the BI implementation if we chose to, as well as develop our own ETL connectors and routines. Some of the client tools are a bit quirky, but I do not what BI/ETL software isn’t, given the complexity of these tools. Overall the product is solid and delivers as expected. We got the enterprise edition for the additional features and product support from Pentaho. One thing that is annoying, is all the configuration files that are spread all over the place. To be fair, this is probably more of a Java application configuration issue, than a Pentaho issue.

When I tell people that I’m using Pentaho, they are usually surprised; then I find out they were using Pentaho 3.x and then I’m not surprised by their reaction. Pentaho 4.x is a big step up from previous major releases, and Pentaho 5.0 is looking really good (I like their UI redesign). I encourage anyone who looked at an early version of Pentaho to take another look. The product has matured and is worth another look.

When I was selecting a BI vendor, the thought “no one ever got fired for choosing IBM (Cognos)” crossed my mind. I could have gone the “safe” route and used one of these other tools. However, I believe the combination of Vertica + Pentaho has delivered more value to the organization in a shorter amount of time that it would have been for us to realize with these other vendors. For our organization, for our business needs, and for the realities of our IT capabilities at the time, Pentaho + Vertica was the way to go. We delivered the project on time and within budget (and without astronomical first-year costs). We have 100% user adoption internally, and are getting very positive feedback from our merchant clients.

Results

  • Recognized by CEO for on-time, on-budget implementation; received “A” grade on end-of-year Enterprise-wide Strategic Initiatives Scorecard
  • Excellent user adoption
  • Positive feedback from external clients
  • Reduced manual reporting tasks over 50% (and over 80% in certain departments)

Reflections on "Hadoop Certification – is it worth it" 18 months later

It has been over a year and half since I took the Cloudera Hadoop Developer Certification course and exam and posted my initial impressions of it on my blog. I have received more comments than I had expected, thank you for reading and sending me comments! There have been a few trends in the comments, some displayed, others kept private. The main ones are:

  1. People really want to get their hands on the Cloudera training materials
  2. People are very eager to get Hadoop jobs
  3. People are trying to transition into Hadoop from different (technical) backgrounds
  4. People want to know if they need to know Java to work with Hadoop
  5. People really want to know if getting a certification in Hadoop will land them a job.

Here is an update to each of these trends:

#1) I cannot share the Cloudera training materials with you, sorry. I wish you the best, but I cannot distribute these materials. They are also pretty old at this point, chances are some of the content is outdated by now. It seems like many of the people asking me for the training materials haven’t picked up any books on the subject at all.  So, please check out the available online resources or pick up some books (Hadoop, the Definitive Guide, comes to mind) .

#2) There is tremendous amount of interest in learning Hadoop (and getting the training materials) in India. If it
is hard to find experienced Hadoop developers in the US right now, I imagine it must be even harder in India (for now, anyway) and there must be many, many job openings right now. I can imagine the outsourcing firms trying to staff up to meet the unmet demand in the US and elsewhere. Almost all the comments and private messages sent to me for training materials were from India. I do not know how much a training course costs in India, but there are plenty of training options, in addition to Cloudera and Hortonworks’ online offerings.

#3) Career switchers (or more accurately, technology-platform-switchers) will need to impress hiring managers with their transferable skill sets and show (not tell) their passion for technology and big data. This is true for any job applicant.

#4) Regarding Java, yes, it is good to know Java to work with Hadoop, but it is not required. You can use other languages, such as python, through the Hadoop Streaming API. To work with big data, python is good language to know anyway (lots of companies are looking with people with linux/python background), so learn python while you are at it (learnpythonthehardway.com). If you know python you will also be able to use Pig to interact with your data. What language you will will be determined by the solution architecture and design. If the company you want to work with has designed a solution with custom coded java map reduce jobs, then you would need to know java. Other places may implement Hadoop Streaming API and use python, so it may be possible to get a job there if you know python.

#5) Having a certification in Hadoop won’t guarantee you a job. Most companies are looking for experienced Hadoop hires, which is hard to do unless they are poaching employees from other Big Data statups or tech firms (Yahoo, Google, etc.). When I interviewed technical job applicants, I was surprised (perhaps I shouldn’t have been) how poorly they interview. So please, please practice your behavioral interviewing skills (“tell me about yourself”, “walk me through your resume”, “tell me about a time you had to solve a difficult problem”, “why do you want this job”, etc.). If someone has 50 certifications and can’t answer these simple questions, I will not consider them for the role. I have heard that some hiring managers consider too many certifications as a cover up for lack of skill (superstar developers don’t bother getting certified / don’t need to be certified). For the rest of us, it can help, but it doesn’t guarantee success. The Cloudera Developer course is a good overview, but for it to be meaningful, you really do need a project to work on. Working on a pet project and being able to share code samples would help set you up for success when interviewing.

As for my own personal experience, I did not get a job working directly with Hadoop following the certification course, but I also was not only considering Hadoop developer roles.  I am now leading a BI implementation project where I interviewed and hired a team of developers and analysts. We are using Pentaho and Vertica (for analytic database) and I have been evangelizing Hadoop and other technologies at my company. I find it humorous when executives say the company needs to do more “big data” or “more Hadoop” without really knowing what it means. The certification course definitely helped me speak more authoritatively about this technology at my company and when networking with others.

Whether or not to take the certification course depends on your individual circumstance. If you are dead-set on getting a job as a Hadoop developer then it may be worth it to you, but make sure to follow up with a personal project to continue learning and practicing. Many people focus on Hadoop, and seem to forget the business applications of using a technology like Hadoop (data science, improved ETL, data processing). Brushing up on those skills and domain knowledge would make you a much more interesting job candidate over all.  Good luck everyone!

Upcoming conference on node.js

Just signed up for node.ph, on April 23rd 2012. Looking forward to learning more about this event-driven framework and how to apply to business challenges.

Schedule of events includes

  • Introduction to the event-driven I/O framework that is changing that way we think about developing web applications.
  • Fully loaded Node! Lloyd Hilaiel will explain how to do a bunch of computation with Node.js, use all available CPUs, fail gracefully, and stay responsive.
  • Charlie Robbins will take us through real-world deployments in business-critical systems and why some of the world’s leading companies are choosing Node.
  • James Halliday and Daniel Shaw will show how to use Node.js to enable the real-time streaming web. Guaranteed to generate ideas for next-generation web applications.

Cloudera Hadoop Developer Training- Is it Worth it?

Is Cloudera Developer Training for Apache Hadoop worth it?

I just finished the Cloudera Developer Training for Apache Hadoop course, and passed the Cloudera Certified Developer for Apache Hadoop exam. I am feeling good about passing the certification exam on the first try, but have some mixed feelings, primarily around: is it worth the course fee (upwards of $2,700 at the time of the writing)? In particular, what is the value to job seekers or professionals wanting to augment their skill set and polish their resumes? Is it worth it to Java developers to take this course? Don’t get me wrong, the instructor (Mark Fei) was excellent! He was very knowledgeable and engaging. The course itself covers a lot of material in a digestible way. The question in my mind is, will you appreciate all the knowledge in taking the Developer course, or would picking up a book on Hive (or Pig) be more than enough for what you need to do?
Continue reading

Cloudera Hadoop Developer Training: Day 1

Cloudera Hadoop Developer Training: First Impressions

Just wrapped the first day of Cloudera Hadoop Developer Training… so far so good! Training lasts 4 days, with the option to take the Certification Exam within 30 days of course completion. Apparently the exam policy is changing in the next couple of months. Right now, the exam is a timed multiple-choice, open book, un-proctored exam that is taken online. Soon it will become a Pearson-administered exam, which means, registering in advance for exams, taken in a highly supervised test environment, and I assume, a closed book format, but not sure. I am taking the course in Columbia Maryland, with Mark Fei as the instructor. He’s great! Passionate and knowledgeable about the material, good course leader and instructor. He’s probably taught this class countless times, but he still has plenty of enthusiasm and interest in answering questions.

Day 1 included a high level overview of Hadoop ecosystem and the Map Reduce algorithm. We had a chance to do hands-on labs in the afternoon and ran our first Map Reduce (MR) algorithms in Java, a word count map reduce, the “hello world” example of Hadoop. Next, we wrote our own average word length map reduce algorithm. The course officially does not require Java knowledge, but it definitely helps knowing how to code in Java or another object oriented language. We discussed how to write Map Reduce algorithms in Python, Perl, even UNIX shell scripts via the Hadoop Streaming API. Mark gave us an overview of the Hadoop ecosystem and discussed how Hive, Pig, Sqoop, and Oozie work with Hadoop Map Reduce in the production environment.

Cloudera Hadoop Developer Training: Is it Worth it?

My biggest takeaway today was the anecdotal observation that 75% of all Hadoop Map Reduce jobs in production are likely invoked via Hive and Pig. Hive provides a “SQL-lite” front end interface to Map Reduce. HiveQL statements are translated into Map Reduce jobs behind the scenes, allowing business analysts to easily query the Hadoop cluster. Pig provides a scripted language interface to Hadoop, allowing users to write more complex queries (in “PigLatin”) without having to write Java. At this point in the class, I asked myself, wow, if 75% of all Map Reduce is being done by Hive or Pig, why am I here taking this Java course? Is the Cloudera Developer course still worth it for the “typical user?” I think for most users, especially those with solid SQL backgrounds and aren’t all that concerned with the inner workings of Hadoop, learning and using Hive is probably good enough. I like “getting under the hood” and truly understanding what’s going on, knowing how to optimize queries, and being able how to deal with the more challenging edge cases. Also, I am interested in learning how to use machine language algorithms with Hadoop. So I think there is still value in learning the edge cases and complexity.

Another key takeaway was getting a good sense of how I could re-architect a production environment using Hadoop and its related tools. At Vostu we were looking at ways to improve the ETL process for our analytics database environment. We were working with Large Data, and our ETL process, which had not been updated in a long time, was showing its age and its limits in scalability as we added more games and experienced viral user growth. In order to leverage Hadoop’s full potential, it seemed to me that an ETL redesign needed to be on the Production team’s radar screen also, not just the Analytics team. It was good hearing Mark’s insights on how other Hadoop practitioners dealt with the very real political and organizational challenges of introducing Hadoop to their respective companies. In a world where data center managers associate big storage with big storage appliances (i.e., SAN or NAS installations), Hadoop relies on commodity hardware with direct storage. Data center managers may not “get it,” and may be resistant, especially since in many Infrastructure teams, you have separate DBA and Storage teams – so who would be the owner of an integrated DB + Storage solutions? Selling and implementing Hadoop to an organization can represent an up-hill challenge to less-than-innovative IT organizations.

Cloudera Hadoop Developer Training: Event Details

The training location, Bridge Education, has reasonable facilities. After a while I noticed all the Sun Server posters (server p0rn, anyone?) in the hallways, classrooms, kitchen pantry, basically everywhere. I hadn’t realized what a big deal Sun Java J2EE Certifications were; that’s the only connection I could make to the wall art. Lunch was provided, Chipotle on first day, so we didn’t need to waste time looking for a place to eat.

Cloudera provides electronic copies of all course materials, including a VMWare Virtual Machine running CentOS with a Hadoop instance running in pseudo-distributed mode. This was pretty cool: it runs separate daemons for HDFS Master/Slave Nodes, and Hadoop Name / Data Nodes, etc., allowing one to fully experience how to get data in/out of a Hadoop cluster without needing to actually configure a multi-node cluster environment. The lecture notes, a 500-page PowerPoint PDF file, is downloaded from the training site. The best part is the fully configured VM image, saving me tons of time downloading and installing this framework for testing at home. It is very similar to the Cloudera image available on the web, so overall, the best way to test Hadoop and run the sample Map Reduce jobs would be through a VM image.

Crowdsourcing solutions for healthcare industry – Kaggle and the $3M Heritage Health Prize

Heritage launched the $3 million Heritage Health Prize with one goal in mind: to develop a breakthrough algorithm that uses available patient data, including health records and claims data, to predict and prevent unnecessary hospitalizations. Heritage believes that incentivized competition – one that includes the involvement of those with passionate minds that don’t know what can’t be done – is the best way to achieve the radical breakthroughs and innovations necessary to reform our health care system. Sponsoring this prize is simply one way that Heritage believes it can help solve a societal problem. Heritage is not an insurer and doesn’t stand to benefit directly by solving this problem – but Heritage is in the business of looking after the health of Americans and believes that corporations have a role in enabling change for the better.

This has the potential of catalyzing big breakthroughs in healthcare analytics. At too many healthcare clients have I seen Informatics departments stuck in operational reporting instead of the higher-value analytics work that they were probably originally recruited for. Sure, every client is different, but a common experience in healthcare consulting is that data is 1) hard to get, 2) hard to interpret, 3) hard to put to use. Health insurers have hard enough time managing their databases and data warehouses given limited IT budgets and qualified resources, let alone do significant value-add or R&D work in effectively mining their membership health and claims data. Other industries are having problem managing and drawing insights from “big data,” but it is especially difficult for healthcare due to HIPAA privacy and other government regulations.

So, the fact that the Heritage provider network has partnered with Kaggle to create an analytics competition is great news, indeed! Finally medical data is available for data scientists and other analytics wizards to comb through, innovate, and perhaps come up with true out-of-the-box thinking on this problem: patient identification and member-level targeting to truly reduce cost (and not just lipservice/buzzwords to put on vendors’ latest care management platform marketing collateral). IT and Informatics/Business Intelligence departments within healthcare companies are too busy doing “business as usual” and “maintenance” projects… so crowdsourcing “anonymous” health data to scientists and data experts just makes sense. I hope to see more of these types of competitions within the healthcare space in the near future.

I have signed up for the competition; looking forward to getting knee-deep in the member data, and getting a real-world handle on the types of challenges that Informatics departments must deal with on a daily basis.

For more info check out

Large-Scale Data Storage and Processing for Scientists with Hadoop

Great overview of Hadoop and related “big-data” tools