Reflections on Innovation and recent opinion piece in NY Times

People often ask me about what metrics I would use to evaluate an organization’s level of “innovation”. Depending on how well I know that person, I sometimes flippantly respond with a question (or two; the first one being the more important one):

  • Has the organization recently created an “Innovation Center” or team? This is usually a big red flag that there is no innovation culture that permeates the organization, so the company creates a “innovation” organization and hires “innovation associates” to help the company “ideate” and “innovate”. The end result is more process, less innovation.
  • What percent of your individual contributors’ day is spent in meetings? When people who should be doing things, researching things, designing things, building things are instead stuck in pointless meetings (you know which ones I mean) then the organization has an execution problem that will come back to haunt them later. Their time would probably be better spent on solving problems and implementing solutions.

On a related note, I thought the following quote from a recent New York Times opinion piece on Innovation and Bell Labs was particularly apropos:

By one definition, innovation is an important new product or process, deployed on a large scale and having a significant impact on society and the economy, that can do a job (as Mr. Kelly once put it) “better, or cheaper, or both.” Regrettably, we now use the term to describe almost anything. It can describe a smartphone app or a social media tool; or it can describe the transistor or the blueprint for a cellphone system. The differences are immense. One type of innovation creates a handful of jobs and modest revenues; another, the type Mr. Kelly and his colleagues at Bell Labs repeatedly sought, creates millions of jobs and a long-lasting platform for society’s wealth and well-being.

I would add “Regrettably, building ‘innovation centers’ passes for innovation today.” The author describes Bell Labs’ founders’ philosophy of innovation:

His fundamental belief was that an “institute of creative technology” like his own needed a “critical mass” of talented people to foster a busy exchange of ideas. But innovation required much more than that. Mr. Kelly was convinced that physical proximity was everything; phone calls alone wouldn’t do. Quite intentionally, Bell Labs housed thinkers and doers under one roof. Purposefully mixed together on the transistor project were physicists, metallurgists and electrical engineers; side by side were specialists in theory, experimentation and manufacturing.

I tend to agree with this approach. You need (empowered) cross-functional teams working cohesively to develop new solutions, given organizational resources (time and budget to do proof of concepts, testing, and take risks) to get an innovative culture going. “Innovation centers” are often a symptom of siloed organizations. When employees bemoan going to another “innovation session,” that is usually a sign that the latest “corporate initiative” to promote innovation is not working. Sometimes the best thing to do is to admit you have a siloed organization and take steps to reshape. This takes true leadership (at the most senior levels) and effective change management. It is easier said than done.

Advertisements

Cloudera Hadoop Developer Training- Is it Worth it?

Is Cloudera Developer Training for Apache Hadoop worth it?

I just finished the Cloudera Developer Training for Apache Hadoop course, and passed the Cloudera Certified Developer for Apache Hadoop exam. I am feeling good about passing the certification exam on the first try, but have some mixed feelings, primarily around: is it worth the course fee (upwards of $2,700 at the time of the writing)? In particular, what is the value to job seekers or professionals wanting to augment their skill set and polish their resumes? Is it worth it to Java developers to take this course? Don’t get me wrong, the instructor (Mark Fei) was excellent! He was very knowledgeable and engaging. The course itself covers a lot of material in a digestible way. The question in my mind is, will you appreciate all the knowledge in taking the Developer course, or would picking up a book on Hive (or Pig) be more than enough for what you need to do?
Continue reading

Cloudera Hadoop Developer Training: Day 1

Cloudera Hadoop Developer Training: First Impressions

Just wrapped the first day of Cloudera Hadoop Developer Training… so far so good! Training lasts 4 days, with the option to take the Certification Exam within 30 days of course completion. Apparently the exam policy is changing in the next couple of months. Right now, the exam is a timed multiple-choice, open book, un-proctored exam that is taken online. Soon it will become a Pearson-administered exam, which means, registering in advance for exams, taken in a highly supervised test environment, and I assume, a closed book format, but not sure. I am taking the course in Columbia Maryland, with Mark Fei as the instructor. He’s great! Passionate and knowledgeable about the material, good course leader and instructor. He’s probably taught this class countless times, but he still has plenty of enthusiasm and interest in answering questions.

Day 1 included a high level overview of Hadoop ecosystem and the Map Reduce algorithm. We had a chance to do hands-on labs in the afternoon and ran our first Map Reduce (MR) algorithms in Java, a word count map reduce, the “hello world” example of Hadoop. Next, we wrote our own average word length map reduce algorithm. The course officially does not require Java knowledge, but it definitely helps knowing how to code in Java or another object oriented language. We discussed how to write Map Reduce algorithms in Python, Perl, even UNIX shell scripts via the Hadoop Streaming API. Mark gave us an overview of the Hadoop ecosystem and discussed how Hive, Pig, Sqoop, and Oozie work with Hadoop Map Reduce in the production environment.

Cloudera Hadoop Developer Training: Is it Worth it?

My biggest takeaway today was the anecdotal observation that 75% of all Hadoop Map Reduce jobs in production are likely invoked via Hive and Pig. Hive provides a “SQL-lite” front end interface to Map Reduce. HiveQL statements are translated into Map Reduce jobs behind the scenes, allowing business analysts to easily query the Hadoop cluster. Pig provides a scripted language interface to Hadoop, allowing users to write more complex queries (in “PigLatin”) without having to write Java. At this point in the class, I asked myself, wow, if 75% of all Map Reduce is being done by Hive or Pig, why am I here taking this Java course? Is the Cloudera Developer course still worth it for the “typical user?” I think for most users, especially those with solid SQL backgrounds and aren’t all that concerned with the inner workings of Hadoop, learning and using Hive is probably good enough. I like “getting under the hood” and truly understanding what’s going on, knowing how to optimize queries, and being able how to deal with the more challenging edge cases. Also, I am interested in learning how to use machine language algorithms with Hadoop. So I think there is still value in learning the edge cases and complexity.

Another key takeaway was getting a good sense of how I could re-architect a production environment using Hadoop and its related tools. At Vostu we were looking at ways to improve the ETL process for our analytics database environment. We were working with Large Data, and our ETL process, which had not been updated in a long time, was showing its age and its limits in scalability as we added more games and experienced viral user growth. In order to leverage Hadoop’s full potential, it seemed to me that an ETL redesign needed to be on the Production team’s radar screen also, not just the Analytics team. It was good hearing Mark’s insights on how other Hadoop practitioners dealt with the very real political and organizational challenges of introducing Hadoop to their respective companies. In a world where data center managers associate big storage with big storage appliances (i.e., SAN or NAS installations), Hadoop relies on commodity hardware with direct storage. Data center managers may not “get it,” and may be resistant, especially since in many Infrastructure teams, you have separate DBA and Storage teams – so who would be the owner of an integrated DB + Storage solutions? Selling and implementing Hadoop to an organization can represent an up-hill challenge to less-than-innovative IT organizations.

Cloudera Hadoop Developer Training: Event Details

The training location, Bridge Education, has reasonable facilities. After a while I noticed all the Sun Server posters (server p0rn, anyone?) in the hallways, classrooms, kitchen pantry, basically everywhere. I hadn’t realized what a big deal Sun Java J2EE Certifications were; that’s the only connection I could make to the wall art. Lunch was provided, Chipotle on first day, so we didn’t need to waste time looking for a place to eat.

Cloudera provides electronic copies of all course materials, including a VMWare Virtual Machine running CentOS with a Hadoop instance running in pseudo-distributed mode. This was pretty cool: it runs separate daemons for HDFS Master/Slave Nodes, and Hadoop Name / Data Nodes, etc., allowing one to fully experience how to get data in/out of a Hadoop cluster without needing to actually configure a multi-node cluster environment. The lecture notes, a 500-page PowerPoint PDF file, is downloaded from the training site. The best part is the fully configured VM image, saving me tons of time downloading and installing this framework for testing at home. It is very similar to the Cloudera image available on the web, so overall, the best way to test Hadoop and run the sample Map Reduce jobs would be through a VM image.

Crowdsourcing solutions for healthcare industry – Kaggle and the $3M Heritage Health Prize

Heritage launched the $3 million Heritage Health Prize with one goal in mind: to develop a breakthrough algorithm that uses available patient data, including health records and claims data, to predict and prevent unnecessary hospitalizations. Heritage believes that incentivized competition – one that includes the involvement of those with passionate minds that don’t know what can’t be done – is the best way to achieve the radical breakthroughs and innovations necessary to reform our health care system. Sponsoring this prize is simply one way that Heritage believes it can help solve a societal problem. Heritage is not an insurer and doesn’t stand to benefit directly by solving this problem – but Heritage is in the business of looking after the health of Americans and believes that corporations have a role in enabling change for the better.

This has the potential of catalyzing big breakthroughs in healthcare analytics. At too many healthcare clients have I seen Informatics departments stuck in operational reporting instead of the higher-value analytics work that they were probably originally recruited for. Sure, every client is different, but a common experience in healthcare consulting is that data is 1) hard to get, 2) hard to interpret, 3) hard to put to use. Health insurers have hard enough time managing their databases and data warehouses given limited IT budgets and qualified resources, let alone do significant value-add or R&D work in effectively mining their membership health and claims data. Other industries are having problem managing and drawing insights from “big data,” but it is especially difficult for healthcare due to HIPAA privacy and other government regulations.

So, the fact that the Heritage provider network has partnered with Kaggle to create an analytics competition is great news, indeed! Finally medical data is available for data scientists and other analytics wizards to comb through, innovate, and perhaps come up with true out-of-the-box thinking on this problem: patient identification and member-level targeting to truly reduce cost (and not just lipservice/buzzwords to put on vendors’ latest care management platform marketing collateral). IT and Informatics/Business Intelligence departments within healthcare companies are too busy doing “business as usual” and “maintenance” projects… so crowdsourcing “anonymous” health data to scientists and data experts just makes sense. I hope to see more of these types of competitions within the healthcare space in the near future.

I have signed up for the competition; looking forward to getting knee-deep in the member data, and getting a real-world handle on the types of challenges that Informatics departments must deal with on a daily basis.

For more info check out

Large-Scale Data Storage and Processing for Scientists with Hadoop

Great overview of Hadoop and related “big-data” tools

China: GMIC And CHINICT Tech Conferences In Beijing: Learnings From China

Great article on techcrunch about a Beijing tech conference; was in town a month ago, too bad I missed it, would have been great to attend:

After exploring the mobile and Internet landscapes in Shanghai and Beijing, the GeeksOnAPlane (GOAP) group (30+ techies mostly from the Silicon Valley) continued their Asian field trip to Korea today. In Beijing, the GOAP attended two of China’s largest tech conferences: CHINICT, “the largest conference on China tech innovation” (which was livestreamed on TechCrunch), and the “Global Mobile Internet Conference” (GMIC), both of which are held in the city every year.

The GOAP got in touch with and gained unfiltered insight from dozens and dozens of local entrepreneurs, VCs and industry observers during the conferences and the events that took place around them. What follows are just a few learnings and impressions the GOAP group picked up during their China web crash course in Beijing (the size of the tech landscape is summarized in my previous post).

GeeksOnAPlane at the GMIC And CHINICT Tech Conferences In Beijing: Learnings From China.

Big Data Is Less About Size, And More About Freedom

My favorite quote in this interview:

Hal Varian, Google’s Chief Economist, recently said,

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it”

Unfortunately for those of us working on these problems in real life, it is not so simple. The archetypal data-renaissance man is mathematician, statistician, computer scientist, machine learner, and engineer all rolled into one. There are opportunities where you can lack some of these skills and work with a team that supplements your weak points—a startup is not one of those.

via TechCrunch Big Data Is Less About Size, And More About Freedom .