Map Reduce is dead, long live Spark!

Map Reduce is dead, long live Spark!

That’s the impression I, and I think most people attending the conference, walked away with after Strata NY 2014.  Most of the interesting presentations were centered on Spark.  Only corporate IT presentations about “in progress hadoop implementations” were about Map Reduce.

So who’s working on Spark?  Cool startups and vendors (preparing for enterprise IT departments to move on to Spark in a year or two).

Who’s working on Map Reduce? Corporate IT departments migrating off legacy BI systems onto the promised land of Hadoop (dream come true, or nightmare around the corner, not sure which one it will be for people).

It makes sense. Map Reduce has been tested and is ‘safe’ now for enterprise IT teams to start deploying it into production systems.  Spark is still very new and untested.  Too risky for a Fortune 500 to dive into replacing legacy systems with a still-in-diapers open source software “solution.”  Nonetheless, I am sure every technical worker will be drooling to “prototype” or create proof of concepts with Spark after this conference.

Reflections on Strata NYC 2014

I had a chance to attend Strata in New York back in October.  I had been wanting to attend Strata for a few years, but had not had a chance until now.  A few impressions: (in the form of brief bullets)

  • It’s huge! (Over 3,000 attendees)
  • Very corporate!  (A bit too corporate, too stuffy, seemed like legal departments censored some presentations)
  • All the cool kids are using/learning Spark (and Scala)
  • Map Reduce is old news.
  • Enterprises move slow like dinosaurs, are just figuring out what Map Reduce is
  • Way too many vendors
  • Not enough interesting/inspiring presentations

Those were just my impressions, others may have other opinions.

Reflections on Big Data Roundtable hosted by JNK Securities

Last week I had the opportunity to attend a Big Data round table discussion organized by JNK Securities, a broker-dealer based in NY w/ offices in DC. The attendees seemed to be evenly split between technologists/practitioners and finance professionals hoping to get a pulse on market trends. The conversation was moderated by Atul Chhabra, entrepreneur and formerly Director of Cloud Strategy at Verizon.

The finance professionals were eager to understand how Hadoop, NoSQL, and other Big Data technologies were going to disrupt (or not) existing technology vendors. One person had asked how easy it would be for existing companies to replace their Oracle installs with Hadoop, or NoSQL database, if Oracle licensing agreements were structured to penalize such migration. As was quickly pointed out by the crowd, it is not “termination fees” that are the problem in moving away from Oracle, but the level of investment (i.e., cost) that would be necessary to refactor existing code and applications to ensure the application would function as expected. One way RDBMS vendors increase their product’s “stickiness” and cost of migration is to promote their database’s proprietary language (PL/SQL for Oracle, TSQL for Microsoft) over ANSI standards. If an application relies heavily on these stored procedures, it will have to be rewritten in the new database’s language (or in standard ANSI SQL to make it more easily transferable in the future). Of course, that’s assuming there are no hidden “gotchas” in the code itself, such as a programmer making a direct JDBC call to a database and hard coding the SQL in the web application code itself. Bottom line, it would be very expensive to rewrite existing code; and very hard to justify doing so since by itself it does not add any additional value to the company. Additionally, as Atul pointed out, migrating off Oracle may be unlikely to reduce licensing costs for enterprises since these licensing contracts are typically based on number of employees, or clients – migrating one application off Oracle would not affect number of employees, so the licensing costs remain the same, and in fact increase if there are licensing costs for the new technology (there usually is). What is more likely is for companies to build new tools, new applications using emerging technologies and leave legacy systems as is.

An interesting idea put out by one of the attendees was that they way we think about coding and building applications will dramatically change now that we are in the age of Big Data and Big Compute. There is a fundamental shift in thinking how we design applications – instead of coding for the limits of hardware assume “best case scenarios” of unbounded scalability, unending amounts of storage and RAM thanks to developments in Big Data architecture, horizontal scaling, and massively parallel processing (MPP). For example, no longer code applications and file systems to purposely delay processing while waiting for hard drives to spin up or to perform file seek operations; instead assume instantaneous read/write thanks to SSDs, assume infinite storage (through HDFS-like architecture), and assume unbounded parallelism (i.e., no longer bounded by number of cores on one particular server)

Overall, it was a great event, good dinner conversation with smart people.  Looking forward to future events.

Cloudera Hadoop Developer Training: Day 1

Cloudera Hadoop Developer Training: First Impressions

Just wrapped the first day of Cloudera Hadoop Developer Training… so far so good! Training lasts 4 days, with the option to take the Certification Exam within 30 days of course completion. Apparently the exam policy is changing in the next couple of months. Right now, the exam is a timed multiple-choice, open book, un-proctored exam that is taken online. Soon it will become a Pearson-administered exam, which means, registering in advance for exams, taken in a highly supervised test environment, and I assume, a closed book format, but not sure. I am taking the course in Columbia Maryland, with Mark Fei as the instructor. He’s great! Passionate and knowledgeable about the material, good course leader and instructor. He’s probably taught this class countless times, but he still has plenty of enthusiasm and interest in answering questions.

Day 1 included a high level overview of Hadoop ecosystem and the Map Reduce algorithm. We had a chance to do hands-on labs in the afternoon and ran our first Map Reduce (MR) algorithms in Java, a word count map reduce, the “hello world” example of Hadoop. Next, we wrote our own average word length map reduce algorithm. The course officially does not require Java knowledge, but it definitely helps knowing how to code in Java or another object oriented language. We discussed how to write Map Reduce algorithms in Python, Perl, even UNIX shell scripts via the Hadoop Streaming API. Mark gave us an overview of the Hadoop ecosystem and discussed how Hive, Pig, Sqoop, and Oozie work with Hadoop Map Reduce in the production environment.

Cloudera Hadoop Developer Training: Is it Worth it?

My biggest takeaway today was the anecdotal observation that 75% of all Hadoop Map Reduce jobs in production are likely invoked via Hive and Pig. Hive provides a “SQL-lite” front end interface to Map Reduce. HiveQL statements are translated into Map Reduce jobs behind the scenes, allowing business analysts to easily query the Hadoop cluster. Pig provides a scripted language interface to Hadoop, allowing users to write more complex queries (in “PigLatin”) without having to write Java. At this point in the class, I asked myself, wow, if 75% of all Map Reduce is being done by Hive or Pig, why am I here taking this Java course? Is the Cloudera Developer course still worth it for the “typical user?” I think for most users, especially those with solid SQL backgrounds and aren’t all that concerned with the inner workings of Hadoop, learning and using Hive is probably good enough. I like “getting under the hood” and truly understanding what’s going on, knowing how to optimize queries, and being able how to deal with the more challenging edge cases. Also, I am interested in learning how to use machine language algorithms with Hadoop. So I think there is still value in learning the edge cases and complexity.

Another key takeaway was getting a good sense of how I could re-architect a production environment using Hadoop and its related tools. At Vostu we were looking at ways to improve the ETL process for our analytics database environment. We were working with Large Data, and our ETL process, which had not been updated in a long time, was showing its age and its limits in scalability as we added more games and experienced viral user growth. In order to leverage Hadoop’s full potential, it seemed to me that an ETL redesign needed to be on the Production team’s radar screen also, not just the Analytics team. It was good hearing Mark’s insights on how other Hadoop practitioners dealt with the very real political and organizational challenges of introducing Hadoop to their respective companies. In a world where data center managers associate big storage with big storage appliances (i.e., SAN or NAS installations), Hadoop relies on commodity hardware with direct storage. Data center managers may not “get it,” and may be resistant, especially since in many Infrastructure teams, you have separate DBA and Storage teams – so who would be the owner of an integrated DB + Storage solutions? Selling and implementing Hadoop to an organization can represent an up-hill challenge to less-than-innovative IT organizations.

Cloudera Hadoop Developer Training: Event Details

The training location, Bridge Education, has reasonable facilities. After a while I noticed all the Sun Server posters (server p0rn, anyone?) in the hallways, classrooms, kitchen pantry, basically everywhere. I hadn’t realized what a big deal Sun Java J2EE Certifications were; that’s the only connection I could make to the wall art. Lunch was provided, Chipotle on first day, so we didn’t need to waste time looking for a place to eat.

Cloudera provides electronic copies of all course materials, including a VMWare Virtual Machine running CentOS with a Hadoop instance running in pseudo-distributed mode. This was pretty cool: it runs separate daemons for HDFS Master/Slave Nodes, and Hadoop Name / Data Nodes, etc., allowing one to fully experience how to get data in/out of a Hadoop cluster without needing to actually configure a multi-node cluster environment. The lecture notes, a 500-page PowerPoint PDF file, is downloaded from the training site. The best part is the fully configured VM image, saving me tons of time downloading and installing this framework for testing at home. It is very similar to the Cloudera image available on the web, so overall, the best way to test Hadoop and run the sample Map Reduce jobs would be through a VM image.

Big Data Is Less About Size, And More About Freedom

My favorite quote in this interview:

Hal Varian, Google’s Chief Economist, recently said,

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it”

Unfortunately for those of us working on these problems in real life, it is not so simple. The archetypal data-renaissance man is mathematician, statistician, computer scientist, machine learner, and engineer all rolled into one. There are opportunities where you can lack some of these skills and work with a team that supplements your weak points—a startup is not one of those.

via TechCrunch Big Data Is Less About Size, And More About Freedom .