Building a credit model

A coworker recently asked me to explain how one goes about building a credit risk model. It’s something my company does a lot of, but apparently it’s not taught during new hire on-boarding. Also, it made me think, how would I actually explain the process end-to-end to someone interested in our industry but not a practitioner? Curious, I searched Google in case anyone had already done so, and of course someone else had! So, here’s a quite impressive deep-dive into credit risk modelling thanks to Natasha Mashanovich, Senior Data Scientist at World Programming: Credit Scoring: The Development Process from End to End

Credit Scores throughout the Customer Journey

This is a ten-part series of blog posts describing the entire process. Her company seems to be some sort of SAS competitor, and I am not endorsing her product or company in any way. That said, her write up is pretty tool-agnostic and pretty general, so it is worth a read if you are interested.

Personally, I would create a modelling pipeline in python / pyspark (since we deal with large data sets) in a cloud environment (like AWS) instead of SAS, but not everyone in the financial services industry has moved to the cloud yet. I hope you find the link to be helpful…

Fishy Fun with Doc2Vec

Using a fishkeeping forum corpus with everyone’s favorite vector representation

I wanted to play around with word2vec but did not want to use the typical data sets (IMDB, etc.). So, I said, what if I were to do some web scraping of one of my favorite fishkeeping forums and attempt to apply word2vec to find “experts” within the forum. Well, turns out this is a much longer journey than I originally thought it would be, but an interesting one nonetheless.

This is a first blog post of hopefully several of my adventures with word2vec/doc2vec. I have a few ideas on how to leverage this corpus using deep learning to auto-generate text, so stay tuned, and if interested, drop me a line or leave a comment!


So word2vec was originally developed by Google researchers and many people have discussed the algorithm. Word2vec provides a vector representation of a sequence of words using a not-deep neural network. Doc2vec adds additional information (namely context, or paragraph context) to the word embeddings. The original paper on Paragraph Vector can be found at A quick literature search revealed I wanted to use doc2vec instead of word2vec for my particular use case since I wanted to compare user posts (essentially multiple paragaphs) instead of just words.

Later, I found this very informative online video from PyData Berlin 2017 where another data scientist used doc2vec to analyze comments on news websites. I thought that was cool, and further fueled my interest to tinker with this algorithm in my spare time… fast forward a few hours, and its almost daylight and I’m still here typing away…

I highly recommend watching this video for additional context:   

What I’m trying to do

I’d like to do the following:

  • analyze user posts on to identify who are the “experts” on fishkeeping and plants/aquascaping
  • have fun with doc2vec while doing this

Continue reading

Computer Vision meets Fish Tank

One day I got curious… what if I programmed my computer to track the fish swimming in my fish tank? That led me to tinkering with an open source software library called OpenCV. I fiddled around with the settings, tried a few things, and saved the output as a video, seen below. There’s a lot of research in computer science around object recognition and identification … this mini-project was just an attempt to have some fun poking around with some “older” computer vision technologies. Let me know what you think!


Python API to


Approximately 10% of American households have fish as pets.
It is estimated that 95% of fish deaths can be attributed to improper housing or nutrition. Many times fish are sold or given away without any guidance to the new pet owner, such as goldfish giveaways at carnivals or at birthdays. Some fish have myths associated with them, such as the betta fish (siamese fighting fish) that supposedly can live in dirty water in small bowls. is a website that helps aquarists plan how to stock their fish tank. Users specify their tank size, their filtration, and what fish they intend to keep in the tank. The site will calculate the stocking level and filtration capacity given the inputs. This is a useful tool to get a rough estimate on a fish tank’s stocking level, it even lets you know whether the fish are compatible with one another, if you have more than one species in the tank. AqAdvisor is sometimes criticized for “not being accurate”, so the output generated should be not be treated as gospel; nonetheless, it gives a reasonable starting point, and is generally very useful for beginner fishkeepers.

Why I created this tool

I started using AqAdvisor and got annoyed at the archaic design. It’s not a RESTful API, it’s a clunky web site that takes a while to load. I was doing lots of research and found myself wanting a better useful experience. I also had some free time on my hands one long holiday weekend so I decided to give myself a little programming exercise of creating a python API to the site.

How to use the tool

The easiest way to use the tool is to use the ipython notebook as a starting point. First, create a stocking, then a tank, and then make a call to the AqAdvisor service. Because of the clunky web interface, multiple calls to must be made if you want to have more than one fish species in a tank (as is would be the case for a community tank). The auto-generated AqAdvisor URL will be printed for each call out to the website. This is useful in case you want to jump over to the web UI, you can just copy and paste the URL into your web browser and continue from there.

Use the common (English) name for the fish you are looking for. PyAqAdvisor will do a “fuzzy match” to AqAdvisor’s species list and match the closet one. This way you can specify your stocking list as “cardinal tetra” and not worry about the scientic name.

Please look at examples/ and examples/example.ipynb for more information.

Here’s an example of how easy it use the new API:

from pyaqadvisor import Tank, Stocking

if __name__ == '__main__':

  stocking = Stocking().add('cardinal tetra', 5)\
   .add('panda cory', 6)\
   .add('lemon_tetra', 12)\
   .add('pearl gourami', 4)

  print "My user-specified stocking is: ", stocking
  print "I translate this into: ", stocking.aqadvisor_stock_list

  t = Tank('55g').add_filter("AquaClear 30").add_stocking(stocking)
  print "Aqadvisor tells me: ",
  print t.get_stocking_level()

Github Repo: PyAqAdvisor


  • PyAqAdvisor currently only works for freshwater fish species. If you are interested in saltwater fish, please contact me.

Generate heart rate charts from MapMyRide TCX files

So I had some free time over Columbus Day weekend and figured why not spend it on a fun programming project. My politically-incorrectly named GhettoTCX project emerged after some quick fussing around with TCX (XML) file.

Ghetto TCX

GhettoTCX will parse a TCX file from Garmin, MapMyRide, etc. and generate some basic plots. The most interesting plot type is the heart rate zone chart. It can create a panel of plots, by parsing all the filed in a given directory.

It’s called GhettoTCX because it’s a no-frills, nothing fancy, not even a true TCX file parser. It simply searches for some keywords and pulls out heartbeat info and lat/long data. And not even at the same time, you need to the read the file twice if you want to plot both.

Heart Rate plots
Heart Rate plots

The example code and python code repository can be found on the project’s github page.

There are “better” TCX/XML file parsers out there. This one was meant to do one thing (actually two things), quickly and easily: plot heart rate (and heart rate zones). It can also plot lat/long data points onto a scatterplot, but it is seriously no-frills when you can get nice google maps charts on MapMyRide and practically any other fitness app out there.

It started out (and ended) as a fun weekend programming project… if you are curious about your heart rate zone, and are too cheap cost-conscious to pay the monthly subscription fee to MapMyRide for the heart rate zone chart, you can use this free tool instead. Enjoy!

Map Reduce is dead, long live Spark!

Map Reduce is dead, long live Spark!

That’s the impression I, and I think most people attending the conference, walked away with after Strata NY 2014.  Most of the interesting presentations were centered on Spark.  Only corporate IT presentations about “in progress hadoop implementations” were about Map Reduce.

So who’s working on Spark?  Cool startups and vendors (preparing for enterprise IT departments to move on to Spark in a year or two).

Who’s working on Map Reduce? Corporate IT departments migrating off legacy BI systems onto the promised land of Hadoop (dream come true, or nightmare around the corner, not sure which one it will be for people).

It makes sense. Map Reduce has been tested and is ‘safe’ now for enterprise IT teams to start deploying it into production systems.  Spark is still very new and untested.  Too risky for a Fortune 500 to dive into replacing legacy systems with a still-in-diapers open source software “solution.”  Nonetheless, I am sure every technical worker will be drooling to “prototype” or create proof of concepts with Spark after this conference.

Reflections on Strata NYC 2014

I had a chance to attend Strata in New York back in October.  I had been wanting to attend Strata for a few years, but had not had a chance until now.  A few impressions: (in the form of brief bullets)

  • It’s huge! (Over 3,000 attendees)
  • Very corporate!  (A bit too corporate, too stuffy, seemed like legal departments censored some presentations)
  • All the cool kids are using/learning Spark (and Scala)
  • Map Reduce is old news.
  • Enterprises move slow like dinosaurs, are just figuring out what Map Reduce is
  • Way too many vendors
  • Not enough interesting/inspiring presentations

Those were just my impressions, others may have other opinions.

Pentaho and Vertica as Business Intelligence / Data Warehousing solution


I recently wrapped up a BI/Data Warehouse implementation project where I was responsible for helping a rapidly expanding international e-commerce company replace their aging BI reporting tool with a new, more flexible solution. The old BI reporting tool was based on a “in-memory” reporting engine, was more of a “departmental solution” than an enterprise-grade one, and was not optimally designed. For example, users found themselves downloading data from different canned reports to Excel where they ran VLOOKUPs and pivot tables to compute simple metrics such as average order value and average unit retail. Needless to say despite best of intentions, there had been a communication gap between business users and IT developers on reporting requirements during the implementation of the original BI tool.

In designing and implementing the new solutions, I set the following strategic tenants / guiding principles:

  • leverage commercial off-the-shelf (COTS) software; minimize customization and emphasize configuration instead (i.e., chose to buy instead of build, and made sure to not to build too much after buying)
  • involve all stakeholders and business users throughout process
  • enable business users to use self-service BI tools as much as possible
  • train as needed; up-skilling user base on self-service tools is better than hiring army of BI analysts
  • leverage data warehouse for both internal and external reporting
  • minimize amount of aggregation in Data Warehouse (we did almost no aggregation)
  • maximize the processing power of the ROLAP engine by pairing it with a high-performance analytical database (i.e., columnar MPP database)
  • stick to Kimball data warehouse design approach as much as possible, but be pragmatic where needed; Star Schema, Star Schema, Star Schema! (no snowflakes here)
  • take an iterative approach where possible – need to “ship” on time – understand that 1st release will not be “perfect” but does need to meet business requirements
  • for external reporting, provide canned reports only initially; test user adoption and work with clients to understand and address reporting needs over time

We looked at traditional players, open source, emerging technologies, and Cloud BI SaaS providers. I made sure business and IT stakeholders were part of the vendor selection process, ensuring they attended demos and vendor presentations. In the end, Pentaho best matched all our needs, providing us with both a solid ETL and BI reporting engines. Since we looking at providing both internal and external reporting with this solution, traditional BI vendors were prohibitively expensive, and “cloud offerings” were not compatible with our current IT capabilities and architecture (our data was not in the cloud).

Solution Description – Vertica + Pentaho BI/PDI

I proposed and received approval from our senior management and company board of directors to use Pentaho and Vertica as our Business Intelligence (BI) / Data Warehouse (DW) solution.


HP Vertica is a columnar MPP database that is 20-100 times faster than Oracle. HP Vertica is available in a Community Edition; allowing organizations to use all the features of the database for free for data up to 1TB on three nodes. You can also install the database on a single node, though for a true proof of concept, you should get at least 3 nodes. We started using Vertica 6.1 Community Edition for proof of concept (POC) and then later upgraded to an enterprise license when we went live in production.


Pentaho is an open source BI platform and ETL tool. I liked the fact that it was open source; allowing us to highly customize the BI implementation if we chose to, as well as develop our own ETL connectors and routines. Some of the client tools are a bit quirky, but I do not what BI/ETL software isn’t, given the complexity of these tools. Overall the product is solid and delivers as expected. We got the enterprise edition for the additional features and product support from Pentaho. One thing that is annoying, is all the configuration files that are spread all over the place. To be fair, this is probably more of a Java application configuration issue, than a Pentaho issue.

When I tell people that I’m using Pentaho, they are usually surprised; then I find out they were using Pentaho 3.x and then I’m not surprised by their reaction. Pentaho 4.x is a big step up from previous major releases, and Pentaho 5.0 is looking really good (I like their UI redesign). I encourage anyone who looked at an early version of Pentaho to take another look. The product has matured and is worth another look.

When I was selecting a BI vendor, the thought “no one ever got fired for choosing IBM (Cognos)” crossed my mind. I could have gone the “safe” route and used one of these other tools. However, I believe the combination of Vertica + Pentaho has delivered more value to the organization in a shorter amount of time that it would have been for us to realize with these other vendors. For our organization, for our business needs, and for the realities of our IT capabilities at the time, Pentaho + Vertica was the way to go. We delivered the project on time and within budget (and without astronomical first-year costs). We have 100% user adoption internally, and are getting very positive feedback from our merchant clients.


  • Recognized by CEO for on-time, on-budget implementation; received “A” grade on end-of-year Enterprise-wide Strategic Initiatives Scorecard
  • Excellent user adoption
  • Positive feedback from external clients
  • Reduced manual reporting tasks over 50% (and over 80% in certain departments)

Reflections on Big Data Roundtable hosted by JNK Securities

Last week I had the opportunity to attend a Big Data round table discussion organized by JNK Securities, a broker-dealer based in NY w/ offices in DC. The attendees seemed to be evenly split between technologists/practitioners and finance professionals hoping to get a pulse on market trends. The conversation was moderated by Atul Chhabra, entrepreneur and formerly Director of Cloud Strategy at Verizon.

The finance professionals were eager to understand how Hadoop, NoSQL, and other Big Data technologies were going to disrupt (or not) existing technology vendors. One person had asked how easy it would be for existing companies to replace their Oracle installs with Hadoop, or NoSQL database, if Oracle licensing agreements were structured to penalize such migration. As was quickly pointed out by the crowd, it is not “termination fees” that are the problem in moving away from Oracle, but the level of investment (i.e., cost) that would be necessary to refactor existing code and applications to ensure the application would function as expected. One way RDBMS vendors increase their product’s “stickiness” and cost of migration is to promote their database’s proprietary language (PL/SQL for Oracle, TSQL for Microsoft) over ANSI standards. If an application relies heavily on these stored procedures, it will have to be rewritten in the new database’s language (or in standard ANSI SQL to make it more easily transferable in the future). Of course, that’s assuming there are no hidden “gotchas” in the code itself, such as a programmer making a direct JDBC call to a database and hard coding the SQL in the web application code itself. Bottom line, it would be very expensive to rewrite existing code; and very hard to justify doing so since by itself it does not add any additional value to the company. Additionally, as Atul pointed out, migrating off Oracle may be unlikely to reduce licensing costs for enterprises since these licensing contracts are typically based on number of employees, or clients – migrating one application off Oracle would not affect number of employees, so the licensing costs remain the same, and in fact increase if there are licensing costs for the new technology (there usually is). What is more likely is for companies to build new tools, new applications using emerging technologies and leave legacy systems as is.

An interesting idea put out by one of the attendees was that they way we think about coding and building applications will dramatically change now that we are in the age of Big Data and Big Compute. There is a fundamental shift in thinking how we design applications – instead of coding for the limits of hardware assume “best case scenarios” of unbounded scalability, unending amounts of storage and RAM thanks to developments in Big Data architecture, horizontal scaling, and massively parallel processing (MPP). For example, no longer code applications and file systems to purposely delay processing while waiting for hard drives to spin up or to perform file seek operations; instead assume instantaneous read/write thanks to SSDs, assume infinite storage (through HDFS-like architecture), and assume unbounded parallelism (i.e., no longer bounded by number of cores on one particular server)

Overall, it was a great event, good dinner conversation with smart people.  Looking forward to future events.