Fishy Fun with Doc2Vec

Using a fishkeeping forum corpus with everyone’s favorite vector representation

I wanted to play around with word2vec but did not want to use the typical data sets (IMDB, etc.). So, I said, what if I were to do some web scraping of one of my favorite fishkeeping forums and attempt to apply word2vec to find “experts” within the forum. Well, turns out this is a much longer journey than I originally thought it would be, but an interesting one nonetheless.

This is a first blog post of hopefully several of my adventures with word2vec/doc2vec. I have a few ideas on how to leverage this corpus using deep learning to auto-generate text, so stay tuned, and if interested, drop me a line or leave a comment!

Background

So word2vec was originally developed by Google researchers and many people have discussed the algorithm. Word2vec provides a vector representation of a sequence of words using a not-deep neural network. Doc2vec adds additional information (namely context, or paragraph context) to the word embeddings. The original paper on Paragraph Vector can be found at https://cs.stanford.edu/~quocle/paragraph_vector.pdf A quick literature search revealed I wanted to use doc2vec instead of word2vec for my particular use case since I wanted to compare user posts (essentially multiple paragaphs) instead of just words.

Later, I found this very informative online video from PyData Berlin 2017 where another data scientist used doc2vec to analyze comments on news websites. I thought that was cool, and further fueled my interest to tinker with this algorithm in my spare time… fast forward a few hours, and its almost daylight and I’m still here typing away…

I highly recommend watching this video for additional context:   

What I’m trying to do

I’d like to do the following:

  • analyze user posts on Fishlore.com to identify who are the “experts” on fishkeeping and plants/aquascaping
  • have fun with doc2vec while doing this

Continue reading

Generate heart rate charts from MapMyRide TCX files

So I had some free time over Columbus Day weekend and figured why not spend it on a fun programming project. My politically-incorrectly named GhettoTCX project emerged after some quick fussing around with TCX (XML) file.

Ghetto TCX

GhettoTCX will parse a TCX file from Garmin, MapMyRide, etc. and generate some basic plots. The most interesting plot type is the heart rate zone chart. It can create a panel of plots, by parsing all the filed in a given directory.

It’s called GhettoTCX because it’s a no-frills, nothing fancy, not even a true TCX file parser. It simply searches for some keywords and pulls out heartbeat info and lat/long data. And not even at the same time, you need to the read the file twice if you want to plot both.

Heart Rate plots
Heart Rate plots

The example code and python code repository can be found on the project’s github page.

There are “better” TCX/XML file parsers out there. This one was meant to do one thing (actually two things), quickly and easily: plot heart rate (and heart rate zones). It can also plot lat/long data points onto a scatterplot, but it is seriously no-frills when you can get nice google maps charts on MapMyRide and practically any other fitness app out there.

It started out (and ended) as a fun weekend programming project… if you are curious about your heart rate zone, and are too cheap cost-conscious to pay the monthly subscription fee to MapMyRide for the heart rate zone chart, you can use this free tool instead. Enjoy!