Using a fishkeeping forum corpus with everyone’s favorite vector representation
I wanted to play around with word2vec but did not want to use the typical data sets (IMDB, etc.). So, I said, what if I were to do some web scraping of one of my favorite fishkeeping forums and attempt to apply word2vec to find “experts” within the forum. Well, turns out this is a much longer journey than I originally thought it would be, but an interesting one nonetheless.
This is a first blog post of hopefully several of my adventures with word2vec/doc2vec. I have a few ideas on how to leverage this corpus using deep learning to auto-generate text, so stay tuned, and if interested, drop me a line or leave a comment!
Background
So word2vec was originally developed by Google researchers and many people have discussed the algorithm. Word2vec provides a vector representation of a sequence of words using a not-deep neural network. Doc2vec adds additional information (namely context, or paragraph context) to the word embeddings. The original paper on Paragraph Vector can be found at https://cs.stanford.edu/~quocle/paragraph_vector.pdf A quick literature search revealed I wanted to use doc2vec instead of word2vec for my particular use case since I wanted to compare user posts (essentially multiple paragaphs) instead of just words.
Later, I found this very informative online video from PyData Berlin 2017 where another data scientist used doc2vec to analyze comments on news websites. I thought that was cool, and further fueled my interest to tinker with this algorithm in my spare time… fast forward a few hours, and its almost daylight and I’m still here typing away…
I highly recommend watching this video for additional context:
What I’m trying to do
I’d like to do the following:
- analyze user posts on Fishlore.com to identify who are the “experts” on fishkeeping and plants/aquascaping
- have fun with doc2vec while doing this