Fishy Fun with Doc2Vec

Using a fishkeeping forum corpus with everyone’s favorite vector representation

I wanted to play around with word2vec but did not want to use the typical data sets (IMDB, etc.). So, I said, what if I were to do some web scraping of one of my favorite fishkeeping forums and attempt to apply word2vec to find “experts” within the forum. Well, turns out this is a much longer journey than I originally thought it would be, but an interesting one nonetheless.

This is a first blog post of hopefully several of my adventures with word2vec/doc2vec. I have a few ideas on how to leverage this corpus using deep learning to auto-generate text, so stay tuned, and if interested, drop me a line or leave a comment!

Background

So word2vec was originally developed by Google researchers and many people have discussed the algorithm. Word2vec provides a vector representation of a sequence of words using a not-deep neural network. Doc2vec adds additional information (namely context, or paragraph context) to the word embeddings. The original paper on Paragraph Vector can be found at https://cs.stanford.edu/~quocle/paragraph_vector.pdf A quick literature search revealed I wanted to use doc2vec instead of word2vec for my particular use case since I wanted to compare user posts (essentially multiple paragaphs) instead of just words.

Later, I found this very informative online video from PyData Berlin 2017 where another data scientist used doc2vec to analyze comments on news websites. I thought that was cool, and further fueled my interest to tinker with this algorithm in my spare time… fast forward a few hours, and its almost daylight and I’m still here typing away…

I highly recommend watching this video for additional context:   

What I’m trying to do

I’d like to do the following:

  • analyze user posts on Fishlore.com to identify who are the “experts” on fishkeeping and plants/aquascaping
  • have fun with doc2vec while doing this


Some challenges:

  • no corpus -> need to scrape the site myself (this was relatively easy to do w/ scrapy)
  • no training data -> there isn’t a training set for who are “experts” on this site. Many “newbies” have a handful of posts. Perhaps # posts could be a proxy for “expert” but there are some very prolific posters who are new to the hobby.
  • relatively small data set -> only ~17,000 users. A million or so posts (that I scraped). Might be good enough for fun, but its not Google scale here by any means
  • highly related documents -> they’re almost all about fish! so same vocabulary, hard to differentiate.
  • highly specialized vocabulary -> many entities (cardinal tetra, lemon tetra, other fish and plant species, etc.) that may need to be encoded or tagged (or just let doc2vec figure it out w/ enough input data, right?)

Let’s have some fun and forgive me if I cut corners here… it was getting very late at night 😉

What happened

Used gensim python library for word2vec/doc2vec functionality. I jumped right in, did some google searches trying to follow some tutorials. So, many of the tutorials I found were out of date (old gensim API). Of course since I was trying to do this as fast as possible, I didn’t read thru the gensim code until much later. The tutorials online are out of date, the API has changed a bit. The best resources were the sample ipython notebooks in the gensim library itself (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb), so if you are interested, follow those first, and forget the online blog posts (except mine since it’s really current and my code actually runs 😉

Pre-work:

I scraped fishlore.com using scrapy. I wrote some custom spiders to download threads in various forums areas, including:

  • beginners
  • plants
  • fish disease
  • other interests

Then, I saved the output by username; that is, one file per user, with one line per user post. This came out to about to 383MB uncompressed text. This wasn’t a complete scrape of fishlore.com, but I figured it was enough to have some interesting results and without them yelling at me. It yielded about posts by about 24,594 users. I am tempted to create some plots/charts about the data set (# posts per user, etc.) but for now I decided to skip that since I really wanted to get to working with doc2vec. I should mentioned I scraped the site over several days.

Since I want to eventually find “experts” based on their posts, I define a document to be all the posts by a particular user. There may be other/better ways to define a document (such as, by user-topic, i.e., Bob-plants, Bob-fishdisease, etc.) but I figured this was a reasonable first approach.

Implementation

I followed the Doc2Vec-IMDB example (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb), and modified it for my corpus as follows (see iPython notebook):

Setup python notebooks and helper functions

In [1]:
# import relevant libraries
import glob
import re
from six import iteritems
import itertools
from collections import OrderedDict, namedtuple
import multiprocessing
import random
import numpy as np

import gensim
from gensim import corpora, models, similarities
from gensim.models import Doc2Vec
import gensim.models.doc2vec

# Make sure we have Cython installed properly and we setup parallelism
cores = multiprocessing.cpu_count()-1
assert gensim.models.doc2vec.FAST_VERSION > -1

# Turn on logging since it may take a while to train models
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def lookup_result(target):
    """
    Let us lookup the user name given a document ID.
    (numerical IDs are more memory efficient than string-based document tags)
    """
    for u, doc_id_ in allusers_index.items():
        if doc_id_ == target:
            return u

def preprocess_text(s):
    """
    Perform some basic text cleanup
    """
    s = re.sub(r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)", 
               " _WEBSITE_ ", s) # replace web URLs  https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
    s = re.sub(r"""said.*Click to expand\.\.\.""", "  ", s) # get rid of quote blocks
    s = re.sub(r"""\$(([1-9]\d{0,2}(,\d{3})*)|(([1-9]\d*)?\d))(\.\d\d)?$""", 
               " _DOLLAR_AMOUNT_ ", s) # replace dollar amounts  https://stackoverflow.com/questions/17864213/java-regular-expression-to-match-dollar-amounts
    s = re.sub(r"\.\.\.", r" . ", s) # convert ellipsis to period
    s = re.sub(r"\.\.", r" . ", s) # convert ellipsis to period
    s = re.sub(r"gallons", r"gallon", s) # domain specific transformation
    s = re.sub(r"([,.!?:])", r" \1 ", s) # convert sentence markers to words
    s = re.sub(r"""[\"~\(\)]""","", s) # get rid of unusual punctuation
    s = s.lower() # convert to lowercase
    return s

Let’s define FishyDocument as our document type. Notice it’s just a namedtuple. We’ll load the user data, apply some basic data transformations and tag the “documents” with a label. Notice we’re reading it all into memory. Works OK for now, but not super scalable. I filtered out the documents for longer-length documents (at least 5000 words) to keep training speed fast.

In [2]:
# Gensim 3.2 uses TaggedDocuments which are actually namedtuples:
FishyDocument = namedtuple('FishyDocument', 'words tags')
In [3]:
user_dir = "/Users/david/git/data/fishlore/users"
user_files = glob.iglob(user_dir + "/*")
In [4]:
# Let's use the IMDB Doc2Vec example ipython notebook as a template
# and customize for our example use case:
allusers_index = {} # Will function as a lookup table so we can find the username corresponding to a doc_ID
alldocs = []  # Will hold all docs in original order
counter = 0
for file_number, fn in enumerate(user_files):
    username = fn.split("/")[-1]
    with open(fn) as f:
        allwords = []
        #tags = [file_number]
        for line in f:
            line = preprocess_text(line)
            tokens = gensim.utils.to_unicode(line).split()
            words = tokens[1:] 
            if len(words) > 5: # we dont want short posts like "Bump" or "look great!"
                allwords.extend(words)
        if len(allwords) > 5000: # let's look at longer documents for now
            allusers_index[username] = counter
            tags = [counter]
            counter += 1
            alldocs.append(FishyDocument(allwords, tags))
    
    if file_number%2000==0: print "Loading file #{0}".format(file_number)

print "Read in {0} files. Used {1} files".format(file_number, counter)
Loading file #0
Loading file #2000
Loading file #4000
Loading file #6000
Loading file #8000
Loading file #10000
Loading file #12000
Loading file #14000
Loading file #16000
Loading file #18000
Loading file #20000
Loading file #22000
Loading file #24000
Read in 24594 files. Used 1883 files

Let’s build some models. Here I’m again using the IMDB ipython notebook for inspiration / code reuse.

In [5]:
simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),

    # My attempts to do something different
    Doc2Vec(min_count=1, window=10, size=400, sample=1e-3, negative=5, workers=cores),
    Doc2Vec(min_count=1, window=10, size=400, sample=1e-3, negative=5, dm=0, workers=cores)
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)
2018-01-07 00:41:22,578 : INFO : collecting all words and their counts
2018-01-07 00:41:22,580 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-01-07 00:41:31,427 : INFO : collected 205021 word types and 1883 unique tags from a corpus of 1883 examples and 40475453 words
2018-01-07 00:41:31,428 : INFO : Loading a fresh vocabulary
2018-01-07 00:41:32,935 : INFO : min_count=2 retains 81238 unique words (39% of original 205021, drops 123783)
2018-01-07 00:41:32,936 : INFO : min_count=2 leaves 40351670 word corpus (99% of original 40475453, drops 123783)
2018-01-07 00:41:33,195 : INFO : deleting the raw counts dictionary of 205021 items
2018-01-07 00:41:33,212 : INFO : sample=0.001 downsamples 53 most-common words
2018-01-07 00:41:33,214 : INFO : downsampling leaves estimated 28488881 word corpus (70.6% of prior 40351670)
2018-01-07 00:41:33,215 : INFO : estimated required memory for 81238 words and 100 dimensions: 431314600 bytes
2018-01-07 00:41:33,545 : INFO : using concatenative 1100-dimensional layer1
2018-01-07 00:41:33,545 : INFO : resetting layer weights
2018-01-07 00:41:34,660 : INFO : resetting layer weights
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t3)
2018-01-07 00:41:35,762 : INFO : resetting layer weights
Doc2Vec(dbow,d100,n5,mc2,s0.001,t3)
2018-01-07 00:41:36,809 : INFO : resetting layer weights
Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t3)
2018-01-07 00:41:38,301 : INFO : resetting layer weights
Doc2Vec(dm/m,d400,n5,w10,s0.001,t3)
Doc2Vec(dbow,d400,n5,s0.001,t3)

Let’s check out some results!

In [6]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
doc_id = allusers_index['TexasDomer']
#doc_id = allusers_index['bigdreams']
#doc_id = allusers_index['jenmur']
#doc_id = allusers_index['Lchi87']


model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents

print "model", model
print "target:",
print lookup_result(doc_id)
print 
print "most similar:\n"
for i in range(10):
    print lookup_result(sims[i][0])
    
print "\nleast similar:\n"
for i in range(10):
    print lookup_result(sims[len(sims) - 1 - i][0])

print
print

print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words[:1000])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)[:1000]))
2018-01-07 00:41:39,861 : INFO : precomputing L2-norms of doc weight vectors
model Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t3)
target: TexasDomer

most similar:

deus ex machina
CricketKeeper
NighttHawk
StephH
midnightwolf
ark_fish
ASquidabs0727
AtreyusMom
Dave125g
TheKiwi

least similar:

Scoutsfish
MaeKay
Norman
CLam
Scott2848
Bhopkins1311
Flowingfins
fishnob
skar
Kenho21

Results

So I happen to know TexasDomer is an expert on fishlore.com. She is very knowledgeable about fishkeeping/husbandry, plants, and fish disease. Here is an excerpt from her document:

here’s are some fish stocking options : website don’t know what that light is . this light is good for low light plants : website you have lots of options for plants . jungle vals , swords , crypts , moneywort , water sprite , water lettuce , java fern , java moss , anubias , etc . than a hood , look for a glass lid . i like versa glass tops , but perfecto works too ! you do need to change water normally in a planted tank . both the fish and plants need fresh water with minerals . and depends on what you want . most people consider ramshorn and bladder snails to be pest snails . they breed rapidly and may eat your plants . malaysian trumpet snails stir up the substrate , but also breed quickly . don’t think mts eat plants though someone correct me if i’m wrong !

The most similar document yielded reasonable results:

rubber banding is a good way to anchor plants to decorations then ? cause i was thinking of using thread . unless i could find some fishing line . have pebble like gravel at the top of my substrate in my 29 gallon , i have managed to plant 4 java fern by pinning it between the pebbles and not burying into the rest of the substrate , they’re holding well . i just like to know whether thats fine to do ? the top half and is completely exposed , only the low roots have been covered because of the bottom of the substrate pressing against it . am i too lightly planted with only 4 java fern , no fish yet ?

The least similar document was more impressive:

duck has a leg issue on one leg and my rooster is having well . serious foot and toe issues . please note my animals are very well taken care of and they are all pets , even down to the chickens , ducks and roosters . they live in a large dirt pen that used to be covered but was taken down due to us having to expand the run , and the chicken wire roof was coming down anyway . there are 3 houses , two small wood tables for shade and a kiddie pool for summer . several water containers and a food bowl . theyre fed chicken scratch and were recently switched to chicken crumbles from ifa to the walmart brand due to it being cheaper . i’ll start with my duck . i have 7 ducks , 2 indian runner ducks , another is a khaki campbell male theyre 3years now i believe

The least similar document seems to have a lot of posts from the “other interests” and “other pets” subforums. So yes, ducks are not similar to fish 😉 More importantly perhaps, TexasDomer writes about different topics than Scoutsfish

Next steps

Before we can declare victory though, it is important to note that garbage in, garbage out applies here. While the initial results are fun, not sure we can really prove anything. It’s more entertainment value at this point.

I’d like to continue and attempt to fit a machine learning model on this data to see if we can use the word embeddings to predict who an expert is. However, we do not have trained data for this, so not sure how feasible this really is. I’m going to spend the next several days/weeks thinking about this, assuming I have some spare time. Please comment/write me if you found this interesting!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s