How would you programmatically parse a sentence and decide whether to answer “that’s what she said”?

I spent a few hours building my own TWSS classifier a couple weekends ago, so I'll describe my experience/flesh out some of the other suggestions.

Training data

I briefly looked at using Twitter as a corpus, as Christopher Lin's excellent answer also mentions, but decided that it was too noisy (most TWSS tweets aren't that funny, it's hard to tell what phrase the TWSS is in response to, and not all tweets containing "TWSS" are even TWSS jokes). Instead, taking a (modified) cue from the Kiddon and Brun paper, I used 1000 sentences from twssstories.com for positive training data, and 500 sentences from each of textsfromlastnight.com and fmylife.com for negative training data. I also normalized all the sentences by removing punctuation and converting to lowercase.

Naive Bayes

Next, I trained a unigram Naive Bayes classifier (with add-one smoothing). I also tried a bigram classifier, but the unigram classifier performed better with the data I had; here's a precision-recall curve comparing the two:
To get a better idea of what's going on, here are some of the most predictive features under each:

unigram          p(twss|unigram)
 * pull            0.9724889822144924
 * bigger          0.9614677503890157
 * wet             0.959004244327654
 * hard            0.9527628206878138
 * stick           0.9505783678914388
 * hole            0.9443870318715991
 * oh              0.9432941279908561
 * replied         0.943294127990856
 * fast            0.943294127990856
 * longer          0.9397415371025485

bigram          p(twss|bigram)
 * it in           0.9801434151851175
 * START wow       0.9705079286853889
 * START oh        0.9473580156961879
 * its too         0.9350522640444204
 * pull out        0.9187779331523677
 * too big         0.9187779331523677
 * START man       0.9113755525471394
 * hard END        0.9113755525471394
 * put it          0.9071442285463515
 * that thing      0.9024886021363793
 * stick it        0.9024886021363793
 * my god          0.9024886021363793
 * go in           0.8916207044791409
 * START ugh       0.8916207044791409
 * make it         0.8916207044791409
 * its so          0.8916207044791409

(So yeah, next time someone starts a sentence with "ugh", or says "it's so…", get ready.)

Under a particular choice of parameters (a TWSS classification threshold of 0.99, when using equal prior probabilities of 0.5 for TWSS and not TWSS), the unigram classifier gives 0.97 precision and 0.82 recall (823 true positives, 177 false negatives, 974 true negatives, 26 false positives) on an out-of-sample test set consisting of equal amounts of positive and negative examples. [This is roughly the same performance as the bvandenbos classifier that Charlie Cheever linked to: mine performs slightly better on my test set, his performs slightly better on his test set. (Unsurprising, given that we're both using Naive Bayes.)]

I also briefly tried using logistic regression and decision trees, but the unigram classifier easily beat them both. (YMMV with more data or better tuning, though.)

Faerie Tail

In order to see how well the classifier generalizes to a different source of data, I ran the algorithm on some fairy tales I pulled from Project Gutenberg. Here are some of the sentences it TWSSed:

 * “The African magician carries it carefully wrapt up in his bosom,” said the princess; “and this I can assure you, because he pulled it out before me, and showed it to me in triumph.” (Aladdin)
 * It is vanished; but I had no concern in its removal. (Aladdin)
 * “My son,” said he, “what a man you are to do such surprising things always in the twinkling of an eye!” (Aladdin)
 * “Sire,” replied Aladdin, “I have not the least reason to complain of your conduct, since you did nothing but what your duty required.”
 * One was too long, another too short; so she tried them all till she came to the seventh, and that was so comfortable that she laid herself down, and was soon fast asleep. (Snow White)
 * “Oh yes, I will try,” said Snow-white.

Finally, I threw up a demo on Heroku (http://twss-classifier.heroku.com/) in order to better play around with the classifier. (Note: the demo optimizes for precision over recall.)
Written May 22, 2011 • View Upvotes

Answer by Edwin Chen:

I spent a few hours building my own TWSS classifier a couple weekends ago, so I'll describe my experience/flesh out some of the other suggestions.

Training data

I briefly looked at using Twitter as a corpus, as Christopher Lin's excellent answer also mentions, but decided that it was too noisy (most TWSS tweets aren't that funny, it's hard to tell what phrase the TWSS is in response to, and not all tweets containing "TWSS" are even TWSS jokes). Instead, taking a (modified) cue from the Kiddon and Brun paper, I used 1000 sentences from twssstories.com for positive training data, and 500 sentences from each of textsfromlastnight.com and fmylife.com for negative training data. I also normalized all the sentences by removing punctuation and converting to lowercase.

Naive Bayes

Next, I trained a unigram Naive Bayes classifier (with add-one smoothing). I also tried a bigram classifier, but the unigram classifier performed better with the data I had; here's a precision-recall curve comparing the two:

To get a better idea of what's going on, here are some of the most predictive features under each:

unigram          p(twss|unigram)

  • pull            0.9724889822144924
  • bigger          0.9614677503890157
  • wet             0.959004244327654
  • hard            0.9527628206878138
  • stick           0.9505783678914388
  • hole            0.9443870318715991
  • oh              0.9432941279908561
  • replied         0.943294127990856
  • fast            0.943294127990856
  • longer          0.9397415371025485

bigram          p(twss|bigram)

  • it in           0.9801434151851175
  • START wow       0.9705079286853889
  • START oh        0.9473580156961879
  • its too         0.9350522640444204
  • pull out        0.9187779331523677
  • too big         0.9187779331523677
  • START man       0.9113755525471394
  • hard END        0.9113755525471394
  • put it          0.9071442285463515
  • that thing      0.9024886021363793
  • stick it        0.9024886021363793
  • my god          0.9024886021363793
  • go in           0.8916207044791409
  • START ugh       0.8916207044791409
  • make it         0.8916207044791409
  • its so          0.8916207044791409

(So yeah, next time someone starts a sentence with "ugh", or says "it's so…", get ready.)

Under a particular choice of parameters (a TWSS classification threshold of 0.99, when using equal prior probabilities of 0.5 for TWSS and not TWSS), the unigram classifier gives 0.97 precision and 0.82 recall (823 true positives, 177 false negatives, 974 true negatives, 26 false positives) on an out-of-sample test set consisting of equal amounts of positive and negative examples. [This is roughly the same performance as the bvandenbos classifier that Charlie Cheever linked to: mine performs slightly better on my test set, his performs slightly better on his test set. (Unsurprising, given that we're both using Naive Bayes.)]

I also briefly tried using logistic regression and decision trees, but the unigram classifier easily beat them both. (YMMV with more data or better tuning, though.)

Faerie Tail

In order to see how well the classifier generalizes to a different source of data, I ran the algorithm on some fairy tales I pulled from Project Gutenberg. Here are some of the sentences it TWSSed:

  • “The African magician carries it carefully wrapt up in his bosom,” said the princess; “and this I can assure you, because he pulled it out before me, and showed it to me in triumph.” (Aladdin)
  • It is vanished; but I had no concern in its removal. (Aladdin)
  • “My son,” said he, “what a man you are to do such surprising things always in the twinkling of an eye!” (Aladdin)
  • “Sire,” replied Aladdin, “I have not the least reason to complain of your conduct, since you did nothing but what your duty required.”
  • One was too long, another too short; so she tried them all till she came to the seventh, and that was so comfortable that she laid herself down, and was soon fast asleep. (Snow White)
  • “Oh yes, I will try,” said Snow-white.

Finally, I threw up a demo on Heroku (http://twss-classifier.heroku.com/) in order to better play around with the classifier. (Note: the demo optimizes for precision over recall.)

How would you programmatically parse a sentence and decide whether to answer "that's what she said"?

Advertisements

Leave a comment

Filed under Life

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s