One of the biggest challenges for dating apps is to categorize their users into groups in order to match them. This clustering is heavily influenced by the questions asked during profile creation. OKCupid is unique in that it asks it’s users lengthier questions, referred to as “essays questions”. The goal of this project was to analyze how these essay questions influence the clustering of users, compared to the more traditional questions.
The data for this project was obtained from https://github.com/lujonathanh/cos424-s2019-OKCupid. Using clustering algorithms such as k-means and Latent Direchlet Allocation (LDA), we find that the non-essay questions produce one large cluster, but the essay questions produce a more uniform splitting of users into clusters.
Principal component analysis (PCA) is used to reduce these clusters into 2 dimensions for visualization purposes and the images below show the different clusters for k-means and LDA.


Further details of this project are available in the report below