Reproduce the dating-app swipe study and inspect the 16 EDA figures
Reuse the EFA, PCA, t-SNE, and UMAP latent-variable pipeline on a similar rated dataset
Compare eight classifier baselines plus a GAM and SHAP explanations on a small dataset
Study a worked example of within-cohort behavioural data analysis with a research-style report
Notebooks must be run in order because each writes parquet files the next consumes, and the stack pulls in XGBoost, LightGBM, CatBoost, PyGAM, SHAP, factor-analyzer, and UMAP.
This project is a Jupyter notebook pipeline that takes a hand-annotated dataset of 123 male dating-app profiles and looks at which features predict a right swipe. The profiles were rated by five women described in the README as securely-attached, and 23.6% of the profiles received a right swipe. The author treats the result as a within-cohort study, not a population average. The repository is organised as five notebooks that run in order: data cleaning and parquet export, exploratory data analysis with 16 figures, deeper feature-level analysis, latent-variable analysis using exploratory factor analysis plus PCA, t-SNE and UMAP, and a final modelling notebook with eight machine-learning models, SHAP, a GAM, and a prescription table. The headline finding reported in the README is that two latent factors, labelled Psychological Safety and Visual Appeal, account for 99.3% of the swipe decisions in this dataset. The strongest individual predictor is the rater-inferred emotional_stability score. All eight models reach an AUC of 1.0 on the held-out test set, which the author attributes to high rater agreement rather than overfitting. Several common beliefs are reported as not supported by the data. Height shows no statistically significant correlation with swipe outcome in this sample. Shirtless photos in the sample receive a 0% swipe rate. Status correlates with swipes raw but drops to non-significant after controlling for perceived attractiveness. The README also notes that photo quality and warmth matter more than the number of photos. To reproduce the work, the README lists Python 3.10 or newer plus pandas, scikit-learn, XGBoost, LightGBM, CatBoost, PyGAM, SHAP, factor-analyzer, UMAP, openpyxl, and pyarrow. Notebooks must be run in sequence because each one writes parquet files the next one reads. A 13-section research-style report covering the methods, findings, and limitations is included as report.md.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.