VectorCluster
AbstractClustering is a final-year NLP/ML project focused on grouping research paper abstracts into meaningful thematic clusters.
What it does
- Ingests large paper metadata/abstract datasets.
- Cleans and preprocesses text for downstream vectorization.
- Builds word and sentence representations using multiple embedding choices.
- Runs clustering experiments across algorithms and tracks quality metrics.
How it is built
- Core modules separate preprocessing, word embedding loading, sentence embedding creation, and data conversion.
- Uses p-means style sentence embedding aggregation from word vectors.
- Supports several clustering workflows (K-Means, spectral, hierarchical, DBSCAN, and deep embedded clustering notebooks).
- Includes templates/notebooks for parameter sweeps and optimal K analysis, with model metadata logging.
Tech stack
Python, NumPy, pandas, scikit-learn, TensorFlow/Keras, NLTK/Gensim, Jupyter notebooks, SQLite.