Investigating the Topic Capacity of Document Embeddings

This project explores the topic capacity of document embeddings, numerical representations of text produced by large language models. While embeddings are widely used for tasks such as information retrieval and clustering, their ability to represent multiple topics accurately remains uncertain.

The research focuses on three components:
– Literature Review: establishing a theoretical foundation on how embeddings are constructed across models.
– Experimental Evaluation: systematically testing how retrieval accuracy changes as the number of topics in a document increases.
– Alternative Strategies: investigating whether multiple topic-focused embeddings improve retrieval performance.

By examining the relationship between topic count and embedding quality, the project will generate new insights into the limitations and strengths of embeddings. These findings will have practical value for retrieval and clustering applications, while preparing the student for advanced research at Honours or Masters level.

Tristan Trieu

Western Sydney University

Tristan Trieu is a second-year Bachelor of Data Science student at Western Sydney University, driven by curiosity and a deep fascination with uncovering hidden patterns through mathematics and data science. Although new to the field, he finds genuine joy and fulfillment in the process of learning — especially in those moments of realization when a complex idea finally clicks or a new perspective emerges. For Tristan, these moments give meaning to his studies and inspire him to keep exploring how mathematics and the right models can be used to reveal insights, guide decisions, and make a real impact.

He aims to continue developing his skills and knowledge, with ambitions to pursue postgraduate studies — a master’s or even a PhD — in machine learning and data-driven research. Outside of his academic pursuits, he enjoys playing football and listening to music, finding in them the same rhythm, flow, and sense of discovery that fuel his passion for data.

You may be interested in

Hanyi Wang

Hanyi Wang

Infilling Missing Data in Time Series
Emily Palit

Emily Palit

A Generalisation of the Ising Model on the Complete Graph
Ishwarabroto Mridha

Ishwarabroto Mridha

Amplitude Equations for Modelling Electromagnetically Induced Flows
Ashton Lu

Ashton Lu

Statistical approaches for integration of single-cell and spatial transcriptomics data at isoform resolution
Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text.