Machine learning projects

The wide availability of high-level libraries and pre-curated public datasets has made machine learning unusually accessible. It is possible to train a model, generate predictions, and report an accuracy score in a matter of hours without a deep understanding of the underlying theory. This accessibility is both an attraction and a risk. At the level of an honours or masters project, the ease of entry can lead students into producing work that may not meet the academic standards required for a pass.

The triviality problem

A common and serious mistake is to treat a machine learning project as an exercise in comparing algorithms: select a dataset from Kaggle or a similar repository, apply several standard classifiers, and report which produces the best accuracy. This approach does not constitute academic research and is likely to fail at honours or masters level regardless of how competently it is executed. The reason is that the purpose of an academic project is the creation of new knowledge, not the demonstration of vocational competence. A project that could be completed by following a standard tutorial, such as fitting a classifier to the Titanic, MNIST, or Iris datasets, for example, does not demonstrate research skills, because those datasets have been exhaustively studied in the literature and offer no opportunity for original contribution. It also does not demonstrate understanding, because the same results can be produced by someone who does not know what the model is doing internally.

Markers at honours and masters level look for evidence that you understand the theoretical basis for your technical choices, that you have engaged critically with relevant prior work, and that your findings go beyond what a library's documentation describes. A project that lacks these qualities risks being classified as a marginal fail at best, not because the code does not run, but because it does not constitute research. The distinction is sometimes described as the difference between a Kaggle mindset (optimise a metric on a given dataset) and a research mindset (ask why a particular approach works, under what conditions it fails, and what that tells us about the problem domain). Both are legitimate activities, but only the second satisfies the requirements of a dissertation.

Finding a research problem

Domain-specific data challenges

One of the most effective ways to avoid a trivial project is to work with data that has genuine idiosyncratic characteristics - properties that depart significantly from the idealised conditions assumed in tutorial exercises, and that therefore require creative modelling decisions. Healthcare is a prime example. Electronic health records are often sparse, heterogeneous, and irregularly sampled; critically, missing values in clinical data are frequently not missing at random, meaning that the absence of a measurement may itself carry clinical information. A model that handles such data naively will produce systematically biased results. The implications of this are explored by Ghassemi et al. (2020) in the context of outcome prediction from EHR data.

Domain	Idiosyncratic challenge	Research opportunity
Electronic health records	Sparsity, heterogeneity, irregular time intervals	Architectures that handle non-deterministic temporal gaps
Medical imaging	Class imbalance, need for interpretability	Explainable AI to validate model decisions for clinicians
Biomedical signals	High noise, subject-specific variability	Transfer learning from population data to individuals
Ecological monitoring	Model drift over time, lack of labelled data	Few-shot or active learning for rare-species identification
Low-resource NLP	Data scarcity, linguistic diversity	Adapting pre-trained models to under-resourced languages

Working in any of these domains places you in contact with problems that the literature has not fully resolved, which is precisely the space where a student project can make a genuine contribution.

Identifying research gaps

A research gap is a mismatch between what a particular domain needs and what current methods can provide. Finding a credible gap requires systematic engagement with the literature rather than guesswork. Survey papers are productive because they typically include sections on open problems and unsolved challenges; the conclusions and limitations sections of high-influence papers at venues such as NeurIPS or ICML serve a similar purpose. Examining benchmark results can also be informative: where state-of-the-art models fall significantly short of human performance on a well-defined task, there is likely an interesting problem to investigate. A practical guide to this process is provided by Springer Nature.

A gap that is observable, documented in the literature, and testable with resources available to you is a sound foundation for an academic project. A gap that exists only because nobody has tried something does not, by itself, constitute a research contribution; there needs to be a reason to expect the attempt to be informative.

Methodological depth

Ablation studies

Applying a machine learning technique and reporting a performance metric is not, by itself, a research contribution. To demonstrate that a contribution is genuine, you need to show that the specific choices you made are responsible for the results you obtained. The standard method for doing this is the ablation study, in which individual components of the proposed system (a particular layer, a loss function, a data augmentation step, or an architectural modification) are removed or varied systematically while the effect on performance is measured. If an attention mechanism improves results, an ablation study confirms that the improvement comes from that mechanism rather than from some other latent factor such as increased parameter count. A thorough treatment of ablation study design is available in Meyes et al. (2019).

Evaluation beyond accuracy

Accuracy is a useful summary statistic, but it is insufficient for most academic purposes and actively misleading in some. In high-stakes domains such as medicine or finance, the calibration of probability scores matters as much as their rank order: a model that is confidently wrong is worse than one that is less accurate but appropriately uncertain. Evaluating uncertainty quantification - the degree to which a model's confidence reflects its actual error rate - is therefore an important dimension of rigour in these contexts. Metrics such as the area under the receiver operating characteristic curve (auROC) and the area under the precision-recall curve (auPRC) provide a more complete picture than accuracy alone. More broadly, as Jarvis (2026) reports, over-aggregated metrics can conceal important variation in model behaviour across subgroups or operating conditions. A thoughtful evaluation section, in which you justify your choice of metrics and acknowledge their limitations, is one of the clearest signals of academic maturity.

Causal reasoning

Most machine learning is associative: it identifies statistical patterns in data without making claims about underlying mechanisms. In many research contexts, however, the question of interest is causal - not what tends to co-occur with what, but what would happen if a specific intervention were made. A model trained on observational data may produce spurious associations if the treatment policies in the training set are confounded with the outcome. Incorporating causal reasoning into a project, for example by using potential outcomes frameworks or causal graph methods to adjust for confounding, represents a level of analytical sophistication that is clearly distinguishable from the associative baseline.

Working with data

Active learning

When labelled data is scarce or expensive to produce, active learning offers a structured approach to reducing the labelling burden. The model identifies which unlabelled instances would be most informative to label (typically those near a decision boundary, where the model is most uncertain) and directs annotation effort accordingly. This is a legitimate research topic in its own right, particularly in domains where annotation requires specialist expertise. An example of active learning approaches in text classification is provided by MIlleret al. (2020), though the principles extend to other modalities.

Synthetic data

In domains where real data is sensitive or difficult to obtain (such as rare disease research or fraud detection), synthetic data generation has become an important area of study. Generative models such as GANs and variational autoencoders can produce artificial datasets that mimic the statistical properties of real data without exposing individuals. A non-trivial project does not stop at generation; it requires a validation framework that assesses both data fidelity and privacy preservability. The tension between these two objectives, where data that is more realistic may also be less private, is a genuine research problem in its own right. A comprehensive review of methods and challenges is provided by Pezoulas et al. (2024).

Robustness and failure analysis

Bias and fairness

Models trained on historical data can inherit and amplify the biases present in that data. In student-facing or healthcare applications this is a particularly serious concern. A systematic approach to bias evaluation involves specifying sensitive attributes of interest, then measuring whether model performance differs significantly across subgroups defined by those attributes. This is a stronger form of evaluation than reporting overall accuracy, and it surfaces the kind of failure that aggregate metrics conceal. A systematic review of bias and unfairness in machine learning is available at Pagano et al. (2023). A curated catalogue of real-world machine learning failures is maintained at Failed-ML and provides useful framing for thinking about how and why systems fail in practice.

Adversarial robustness

A model that performs well under standard test conditions may be vulnerable to inputs that have been deliberately perturbed to cause misclassification. For projects with a security or safety angle, evaluating adversarial robustness is a well-motivated research direction. A taxonomy of machine learning attack types and threat models (Kumar et al., 2019) provides a useful framework for scoping this kind of investigation. Applying Failure Modes and Effects Analysis (FMEA) to a machine learning pipeline is one way to approach this systematically, cataloguing predicted failure modes, ranking them by severity, and proposing corrective actions for each.

Edge deployment

An assumption implicit in most tutorial-level machine learning is that compute and memory are essentially unconstrained. This is not the case for models deployed on embedded or edge devices, where memory may be measured in kilobytes and power consumption is a binding constraint. TinyML - the development of models that train and run inference on microcontrollers - is an active research area that combines algorithmic innovation with hardware-aware design. Techniques such as weight quantisation, sparse updates, and neuromorphic architectures address this constraint in different ways. For students with a background in embedded systems, this area offers projects that are technically demanding and genuinely novel. An accessible overview is provided in this MIT News article (Zewe, 2022).

Scoping and managing the project

The primary risk for an ambitious project is not complexity itself but unmanageable complexity. A project scoped too broadly is almost certain to produce superficial results because no part of it can be pursued in sufficient depth. The most reliable mitigation is to define a single, well-bounded research question and to build a zero model early: a minimal, working implementation using the simplest reasonable approach to the problem. If the zero model performs no better than a random or majority-class baseline, this is a valuable signal that the dataset or the problem formulation needs to change, and it is far better to discover this in the first two weeks than in the final two. A guide to identifying problems that are both tractable and meaningful is provided by Akinkugbe (2025).

The table below gives an indicative breakdown of how time in a twelve-week project might be distributed across the main phases.

Phase	Typical duration	Key milestones
Literature review and gap identification	2-3 weeks	Mapping the sub-area; identifying contradictions; selecting benchmarks
Scoping and zero model	1-2 weeks	Baseline implementation; data quality validation
Experimental iterations	4-5 weeks	Model modifications; ablation studies; incremental evaluation
Evaluation and analysis	1-2 weeks	Uncertainty, fairness, or robustness evaluation; interpretation
Dissertation writing	2-3 weeks	Synthesising results; documentation; critical reflection

These phases overlap in practice, and writing in particular should begin earlier than the table implies. The literature review and scoping phases are not overhead to be minimised but the foundation on which the rest of the project rests.