Projects

Maya

As part of the Cohere 4 AI community, trained a novel Vision Language Model on multilingual instruction dataset, curated by the team. Published paper on Arxiv.

Vision-Language ModelsPyTorchMultilingual AI

HuggingFace 🤗 Contributions

Contributed notebooks/documentation to HuggingFace for various initiatives. Contributions include a notebook on Knowledge Distillation for Computer Vision as part of the Computer Vision Community Course as well as a notebook on PII detection for LLMs for the HuggingFace Cookbook.

Computer VisionKnowledge DistillationLLMsPII Detection

Topic Auto-label

Released a simple pip package to automatically label text, image, and video data using LLMs to identify topics in the corpus. Leverages local LLMs with integrations for Ollama and pydantic for structured output.

PythonLLMsOllamaPydantic

Manifest Climate

Partnered with Manifest Climate to assist them in the business challenge of identifying key metrics in climate disclosures. Lead a team of members of the University of Waterloo Data Science Club to solve this by building a data labelling tool, labelling custom data, and fine-tuning a custom distilbert model, enabling an additional 16 data points for clients to act upon and cutting down LLM API costs by ~99.9%.

DistilBERTData LabelingClimate TechNLP

Text2SQL

Fine-tuned an LLM on synthetic data to resolve natural language queries about an SQLite database by generating an SQL Query and interpreting the results. Fine-tuned model passed 86% of test cases.

LLM Fine-tuningSQLiteNatural Language Processing

DotaLLM

Trained a YOLO model for enemy detection and using the predicted areas of enemies, prompted Cohere's Command-R+ to make movements and attack enemies.

YOLOObject DetectionCohereGame AI

Dreambella

Dreambella was a fine-tuned diffusion model for my dog Bella. Made use of Dreambooth to efficiently fine-tune the model with limited data samples (even though I probably have a thousand pictures of Bella lol).

Stable DiffusionDreamboothFine-tuning

Titanic Challenge in Production

Created synthetic data with simulated data drift to create an introductory video lesson for those new to data science and interested in learning about challenges around production systems. Video covers TensorFlow Extended, data drift, and CTGAN for synthetic tabular data generation.

TensorFlow ExtendedCTGANData DriftEducational Content