Projects
Maya
As part of the Cohere 4 AI community, trained a novel Vision Language Model on multilingual instruction dataset, curated by the team. Published paper on Arxiv.
HuggingFace 🤗 Contributions
Contributed notebooks/documentation to HuggingFace for various initiatives. Contributions include a notebook on Knowledge Distillation for Computer Vision as part of the Computer Vision Community Course as well as a notebook on PII detection for LLMs for the HuggingFace Cookbook.
Topic Auto-label
Released a simple pip package to automatically label text, image, and video data using LLMs to identify topics in the corpus. Leverages local LLMs with integrations for Ollama and pydantic for structured output.
Manifest Climate
Partnered with Manifest Climate to assist them in the business challenge of identifying key metrics in climate disclosures. Lead a team of members of the University of Waterloo Data Science Club to solve this by building a data labelling tool, labelling custom data, and fine-tuning a custom distilbert model, enabling an additional 16 data points for clients to act upon and cutting down LLM API costs by ~99.9%.
Text2SQL
Fine-tuned an LLM on synthetic data to resolve natural language queries about an SQLite database by generating an SQL Query and interpreting the results. Fine-tuned model passed 86% of test cases.
DotaLLM
Trained a YOLO model for enemy detection and using the predicted areas of enemies, prompted Cohere's Command-R+ to make movements and attack enemies.
Dreambella
Dreambella was a fine-tuned diffusion model for my dog Bella. Made use of Dreambooth to efficiently fine-tune the model with limited data samples (even though I probably have a thousand pictures of Bella lol).
Titanic Challenge in Production
Created synthetic data with simulated data drift to create an introductory video lesson for those new to data science and interested in learning about challenges around production systems. Video covers TensorFlow Extended, data drift, and CTGAN for synthetic tabular data generation.