University of British Columbia, March - April 2019
- Applied unsupervised learning models to cluster unlabeled documents into groups and identified latent topics - Trained classification models by K-Means clustering and Latent Dirichlet allocation (LDA) - Tokenized, stemmed and removed stop-words, and extracted features by term frequency-inverse document frequency method for text preprocessing
University of British Columbia, March - April 2019
- Applied text analytics on Google Books Ngram dataset to obtain the number of occurrences of specific words by year - Created S3 bucket and used Amazon elastic map reduce (EMR) cluster to reduce the running time of this analysis
University of British Columbia, March - April 2019
- Conducted a research on an exploratory question that whether the number of times a student goes to office hours affect the average number of hours they spend working on labs per week - Collected data via surveys, performed exploratory data analysis (EDA), and conducted statistical hypothesis testing to conclude that 1 visit increase in office hours is expected to increase hours spent on labs per week by 1.1 times - Wrote a 5-page report to explain this project thoroughly, including background, method and conclusions
University of British Columbia, March - April 2019
- Obtained the time series of stock prices, volume, and technical indicators and created daily, weekly, and monthly plots - Created a wrapper in Python to handle the Alpha Vantage API requests to retrieve information of a given stock, which makes working with the API easier
University of British Columbia, March - April 2019
- Retrieved prerequisite information of each course in MDS program from UBC website using a BeautifulSoup object - Created a diagram of all courses with direction arrows pointing from prerequisites to give an overall picture of courses
University of British Columbia, February - March 2019
- Developed a Python and an R package that can be used to emboss, compress images, and return dimensional information - Applied git version control to manage multi-person project with Agile method and wrote comprehensive test suites - These packages are open sourced on GitHub that can be found at BeautyPy and BeautyR.
University of British Columbia, January - February 2019
- Built a supervised model to classify handwritten-digit images from MNIST dataset - Implemented a 5-layer convolutional neutral network (CNN) using Keras with 55 million pixels of training data and achieved validation accuracy over 97%
University of British Columbia, January - February 2019
- Built a supervised engine to predict scores for unreleased movies and an unsupervised engine to recommend movies - Applied linear regression, and multivariate inference to predict moviesโ scores, selected and modified predictors by combining stepwise method and Box-Cox transformation and used absolute error for evaluation - Implemented k-Nearest Neighbors (KNN) to find the most similar movies for recommendation, using and comparing both Euclidean distance and cosine similarity
University of British Columbia, January - February 2019
- Applied a pre-trained animal classification convolutional neural network (CNN) as a starting point, fine-tuned the original CNN, and further built a model to classify dog breed using Keras - Increased the model accuracy from 22% to 76% comparing to a scratch CNN model without using transfer learning
University of British Columbia, January - February 2019
- Designed and implemented R Shiny app to allow users to interactively use the interface to select a state or city to visualize trends and patterns of different types of crime over time - Implemented hover feature, which allows users to hover over the graph to see more details about each data point - Published this app online for everyone to use, which can be found here
University of British Columbia, January - February 2019
- Implemented and tuned parameters of principal component analysis (PCA) to extract information from face images under different lighting conditions - Reduced dimensionality of each face image from 1024 components to 10 components while still capturing 85% of the information in the image