As part of my undergraduate thesis, I implemented a convolutional neural network for few-example object detection under the supervision of Professor Nejat.
With the rapid growth rate of senior population surpassing all younger age groups, our society needs to adapt to this demographic shift. One possible application is to design assistive robots to guide seniors around supermarkets and help them locate products. A typical supermarket contains on average 50,000 different products and the frequent relocation of items often decrease satisfaction of older customers. Existing designs focus primarily on navigation using radio frequency identification to escort the visually impaired to the general location of the product, with limited research available on intelligent product identification systems in retail environments.
A wide range of categories and items in supermarkets could greatly increase training complexity and storage. Furthermore, it would be difficult to obtain a large number of samples for each class to train an accurate network. Hence, I designed a network that would require only one training example to make a logical guess of how similar an item is relative to an object of interest.
With supermarkets constantly changing and importing new products, it is necessary to train a network that is able to detect objects based on very few examples and identify unseen classes. My solution involves using the Siamese Network, which performs a convolutional neural network on two images and output a similarity score based on their feature vectors.
The Siamese Network performs one-shot learning classification tasks, which makes a prediction on an test instance after observing only one example from each class. I was able to incorporate this method into a current state-of-the-art object detection network, RetinaNet. RetinaNet is faster than other region-based algorithms while achieving very high accuracies.
I annotated my own training and testing images using the Omniglot dataset, which consists of over 1600 characters and 20 examples of each. Compared to the baseline accuracy of 71.88% obtained from passing the inputs into the two networks separately, my network achieved a 89.43% performance.
Prior to my thesis work, I have covered a basic neural network course and trained a simple fully connected network to classify faces. The immediate progression from a few-layered architecture to a multi-layered and multi-network implementation was difficult to absorb and comprehend. Furthermore, it was especially challenging to translate the research paper implementations into code. There were specific details that were left in the papers and I had to fill the gap on my own.
However, after reading through many papers and going through online repositories, I realized the similarities across all the object detection networks and how they organize the code structure. I learned how to build networks using PyTorch (python library) and leverage python classes and parser arguments to build configurable models. Furthermore, I now have a stronger understanding of tensors and their operations, network gradients and losses, as well as classification and regression tasks in object detection algorithms.
Lastly, I developed strong independent research and problem-solving skills, which I believe will be essential in my future career.
My network offers an end-to-end training to learn all parameters and feature vectors and is applicable to unseen test examples. It was an enriching experience to explore real-world convolutional networks and apply them to my own research.
As a next step to build upon my work, I would recommend testing the network on a more complex dataset and tune the hyperparameters to improve the accuracy.
Feel free to contact me if you have any questions.
angela.ye@mail.utoronto.ca