- Thu., May 17: Project Milestone, due with HW3
- Thu., May 31, 9am-12pm: Poster Session
- Thu., June 7: Project Report
- You may work in pairs or individually. If you work in pairs, you must do both parts of the project.
- You must submit a project report, as specified below. Each of you must submit your own report on Canvas, even if you work in pairs.
- You must present at the poster session.
The following readings are important for understanding the setup and the motivation, along with modern techniques.
- If you are not familiar with ConvNets, then read this (and if you want a refresher on convolutions, see the wiki page).
- Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, Lecture 8-Slide 8, Spatial Localization and Detection. Available here
- T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, L. Zitnick. "Microsoft COCO: Common Objects in Context". ECCV. 2014. paper
- Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. paper
- Girshick, Ross."Fast R-CNN". arXiv:1504.08083. 2015. paper
- Ren, Shaoqing, et al. "Faster R-CNN: towards real-time object detection with region proposal networks." IEEE transactions on pattern analysis and machine intelligence 39.6 (2017): 1137-1149. paper
- K. He, X. Zhang, S. Ren, J. Sun. "Spatial pyramid pooling in deep convolutional networks for visual recognition". IEEE PAMI, 2015. paper
- Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016). paper
Overview
Introduction and background: In this class, we will work with the MS-COCO dataset [Lin. et. al. 2014] and understand several aspects of building machine learning pipelines that deal with issues stemming from large amount of high-dimensional data processing. MS-COCO is one of the current state-of-the-art datasets used for developing and benchmarking in several computer vision tasks including (a) semantic segmentation, (b) object localization and detection, (c) multi-label learning, (d) image retrieval, amongst other interesting and practically relevant questions. Homeworks 1 and 2 were useful warmups to get familiar with the dataset for binary and multilabel classification. For the project milestone due on May 17, you will build an object detector on the small dataset. The final project will build on this with nearest neighbor retrieval and/or metric learning.
Setup: The instructors have setup the dataset in a way that allows for any complications stemming from data/feature extraction to be abstracted out in order to maximize the time spent on developing and evaluating the machine learning models. Features have been provided from a convolutional neural network. Ideas are taken from [Girshick 2015; He et. al. 2015].
Dataset splits: MS-COCO has 12 super-categories, and multiple categories in each supercategory. First, we just consider 2 super-categories and solve all the downstream tasks as a simple problem involving predicting category (a) or (b) or none. Then, we will consider all categories that are a derivative of these 2 super-categories for multi-label classification and object detection.
Starter code: Image features (see above) and code to read these features has already been provided with hw1. Soon, we have also provided code that allows you to (a) find rectangles that are likely to contain objects in each image (''regions of interest'') using the selective search functionality in OpenCV, and (b) extract features corresponding to any rectangle in an image. You are welcome to use your own code instead.
Task details: Here is a brief overview of the project. Broadly, the project is composed of two parts: the first part deals with object detection/localization; the second part consists of nearest neighbor search and retrieval.
- Object detection/localization: The task of object detection involves localizing an object (by placing a tight bounding box around it) and classifying it as belonging to a category of interest. Here is a rough pipeline to follow: (a) first, a detector is trained for every category using a simple binary classifier (e.g., linear SVM) built on positive examples of a given class and sampling negatives appropriately; then, (b) the model in (a) is strengthened through ''hard-negative mining'', wherein, training for each category proceeds in an iterative fashion (with warm starts) - each iteration involves using the previously learnt model to construct a new set of negative examples that have been misclassified by the current model; these samples are used to train the classifier once more to strengthen its performance for detecting examples of a given category. The performance metric we shall use is the average precision (AP) per category and the mean average precision (mAP), which is AP per category averaged across categories. You can get creative with the how, for example, what binary classifiers or other alternatives to use and the hard negative mining. You have to figure out how to deal with class imbalance between positives and negatives for any one category and different data sizes across multiple categories, amongst other challenges.
- Nearest Neighbor search/retrieval: Given features from a particular image patch/bounding box, the goal is to extract similar patches from the training dataset. We take two approaches here: (a) build a scalable solution using Locality sensitive hashing or random partition trees, and (b) learn a metric over the feature space. For full details, see the section on project options below.
- Faster optimization methods: You are free to try out faster methods like SDCA or SVRG. Using SDCA,in a block coordinate ascent method can be particularly effective.
- Richer models: You are free to try non-convex procedures as well, provided you are abe to train them fast enough for this task. You should figure out first if you have the compute resources before your try this option out.
- Parallelization: You are free to explore other approaches to parallelization. For this, I don't have particular suggestions, as much of applied ML is either embarrassingly easy to parallelize or where we don't know what to do.
Useful pointers: It is important to note that AWS credits must be rationed and used through out the course. Running out of AWS credits is a sign that the rationing and management of AWS credits has not been good enough. The assignments/projects have been designed in a way that a budgeted utilization will make sure that a student does not run out of credits. Always use your own laptops for the purpose of debugging and unit testing your code. AWS should be used only when you are sure that the code is stable and does not carry any bugs, to the best of your knowledge.
You must choose one of the two following options. See the next secion for full details.
- Option 1 : Object detection + retrieval with a nearest neighbor data structure: You must implement nearest neighbor search using your own implementation of RPT, or LSH, or another effective nearest neighbor data structure (which you design or using one from the literature) on the larger dataset that we used for HW2. You have the option to try 'boosted' RPTs as discussed in class.
- Option 2 : Object detection + retrieval with metric learning: You must implement a an existing metric learning method (from a research paper) or come up with your own variant of a metric learning algorithm which is then used as a brute force method for retrieval of similar objects on the smaller dataset that we used in HW1.
Once the project reports are submitted, we will post a summary on canvas, along with excerpts showing the highest performing methods for each option (as judged by the evaluation metrics below).
Further Instructions
- We shall use nearest neighbor (NN) classification as an alternate means to perform object detection. Note that this pipeline is independent of the initial detection task performed via binary classification. Option 1 uses fast approximate NN search with Euclidean distance as the metric while option 2 uses brute force NN search with a learnt metric.
- You are given a bunch of candidate bounding boxes for each image in bboxes_retrieval.tgz. These are a subset of the bounding boxes returned by selective search that are handpicked by us to make the retrieval task more interesting.
- Each image patch will be represented its CNN features (you have already used these for the milestone). Assign each patch one of 19 labels: background or one of the 18 classes. (In the rare event that a patch has IoU > 0.5 with two different ground truth boxes, assign one of the two labels arbitrarily.)
Instructions specific to option 1:
- Create your data structure of choice for approximate NN search (e.g., Locality Sensitive Hashing (LSH), Random Partition Trees (RPT), etc) over patch features corresponding to given bounding boxes from the training set.
- For a given val or test image patch, find the K nearest image patches from the set of training image patches using your data structure. The score for a class C is (number of retrieved patches that belong to class C) / K. You must only retrieve one patch per image. (this is to avoid having multiple overlapping patches).
Evaluation Metric:
- Quality of the retrieval is measured via (a) average distance to the retrieved K nearest neighbors (averaged over the validation or test set as well) and (b) Average Precision (AP) per class and mean AP (mAP). Note that the Euclidean distance might not an appropriate distance metric (option 2 fixes this problem), so performing a more thorough search might not lead to improved mAP. But it will lead to a smaller average distance to nearest neighbors.
- Speed of search: To be agnostic to the hardware and quality of implementation, we shall measure the speed of the search as the number of distance computations in the original feature space that your algorithm needs to perform. For instance, exhaustive search requires N distance computations where N is the size of the training set, which RPTs with a leaf size of n require only n distance computations.
-
Plots : There is a tradeoff between the speed vs the quality
of the search. You can control hyper-parameters such as the leaf
size in RPTs. Plot:
- Plot average distance to the K nearest neighbors vs search time and mAP vs the search time (where search time is measured by number of distance computations), by varying the hyper-parameters of your approximate NN algorithm but keeping K fixed.
- Plot mAP vs K (for a fixed setting of hyper-parameters) for different values of K from {10, 50}.
Bonus:
- Real time NN search during the poster session. Perform your search real time and visualize the K nearest neighbors for us to see.
Instructions specific to option 2:
- The nearest neighbor search is based on the metric that is induced by the metric learning algorithm that you have implemented.
- For a given val or test image patch, find the K nearest image patches from the set of training image patches using the metric learnt through the metric learning algorithm where the NN search is exhaustive. The score for a class C is (number of retrieved patched that belong to class C) / K.
- For any metric learning algorithm that you choose, there should be a way to control the complexity of the learnt metric. For example, in the information theoretic metric learning framework, one can have a metric that is low rank. The rank here represents the "complexity" of the metric. This is because there are natural ways in which the number of computations during test time (retrieval) depend on the rank of the learnt metric and not the actual dimension. If you are unsure of how your metric learning algorithm's complexity can be controlled, please reach out to the instructors to figure this out.
Evaluation Metric:
- See instruction in Option 1 for scoring AP scoring.
- We can compute average precision (per class) and the mean average precision.
- Plots: free parameters: K={10,50} (for example) [K = number
of retrieved nearest neighbors], embedding dimension "dim" (or any other
means to control the quality of learnt metrics). You must only retrieve one patch
per image. (this is to avoid having multiple overlapping patches).
- for dim in embedding dimension: plot average precision versus K (in top-K neighbors).
- for embedding dimension dim in (say) {5,10,50,100}: plot average precision versus dim.
Bonus:
- Demonstrate retrieval using your learnt metric during the poster session and compare it visually to the NN as per the Euclidean distance.
The grading of the final project is split amongst three deliverables:
- A project poster presentation (30% of the grade).
- A final report (70% of the grade).
Your final report will be evaluated by the following criteria:
- Scientific merit: Do you have sound reasoning? Can you draw quantitative conclusions from you work? Are you taking a justifiably simple approach or, if you are choosing a more complicated method, do you have sound reasoning for doing this?
- Technical depth: How technically challenging was what you did? Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
- Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use good graphs and visualizations? How clear was the writing? Did you justify your approach?
We will hold a poster session in the Atrium of the Paul Allen Center. Each team will be given a stand to present a poster summarizing the project motivation, methodology, and results. The poster session will give you a chance to show off the hard work you put into your project, and to learn about the projects of your peers.
Here are some details on the poster format:
- We will provide poster boards that are 32x40.
- Both one large poster (recommended) and several pinned pages are OK. The TAs can help with printing your posters, provided you give them enough notice.
You must submit your poster on Canvas.
Your final submission will be a project report on Canvas. Your write up should be 8 pages maximum in NIPS format, not including references. You must use the NIPS LaTex format. You should describe the task you solved, your approach, the algorithms, the results, and the conclusions of your analysis. Note that, as with any conference, the page limits are strict. Papers over the limit will not be considered.
1. Source: Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, Lecture 8-Slide 8, Spatial Localization and Detection. Available here.↩