\documentclass{article}
% Packages
\usepackage{amsmath,amsfonts,amsthm,amssymb,amsopn,bm}
\usepackage[margin=.9in]{geometry}
\usepackage{graphicx}
\usepackage{url}
\usepackage[usenames,dvipsnames]{color}
\usepackage{fancyhdr}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{booktabs}
% New colors defined below
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.98,0.98,0.98}
% Code listing style named "mystyle"
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour}, commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\footnotesize,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
%"mystyle" code listing set
\lstset{style=mystyle}
% For enumerate environment
\usepackage{enumitem}
\renewcommand{\theenumi}{\alph{enumi}}
\renewcommand{\labelenumi}{(\theenumi)}
% Math commands
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Var}{\mathrm{Var}}
\def\rvx{{\mathbf{x}}}
\def\rvy{{\mathbf{y}}}
\newcommand{\softmax}{\mathrm{softmax}}
\newcommand{\inv}{^{-1}}
% Formatting
\newcommand{\grade}[1]{\small\textcolor{magenta}{[#1 points]} \normalsize}
\date{{}}
% Solutions
\usepackage{ifthen}
\newboolean{showSolutions}
\setboolean{showSolutions}{false} % Change this to toggle solutions
\newcommand{\solution}[1]{\ifthenelse {\boolean{showSolutions}} {{\leavevmode\color{blue}\textbf{Solution:} #1}}{}}
% Comments
\newcommand{\hugh}{\textcolor{blue}}
\newcommand{\ian}{\textcolor{red}}
% No indent
\usepackage[parfill]{parskip}
\begin{document}
\title{Homework \#3}
\author{\normalsize{CSEP 590B: Explainable AI}\\
\normalsize{Prof. Su-In Lee} \\
\normalsize{Due: 6/1/22 11:59 PM}}
\maketitle
\section{Review questions (20 points)}
\begin{enumerate}
\item \grade{5} Linear models with a large number of features are considered less interpretable than those with a small numbers of features. Describe how having a large number of features affects the three meanings of ``model interpretability'' discussed in class: \textit{simulatability}, \textit{decomposability} and \textit{algorithmic transparency}.
\item \grade{5} Describe the different requirements for concept labels in Concept Bottleneck Models and TCAV. Which approach requires more concept labels?
\item \grade{5} Describe the optimization procedure used to visualize what activates a neuron within a neural network (activation maximization). What are the advantages and disadvantages of this approach over visualizing the neuron using dataset examples?
\item \grade{5} Describe what \textit{leverages scores} represent in the linear regression context. How are they calculated given a dataset $X \in \R^{n \times d}$ and $Y \in \R^n$?
\end{enumerate}
\section{Inherently interpretable models (45 points)}
In this problem, we'll train several inherently interpretable models and see what we can learn from the results. First, download the wine quality dataset from \href{https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009?resource=download}{here}, and then load the dataset using the following code:
\begin{lstlisting}[language=python]
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-red.csv')
X = df.drop(columns=['quality'])
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
\end{lstlisting}
\begin{enumerate}
\item \grade{6} Train a standard linear regression model to predict the wine quality label, and report the root mean squared error (RMSE) on the test data. Then, show a bar plot of the coefficients for each feature. \textbf{Hint:} use \texttt{sklearn.linear\_model.LinearRegression}.
\item \grade{8} It is generally a bad idea to directly interpret coefficients in a linear model, because the coefficients depend on each feature's scale. Instead, create a new linear regression model trained on \textit{normalized} input data---that is, normalize each feature by subtracting its mean and then dividing by its standard deviation. Report the RMSE and make a bar chart of the coefficients in this model. Based on this plot, which features seem important for predicting wine quality?
\textbf{Hint:} calculate the mean and standard deviation using the training data, and use these to normalize both the training and testing data.
\item \grade{8} Regularization reduces the risk of overfitting and can improve interpretability by encouraging sparsity. Fit lasso and ridge regression models using \texttt{alpha=0.1} for lasso and \texttt{alpha=10.0} for ridge, again using the normalized data from part~(b). Report the RMSE for each model and make bar charts of the coefficients.
\textbf{Hint:} use \texttt{sklearn.linear\_model.Lasso} and \texttt{sklearn.linear\_model.Ridge}.
\item \grade{8} The key hyperparameter in regularized linear regression is \texttt{alpha} (denoted as $\lambda$ in the lecture slides), which controls the weight of the penalty term. Fit separate lasso regression models with the following alpha values: \texttt{[0.001, 0.01 , 0.1 , 1.]}. Then, generate three scatter plots: (1)~alpha values on the x-axis and the test RMSE on the y-axis, (2)~alpha values on the x-axis and the number of zero coefficients on the y-axis, and (3)~number of zero coefficients on the x-axis and test RMSE on the y-axis. Consider using a log-scale for the axes where appropriate.
If sparsity is viewed as a surrogate for interpretability, what does the third plot say about the accuracy-interpretability tradeoff?
\item \grade{8} Instead of assuming linear relationships between features and the output, we will now try fitting a GAM. Use \texttt{pygam} to fit a \texttt{LinearGAM} consisting of univariate splines for each feature (see \href{https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html#Regression}{here}). Report the RMSE, and then create plots visualizing each univariate spline. Do the results agree with those from part~(c)? Based on an internet search of how alcohol and volatile acidity affect wine quality, do your results agree with the experts? \textbf{Note:} when creating the model, use one spline term for each input variable and no factor terms.
\item \grade{7} Finally, to contextualize the previous models' accuracy, let's compare to both a strong baseline and a weak baseline. As a weak baseline, consider the simplest possible model: predicting the mean label from the training dataset. Report the test RMSE from this baseline model. Then, as a strong baseline, report the RMSE from a gradient boosting machine. How do these compare to the simpler models from the previous questions? \textbf{Hint:} use \texttt{sklearn.ensemble.GradientBoostingRegressor}.
\end{enumerate}
\section{Instance explanations (35 points)}
In this problem, we'll implement one of the simplest instance explanations: leave-one-out. First, load a modified version of the census income (adult) dataset using the SHAP package as follows:
\begin{lstlisting}[language=python]
import shap
import numpy as np
from sklearn.model_selection import train_test_split
def load_mislabeled_data(test_size=0.99, n_flips=50, seed=1904):
"""Load census dataset and mislabel samples.
Args:
test_size: Percentage of data to reserve for test data.
Use this parameter to shrink the size of the training data.
n_flips: number of training labels to flip.
seed: random seed to randomly choose samples to flip.
Returns:
X_train: training input data.
X_test: testing input data.
y_train: training output.
y_test: testing output.
is_flip: boolean array representing mislabeled samples.
"""
X, y = shap.datasets.adult()
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=7)
# Randomly mislabel training data
np.random.seed(seed)
flip_inds = np.random.choice(
range(X_train.shape[0]), n_flips, replace=False)
is_flip = np.zeros(X_train.shape[0]).astype('bool')
is_flip[flip_inds] = True
y_train[flip_inds] = 1 - y_train[flip_inds]
return X_train, X_test, y_train, y_test, is_flip
\end{lstlisting}
\begin{enumerate}
\item \grade{3} Load a \textit{small} version of the census data with 50 mislabels and a small training dataset:
\begin{lstlisting}[language=python]
data = load_mislabeled_data(test_size=0.99)
\end{lstlisting}
How many training samples are in this dataset? Looking at the loading function, how are we synthetically mislabelling training data?
\item \grade{8} In this problem, our goal is to identify the mislabeled samples. We can treat this as a classification problem where we aim to rank samples according to the likelihood that they are mislabeled. We can then use metrics such as \textit{area under the receiver operating characteristic curve} (AUROC) or \textit{area under the precision recall curve} (AUPR) to evaluate the ordering against the mislabels, which we know because we created them (see the \texttt{is\_flip} variable returned by \texttt{load\_mislabeled\_data}).
As a baseline, generate a random importance score for each training example by sampling from a uniform random variable. Evaluate these importance scores through their ability to identify mislabeled samples using AUROC and AUPR. In addition, generate side-by-side boxplots of the importance values for normal samples and mislabeled samples.
\textbf{Hint:} use \texttt{sklearn.metrics.roc\_auc\_score} and \texttt{sklearn.metrics.average\_precision\_score} for AUROC and AUPR, and \texttt{matplotlib.pyplot.boxplot} for the boxplot.
\item \grade{8} Implement the leave-one-out
instance explanation approach to score each example in the training dataset. Specifically, begin by training a model $f$ using all the data. Then, for each sample $(x_i, y_i)$ in the training data, train a new model $f_i$ without $(x_i, y_i)$. The difference between the new model's test loss and the full model's test loss represents that sample's importance. For this problem, the model should be \texttt{sklearn.linear\_model.LogisticRegression} and the loss should be \texttt{sklearn.metrics.log\_loss}. Evaluate the leave-one-out importance scores using AUROC/AUPR and generate side-by-side boxplots of the importance scores for normal and mislabeled samples, similar to part~(b).
\textbf{Hint:} If you encounter convergence issues with the logistic regression, try standardizing the data or increasing the \texttt{max\_iter} parameter.
\item \grade{8} The leave-one-out approach should have worked fairly well in the previous setting. Now, we'll try it with larger training datasets. Report the AUROC/AUPR metrics and create boxplots for normal samples and mislabeled samples using random scores and leave-one-out scores (as in (b) and (c)) with the following two datasets:
\begin{itemize}
\item A \textit{medium} dataset with 50 mislabeled examples:
\begin{lstlisting}[language=python]
data = load_mislabeled_data(test_size=0.95)
\end{lstlisting}
\item A \textit{large} dataset with 50 mislabeled examples:
\begin{lstlisting}[language=python]
data = load_mislabeled_data(test_size=0.90)
\end{lstlisting}
\end{itemize}
How does leave-one-out perform for these larger datasets? Why is this the case, and how could a method like Data Shapley provide better results?
\item \grade{8} Finally, one very simple method to identify mislabeled samples is to train a model on the full data and use the per-sample loss. Intuitively, bad samples are more likely to be misclassified. Implement this approach for the \textit{large} dataset from part (d) and evaluate it as in the previous question (AUROC/AUPR and boxplots). How does this compare to the random and leave-one-out approaches?
\textbf{Hint:} note that \texttt{sklearn.metrics.log\_loss} automatically averages the loss over all samples.
\end{enumerate}
\end{document}