Objectives¶
The goal of this lesson is to examine how data science can go wrong in a series of case studies. This reading is a bit longer, but we think it’s important to explore more examples of the historical and cultural context for you to prepare you for your future careers. These case studies aren’t comprehensive of all ways data science can go wrong, but can hopefully act as reference points or reminders when exploring your applications.
In this reading, we will talk about 3 applications of data science (in addition to the case studies we read on Monday and Wednesday).
- Potholes and data collection bias
- Facial recognition and surveillance
- Using AI-generated art in a TV title sequence
- Labor concerns around ChatGPT
There are also further readings and articles posted under each topic. You do not need to understand every single detail from each example, but you should at least look over them all to understand the key take-aways. Your responsibility at the end of this lesson is to make a post sharing an opinion on a discussion board thread, so you shouldn’t feel overwhelmed by all the small details presented.
Setting up¶
Since this is not a coding lesson, there is no notebook to follow along with. For an offline copy of this lesson, feel free to print out this page!
Case Studies¶
Why have we been asking you to reflect on case studies? Ethics in data and computing can sometimes feel abstract. Principles like algorithmic fairness and differential privacy are important, but they’re notoriously difficult to illustrate with simple, self-contained examples. It’s difficult to come up with smaller versions of systemic bias that leads to unfairness or loss of privacy, or other ethical concerns. Case studies are a great way to explore specific examples of broader topics. Some of these topics are necessarily more difficult to come up with structured examples in a course setting, but the (somewhat) good news is that there are plenty of real-life examples to show how these principles play out in practice.
Further, many of these case studies are not typical readings that we’ve had in previous lessons. In these case studies, you have read news articles, opinion pieces, technical blog posts, and peer-reviewed research. Learning to navigate that range is a skill you’ll use constantly as a data practitioner (and part of why we’ve had Reading Assignments in this course)! The world is full of data reporting and tech commentary, and being a critical reader of data-dense media matters. That’s why we’ve included Food for Thought questions throughout!
You’ll notice that all of our case studies are at least a few years old. That’s also intentional. More recent cases are still being actively debated or developed, which makes it harder to get a clear picture of the stakes, solutions, and who’s affected. Older cases have had time to accumulate secondary literature, independent analyses, and retrospective commentary, and there is value in engaging with these examples from a critical distance.
Previous Case Studies¶
In Lessons 22 and 23, we read about the following case studies:
- COMPAS (Lesson 22)
- Predicting Criminality (Lesson 22)
- Tracking for Safety (Lesson 23)
- Rides of Glory (Lesson 23)
Here, we’ll introduce a few more case studies and some guiding questions.
Potholes in Baltimore¶
The city of Baltimore has problems with potholes (holes in the street that are not fun to drive over). Part of the city’s responsibility is to fix these potholes to make roads safer. There has always been a system in place for people to report these potholes, but the process was slow. The city invested in building a smartphone app to automatically report potholes and reduce the time it takes to fix potholes. The idea was to use someone’s phone GPS and accelerometer to report the pothole’s location as someone drives over it. At this point, it seems like a relatively straightforward data science problem to take this incoming data and predict where the potholes are.
While this case study sounds less risky at first (maybe even like a useful application of data science), it demonstrates a very dangerous pitfall data scientists face. To benefit from this technology requires that people have a smartphone. That means areas where residents are less likely to have smartphones, are less likely to have these automatic reports sent in. This can be a real fear that these more impoverished communities will be left behind, as more resources are sent towards the more affluent neighborhoods with more reports, purely because there are more people with smartphones there.
In some sense, the city added a reporting bias to their system. A reporting bias exists when there is some reason the answers reported differ from the truth. An example of reporting bias is asking a married person, “Have you cheated on your spouse?” The answers people say are most likely biased towards “no” since there is a risk of reporting truthfully. Here, the reporting bias comes from differing levels of technological access.
When designing a data analysis, application, or model, you need to think carefully about how it impacts people of different races, genders, physical or mental abilities, socioeconomic status, etc. (and how it can affect intersecting identities). Thinking of diversity and inclusion is crucial for a data scientist since we want to make artifacts that benefit all people.
Food for thought:
- A common counter-point to this critique is to appeal to designing to the “average” person. Someone might say “Well most people have a smartphone, so isn’t it the right thing to consider this average case?” In what ways might designing for the average be helpful or harmful to users?
- Listen to this podcast by 99% Invisible about designing for averages. What recommendations are made, and how might you incorporate them in your data analyses?
- Pick a few definitions of biases from the Catalogue of Bias. What other biases do you see in this case study? In others?
Facial Recognition¶
Read this article on the use of facial recognition in machine learning. (Or read the archived article if you don’t have access.)
The data we use to train our model affects the results of our model. This can lead to discriminatory practices. If our model underrepresents or overrepresents different groups that can lead to inequitable outcomes. However, with facial recognition and any other technology, we must ask ourselves, to what end are we designing this technology? Is it ethical to develop facial recognition technology for improving our daily lives if the same technology enables a surveillance state? In just the last few months, several tech employees at companies like Google, Amazon, and Meta, and OpenAI have taken collective action to urge their employers to cancel contracts with the US Immigrations and Customs Enforcement. You can read more about collective action from tech workers in this Wired article and this Guardian article.
Food for thought:
- Watch this Ted Talk by Joy Buolamwini on how her research helps fight bias in algorithms. How does it relate to what you have read about in this case study?
- Facial recognition was also part of the technology in the Predicting Criminality case study. What similarities and differences do you see between the two case studies?
- This article by Lee and Chin-Rothmann discusses the connection between police surveillance and facial recognition and communities of color. Think back to our conversations about biases in machine learning—what biases do Lee and Chin-Rothmann point out, and what are the impacts of perpetuating those biases?
Secret Invasion¶
Read this article about the use of generative AI in Marvel’s Secret Invasion.
Marvel’s Secret Invasion is a superhero-spy-thriller series that faced backlash for its AI-generated title sequence. The director of the series thought the bizarre and shifting images from the AI-generated sequence aligned with the themes of Secret Invasion, which featured questions of identity and shape-shifting aliens.
Many online called for a boycott of the series, questioning the ethics of using a computer to generate the title sequence instead of hiring a team of artists, particularly by a large corporation like Marvel (Disney). Others pointed out that many generative AI models for creative work like visual arts are often trained on images from the web that individual artists may not have consented to have used. So while it’s fun to generate images through bots like Midjourney or Dall-E, it’s hard to tell where the training images were sourced and whether the original owners or creators of the images consented to have their work used. Many online art communities and competitions have implemented and enforced strict rules banning the use of AI to produce artwork.
Food for thought:
- For a more general overview of using generative AI in creative work, see this article by Diego Ruiz on Medium. How does this impact your understanding of the Secret Invasion article?
- Is it ethical to use machines that might exploit artists on the internet? Can art really be ‘owned’ by anyone, creator or computer?
- What happens in cases where artists themselves are banned from creative spaces for too-close resemblance to AI-generated pieces (as in the case of Ben Moran)?
ChatGPT and Labor¶
Read this article about the working conditions of the Kenyan workers who helped make ChatGPT less toxic.
ChatGPT is a generative language model from OpenAI that has gained much popularity over the past few years. One of its aspects which surprised many users and researchers is its resistance to toxicity. Other attempted chatbots devolved into racist, sexist, and hateful speech when users fed it such content. ChatGPT, however, would produce warning messages or ostensibly refuse to engage in such behavior.
It was revealed in early 2023 that part of the reason why ChatGPT was so good at resisting toxicity was through extensive and exhaustive human labor to filter through training data for toxic comments. The workers in the article had to read through thousands of hateful and toxic comments and internet content with very little pay, emotional support, or breaks.
Other concerns from using ChatGPT stem from uncertainty about how user data is collected and where training data has been sourced. Additionally, in academic settings (like this class!), an increasing number of students have used ChatGPT or other generative AI for homework help, exam answers, and general breaches of academic integrity. Universities and institutions are still developing policies to keep up with the widespread use of generative AI in classroom settings.
(Needless to say, if we’re doing our jobs as your instructors properly, there should be no need to consult ChatGPT!)
Food for thought:
- How does understanding the different types of labor involved in the creation of a technology (LLM or otherwise) affect someone’s choice to use it? Is it ethical to continue to use innovative AI tools that rely on exploitative labor? Should the work be made fairer and more accommodative, or should it be stopped altogether?
- Here’s an article about ethical uses of ChatGPT. Do the recommendations and scenarios here align with your understanding of ethical AI use? Why or why not?
- Here’s an article about DAN, a “dark” ChatGPT. How do you feel about the existence of “dark” or “jailbroken” chatbots? What might they be used for?
⏸️ Pause and 🧠 Think¶
Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.
Here’s what we covered in this lesson:
- Why case studies?
- Review of Lesson 22 and 23 case studies
- New case studies:
- Potholes
- Facial recognition and surveillance
- Digital art and generative AI
- LLM training and labor
Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:
- In your own words, write a few sentences summarizing what you learned in this lesson.
- What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
- What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
- Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.
In-Class¶
When you come to class, we will discuss the case studies in this lesson. You will be responsible for posting your reflection in the Canvas Quiz for today, and you are welcome to share your thoughts on the designated Ed thread to continue these conversations!
Canvas Quiz¶
All done with the lesson? Complete the Canvas Quiz linked here!