Content Moderation
Optimizing algorithms with a composition of data structures.
Table of contents
Social media platforms rely on algorithms to draw attention to the user-generated content that is most likely to engage. Content moderation plays an understated but integral role in determining which pieces of content will be shown to the platform’s users. Unlike traditional publishers, online platforms are protected under Section 230 from civil liability for user-generated content.
No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider.
Furthermore, if platforms choose to moderate, they are not held liable for their moderation decisions. Despite the centrality of content moderation, many social media platforms have marginalized and outsourced content moderation to third-party vendors. Content moderators review content flagged by machine learning systems and reported by users.
In this project, we will compare 4 implementations of priority queues for moderating a stream of user-generated content. To get started, join your project team on Canvas. One team member should Fork
Learning objectives
- Design and compose multiple data structures to solve complex problems.
- Analyze the affordances and values embodied within abstractions.
- Critique the implications of content moderation on information society.
Design
Interface
The java.util
standard library includes a binary heap PriorityQueue
class. Unlike the Java collection interfaces Set
, Map
, and Queue
, PriorityQueue
is a class rather than an interface!
The elements of the priority queue are ordered according to their natural ordering, or by a
Comparator
provided at queue construction time, depending on which constructor is used. A priority queue does not permitnull
elements. A priority queue relying on natural ordering also does not permit insertion of non-comparable objects (doing so may result inClassCastException
).The head of this queue is the least element with respect to the specified ordering. If multiple elements are tied for least value, the head is one of those elements – ties are broken arbitrarily. The queue retrieval operations
poll
,remove
,peek
, andelement
access the element at the head of the queue.
For this project, we will introduce the ExtrinsicMinPQ
interface: a more complex version of the MinPQ
interface presented in the lesson. Priority values are extrinsic to each item: rather than relying on the compareTo
method, the priority value must be given as an argument to the add
and changePriority
methods.
void add(T item, double priority)
- Adds an item with the given priority value.
boolean contains(T item)
- Returns true if this priority queue contains the given item; false otherwise.
T peekMin()
- Returns the item with the lowest priority value.
T removeMin()
- Removes and returns the item with the lowest priority value.
void changePriority(T item, double priority)
- Updates the priority of the given item.
int size()
- Returns the number of items in this priority queue.
boolean isEmpty()
- Returns true if this priority queue is empty; false otherwise. This is a default method provided by the interface that calls the
size
method.
Unlike the MinPQ
interface presented in the lesson, ExtrinsicMinPQ
can only contain unique items. There cannot be two copies of the same item, though it’s possible for multiple items to be assigned the same priority value. In this case, the priority queue may break ties arbitrarily.
The PriorityNode
class represents an item together with its extrinsic priority value. This is useful for working with Java collections that need to specify particular data types. Rather than maintain two lists (one for each element and the other for each priority value), we can instead use a single list of priority nodes.
List<PriorityNode<T>> items = new ArrayList<>();
items.add(new PriorityNode<>(item, 0));
Two PriorityNode
instances are considered equal if their items are equal. (Their priority
values do not need to be equal!) The following contains
call will return true
, and the remove
call will successfully remove the priority node added earlier.
items.contains(new PriorityNode<>(item, 1));
items.remove(new PriorityNode<>(item, 2));
Reference implementation
DoubleMapMinPQ
- Each unique priority value is associated with the set of items that share that priority value in a red-black tree. Retrieving a lowest-priority item involves searching for the leftmost set of items in the tree and then choosing any element from that set. The red-black tree is supported by a
HashMap
that associates each item with its priority value in order to speed-upcontains
andchangePriority
.
Implementation task
Design and implement 3 other implementations of the ExtrinsicMinPQ
interface.
UnsortedArrayMinPQ
- Items are added to an array or
ArrayList
in an arbitrary order. Most operations need to search over the entire array orArrayList
. HeapMinPQ
- A priority queue that delegates all method calls to the Java
PriorityQueue
. Each instance ofHeapMinPQ
only contains one field assigned to a JavaPriorityQueue
instance, and everyExtrinsicMinPQ
operation is implemented by calling the underlyingPriorityQueue
. OptimizedHeapMinPQ
- An enhanced binary heap priority queue supported by a
HashMap
that associates each item with its heap index to speed-upcontains
andchangePriority
. In order to access the heap array, it is necessary to implement our own custom binary heap. Identify the relevant portions of theMinPQ
class and modify the code to efficiently implement theExtrinsicMinPQ
interface. Each heap array update will require a correspondingHashMap
update. - Video
- Explain one part of the project implementation that you’re particularly proud of developing.
Run, check, debug
Open Terminal Moderator
class. Use the keyboard shortcut Ctrl+C to stop program execution at any time. Once the Moderator
class has finished execution, the rm
command removes the compiled class files. The clear
command clears the console.
- Warning
- The provided datasets contain text that may be considered profane, vulgar, or offensive. The
Moderator
class prioritizes the most toxic content.
javac Moderator.java && java Moderator; rm *.class
clear
By default, the Moderator
class uses the DoubleMapMinPQ
reference implementation. Modify the class to test your own implementation for debugging purposes and check that it lines up with the matches returned by DoubleMapMinPQ
.
For debugging, use print statements to output the state of important variables. Study this 4-minute video on debugging with print statements. Note how they spend almost all of the time diagnosing the problem and developing a thorough explanation for why the program is not working as expected. The actual bugfix only takes a few seconds of typing. This is normal practice for all programmers, though programmers who have seen similar bugs before will be able to identify and resolve problems more quickly.
To compare several implementations at once, run the provided ModeratorMultiTest
, which validates the implementations against DoubleMapMinPQ
. If some of your implementations are not ready for testing, comment them out of the implementations
map.
javac ModeratorMultiTest.java && java ModeratorMultiTest; rm *.class
Analyze
Affordance analysis
Prioritizing review of the most “toxic” content alone is not a universal good for everyone. How does the prioritization of certain content affect the way society defines truth? What does that prioritization emphasize, and who does it benefit? Consider the positionality of these actors that interact with the platform.
- Users of the platform, particularly the ways in which certain identities may be further marginalized.
- Content moderators who are each reviewing several hundred pieces of toxic content every day.
- Legal teams who want to mitigate government regulation and legislation that are not aligned with corporate interests.
- Users who want to leverage the way certain content is prioritized in order to shape public opinion.
- Video
- Give an affordance analysis of priority queues for content moderation. Pick a couple conflicts from among these actors and describe potential modifications to the priority values (or the priority queue as a whole) that would help resolve the conflict by combining their values. Are modifications to the priority values or priority queue alone sufficient to address the conflicts? If they’re not sufficient, briefly describe another approach that might be better.
Asymptotic analysis
- Video
- Give a big-theta bound for the worst-case runtime of the
removeMin
andchangePriority
methods of each implementation, includingDoubleMapMinPQ
, with respect to N, the size of the priority queue. Explain the runtime of each implementation in 1 or 2 sentences while referencing the code.
PriorityQueue
relies on a resizing array-representation binary heap. TreeSet
and TreeMap
rely on a self-balancing red-black tree. HashSet
and HashMap
rely on a resizing separate-chaining hash table with a tree bucket optimization that is applied when the size of a linked list bucket exceeds the TREEIFY_THRESHOLD
(8 elements).
* This map usually acts as a binned (bucketed) hash table, but * when bins get too large, they are transformed into bins of * TreeNodes, each structured similarly to those in * java.util.TreeMap. [...]
For hash tables, assume the hash function distributes elements evenly so that the size of each bin is constant. (Consider how this assumption affects the tree bin analysis.) For all array-representation data structures, assume that a removeMin
or changePriority
call will not require a resize.
Critique
Connect content moderation to discussions and opinions occurring in real-world contexts. Here are a few articles and essays to begin your research.
- To Apply Machine Learning Responsibly, We Use It in Moderation
- The New York Times comment moderation team raises important concerns about implicit bias in the machine-learning systems that rate content. How can the New York Times leverage algorithms that rely on machine learning outputs while mitigating risks?
- Custodians of the Internet
- What are platforms’ legal obligations to the public interest? If platforms operate an attention economy, then “content moderation constitutes the platform.”
- Who Moderates the Social Media Giants? A Call to End Outsourcing
- Without content moderation, “platforms would be inundated by harmful content.” What are the consequences of outsourcing content moderation to third-party vendors?
- Video
- Synthesize at least one of the above articles together with a couple additional sources. Examine the positionality of the various articles by considering some of the following questions (and any of your own questions). How does the author’s lived experiences lead them to their point of view? How do the distribution of benefits and harms affect different people and society?
Submission
The final deliverable for your project is a video no longer than about 10 minutes that addresses all of the tasks labeled Video
All project team members must be present in the video, which you can do by recording it over a Zoom meeting. The presentation video must be organized around presentation slides. Upload your video to the Canvas assignment. After the submission is complete, add a submission comment with two links:
- Your Ed Workspace.
- Your presentation slides.