November 2, 2022
Hello Fellow Mutineers!
I hope the first few weeks of our first mutiny have been thought-provoking, fun, and maybe even a bit challenging! We’ve been getting some great questions and comments from the community, and wanted to share some thoughts here.
It’s important to have an honest discussion about the implications of ML models and how difficult it is to create a fair and just application of this technology.
Our challenge is controversial in design, because as practitioners in this field, we wanted to create a real-world scenario, not a feel-good scenario. A wise person once told me that our jobs are to be the ‘truth-tellers,’ and I take that role seriously. This means that our roles as practitioners are to create that bridge of understanding between social impact and pragmatic application.
Your challenge is to create a model that addresses issues of demographic bias in datasets, and we’ve chosen some thought-provoking ones. For those unfamiliar with the background and history of the demographics we have chosen, I hope this blog post serves as an early primer to the complexity of the issues at hand. Congratulations, you’ve chosen one of the most difficult and important practices to engage in.
I. Skin Tone.
There is a long history of the failure of ML models to appropriately identify individuals because of their skin tone. For decades, best practices simply sorted people into 4-6 races, leading to 50+ years of biased studies. The Monk skin tone scale was developed thoughtfully to create more accurate descriptions of skin tone for evaluation.
Gender Shades began this conversation and we have seen skin tone based biases manifest in other applications, including the image cropping bias scenario that sparked the first-ever algorithmic bias bounty at Twitter.
However, the answer isn’t as easy as simply creating a better skin tone identifier. Skin tone is a classifier for race, and inappropriate and unethical uses abound to identify individuals for purposes of surveillance. From a social impact perspective, practitioners should deeply engage with and understand the implications of how skin tone identifiers can be used against black and brown communities. Similarly, the rise of surveillance technologies to enable the carceral state has sparked more conversations about data privacy as it relates to ethical use.
Our classifier uses the term ‘perceived gender’ to address the fact that ML models are not inherently able to identify an individual’s preferred gender, but is at best an approximation of how a stranger might perceive someone’s gender expression based on social constructs and norms. Or to put it more plainly, most people think pink means girl and blue means boy (or long hair means girl and short hair means boy), even if reality is more beautifully complex. Identifying gender can be helpful in some situations, such as identifying human trafficking or child sexual abuse content. Gender identification is also used in malicious ways, such as attempting to identify trans individuals. We recommend reading this insightful paper by Queer in AI, and in particular:
“…[Gender identification] systems are also inherently flawed and cannot benefit from participatory design because of 1) their invalid assumption that one’s expression can predict their gender identity and 2) their treatment of gender as an immutable, discrete quantity, which grossly mischaracterizes the flexible and fluid nature of one’s inner sense of their gender…
Furthermore, these gender detection systems often poorly, or entirely don’t, consider intersectionality: 1) they lack data of non-binary persons facing intervening vectors of marginality 2) they suffer from exnomination, in which researchers implicitly rely on the visual definition of non-binary persons as Western, white, androgynous people with stereotypically queer experiences 3) they treat “non-binary” as a monolithic, third gender, when in reality non-binary genders comprise all genders beyond the binary.
Most machine learning systems, such as those employed in ad targeting and commercial gender recognition, and attempts to mitigate their harms, focus on binary gender. The collection of binary gender data and inference of binary gender forces non-binary individuals to misgender themselves or be misgendered by systems, as well as suffer cyclical erasure, in which the assumption of gender as binary is encoded into machine learning models, thereby reinforcing and perpetuating dangerous, hegemonic ideas about gender being binary.”
Age as a category on its own is not as fraught with overtly harmful applications. However, the intersection of age with other categories can introduce intersectional harm. For example the assessment of infants with perceived gender is problematic, as this is in its entirety a social construct imposed upon the child. In short, babies don’t have expressed gender beyond simplistic expressions of artificial constructs of the concept of gender, so what a model would classify and reinforce is the social norm.
Intersecting age and perceived gender on the other end of the spectrum can introduce some uncomfortable ageist biases. Older women, for example, tend to be tagged as men, due to perfectly normal aging processes and styling choices.
Why bother with this exercise, then? First and foremost, this challenge reflects the types of challenges practitioners may face in a real-world algorithmic auditing setting. Learning to thoughtfully navigate them is critical. Second, as the regulatory environment in AI expands, we will see bias identification mandates for protected classes and at-risk communities, which include people of color, the elderly, children, and more. Third, and at the core of this challenge, is the need for non-exploitative methods of image annotation and dataset creation—but more on that in our next blog post. In the meantime, if you’re interested in the details of our dataset, please check out our datasheet.
Happy Hacking! Captain Rumman Chowdhury