top of page

KYD - Know Ye Data

December 5, 2022

Ahoy! As we eagerly await announcing the winners of the 8-Bit Bias Bounty, let’s talk a bit about dataset curation. Most of us came to ML modeling via similar paths—poking around the standard suite of datasets, whether it’s ImageNet, or German Credit Scoring, or other publicly available datasets. We tend to hyperfocus on the model we are using and optimize model performance based on our algorithmic parameters, and data tends to be an afterthought—an ingredient you use in your recipe, but don’t really put much thought into. Only recently have we started having real discussions about the role of datasets in developing good or bad model outcomes.

It’s a mistake to ignore our datasets. The provenance of our data is critically important. Groundbreaking work, like Datasheets for Datasets, emphasized the need to be thoughtful of our data, where it comes from, and how we use it (see our datasheet here). We wanted to also provide a view a leve

l deeper—into the art and practice of dataset construction. In this blog post, I interview Ben Colman, founder and CEO of Reality Defender and co-founder of the 8-bit bounty challenge in Bias Buccaneers! Ben and his team at Reality Defender donated the dataset that you have been hacking at diligently for the past few weeks. We wanted to give you a peek into the process of curating and tagging an image dataset. Here’s Ben in his own words:

Rumman: First off, it’s no easy task to build a dataset from scratch, especially one with images. How did you get this together?

Ben: We started with a goal of 100k faces to create a robust challenge. However we quickly determined that trying to curate data ethically is a huge task. In other words, there are many images online that are taken improperly or without consent. We personally reached out to each provider, so that we could get, in their own words, written commitments about the data. That immediately removed 50% of the data from the funnel, when the providers could not (or would not) fully confirm. Even when we reminded them that their platforms already did confirm, they quietly brought our wildest fears to life: They were not sure about provenance. WTF.

Very quickly, our data set was reduced from 2m+ to 25k, because we required ourselves to be 100% confident on permissioning, release, provenance, and compliance with existing laws (such as the Illinois Biometric Information Privacy Act).

Rumman: It seems like dataset curation is a big challenge—other than provenance and permission, was the rest easy?

Ben: I wish! Our next issue was tagging the dataset. We needed to create a process that was transparent, repeatable, and reduced biases. Dataset tagging is a human-driven process. People have to literally stare at images and decide what to tag them. To deal with variation across annotators, we created a pipeline that starts from multiple tags from the same image, which are then passed through two rounds of QA to ensure minimal divergence between taggers and QA/ As a final step, I personally checked 5% of all images as a 3rd order QA.

Rumman: Right, and this is the purpose of the challenge—to create a programmatic way to label images, because it’s pretty rare that dataset creators are spending the amount of time and energy on creating properly curated, compliant, and transparently tagged datasets. For example, ImageNet crowdsources its annotation and uses Amazon’s Mechanical Turk. It’s hard to do this work at scale and we don’t have good solutions. So what do you want practitioners to take away from this challenge, and to learn from what you and your team went through?

Ben: We hope for two things:

  1. The data set can do good in reducing algorithmic bias—models are only as good as the data that feeds them, and this is often not explored.

  2. We also bring attention to the challenge of creating data sets. We hope that practitioners of algorithmic auditing also ask the hard questions of the data their models are based on, and are not just focused on the models they are building

Datasets are critically important to creating good model output. If you’re someone who wants to explore algorithmic auditing as a field, you’ll quickly realize that many issues of bias stem from the core data. As we all know, garbage in, garbage out.

The last thought we’d like to leave you with is that there will likely never be a fully technical solution to the problems of biased datasets. If you consider the larger picture of the potential for bias in datasets, we’ve asked you to think of only three of them, and limited complexity significantly to remove interactions. Our challenge scratches the surface, but hopefully also provides helpful scaled approaches to supplement human annotation.

We’re looking forward to announcing the winners of our competition later this week!

15 views0 comments

Recent Posts

See All
bottom of page