First things first

A quick note: I'm going to take a brief newsletter break, but will resume publishing the week of July 9. At that time, or shortly thereafter, I will likely begin publishing 4 times per week (Monday-Thursday). You'll receive twice as many emails, but each issue will be significantly less dense. This issue is particularly dense.

There's a lot to talk about this week, but I want to begin with the topic of bias in artificial intelligence models. It's an issue that gets an awful lot of attention -- including in this feature from Fortune -- but that I think the general public misunderstands. There were also two other items this week that got me thinking about this:

The problem is not with AI at all, but rather with the data we feed the models and, sometimes, with the people whose job is to use the products of those models.

Really, the problem of bias goes back as far as the term "big data," and even before that was recognized with the old saying, "garbage in, garbage out." The whole idea of big data was to collect and analyze as much of it as possible, and then draw conclusions about behavior, trends or whatever else the data could tell you. This could be at an aggregate level, an individual level, or both (e.g., placing things in cohort groups).

Bigger was better, the story went, because it both increased accuracy and helped prevent overfitting. At massive scale, the effects of any anomalies would be mitigated and you should start to see a flattening toward the overall reality.

The problem, as smart folks were pointing out several years ago already -- including in [this 2014 White House report], and this 2016 follow-up report, and in papers and talks and think tanks -- is that accuracy is not the same as objectivity. You can have a massive dataset that's fundamentally flawed because the data is incomplete, doesn't include data on certain groups, unintentionally reinforces stereotypes or flawed policies, or any number of other issues.

AI, especially deep learning, is really just an advancement of big data. Models ingest lots of data, detect patterns, and then serve predictions. Yeah, there's some black-box stuff happening that obscures exactly why a model makes the decisions it makes, but in the end it all boils down to the data.

Algorithms like this are not biased because they're the product of intelligent-but-racist (or classist or ageist or sexist) systems. They're biased because that's the data they were fed, and nobody noticed, thought enough about or was able to fix the problem. In some cases, the systems might only appear biased, because they've been tweaked to optimize for a certain desired outcome.

I'm as happy as anybody that AI is finally bringing the issue of algorithmic bias into the public eye, but let's please not call it an AI problem. Let's call it what is: a problem with data and a problem with people. Solving those is harder, and also entirely necessary if we really want AI to lead us into a future that's not a disaster.

Silicon Valley vs. the United States

The issue of tech employees protesting their employers' work with the U.S. military, ICE and law enforcement has been in the news quite a bit lately, and I've discussed it a couple of times myself (starting with the Google AI controversy):

This week brought some minor good news for folks protesting law enforcement use of AWS facial recognition tools, when the Orlando Police Department said it's no longer using the Rekognition service. However, it's worth noting, the department didn't say it wouldn't re-up with AWS at some point, nor did it say it will stop facial recognition altogether. So the actual news is that it's not using Rekognition at the moment.

Salesforce employees upset about the U.S. policy of separating families crossing the border also urged that company to cease its work with border patrol agencies and generally be more mindful going forward. However, Salesforce will not end its contract with U.S. Customs and Border Protection, despite CEO Marc Benioff's personal disgust for the policies his employees are lobbying against.

More or less, this is how I would expect most of these situations to play out. It happened with Microsoft re: border patrol, it happened with AWS re: facial recognition, and it happened with Salesforce. Either the products aren't directly involved in any questionable practices (Salesforce sells CRM software, after all) or the companies don't want to police every way in which their customers are using their products. It's heartening to see employees and even executives expressing themselves on these matters, but at the end of the day and absent extenuating circumstances, money is going to talk loudest.

California's new data privacy law

You probably read something about this. If not, the short version is that California has a new privacy law in the vein of the EU's GDPR that will take effect in 2020. I haven't read it in its entirety, but the gist is definitely the same, including provisions around deleting customer data and ensuring what companies do store is private.

It's easy to condemn something like this as bad for business -- and if this dire report on the publishing industry is any indication, many companies still aren't GDPR-compliant -- but let's not overlook how badly the United States needs some privacy legislation with teeth. Here's this week alone:

There can't be a data-driven world with AI everywhere if that data isn't safe and users don't have any control over it. If self-regulation won't work, the only other real fix is legislative.

This pretty much sums up how to think about AI for your business

From Andreessen Horowitz partner Benedict Evans: Ways to think about machine learning. You should read the whole thing, but this section really sums it up and really resonates with how I've been trying to explain the value for years now:

Five years ago, if you gave a computer a pile of photos, it couldn’t do much more than sort them by size. A ten year old could sort them into men and women, a fifteen year old into cool and uncool and an intern could say ‘this one’s really interesting’. Today, with ML, the computer will match the ten year old and perhaps the fifteen year old. It might never get to the intern. But what would you do if you had a million fifteen year olds to look at your data? What calls would you listen to, what images would you look at, and what file transfers or credit card payments would you inspect?

That is, machine learning doesn't have to match experts or decades of experience or judgement. We’re not automating experts. Rather, we’re asking ‘listen to all the phone calls and find the angry ones’. ‘Read all the emails and find the anxious ones’. ‘Look at a hundred thousand photos and find the cool (or at least weird) people’.

In a sense, this is what automation always does; Excel didn't give us artificial accountants, Photoshop and Indesign didn’t give us artificial graphic designers and indeed steam engines didn’t give us artificial horses. (In an earlier wave of ‘AI’, chess computers didn’t give us a grumpy middle-aged Russian in a box.) Rather, we automated one discrete task, at massive scale.

Read and share this issue online here.

Sponsor: Neo4j

AI and machine learning

Sponsor: MongoDB

Cloud and infrastructure

Sponsor: Replicated

Data and analytics