Questions Around Bias, Legalities in GitHub's Copilot

At the end of June, GitHub and OpenAI announced a new tool for devs. Copilot was built to assist developers write their code by generating function suggestions in real-time. To do this, Copilot uses Codex, an OpenAI algorithm trained on open-source code. When announced, it was only available for technical preview, which is probably good because questions surround the new tool, the code and AI it uses, and the legality of how it shares that code with users.

When it comes to artificial intelligence, it’s long been known that it’s not perfect. We’ve already learned that AI can be fooled by written words, that there is bias in AI, and we already know that the regulations around AI are few and paltry. So it shouldn’t come as a huge surprise that there is bias in Copilot’s AI, something OpenAI revealed in a paper published on July 7. OpenAI concedes that Copilot might have significant limitations, including biases, which can be a really big problem for developers.

With the global push for equality for all, having bias in AI that’s used for business purposes is bad. Really, really bad. Early research shows that Codex, like other language models, “generates responses as similar as possible to its training data, leading to obfuscated code that looks good on inspection but actually does something undesirable. Specifically, OpenAI found that Codex, like GPT-3, can be prompted to generate racist and otherwise harmful outputs as code. Given the prompt “def race(x):,” OpenAI reports that Codex assumes a small number of mutually exclusive race categories in its completions, with “White” being the most common, followed by “Black” and “Other.” And when writing code comments with the prompt “Islam,” Codex often includes the word “terrorist” and “violent” at a greater rate than with other religious groups.” This is something devs need to look for if/when using Copilot.

OpenAI says it has found a way to improve large language models to reduce such bias, but it’s hard to say if they have or if it will work. They also indicate that Codex is sample-inefficient, which means that novice coders will still have to use critical thinking skills to problem-solve, even if they have seen fewer lines of code than Codex. On top of that, the suggestions it provides seem appropriate, but a deeper dive shows the intended use is not what it seems. It also recommends compromised packages as dependencies and invokes functions insecurely.

So here you have just about everything we mentioned the other day. Insecure coding, bias in AI, incorrect function suggestions, etc. If those were the only questions, the tool’s potential, how it grows and changes as time moves forward, would be higher. But those aren’t the only potential problems Copilot is facing. GitHub CEO, Nat Friedman, has written on forums, “Training machine learning models on publicly available data is considered fair use across the machine learning community.”

The problem with this statement is that it’s not really true. Certain aspects may have been accepted up until now, but there has been no legal US precedent that upholds publicly available training data as fair use. Friedman would likely refer to the Google Books case. Google downloaded and indexed more than 20 million books to create a literary search database, which those who support Copilot will say is akin to training an algorithm. However, there really isn’t controversy over the ability to pull copyrighted material. It’s the output created by the machine collecting that information that blurs the line. There is no way to verify that whomever is using that information should actually have access to it.

Now we’re in a place where we have more questions than answers. Problems have been identified but not rectified. Laws around publicly available training data are nonexistent, and assumptions around what is legal versus what is actually legal could change how the entire thing works.

Before you dive head-first into this, or anything else technical, be sure to do your research. Learn everything you can about benefits and drawbacks, keep your eye on any legal proceedings or regulations that are implemented and make sure you understand your responsibilities as a business owner. Remember, the FTC already said that a business using biased AI is on the hook for using a biased algorithm, regardless of where it came from. This will apply to code generated by AI as well. This article will give you some insight into what devs who received the technical preview think of how it works, pay attention to their views. Remember that your team will come across similar scenarios, so note what the devs find frustrating and what they like.

Don’t fall prey to the newest, shiniest tool out there. Remember that it is untested in the real world, there is much that is unknown about it and even less of an idea of how it will work once regulations catch up. Seeing as it’s not available for full use yet, keep your eye on this one. Make sure the dust is settled before you make too many moves.

Complex Problems Solved

Answers to the Tech Problems Keeping You Up at Night

Questions Around Bias, Legalities in GitHub’s Copilot

About the Author

Contact us

Contact Us About Anything

Need Project Savers, Tech Debt Wranglers, Bleeding Edge Pushers?

Announced less than two weeks ago, questions surround the legalities of the AI tool Copilot. Aimed at helping devs, Copilot’s Codex enhances bias and may not share code legally.

About the Author

Contact us

Contact Us About Anything

Need Project Savers, Tech Debt Wranglers, Bleeding Edge Pushers?