CLS Blue Sky Blog

Debevoise & Plimpton Discusses the Myth of Artificial Intelligence Errors

Machines are increasingly making important decisions that have traditionally been made by humans, such as who should get a job interview or who should receive a loan. For valid legal, reputational, and technical reasons, many organizations and regulators do not fully trust machines to make these judgments by themselves. As a result, humans usually remain involved in AI decision making, which is referred to as a “human-in-the-loop.” For example, in the detection of skin cancer, the process may now involve an AI machine reviewing a photograph of a mole and making a preliminary assessment of cancer risk, followed by a dermatologist either confirming or rejecting that determination.

For these kinds of decisions, where human safety is at risk and there is an objectively correct answer (i.e., whether a mole is cancerous or not), human review of AI decisions is appropriate, and indeed, may be required by regulation. But little has been written about exactly when and how humans should review AI decisions, and how that review should be conducted for decisions with no objectively correct answer (e.g., who deserves a job interview). This article is an attempt to fill in some of those gaps by proposing a framework for achieving optimal joint human-machine decision making that goes beyond assuming that human judgment should always prevail over machine decisions.

Regulatory Requirements for Human Review of AI Decisions

Most regulations that address machine decision making require humans to review machine decisions that carry significant risks. For example, Article 22 of the European Union’s General Data Protection Regulation provides that EU citizens should not be subject to “solely automated” decisions that significantly affect them. The European Commission’s proposed AI Act similarly provides that high-risk AI systems should be designed and developed so that they can be effectively overseen by natural persons, including by enabling humans to intervene in or interrupt certain AI operations. In the United States, the Biden Administration’s recently released Blueprint for an AI Bill of Rights (which is not binding but will likely be influential on any future U.S. AI regulations) provides that AI systems should be monitored by humans as a check in the event that an automated system fails or produces an error.

These and other AI regulations require some level of human oversight over autonomous decision making to catch what are referred to as “algorithmic errors,” which are mistakes made by machines.  Such errors certainly do occur, but these regulations are flawed to the extent that they imply that whenever a human and a machine make different decisions or reach different conclusions, the machine is necessarily wrong and the human is necessarily right. As discussed below, in many instances, there is no objectively right decision, and resolving the disagreement in favor of the human does not always lead to the optimal result.

In a recent Forbes article on AI ethics and autonomous systems, Lance Eliot provided some alternatives for resolving human-machine disagreements instead of defaulting to a view that the human is always correct:

Eliot rightly points out that over thousands of years, societies have developed several ways to efficiently resolve human-human disagreements, and, in fact, we often design processes to surface such disagreements in order to foster better overall decision making. Creating a similar system for identifying and resolving human-machine disagreements will be one of the fundamental challenges of AI deployment and regulatory oversight in the next five years. As discussed below, for many AI uses, it makes sense to require a human to review a machine’s decisions and override it when the human disagrees. But for some decisions, that approach will result in more errors, reduced efficiencies, and increased liability risks, and a different dispute resolution framework should be adopted.

Not All Errors Are Equal – False Positives vs. False Negatives

For many decisions, there are two different types of errors – false positives and false negatives — and they may have very different consequences. For example, assume that for every 100 patients, a dermatologist can accurately identify a mole as being cancerous 90% of the time. For the patients who receive an incorrect diagnosis, it would be much better if the doctor’s mistake is wrongly identifying a mole as cancerous when it is not (i.e., a false positive), rather than wrongly identifying a cancerous mole as benign (i.e., a false negative). A false positive may result in an unnecessary biopsy that concludes that the mole is benign, which involves some additional inconvenience and cost. But that is clearly preferable to a false negative (i.e., a missed cancer diagnosis) which can have catastrophic results, including delayed treatment or even untimely death.

Now suppose that a machine that checks photographs of moles for skin cancer is also 90% accurate, but because the machine has been trained very differently from the doctor and does not consider the image in context of other medical information (e.g., family history), the machine makes different mistakes than the doctor. What is the optimal result for patients when the doctor and the machine disagree as to whether a mole is benign or cancerous? Considering the relatively minor cost and inconvenience of a false positive, the optimal result may be that if eitherthe doctor or the machine believes that the mole is cancerous, it gets shaved and sent for a biopsy. So, adding a machine into the decision-making process and treating it as an equal to the doctor increases the total number of errors. But because that decision resolution framework reduces the number of potentially catastrophic errors, the overall decision-making process is improved. If instead, the human’s decision always prevailed, there would be cases where the machine detected cancer, but the human did not, so no biopsy was taken, and the cancer was discovered only later, perhaps with extremely negative implications for the patient caused by the delayed diagnosis. This would obviously be a less desirable outcome, with increased costs, liability risks, and most significantly, patient harm.

Not All Decisions Are Right or Wrong – Sorting vs. Selecting

There are times when machines make mistakes. If a semi-autonomous car wrongly identifies a harvest moon as a red traffic light and slams on the brakes, that is an error, and the human driver should be able to override that incorrect decision. Conversely, if the human driver is heavily intoxicated or has fallen asleep, their his driving decisions are likely wrong and should not prevail.

But many machine decisions do not lend themselves to a binary right/wrong assessment. For example, consider credit and lending decisions. A loan application from a person with a very limited credit history may be rejected by a human banker, but that loan may be accepted by an AI tool that considers non-traditional factors, such as cash flow transactions from peer-to-peer money transfer apps. For these kinds of decisions, it is difficult to characterize either the human or the machine as right or wrong. First, if the loan is rejected, there is no way to know whether it would have been paid off had it been granted, so the denial decision cannot be evaluated as right or wrong. In addition, determining which view should prevail depends on various factors, such as whether the bank is trying to expand its pool of borrowers and whether false positives (i.e., lending to individuals who are likely to default) carry more or less risk than false negatives (i.e., not lending to individuals who are likely to repay their loans in full).

The False Choice of Rankings and the Need for Effective Equivalents

Many AI-based decisions represent binary yes/no choices (e.g., whether to underwrite a loan or whether a mole should be tested for cancer). But some AI systems are used to prioritize among candidates or to allocate limited resources. For example, algorithms are often used to rank job applicants or to prioritize which patients should receive the limited number of organs available for transplant. In these AI sorting systems, applicants are scored and ranked. However, as David Robinson points out in his book, Voices in the Code, it seems arbitrary and unfair to treat one person as a superior candidate for a kidney transplant if an algorithm gave them a score of 9.542, when compared to a person with a nearly identical score of 9.541. This is an example of the precision of the ranking algorithm creating the illusion of a meaningful choice, when in reality, the two candidates are effectively equal, and some other method should be used to select between them.

Humans and Machines Playing to Their Strengths and the Promise of Joint Decision Making

Despite significant effort to use AI to improve the identification of cancer in mammograms or MRIs, these automated screening tools have struggled to make diagnostic gains over human physicians. Doctors reading mammograms reportedly miss between 15% and 35% of breast cancers, but AI tools often underperform doctors. The challenge of AI for mammogram analysis is different from the skin cancer screening discussed above because mammograms have a much higher cost for false positives; a biopsy of breast tissue is more invasive, time-consuming, painful, and costly than shaving a skin mole.

However, a recent study published in The Lancet indicates that a complex joint-decision framework, with doctors and AI tools working together, and checking each other’s decisions, can lead to better results for the review of mammograms – both in terms of reducing false positives (i.e., mammograms wrongly categorized as showing cancer when no cancer is present) and reducing false negatives (i.e., mammograms wrongly categorized as showing no cancer when cancer is present).

According to this study, the suggested optimal workflow involves the machine being trained to sort mammograms into three categories: (1) confident normal, (2) not confident, and (3) confident cancerous:

This complicated workflow achieves superior results because it optimizes the elements of decision making where each contributor is superior. The machine is much better at quickly and consistently determining which scans are clearly not interesting. The radiologists are better at determining which potentially interesting scans are actually interesting, but the doctors, being human, are not better all the time. In certain circumstances (e.g., when the doctors are tired, distracted, rushed, etc.), the machine may be better, so a safety net is inserted into the process to capture those situations and thereby optimize the overall decision process. This is a good example as to why the workflow of human-machine decisions needs to be tailored to the particular problem. Here again, having the human decisions always prevail would not achieve the best results. Only through a complex framework, with decision-makers playing to their strengths and covering for each other’s weaknesses (e.g., the machines never getting tired or bored), can results be significantly improved.

Creating a Framework for Resolving Human-Machine Disagreements

Again, much of the regulatory focus on AI decisions is aimed at requiring humans to review certain decisions being made by machines and to correct the machine’s errors. For low-stake decisions that need to be made quickly and in large volume, that is often the right decision framework, even if it is not always the most accurate one. One of the primary benefits of AI is speed, and in designing any human-machine decision framework, one must be careful not to make insignificant gains in optimizing accuracy at the cost of substantial losses in overall efficiency.

But in many circumstances, the decisions being made by AI are significantly impacting people’s lives, and efficiency is therefore not as important as accuracy or rigor. In those cases when humans and machines have a legitimate disagreement, assuming that the human is right and should prevail is not always the best approach. Instead, the optimal results come from an analysis of the particular dispute and the implementation of a resolution framework that is tailored to the particular decision-making process and automated technology.

Below are some examples of how different human-machine decision frameworks may be appropriate depending on the circumstances.

Option #1 – Human in the Loop: The machine reviews a large number of candidates and makes an initial assessment by ranking them, but the actual selection is made by a human

Decision Examples:

Factors:

Option #2 – Human Over the Loop: The machine makes an initial decision without human involvement, which can be quickly overridden by a human if necessary

Decision Examples:

Factors:

Option #3 – Machine Authority: The machine prevails in a disagreement with a human

Decision Examples:

Factors:

Option #4 – Human and Machine Equality: If either the human or the machine decides X, then X is done

Decision Examples:

Factors:

Option #5 – Hybrid Human-Machine Decisions: Human and machine check each other’s decisions, sometimes without prior knowledge of what the other decided; for high-risk decisions where the human and machine clearly disagree, the human (or another human) is alerted and asked to review the decision again.

Decision Examples:

Factors:

Conclusion

These examples demonstrate that there are several viable options for resolving disputes between machines and humans. Sometimes, circumstances call for a simple workflow, with human decisions prevailing. In other cases, however, a more complicated framework may be needed because the humans and the machines excel at different aspects of the decision, which do not easily fit together.

In the coming years, efforts to improve people’s lives through the adoption of AI will accelerate. Accordingly, it will become increasingly important for regulators and policy makers to recognize that several options exist to optimize human-machine decision making. Requiring a human to review every significant AI decision, and to always substitute their decision for the machine’s decision if they disagree, may unnecessarily constrain certain innovations and will not yield the best results in many cases. Instead, the law should require AI developers and users to assess and adopt the human-machine dispute-resolution framework that most effectively unlocks the value of the AI by reducing the risks of both human and machine errors, improving efficiencies, and providing appropriate opportunities to challenge or learn from past mistakes. For many cases, that framework will involve human decisions prevailing over machines, but it won’t for all.

This post comes to us from Debevoise & Plimpton LLP. It is based on a post on the firm’s Data Blog, “When Humans and Machines Disagree – The Myth of “AI Errors” and Unlocking the Promise of AI Through Optimal Decision Making,” available here.

Exit mobile version