Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

No More Excuses for Not Using Predictive Coding

By Herbert L. Roitblat
November 28, 2011

Predictive coding is the latest of several innovations that have fundamentally altered the process of e-discovery, some of them even leading to changes in the Federal Rules. Such innovations as native file review, keyword search, concept search and clustering have led to dramatic decreases in the cost and effort required for e-discovery. Predictive coding, though, promises to overshadow these earlier innovations, even further reducing the amount of time, cost and effort required for e-discovery. Predictive coding has become a must-have e-discovery capability.

How It Works

With predictive coding, a single expert can make decisions about the responsiveness of millions of documents, with a transparency and level of accuracy not achievable using other approaches. The computer takes over much of the drudgery of reviewing documents while implementing the expert's judgments with a high level of fidelity and transparency.

Several different machine-learning technologies have been used for predictive coding. For example, support vector machines involve a technique that places dividing lines between responsive and nonresponsive documents. Probabilistic approaches compute the probability, based on the words in a document, whether the document is more likely to be responsive or nonresponsive. Probability, here, does not mean the flip of a figurative coin, but rather that any piece of evidence, for example, each word, is more likely to indicate a responsive or a nonresponsive document. Nearest neighbor approaches ask whether a given document is more like (“near to”) a document known to be responsive or more like one known to be nonresponsive. Clustering, sometimes arguably described as predictive coding, gathers together groups of similar documents that a user can then tag as a whole group as desired.

Some predictive coding systems require the user to come up with a seed set of responsive documents. Others rely on random sampling to provide a representative set of training examples, without having to know ahead of time which documents are responsive.

Predictive coding often involves an iterative learning process, where the computer predicts a sample of documents as responsive or not, and an expert corrects those predictions. The iterative process can use active learning, where the system selects for human judgments only those documents that are most ambiguous, or it can continue to present random samples.

Accuracy

As a group, predictive coding systems have been found to achieve high levels of accuracy that match or exceed human reviewer teams. For example, in a peer-reviewed study published in the Journal of the American Society for Information Science and Technology, Roitblat, Kershaw, & Oot, 2010, we reported that two computer systems each agreed with an original human review on about 83% of the documents. By comparison, two new human teams agreed on about 73% of the documents. We made the point that the computerized predictive coding was certainly no worse than using human review. We also reported that if one team of reviewers finds a document responsive, the odds that another team working independently will find the same document responsive is about 50:50.

Other studies have also addressed the “myth” that reviews by teams of reviewers are inherently more reliable than could be obtained with machine assistance (e.g., Grossman & Cormack, 2011; Baron, et al., 2009). The evidence seems very clear that using computers to aid in the categorization of documents (predictive coding) can be very effective.

Moving Toward Predictive Coding

Several factors have come together over the last 18-24 months to finally make attorneys willing to consider predictive coding as a reasonable alternative to human review:

  • The volume of documents or electronically stored information (ESI) continues to mushroom;
  • The recession has impacted all parts of the economy. Companies have felt the need to conserve their funds to the point where they avoid investing in lawsuits. Business litigation is expensive, even if the party believes that it will prevail;
  • Outsourcing to review management companies has become more acceptable. Once the responsibility for first-pass review has been assigned to an outside resource, it becomes easier to conceive of other methods that could give comparable results;
  • There has been a growing recognition that the e-discovery process can be measured and one can determine how well two groups, two systems or two people are doing at their tasks;
  • Papers have been published that do, in fact, measure the accuracy of human and machine review (predictive coding) and find that the machines do at least as well as people; and
  • The IBM computer system, Watson, which beat out human Jeopardy champions so handily, helped people to understand that computers were capable of fairly sophisticated information understanding.

Because the guiding decisions in predictive coding are always made by an expert reviewer, they always reflect that expert's judgment. The attorney never loses control over the decision-making because the computer does not create the decision rules, it merely implements the implicit rules employed by the expert.

Predictive coding, then, is never a choice of people or machine, but a combination of the two. The judgment is the expert's; the bulk of the labor is the machine's. Unlike human reviewers, though, computers do not get tired, they don't get distracted and they don't treat the same information in inconsistent ways. They don't get headaches or take vacations.

Answering the Skeptics

Despite the studies, there is still some doubt about using predictive coding in e-discovery. It seems to some lawyers like they are being asked to take a leap of faith that a computer system can do what proponents say it can.

Several studies have found that predictive coding is effective, but the best strategy is to measure directly the accuracy of predictive coding in each particular case. This measurement is fairly easy to accomplish and not particularly burdensome. Measuring the actual outcome means that you don't have to trust in a black box. About the simplest method is to take a random sample of a few hundred documents that have been predictively coded and review them again, this time by someone authoritative. If the level of accuracy is comparable to or better than what could be achieved with human reviewers, then it would be reasonable to say that you have done as well as you would have done with more traditional coding methods. Recall that teams of reviewers are less than perfect as well. The goal of review is to be reasonable, not necessarily perfect.

Some predictive coding skeptics are concerned about what judges might say about its use. There has never been a ruling, to my knowledge, about whether it is acceptable to use predictive coding; however, nor has there been a ruling on whether it is acceptable to use keyword searching. The courts generally do not endorse specific technologies, they typically do not rule about processes that work, and they rule only when there is a dispute. There have been cases about what searches were conducted, but never about whether searching is acceptable. In an Oct. 1 article in Law Technology News (LTN), Judge Andrew Peck predicted that no ruling on predictive coding is likely to be forthcoming. (See, “Search, Forward,” http://bit.ly/mSCBkd.)

The View from the Bench

In “Ten Key E-Discovery Issues to Watch in 2011″ (published online by Huron Consulting, available at http://bit.ly/vDseEB), Judge Peck, with David J. Lender, argues that the lack of overt judicial approval from the bench should not stop the use of such tools.

Even if the steps suggested in the William A. Gross [William A. Gross Const. Assc., Inc. v. American Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 136 (S.D.N.Y. 2009)] decision are followed, keyword searching will produce less than 50% of responsive ESI. There are more sophisticated search tools available, such as clustering and concept searching techniques instead of, or in combination with, keyword searches that may be considered. We are not aware of any published judicial decision addressing these tools, but that does not mean that they should not be considered ' especially if done transparently and under the cooperation model.

In the context of considering Rule 502 of the Federal Rules of Evidence, Peck and Lender note that the courts have found in some cases that the use of technology did not meet the standard of reasonable care. Lender and Peck argue, though, that it was because of specific deficiencies in how the technology was used, rather than from the use of technology per se.

[T]hose decisions have indicated that technological solutions may be sufficient so long as proper quality assurance testing is employed and can be established. This means that both the privileged and responsive sets need to be tested with statistically significant samples to ensure that the key words or other techniques (e.g., clustering, logic-based algorithms) utilized reasonably worked to isolate the privileged documents from the remainder of the production.

Judge Peck recommends transparency, sharing with the other side that predictive coding will be used, and perhaps sharing the documents used to train the predictive coding.

Judge Paul Grimm, writing with Lisa Yurwit Bergstrom and Matthew P. Kraeuter, provided another favorable opinion regarding the use of advanced technology, such as predictive coding, again in the context of Rule 502:

It is hoped that future courts will be receptive and accommodating to the use of these screening methods to prevent disclosure of privileged and protected information. While these methods are not perfect, there is growing evidence that they are as good, or far better than, “eyes on” review of all digital information by an attorney or paralegal.

“Federal Rule of Evidence 502: Has It Lived Up to Its Potential?,” University of Richmond Journal of Law & Technology, Spring 2011 (http://bit.ly/rUd4SB).

Some skeptics are concerned about the possibility of a Daubert challenge to the use of predictive coding.

Judge Peck, in the LTN article, comments that he does not think that concerns about a Daubert challenge are well founded. Daubert, he says, concerns the admissibility of scientific evidence in a trial. In contrast, whether a document found by predictive coding or any other means is admissible is a function of the document itself, not the means by which it was found.

On the other hand, predictive coding is based on sound, broadly accepted science. There are many papers in widely read peer-reviewed journals measuring the effectiveness of a range of document categorizers, so even if faced with a Daubert-type challenge, such a challenge would probably be easily resolved.

Judge Peck concludes his LTN article by saying: “In my opinion, computer-assisted coding should be used in those cases where it will help 'secure the just, speedy, and inexpensive' (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”

Conclusion

Hungry penguins on the Antarctic ice floes gather at the water's edge and jostle each other. Their food is in the water, but so are the seals and sharks that prey on them. No one wants to be first. I think that the evidence is now sufficiently clear that predictive coding works, judges don't object to using it, and that there is no longer any reason to avoid it. Unlike the penguins, I think that we can see that there are no predators waiting to pounce on those who use predictive coding. Moreover, the consequences of not using it are so impactful on clients and the legal system that we can ill afford to wait any longer.

Clients are demanding creative and effective ways of meeting their e-discovery needs without going bankrupt in the process. Predictive coding can be an essential tool for meeting those needs and is one of those innovations that is going to stick and change the nature of e-discovery.


Herbert L. Roitblat, Ph.D., is the CTO and Chief Scientist for OrcaTec, an international document decisioning technology company based in Atlanta. Dr. Roitblat is a member of the Sedona working group on Electronic Document Retention and Production, on the Advisory Board of the Georgetown Legal Center Advanced eDiscovery Institute, a member of the program committee for the 2011 Georgetown Advanced eDiscovery Institute, and the chair of the Electronic Discovery Institute.

Predictive coding is the latest of several innovations that have fundamentally altered the process of e-discovery, some of them even leading to changes in the Federal Rules. Such innovations as native file review, keyword search, concept search and clustering have led to dramatic decreases in the cost and effort required for e-discovery. Predictive coding, though, promises to overshadow these earlier innovations, even further reducing the amount of time, cost and effort required for e-discovery. Predictive coding has become a must-have e-discovery capability.

How It Works

With predictive coding, a single expert can make decisions about the responsiveness of millions of documents, with a transparency and level of accuracy not achievable using other approaches. The computer takes over much of the drudgery of reviewing documents while implementing the expert's judgments with a high level of fidelity and transparency.

Several different machine-learning technologies have been used for predictive coding. For example, support vector machines involve a technique that places dividing lines between responsive and nonresponsive documents. Probabilistic approaches compute the probability, based on the words in a document, whether the document is more likely to be responsive or nonresponsive. Probability, here, does not mean the flip of a figurative coin, but rather that any piece of evidence, for example, each word, is more likely to indicate a responsive or a nonresponsive document. Nearest neighbor approaches ask whether a given document is more like (“near to”) a document known to be responsive or more like one known to be nonresponsive. Clustering, sometimes arguably described as predictive coding, gathers together groups of similar documents that a user can then tag as a whole group as desired.

Some predictive coding systems require the user to come up with a seed set of responsive documents. Others rely on random sampling to provide a representative set of training examples, without having to know ahead of time which documents are responsive.

Predictive coding often involves an iterative learning process, where the computer predicts a sample of documents as responsive or not, and an expert corrects those predictions. The iterative process can use active learning, where the system selects for human judgments only those documents that are most ambiguous, or it can continue to present random samples.

Accuracy

As a group, predictive coding systems have been found to achieve high levels of accuracy that match or exceed human reviewer teams. For example, in a peer-reviewed study published in the Journal of the American Society for Information Science and Technology, Roitblat, Kershaw, & Oot, 2010, we reported that two computer systems each agreed with an original human review on about 83% of the documents. By comparison, two new human teams agreed on about 73% of the documents. We made the point that the computerized predictive coding was certainly no worse than using human review. We also reported that if one team of reviewers finds a document responsive, the odds that another team working independently will find the same document responsive is about 50:50.

Other studies have also addressed the “myth” that reviews by teams of reviewers are inherently more reliable than could be obtained with machine assistance (e.g., Grossman & Cormack, 2011; Baron, et al., 2009). The evidence seems very clear that using computers to aid in the categorization of documents (predictive coding) can be very effective.

Moving Toward Predictive Coding

Several factors have come together over the last 18-24 months to finally make attorneys willing to consider predictive coding as a reasonable alternative to human review:

  • The volume of documents or electronically stored information (ESI) continues to mushroom;
  • The recession has impacted all parts of the economy. Companies have felt the need to conserve their funds to the point where they avoid investing in lawsuits. Business litigation is expensive, even if the party believes that it will prevail;
  • Outsourcing to review management companies has become more acceptable. Once the responsibility for first-pass review has been assigned to an outside resource, it becomes easier to conceive of other methods that could give comparable results;
  • There has been a growing recognition that the e-discovery process can be measured and one can determine how well two groups, two systems or two people are doing at their tasks;
  • Papers have been published that do, in fact, measure the accuracy of human and machine review (predictive coding) and find that the machines do at least as well as people; and
  • The IBM computer system, Watson, which beat out human Jeopardy champions so handily, helped people to understand that computers were capable of fairly sophisticated information understanding.

Because the guiding decisions in predictive coding are always made by an expert reviewer, they always reflect that expert's judgment. The attorney never loses control over the decision-making because the computer does not create the decision rules, it merely implements the implicit rules employed by the expert.

Predictive coding, then, is never a choice of people or machine, but a combination of the two. The judgment is the expert's; the bulk of the labor is the machine's. Unlike human reviewers, though, computers do not get tired, they don't get distracted and they don't treat the same information in inconsistent ways. They don't get headaches or take vacations.

Answering the Skeptics

Despite the studies, there is still some doubt about using predictive coding in e-discovery. It seems to some lawyers like they are being asked to take a leap of faith that a computer system can do what proponents say it can.

Several studies have found that predictive coding is effective, but the best strategy is to measure directly the accuracy of predictive coding in each particular case. This measurement is fairly easy to accomplish and not particularly burdensome. Measuring the actual outcome means that you don't have to trust in a black box. About the simplest method is to take a random sample of a few hundred documents that have been predictively coded and review them again, this time by someone authoritative. If the level of accuracy is comparable to or better than what could be achieved with human reviewers, then it would be reasonable to say that you have done as well as you would have done with more traditional coding methods. Recall that teams of reviewers are less than perfect as well. The goal of review is to be reasonable, not necessarily perfect.

Some predictive coding skeptics are concerned about what judges might say about its use. There has never been a ruling, to my knowledge, about whether it is acceptable to use predictive coding; however, nor has there been a ruling on whether it is acceptable to use keyword searching. The courts generally do not endorse specific technologies, they typically do not rule about processes that work, and they rule only when there is a dispute. There have been cases about what searches were conducted, but never about whether searching is acceptable. In an Oct. 1 article in Law Technology News (LTN), Judge Andrew Peck predicted that no ruling on predictive coding is likely to be forthcoming. (See, “Search, Forward,” http://bit.ly/mSCBkd.)

The View from the Bench

In “Ten Key E-Discovery Issues to Watch in 2011″ (published online by Huron Consulting, available at http://bit.ly/vDseEB), Judge Peck, with David J. Lender, argues that the lack of overt judicial approval from the bench should not stop the use of such tools.

Even if the steps suggested in the William A. Gross [ William A. Gross Const. Assc., Inc. v. American Mfrs. Mut. Ins. Co. , 256 F.R.D. 134, 136 (S.D.N.Y. 2009)] decision are followed, keyword searching will produce less than 50% of responsive ESI. There are more sophisticated search tools available, such as clustering and concept searching techniques instead of, or in combination with, keyword searches that may be considered. We are not aware of any published judicial decision addressing these tools, but that does not mean that they should not be considered ' especially if done transparently and under the cooperation model.

In the context of considering Rule 502 of the Federal Rules of Evidence, Peck and Lender note that the courts have found in some cases that the use of technology did not meet the standard of reasonable care. Lender and Peck argue, though, that it was because of specific deficiencies in how the technology was used, rather than from the use of technology per se.

[T]hose decisions have indicated that technological solutions may be sufficient so long as proper quality assurance testing is employed and can be established. This means that both the privileged and responsive sets need to be tested with statistically significant samples to ensure that the key words or other techniques (e.g., clustering, logic-based algorithms) utilized reasonably worked to isolate the privileged documents from the remainder of the production.

Judge Peck recommends transparency, sharing with the other side that predictive coding will be used, and perhaps sharing the documents used to train the predictive coding.

Judge Paul Grimm, writing with Lisa Yurwit Bergstrom and Matthew P. Kraeuter, provided another favorable opinion regarding the use of advanced technology, such as predictive coding, again in the context of Rule 502:

It is hoped that future courts will be receptive and accommodating to the use of these screening methods to prevent disclosure of privileged and protected information. While these methods are not perfect, there is growing evidence that they are as good, or far better than, “eyes on” review of all digital information by an attorney or paralegal.

“Federal Rule of Evidence 502: Has It Lived Up to Its Potential?,” University of Richmond Journal of Law & Technology, Spring 2011 (http://bit.ly/rUd4SB).

Some skeptics are concerned about the possibility of a Daubert challenge to the use of predictive coding.

Judge Peck, in the LTN article, comments that he does not think that concerns about a Daubert challenge are well founded. Daubert, he says, concerns the admissibility of scientific evidence in a trial. In contrast, whether a document found by predictive coding or any other means is admissible is a function of the document itself, not the means by which it was found.

On the other hand, predictive coding is based on sound, broadly accepted science. There are many papers in widely read peer-reviewed journals measuring the effectiveness of a range of document categorizers, so even if faced with a Daubert-type challenge, such a challenge would probably be easily resolved.

Judge Peck concludes his LTN article by saying: “In my opinion, computer-assisted coding should be used in those cases where it will help 'secure the just, speedy, and inexpensive' (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”

Conclusion

Hungry penguins on the Antarctic ice floes gather at the water's edge and jostle each other. Their food is in the water, but so are the seals and sharks that prey on them. No one wants to be first. I think that the evidence is now sufficiently clear that predictive coding works, judges don't object to using it, and that there is no longer any reason to avoid it. Unlike the penguins, I think that we can see that there are no predators waiting to pounce on those who use predictive coding. Moreover, the consequences of not using it are so impactful on clients and the legal system that we can ill afford to wait any longer.

Clients are demanding creative and effective ways of meeting their e-discovery needs without going bankrupt in the process. Predictive coding can be an essential tool for meeting those needs and is one of those innovations that is going to stick and change the nature of e-discovery.


Herbert L. Roitblat, Ph.D., is the CTO and Chief Scientist for OrcaTec, an international document decisioning technology company based in Atlanta. Dr. Roitblat is a member of the Sedona working group on Electronic Document Retention and Production, on the Advisory Board of the Georgetown Legal Center Advanced eDiscovery Institute, a member of the program committee for the 2011 Georgetown Advanced eDiscovery Institute, and the chair of the Electronic Discovery Institute.
Read These Next
Major Differences In UK, U.S. Copyright Laws Image

This article highlights how copyright law in the United Kingdom differs from U.S. copyright law, and points out differences that may be crucial to entertainment and media businesses familiar with U.S law that are interested in operating in the United Kingdom or under UK law. The article also briefly addresses contrasts in UK and U.S. trademark law.

The Article 8 Opt In Image

The Article 8 opt-in election adds an additional layer of complexity to the already labyrinthine rules governing perfection of security interests under the UCC. A lender that is unaware of the nuances created by the opt in (may find its security interest vulnerable to being primed by another party that has taken steps to perfect in a superior manner under the circumstances.

Strategy vs. Tactics: Two Sides of a Difficult Coin Image

With each successive large-scale cyber attack, it is slowly becoming clear that ransomware attacks are targeting the critical infrastructure of the most powerful country on the planet. Understanding the strategy, and tactics of our opponents, as well as the strategy and the tactics we implement as a response are vital to victory.

Legal Possession: What Does It Mean? Image

Possession of real property is a matter of physical fact. Having the right or legal entitlement to possession is not "possession," possession is "the fact of having or holding property in one's power." That power means having physical dominion and control over the property.

The Stranger to the Deed Rule Image

In 1987, a unanimous Court of Appeals reaffirmed the vitality of the "stranger to the deed" rule, which holds that if a grantor executes a deed to a grantee purporting to create an easement in a third party, the easement is invalid. Daniello v. Wagner, decided by the Second Department on November 29th, makes it clear that not all grantors (or their lawyers) have received the Court of Appeals' message, suggesting that the rule needs re-examination.