The Battle of the Pens: Insights into Detecting AI-Generated Text

Read SHL Labs’ latest insights from analyzing 1 million data points to detect human and AI-generated text.

By Kuldeep Yadav

Apr 11, 2023

We are all intrigued by the capabilities that ChatGPT has to offer. With its ability to generate human-like text and answer just about any question, what does this mean for the future of assessments?

ChatGPT is a large language model released by OpenAI a few months ago. It is capable of answering questions across different domains, including follow-ups, and can even solve complex problems such as coding and logical reasoning. However, it also has significant limitations in terms of hallucinations and making up facts.

The virality of ChatGPT has sparked discussions about the future of tests and assessments in schools, universities, and workplaces. ChatGPT has demonstrated its ability to accurately solve many assessments, including the following examples:

Similarly, we have seen several scientific publications avenues blocking the use of ChatGPT in their manuscripts. For example, ChatGPT has also been accused of fake generations, which experts refer to as "hallucinations", when it provides a bogus citation-backed answer.

All the risks mentioned above are not only related to ChatGPT, but they are also true for other language models. There are many language models that are available to the public, and they are becoming more advanced every day.

Now the million-dollar question is:

Whether an AI-generated text detection algorithm can accurately distinguish between human-written and AI-generated text accurately, and where the current state-of-the-art techniques stand.

The last few weeks have seen many techniques that claim to accurately differentiate between human-written and AI-generated text. To answer the burning question, we present insights from one of the largest studies examining different techniques on more than 1 million essays written by humans and artificial intelligence together.

Dataset and algorithms

We acquired a dataset of 1 million human-written essays/ articles with an average length of 250 words. All these essays were written for 10 different prompts between January 2019 and October 2022. Further, we generated about 16,000 essays using large language models such as OPT, Bloom, GPT-Neo, GPT-3, and ChatGPT. The following pie chart provides a breakdown of the dataset.

We evaluated GLTR, OpenAI-GPT2 detector, Perplexity (PPL) features based (similar to GPTZero), and HC3-Roberta model (public release on January 18, 2023). We will discuss the implementation details of the compared AI-Text detection techniques in a future ArXiv study.

What are we measuring?

To keep it simple and understandable for a diverse population, we are going to work with two metrics.

Error in detecting human-written essays (%): Fraction of essays that are human-written but the algorithm classifies them as AI-generated. Ideally, this should ideally be ZERO to minimize any penalty to the writer who is doing the work with honesty.
Error in detecting AI-generated essays (%): Fraction of essays that are AI-generated but the algorithm classifies them as human-written.

What did we find?

Insight #1: Human-written essays are likely to be misclassified as AI-generated on an average of nearly 1.5%. About 15,000 essays (from our 1M dataset) were detected as AI-written. This error is considerably high and could penalize valid submissions.

Insight #2: AI-written essays are likely to be misclassified as human-written nearly 48% of the time. This is huge as nearly half of the content goes undetected using current detectors.

Insight #3: There is no consistent winner in detecting AI-generated and human-written text. Perplexity wins in detecting human-written text well but fairs poorly in detecting AI-generated text. GLTR provides the lowest error in detecting AI-generated text.

Insights #4: The efficacy of detection techniques can be influenced by the language model used to generate the text. Text generated by ChatGPT or GPT-3 is generally easier to detect compared to the text generated by lesser-known models such as OPT and Bloom, which may be less familiar outside of the AI community.

The final verdict is that current techniques are struggling to match the generative capabilities of large language models (LLMs), and sustained research efforts are needed. Our findings align with the recent announcement that detecting AI-generated text has an average error rate of 74%, compared to 9% for human-written text. OpenAI's larger corpus and varied content are expected to contribute to a higher error rate.

This blog post is the first in a series on this topic, and we plan to publish additional insights from our study in a scientific paper on ArXiv. Using our large corpus of human-written text to learn detectors will be one of our ongoing efforts. For example, what happens if humans and AI collaborated to write a piece of text or paraphrase using other tools such as Grammarly?

For more such interesting insights visit the SHL Labs page.

Kuldeep Yadav is Senior Vice President and leads SHL Labs and AI teams. His team works to implement AI-backed innovations in SHL's platforms, assessments, and products to help customers hire the best and diverse candidates. He is a PhD in computer science and previously, worked in Xerox Research and Microsoft Research Labs.

Yash is a research intern at SHL Labs with a focus on NLP and Deep Learning. He has practical experience conducting NLP research and is always seeking new opportunities to apply his skills to real-world problems. He strives to keep abreast of the latest advancements in these fields.

Explore SHL’s Wide Range of Solutions

With our platform of pre-configured talent acquisition and talent management solutions, maximize the potential of your company’s greatest asset—your people.

See Our Solutions

Outdated browser detected

The Battle of the Pens: Insights into Detecting AI-Generated Text

Read SHL Labs’ latest insights from analyzing 1 million data points to detect human and AI-generated text.

Dataset and algorithms

What are we measuring?

What did we find?

Kuldeep Yadav

Yash Vadi

Related Resources

Who Owns Interviewing? Why Talent Acquisition Leaders Must Take Charge

Five Timeless Skills Driving HR Success

People Decisions with Data – 3 Challenges to Overcome

Preparing Your Workforce for 2030: Why HR is the Key to Business Success

HR Skills Insights: Creating a Future-Ready HR Team Built for Success

Healthcare’s Skills DNA: Strengths, Gaps, and the Path Forward

Driving Objective Insights and Feedback with Skills Assessments

Why Internal Mobility Fails and How Science Can Fix It

Explore SHL’s Wide Range of Solutions