OpenAI tests ChatGPT for political bias, finds GPT-5 less biased but transparency gaps remain

What happened, and who is involved

OpenAI ran a months long internal stress test to measure political bias in ChatGPT. The company tested responses to 100 topics using prompts that ranged from liberal to conservative, and from charged to neutral. Four model families were compared: older models labeled GPT 4o and OpenAI o3, and newer models labeled GPT 5 instant and GPT 5 thinking. OpenAI says the newer GPT 5 models showed about 30 percent lower bias scores compared with the older models.

The evaluation used another large language model as a grader, guided by a rubric that flags rhetorical techniques OpenAI views as biased. Examples include escalation, scare quotes, and presenting policy positions as the chatbot’s own opinion. OpenAI published summary results but did not release the full list of prompts or every detail of the evaluation method.

Why this matters to ordinary readers

AI chatbots are used by millions for information, news summaries, homework help, and everyday advice. If a chatbot leans toward one political point of view when prompts are worded in a charged way, that could shape what people believe or how they vote. Knowing whether these systems respond differently to conservative or liberal framing is important for trust, public debate, and how governments choose AI tools.

The results also arrive amid pressure from policymakers. The U.S. administration has pushed for clear rules on AI, and an executive order emphasized that government purchases should avoid models judged to be ‘woke’ or biased in one direction. That context makes OpenAI’s internal test both a technical exercise and a public relations answer to critics and regulators.

How OpenAI ran the test

OpenAI set up a stress test with four main parts. Each part helps explain what the company tried to measure and where questions remain.

  • 100 topics, spanning policy areas where political language can change a response. The topics were not released in full by OpenAI.
  • Prompt framing. For each topic, prompts were written to vary from clearly liberal, to neutral, to clearly conservative. The wording could be charged for either side. The company provided some example charged prompts but kept the full set private.
  • Four model families. Two older models, GPT 4o and OpenAI o3, were compared with two GPT 5 variants, called instant and thinking. The newer GPT 5 models returned lower bias scores overall, by about 30 percent.
  • Automated grading plus a rubric. OpenAI used another LLM as an automated grader, applying a rubric that flags rhetorical moves it calls biased. The rubric looks for things like escalation, use of scare quotes, or statements that treat a policy as the chatbot’s personal view.

What the test found

OpenAI reported that biased outputs were relatively infrequent and generally low in severity. That means most replies did not trigger the rubric’s bias flags. Still, the company observed a stronger pull when prompts were strongly charged toward one side, especially strong liberal prompts in the examples shown.

Comparing models, the GPT 5 variants produced fewer flagged responses than GPT 4o and OpenAI o3. OpenAI quantified that reduction as roughly 30 percent lower bias scores in GPT 5 models. The company used the grader model and rubric consistently across model versions, which helps internal comparison but does not remove questions about independent verification.

What OpenAI did not release

OpenAI withheld the full prompt list, and it did not publish all evaluation details. That limits outside researchers from reproducing the test exactly. Without the full prompts, it is harder to judge how representative the set of topics and charges were, or whether the rubric captured every kind of biased rhetorical move.

OpenAI also relied on an automated grader. The grader itself is a large language model, which raises questions about grader bias, grader calibration, and whether human review was part of the final scoring process. The company said it used a rubric, but the full rubric and annotation protocols were not shared in every detail.

Technical notes about using an LLM as a grader

Using an LLM to score other LLM outputs is becoming common because it scales quickly. Still, it creates a few technical concerns:

  • Grader alignment. A grader model can share biases with the systems being evaluated. If the grader leans one way on certain political phrasings, the scores will reflect that leaning.
  • Rubric limits. A rubric lists what counts as biased, but no rubric can catch every form of persuasion. The choice of flagged techniques, and how strictly they are applied, shapes the results.
  • Reproducibility. Public evaluations usually publish prompts, rubric, and grader code or seed samples. Without those, other teams cannot validate the claims independently.

Implications for trust and fairness in AI

For everyday users, these results mean progress and uncertainty at the same time. Progress, because OpenAI reports fewer biased outputs in newer models. Uncertainty, because the lack of full transparency prevents independent testing and confidence that the test covers every scenario people care about.

Important practical questions include:

  • Will chatbots still respond differently when users choose charged language?
  • How will companies and governments define acceptable bias for public procurement and regulation?
  • Can independent audits confirm vendor claims about bias reduction?

Policy context and pressure

OpenAI’s report appears at a time of political scrutiny. The U.S. administration has called for clearer accountability around AI. An executive order aimed to limit models with perceived ideological slants from being used by government agencies. That pressure makes internal tests more than a research exercise; they are part of a broader effort to show regulators and the public that a company is addressing political bias.

Policy makers may use reports like this to shape rules for AI procurement, labeling, and auditing. The presence of withheld materials could reduce the weight regulators place on such self reported tests.

Recommendations for better verification

OpenAI’s effort offers a roadmap for what more is needed. Independent verification is essential for public trust. Practical steps include:

  • Publish the full prompt set, or a representative public subset that allows replication.
  • Make the rubric fully available, with annotation instructions and examples for each flagged technique.
  • Include human reviewers alongside automated graders to check edge cases and grader errors.
  • Invite external audits by academic teams, civil society groups, and independent labs with access to the tested model snapshots.
  • Run continuous monitoring with public dashboards that report bias metrics over time.

Key takeaways

  • OpenAI ran an internal test on 100 topics to measure political bias in ChatGPT models.
  • Newer GPT 5 models scored about 30 percent lower on OpenAI’s bias metric than older models.
  • The test used an LLM grader and a rubric. OpenAI released examples but withheld the full prompt list and full evaluation details.
  • OpenAI reported bias was rare and often low severity, but strongly charged prompts, particularly strong liberal wording in examples, could nudge responses.
  • Transparency and independent audits remain needed for public confidence and policy use.

FAQ

Q: Does this mean ChatGPT is now politically neutral?
A: No. OpenAI reports lower bias scores with GPT 5, but it also says biased responses can still occur. The withheld prompts and grader details mean independent researchers cannot fully confirm neutrality.

Q: Why not publish the full prompt list?
A: OpenAI did not explain fully why it kept the prompt list private. Companies sometimes withhold prompts for proprietary reasons, to avoid gaming of evaluations, or because prompts could be misused. Withholding limits external verification.

Q: Can graders that are LLMs be trusted?
A: LLM graders are useful for scale, but they can carry their own biases. Best practice is to combine automated grading with human review and to publish the grader instructions and examples.

Conclusion

OpenAI’s internal stress test is an important step in measuring political bias across ChatGPT models, and it shows progress with newer GPT 5 variants. The report answers some questions about model behavior, but it leaves others open because key materials were not released. For the public, the test means companies are paying attention to political bias, while independent audits and greater transparency will be needed for lasting trust and for policy makers deciding which AI systems to allow in public use.

Short term, users should be aware that wording matters. Charged prompts can nudge responses, even if flagged cases are relatively rare. Long term, public confidence will depend on open benchmarks, third party audits, and clear standards for how to measure and reduce political bias in AI systems.

Leave a comment