11/12/25
Table of Contents
Undocumented changes
ChatGPT-5 / OpenAI
Start Time: 06:42 AM AEST
Yesterday, we noticed significant changes where Chat GPT5.1 interpreted everything as a rigid command scheme.
Given the multiple reports of 5.2 being released this week, we needed to try and pre-empt the parts of our various workflows that would be impacted.
A backend model rev was rolled through a few hours ago that impacts:
Reasoning depth
Tool Use Timing
Safety Routing
Further evidenced in posts from other users.
Post got deleted and we were notified at 12:34PM by a user who was seeking that information.
Post is as per below and we have stopped updating external facing documents.
1. Behaviour Change: Literalism spike
How to Verify: Ask “Summarise this + list risks.” It will either do only one part or ask for formatting instructions.
Impact: CHAT gives partial outputs; API multi-step instructions break; AGENTS loop or stall.
Expected Duration: 6–24 hours.
Reasoning: Triggered by safety/routing realignment; stabilises once new weights settle.
2. Behaviour Change: Context shortening
How to Verify: Give three facts and ask a question requiring all three; it will drop or distort one.
Impact: CHAT long threads wobble; API loses detail; AGENTS regress or oversimplify.
Expected Duration: 12–48 hours.
Reasoning: Summarisation heuristics recalibrate slowly with live user patterns.
3. Behaviour Change: Tool-routing threshold shift
How to Verify: Ask a borderline tool-worthy question (Web searches, connectors etc): tool calls will be inconsistent (fires too early or not at all).
Impact: CHAT shows weird tool availability; API gets unexpected tool calls; AGENTS fragment tasks.
Expected Duration: 12–36 hours.
Reasoning: Tool gating needs fresh interaction data and global usage to stabilise.
4. Behaviour Change: Reduced implicit navigation
How to Verify: Ask “open the last doc”; it will refuse or demand explicit identifiers.
Impact: CHAT/API now require exact references; AGENTS break on doc workflows; CONNECTORS show more access refusals.
Expected Duration: 24–72 hours.
Reasoning: Caused by tightened connector-scoping + safety constraints; these relax slowly.
5. Behaviour Change: Safety false positives
How to Verify: Ask for manipulation/deception analysis. May refuse or hedge without reason.
Impact: CHAT/API inconsistent; AGENTS enter decline loops and stall.
Expected Duration: 12–72 hours.
Reasoning: Safety embedding tightened; loosens only after overrides propagate + usage patterns recalibrate.
6. Behaviour Change: Multi-step planning instability
How to Verify: Ask for a 5-step breakdown; watch for missing or merged middle steps.
Impact: CHAT outputs shallow; API automations break; AGENTS produce incomplete tasks.
Expected Duration: 6–24 hours.
Reasoning: Downstream of literalism + compression; planning returns once those stabilise.
7. Behaviour Change: Latency/cadence shift
How to Verify: Ask a complex question; expect hesitation before the first token.
Impact: Mostly UX; API tight-loop processes feel slower.
Expected Duration: <12 hours.
Reasoning: Cache warming and routing churn; usually clears quickly.
8. Behaviour Change: Tag / mode-signal sensitivity
How to Verify: Send a mode tag (e.g., analysis, audit); model may ignore it or misinterpret.
Impact: CHAT with custom protocols suffers most; API lightly affected; AGENTS variable.
Expected Duration: 12–48 hours.
Reasoning: Depends on how quickly the model re-learns your signalling patterns; consistent use accelerates recovery.
9. Behaviour Change: Memory recall / memory writing wobble
How to Verify: Ask it to restate a stored memory or save a new one, expect hesitation or misclassification.
Impact: CHAT recall inconsistent; API/AGENTS degrade if workflows depend on memory alignment.
Expected Duration: 12–48 hours.
Reasoning: Temporary mismatch between updated routing heuristics and long-form reasoning; system over-prunes until gating stabilises with real usage.
UPDATE 1:
1. Projects – SEVERITY: HIGH
What breaks: multi-step reasoning, file context, tool routing, code/test workflows
Why: dependant on stable planning + consistent heuristics
Duration: 12–48h
2. Custom GPTs – SEVERITY: MED–HIGH
What breaks: instruction following, connector behaviour, persona stability, multi-step tasks
Why: literalism + compression distort the System prompt
Duration: 12–36h
3. Agents – SEVERITY: EXTREME
What breaks: planning, decomposition, tool selection, completion logic
Why: autonomous chains rely on the most unstable parts of the model
Duration: 24–48h
Other similar reports:
https://www.reddit.com/r/ChatGPTPro/comments/1pio6uw/is_it_52_under_the_hood/
https://www.reddit.com/r/ChatGPTPro/comments/1pj9wxn/how_do_you_handle_persistent_context_across/
https://www.reddit.com/r/singularity/comments/1pjdec0/why_does_chatgpt_say_he_cant_read_any_tables/
Gemini Forensics
Gemini 3 / Google
Start Time: 01:34PM AEST
Created a new account for Valehart purposes. Running the calibration tool at 01:38PM
Extraction poses an issue. Gemini format is unique in the way a copy and paste does not separate user vs AI message. The chat had significant sensitive topics so we wanted to isolate where the issue began. We uncovered a larger issue and have reported this to local Google security leads and senior teams.
We attempted to reach out Simon Kriss who is labelled as “Australia's leading business AI expert, and Co‑Founder of Sovereign AI”. While we did not divulge in details or vendor names. We highlighted privacy concerns and an existing report with OAIC being present. Particularly as there was a push announced this morning. Source.
We were blocked and continue to be concerned about the hype of AI in Australia and the cost it has and will have on its citizens.

