16/12/25


Vendor Reporting of security incidents

16/12/2025 07:19AM

Only outcomes count. Precursors don’t.

Vendors consistently lack dedicated triage teams capable of handling systemic faults, particularly in AI systems. We have spent significant unpaid time attempting to report major bugs with real security and compliance implications. The problem is not discovery quality or reproducibility. The problem is systemic intake and response failure, observed repeatedly across vendors.

Standard reporting process followed:

  • Incidents documented and preserved using DFIR standards (chain of custody, evidence integrity).

  • Reports submitted via official bug bounty / vulnerability programs.

  • Assets consistently provided: PoC, issue description, reproduction steps, impact analysis, videos of live replication.

Common failure pattern:

  1. Ticket closed or marked “not reproducible.”

  2. Reporter required to invest additional unpaid time explaining the issue to teams responsible for triage, often correcting incorrect testing or misunderstanding of supplied material.

  3. Repeated back-and-forth despite complete evidence.

  4. Ticket closed again.

  5. Underlying issue remains active.

These are not minor flaws. Reports explicitly map issues to security, compliance, and governance breaches, often with broader or global impact considerations. All reports are reviewed internally multiple times to avoid overstatement.

Observed outcomes:

  • Vendor A: Issue active for 3+ months, now escalated to OAIC.

  • Vendor B: Issue active for 7+ days; progress only due to an individual outside AI/VRP taking responsibility.

  • Average vendor response-to-action time: ~48.5 days.

Conclusion:
“One-size-fits-all” vulnerability intake models do not work for AI systems. They are structurally incapable of handling early, systemic, cross-boundary risks.

AI safety failures resemble high-risk systems with delayed harm: prevention signals are ignored until outcomes occur. Vendors publicly commit to safety, but operationally defer responsibility until damage is visible.

Next step (intent):
This pattern is the basis for developing a public escalation and transparency framework documenting:

  • reporting steps taken,

  • timelines,

  • escalation thresholds,

  • and responsible authorities.

The goal is not attention or confrontation. The goal is to make it explicit when safety warnings were raised, ignored, or deferred — and to remove “how would we know” as a post-incident excuse.

Rewritten by AI because the original was not suitable for public share.

AI Audio + LLM

16/12/2025 11:20AM

What people call “accent” differences are mostly register compression. Spoken language runs in lossy mode: phonemes drop, function words collapse, grammar markers disappear. Humans resolve it via context; LLMs and ASR models trained on clean text systematically misparse it unless trained on actual speech, not transcripts pretending speech is text.

Many people overlook the impact of language when it comes to LLMs. Particularly how audio works and the impact on reasoning capabilities:

* Reasoning variance between text to speech AND voice modes
* Linguistic rules. We switch languages mid conversation and incorporate slang during our tests.
With languages here are some considerations:
- Learned French is typically formal and a written register. Spoken French is compressed, elided and well... why as the saying goes "Why say many words when few do trick?"
Example: I don't know | Learned/formal: Je ne sais pas
Actual:
• Chais pas is fast, informal, spoken in Paris, social media, etc
• J'sais pas is slower, clearer and still informal but ised commonly across all regions, city and rural.
- AU, we have some odd English words like thongs actually refers to flip flops.
- Arabic. There are certain tones between countries that help identify origin. Also, words are not slang but still have different meanings.
Egypt: إيه؟ (ēh?)
Saudi / Gulf: إيش؟ (ēsh?)
Levant: شو؟ (shū?)

or even the word key:
“Key”
Egypt: مفتاح (meftāḥ)
Saudi / Gulf: مفتاح (miftāḥ)

Here is an excerpt from a user we helped a few months back where we explained how audio is processed.

Hey there. It cannot actually hear your audio but to give you further clarification:

When you upload an audio file or speak via the audio option, a GPT cannot perceive tone, emotion or timbre. What it does is process data extracted in waveform and not the sound itself.

The system uses a spectral or feature analyser behind the scene to identify things like pitch, tempo, energy levels, and speech segments. It then isolates and classifies these into vocal vs background music based on those patterns.

Now, lets talk speech to text. This almost always goes through a transcription pass. Usually through a model like Whisper. This turns your speech to text and the LLM (GPT in your case) processes that text plus any structured metadata. For example; music: yes, voice clarity: low

With upload and speech to text, a confidence score for each recognised word or segment is provided. Not the exact one below and certainly varies. If we considered a score of 80% legible:

  • 0.98 confidence = clean, clear speech.

  • 0.65 = background noise or accent interference.

  • 0.30 = the model’s basically guessing.

    When that information is available, a model reads those scores as metadata and makes inferences and suggestions.

    The reason your GPT suggested cleaning your vocals is because it might have detected background sound, poor vocal separation from the signal pattern or other related reasons. It would have analysed and inferred low speech intelligibility metrics or misrecognition patterns.

    Honestly? Pretty crazy all this happens in a couple of seconds!

In short: No ears, only maths. No opinions, just probability. No music, only geometry.

Project Calm

Date: 16/12/25

During recent testing, we observed that separating these roles: keeping reasoning read-only and constraining execution to tightly scoped actions dramatically changes the risk profile. Not because the model is “smarter”, but because exposure is reduced and behaviour becomes traceable.

This matters in a moment where vendors are pushing agent access into increasingly sensitive domains (infrastructure, operations, governance) while guidance remains vague. Capability alone isn’t the issue; unbounded capability is.

Preparedness here doesn’t mean racing to automate more. It means designing systems where:

  • reasoning cannot silently mutate state,

  • permissions are explicit and revocable,

  • actions are logged, not assumed,

  • and humans remain the authority, not a fallback.

The calm before the storm isn’t inactivity.
It’s deciding where intelligence is allowed to act, and where it absolutely isn’t.

Comparative analysis of AI systems is often reduced to performance metrics such as speed, accuracy, or fluency. These metrics are insufficient for evaluating suitability in operational or high-risk contexts.

Observed differences between major AI platforms primarily emerge in boundary handling, not reasoning quality. When identical tasks are executed under similar constraints, platforms diverge in how they interpret permissions, enforce scope, and surface failure states.

Key comparative dimensions observed:

  • Permission enforcement consistency

  • Scope adherence versus silent task expansion

  • Transparency of tool use and intermediate steps

  • Quality and completeness of audit logs

  • Failure signalling and escalation behaviour

In multiple cases, task completion alone was not a reliable indicator of safe operation. Systems that successfully produced expected outputs varied significantly in their ability to explain actions taken, constraints applied, or deviations encountered.

Anonymised comparison is necessary to preserve analytical integrity. Attribution introduces incentives that distort evaluation, including selective disclosure and optimisation for perception rather than risk control.

From an operational perspective, the relevant comparison is not which system performs best, but which system exhibits predictable, auditable, and constrained behaviour under identical task envelopes.

These characteristics define the practical risk envelope of an AI system and should be evaluated independently of model branding or capability claims.

Previous
Previous

17/12/25

Next
Next

15/12/25