Ghosts in the Training Data: When Old Breaches Poison New AI
The idea behind Ghosts in the Training Data is simple and uncomfortable. The most dangerous failures in artificial intelligence (A I) over the next few years will not come from exotic new model architectures. They will come from crooked histories baked into the data those models learned from. Old incidents, sloppy labels, and stolen context have quietly become the ground truth behind many defensive and business systems. Leaders who treat A I as a mysterious engine with magical capabilities will miss that. Leaders who treat A I as a reflection of the data supply chain they are accountable for will start to see the ghosts and, more importantly, do something about them.
Most organizations still talk about a breach as if it were a single bad event. There is the day of the compromise, the assessment period, the remediation work, the board briefing, and the slow desire to move on. In reality, the incident does not end when you close the ticket. Code snippets that attackers stole, misconfigured templates they discovered, chat logs from your crisis channel, and credential dumps that eventually hit a broker all leave your control. They are copied, scraped, and repackaged into places you will never see. Months or years later, they show up as part of “representative real world data” that someone uses to train a model. What felt like one miserable quarter has quietly turned into part of the baseline that future systems will learn from.
This is not limited to headline-making mega breaches. A small software as a service vendor’s leaked support tickets can easily end up in a generic “enterprise I T” dataset. A series of phishing campaigns against a regional bank can be harvested into a labeled “security incidents” corpus. Public code repositories, some of them forks of previously compromised projects, are pulled into training sets for code generation tools. Over time, the original context disappears. The model does not know which snippet was a rushed hotfix written at three in the morning and which was a carefully reviewed pattern. All it sees are tokens and frequencies. Old hard coded secrets, brittle authentication flows, and strange logging choices are turned into examples to copy instead of warnings to avoid.
Inside your own organization, a similar process often plays out on a smaller but equally important scale. Historical logs, chat transcripts, email archives, incident timelines, and ticket histories are poured into data lakes because “we might train on this later.” Very little curation happens on the way in. Labeling is inconsistent, telemetry is missing during the most stressful moments, and undocumented exceptions linger for years. When the business finally decides to build or buy A I for security or operations, that lake is treated as a gold mine. In reality, it is a messy swamp full of ghosts from past outages and compromises. Those ghosts are about to be promoted from artifacts of yesterday’s breach to the training data for tomorrow’s models.
Most leaders have heard that modern models are trained on “internet scale” corpora. What is rarely said out loud is that the internet itself is saturated with the residue of breaches. Stolen code gets reposted in public repositories. Credential dumps are mirrored across dozens of sites. Incident write ups include more internal detail than anyone originally intended. Data broker products quietly incorporate information that started its life inside compromised systems. When a vendor proudly says their platform is trained on “billions of lines of real world code” or “massive real world security data,” that often means the model has absorbed patterns born in compromise, not just patterns born in careful engineering. Over time, the difference blurs.
You see the impact most clearly when these blended corpora sit behind friendly interfaces. Developer copilots that generate boilerplate authentication flows, key management routines, and logging defaults are drawing on those mixed histories. If insecure idioms appear often enough in the training set, the model will happily recreate them whenever the surrounding context feels similar. Security analysts using A I helpers to draft detection rules or response procedures face the same risk. A model steeped in leaky breach data and outdated playbooks can sound perfectly authoritative while suggesting fragile conditions, weak indicators, or “best practices” that were never truly best. The failure mode is not a dramatic exploit announced on social media. It is the quiet normalization of bad patterns.
Even when vendors make honest efforts to clean their corpora, their focus tends to be narrow. They work hard to scrub obvious categories: personal information, offensive language, explicit content. Almost no one has turned “code that passes unit tests but is security toxic” into a systematic exclusion rule. Almost no one has built robust tooling to find and downrank business logic that bakes in yesterday’s risky access assumptions. That gap is not driven by malice. It exists because security teams and A I platform teams use different language, measure different things, and report to different leaders. Until someone at the executive level starts asking precise questions about data provenance, curation criteria, and the ability to exclude or demote suspect sources, the organization will keep importing other people’s breaches as if they were a neutral representation of how modern systems should behave.
The same pattern shows up on the defensive side, where many teams now buy or build A I for detection and response. Behind the marketing, those systems are often trained on exports from Security Information and Event Management (S I E M) platforms, Endpoint Detection and Response (E D R) tools, ticketing systems, and incident timelines. None of those sources were ever designed as training sets. They are full of gaps and distortions. Logging gets turned down during high load, noisy rules are disabled or tuned without documentation, temporary exceptions become permanent, and incident labels are applied quickly because the team is exhausted. When you feed that history into a model, it digests the whole thing as if it were a coherent story about how attacks unfold and how defenders respond.
The ghosts in that crooked history affect how your defensive models behave. If a particular abuse pattern was repeatedly whitelisted for business reasons, a model trained on that environment may implicitly learn that the pattern is normal behavior. If critical telemetry was missing during the worst incidents, the model’s sense of what “high severity” looks like will be anchored in partial evidence. Security Orchestration, Automation, and Response (S O A R) playbooks and response tickets that understate impact or misclassify attacker objectives become labels the model uses to generalize. Over time, it learns a skewed baseline. What your tools failed to see, your models will struggle to imagine. That is a dangerous place to be when attackers are constantly searching for the gaps at the edge of your visibility.
Leaders often feel the effect of these ghosts without naming them. A detection copilot that grew up on crooked histories may confidently de-prioritize early indicators because they do not look like the partial patterns it was taught to care about. It may overtrust a legacy allow list, or flood analysts with alerts in areas where there is plenty of clean historical data but little actual risk. From the outside, this does not look like a dramatic machine learning failure. It feels like a familiar story: the security operations center still misses important things, and the new A I did not meaningfully change the false positive rate. The difference is that now those distortions are built into a system that can scale its influence much faster than you can retrain human analysts. Unless leaders treat defensive training data as something to audit, curate, and sometimes amputate, they will keep funding smarter engines running on bent rails.
Leaders rarely see the full chain between “someone else’s breach” and “our decision engine” because it runs through contracts, A P I integrations, and dashboards rather than through incident reports. Procurement teams buy feeds on the promise of better conversion or lower fraud loss. Data science teams focus on coverage, signal strength, and lift. Security and privacy teams are often invited in late, if at all, and even then, they are given high level marketing descriptions of source data instead of concrete lineage. Unless leaders insist on understanding where the most influential external feeds come from, what rights attach to them, and how they can be switched off, they will end up with business A I that quietly monetizes stolen context while they stand in front of customers and regulators promising responsible data practices.
Once you have a map, practical moves open up. You can ask for a kind of bill of materials for key models: a high level breakdown of data sources, the relative weight of each category, and any known use of breach-sourced or high risk feeds. You can work with legal and procurement teams to update contracts with data vendors and A I providers so they include explicit statements about origin, allowed uses, and obligations to notify you when their sources change or are discovered to include compromised material. Inside your own environment, you can apply the same mindset to data lakes and feature stores. Teams responsible for logs, tickets, and historical records can categorize which collections are suitable for training, which require redaction or aggregation, and which should never be promoted without heavy curation.
The final element is designing for change instead of pretending training data is fixed forever. Leaders should push for the ability to correct and partially forget when new ghosts are discovered. That might mean downranking specific sources, retraining on relabeled incidents, or carving out particular time windows and feeds that are discovered to be badly distorted. It might mean sponsoring red-teaming focused on data artifacts, not only on prompts and model behavior. You can deliberately inject known bad patterns to see how models respond, or simulate the removal of a compromised feed to measure impact. None of this will make training data perfectly clean. What it will do is turn those ghosts into visible dependencies that can be governed, traded off, and, when necessary, cut loose.
At its heart, this story is not about models at all. It is about whether you see A I as a clever engine sitting above your world, or as a fragile mirror of the data that you, your partners, and your adversaries have spilled into that world over many years. Defensive copilots, fraud engines, and decision systems may look sleek from the dashboard, but they are only as honest as the histories they were trained on. Old breaches, leaked code, crooked logs, and brokered context are not background noise. They are active ingredients. Once you see that, it becomes much harder to nod along when a vendor treats training data as an opaque blob that is “handled elsewhere” in the stack.
When leaders internalize this, the conversation around A I shifts. Questions about data provenance, labeling quality, and usage rights move from the appendix of the slide deck into the main discussion, next to performance and cost. Budget decisions change as well. Instead of pouring every extra dollar into “more data and bigger models,” some of that investment moves into curating incident histories, governing external feeds, and building the ability to correct or retire bad sources. Accountability becomes shared. Security, data, legal, procurement, and product teams all recognize that they have a stake in whether the ghosts in the training data are understood and managed, rather than tolerated as the price of innovation.
From here, the most useful step is not a sweeping new framework. It is a set of sharper questions. Which of your most important models depend on data whose lineage you could not confidently explain to a regulator or a major customer? Where have you implicitly trusted vendors, platforms, or brokers to make risk decisions about training data on your behalf, without ever stating that trade off out loud? If you and your leadership team can hold those questions in the open and keep asking them as new A I projects appear, you will already be ahead of many peers still admiring their models without looking at the ghosts inside them. The goal is not to eliminate every ghost. It is to make sure the ones that remain are there by conscious choice, not by accident.