Get Data for Democracy

The NSA knows what you Googled last Tuesday, but most of what the government has done is not even on the internet.

The federal government knows more about you than you know about it. That's not a slogan — it's a structural fact, and the gap is getting worse.

Federal agencies can surveil your movements, purchase your data from brokers, intercept your communications, and run computational analysis on your life at a scale that would have been unimaginable a generation ago. Meanwhile, the records of what your government has actually done — the decisions it made, the wars it started, the programs it ran, the money it spent, the rights it violated — are sitting in boxes. Paper boxes. In climate-controlled warehouses in College Park, Maryland, deteriorating on microfilm, nitrate stock, and acid-eaten paper that is literally disintegrating.

The National Archives holds 13.5 billion pages of textual records, along with 41 million photographs, 40 million aerial photographs, 10 million maps and drawings, and more than 450 million feet of motion picture, sound, and video recordings. [1] As of today, approximately 455 million pages have been digitized and made available to the public through the National Archives Catalog. [1] That's roughly 3.4 percent of the textual holdings alone. The other 96.6 percent — the declassified intelligence files, the regulatory proceedings, the internal memos, the deliberative records of agencies that shaped the lives of hundreds of millions of people — might as well not exist for anyone who can't physically travel to a reading room and request a box.

The MAD Act would change that. Title V authorizes the first mass digitization of the federal government's records — all of them — and puts over a billion dollars behind it.

The Scale of What We're Losing

This isn't an abstract preservation problem. Records are dying.

Microfilm curls and cracks. Nitrate-based film stock is chemically unstable — it doesn't just degrade, it can spontaneously combust. [2] [3] Cellulose nitrate decomposition causes the film base to shrink, yellow, and become brittle, releasing nitrogen oxides that accelerate the process. In the advanced stages of deterioration, self-ignition can occur at sustained temperatures only slightly above 100°F. [3] Nitrate fires burn rapidly, produce their own oxygen, and are virtually impossible to extinguish. [2] The 1937 Fox vault fire in New Jersey — caused by spontaneous combustion of decomposing stock during a heat wave — destroyed millions of feet of irreplaceable film and killed one person. [4] As recently as 1993, a fire at the National Archives' Suitland facility destroyed nitrate holdings that had survived decades of storage. [5]

Thermofax paper fades to nothing. Acid-degraded paper crumbles when you touch it. Cellulose acetate "safety film" — once promoted as the solution to the nitrate problem — develops vinegar syndrome, shrinking and warping as it decomposes. [2] The National Archives, the Library of Congress, and the Government Publishing Office all hold massive collections on media with projected lifespans of fewer than 50 years, and much of that clock started ticking decades ago.

The Library of Congress holds more than 178 million items across 470 languages — the largest library collection on Earth. [6] [7] Among those holdings are substantial collections of government documents, congressional records, and legal materials that remain undigitized. The Library's digitization budget has been approximately $6 to $8 million per year, and only about 15 million items — less than 10 percent of the total collection — have been digitized. [8]

The Government Publishing Office maintains over 2 million federal publications on govinfo.gov, but its digital-native collections overwhelmingly begin with the 103rd Congress (1993–1994). [9] The substantial historical record of federal government publications predating 1994 — including committee reports, legislative histories, and executive agency publications stretching back more than two centuries — remains largely trapped in analog form.

Every year that passes without digitization is a year closer to permanent loss. Not redaction, not classification — loss. The physical destruction of the only copies of records that document how the United States governed itself.

The Asymmetry That Should Concern Everyone

Here is the fact that should make this a priority regardless of your politics: the federal government has spent the last two decades building one of the most sophisticated surveillance apparatuses in human history — and it has aimed that apparatus squarely at its own citizens.

The NSA's bulk collection programs, exposed by Edward Snowden in 2013, vacuumed up the phone records of hundreds of millions of Americans. [14] The FBI has used warrantless geofence and keyword search warrants to identify suspects from Google's databases. [15] ICE and CBP have purchased commercial location data to track people's movements without judicial authorization — a practice the MAD Act's own data broker title would ban. DHS has monitored social media accounts, assembled dossiers on journalists, and tracked the movements of protesters. [16] Federal agencies have deployed facial recognition technology, automatic license plate readers, and cell-site simulators across American cities. The government can, right now, reconstruct where you've been, who you've talked to, what you've searched for, and what you believe — often without ever appearing before a judge.

That is the government's capacity to surveil you. Now consider your capacity to surveil it.

You cannot computationally analyze the records of your own government. You can't run machine learning models across a century of regulatory decisions. You can't use natural language processing to identify patterns in how agencies have interpreted their statutory mandates over time. You can't cross-reference decades of adjudicative precedent to detect inconsistencies in how the government has treated similarly situated people. You can’t read what the Founders of this country thought when it came time to debate their own laws. You can't do any of this because 96.6 percent of the record is paper sitting in a warehouse.

The government built a digital panopticon aimed at the public. The public can't even read the government's filing cabinets.

This is what the bill's findings call the asymmetry — and it's about to get dramatically worse. Artificial intelligence is transforming the ability to organize, search, and extract insight from large datasets. Every advancement in AI makes the surveillance apparatus more powerful. If the government's own records stay analog while its surveillance capabilities go digital, the imbalance becomes structural and permanent: the government will have ever-greater capacity to monitor, profile, and act on information about its citizens, while the citizens' capacity to oversee, audit, and hold accountable the government doing the monitoring will remain frozen in the twentieth century.

The National Archives budget tells this story in dollars. A 2022 analysis by the National Security Archive at George Washington University documented that NARA's budget, adjusted for inflation, has essentially flatlined for three decades — even as its responsibilities have expanded dramatically. [10] The agency has been given more presidential libraries to manage, more electronic records to process, and the entire National Declassification Center to stand up, with few additional resources. The result is a FOIA backlog so severe that one presidential library informed a requester that their request might be completed in 12 years. [10]

Digitizing the federal record doesn't just preserve history. It rebalances the information relationship between the government and the governed.

What the Bill Would Do

The MAD Act directs three institutions — the National Archives, the Library of Congress, and the Government Publishing Office — to enter into competitively awarded contracts for mass digitization of their analog holdings. Not selective digitization. Not on-demand scanning when a researcher files a request. Mass digitization — high-throughput, assembly-line conversion of everything.

The bill appropriates $1 billion to the National Archives, $30 million to the Library of Congress, and $20 million to the Government Publishing Office, with all funds available through fiscal year 2051. It also authorizes at least $10 million per year for 25 years for digital storage, format migration, and long-term retention of the digitized materials. The money is fenced: it can only be used for digitization, quality assurance, metadata creation, and public deposit. Not a dollar can be redirected to general operations, staffing, or facilities. And the bill explicitly prohibits any of the three institutions from cutting their existing budget requests on the grounds that digitization funding is now available. The money is additive.

The prioritization scheme is straightforward. At-risk records — anything on deteriorating media — come first. Declassified records that have been processed for public release by the National Declassification Center but never put online in machine-readable form come second. High-demand records, as measured by actual reference requests and FOIA processing data, come third. Administrative and program records with potential for AI and automation applications come fourth. Pre-1994 government publications come fifth. Everything else follows.

The bill sets milestone targets: all at-risk records digitized within 10 years, all processed declassification backlogs within 12, all pre-1994 government publications within 15. If an institution can't hit a target, it has to explain why and provide a revised timeline. The GAO reviews the entire program every five years. Inspectors General conduct biennial audits. Any contract worth more than $10 million gets a separate IG review within 60 days of award.

To put the scale in perspective: the largest comparable mass digitization effort in history — Google Books — took roughly 15 years and an estimated $750 million to scan approximately 40 million volumes. [11] [12] As of 2019, Google celebrated the milestone of having scanned more than 40 million titles. [11] The federal government's analog holdings dwarf that. This is a generational undertaking.

The Part That Matters Most

The single most important provision in this title isn't the money. It's the mandatory public deposit requirement.

Every contract awarded under this bill must require the contractor — at its own expense — to deposit a complete, unaltered copy of every digitized record with at least one independent, nongovernmental public access platform. Not just the government's own servers. An independent platform — a university archive, a nonprofit digital library, any qualified nongovernmental repository — that operates outside the control of any government.

This matters because government-operated platforms can be taken down. Administrations change. Budgets get cut. Websites get restructured and records quietly disappear. The bill ensures that once a record is digitized, a copy exists somewhere the government can't unilaterally remove it.

If a contractor fails to complete the deposit within 365 days of finishing the digitization, it's a material breach of contract. The bill specifies liquidated damages for noncompliance, and authorizes the covered institution — upon 60 days' notice — to complete the deposit itself using the contractor's deliverables and recover the costs from amounts otherwise owed to that contractor. This isn't aspirational language. It's an enforceable mechanism with teeth.

The bill goes further: no contract can grant any company exclusive rights to the digitized records. No embargoes. No periods of exclusive commercial access where a private company gets to charge the public for access to the public's own records. The government retains unlimited rights to everything, and everything goes to the public in open, machine-readable formats — free, unrestricted, no paywalls.

This is a direct response to what happened with previous public-private digitization partnerships at the National Archives. Starting in 2006, NARA entered into agreements with commercial partners — including Ancestry.com and Fold3 — that digitized records at no cost to the government in exchange for periods of exclusive commercial access. [13] The result was that digitized copies of public-domain government records were available to the public only through paid subscriptions for extended periods. The bill says that model is over for any new work. Existing agreements are grandfathered — their embargo periods can run to their original expiration — but no new embargoes can be imposed on any record digitized after the date of enactment, and any renewal or successor agreement must comply with the new rules.

Why This Is a National Security Issue

The bill's findings make a point that deserves attention: no other nation on Earth possesses a comparable body of democratically generated, archivally curated, provenance-verified government records spanning more than two centuries of continuous self-governance. That corpus — if digitized and made machine-readable — would be one of the most valuable public-domain datasets in existence for training AI systems capable of legal reasoning, policy analysis, national security assessment, and evidence-based governance.

Right now, it's paper.

The emergence of AI has created unprecedented demand for high-quality, provenance-verified textual datasets. The federal government's permanent records — individually evaluated and selected for retention precisely because of their enduring value — constitute exactly the kind of curated, trustworthy corpus that AI development needs. Making it available wouldn't just serve the public interest in transparency. It would give the United States a strategic advantage in the global competition for AI leadership that no amount of compute power or private-sector investment can substitute for, because no other country has this archive.

Why This Should Be Unanimous

This is a bill that gives researchers access to primary sources. It gives journalists the ability to audit government conduct at scale. It gives AI developers a uniquely valuable public-domain training dataset. It gives federal agencies the ability to computationally analyze their own institutional history. It gives genealogists access to census, military, and immigration records that currently require a trip to College Park. It gives Congress itself the ability to search the full breadth of legislative precedent instead of whatever happens to be online.

And it does all of this while protecting the physical originals, safeguarding classified material, respecting copyright, and ensuring that no private company gets to put a paywall between the public and its own government's records.

The federal government spent hundreds of years creating these records and hundreds of billions of dollars preserving them. The additional investment to make them accessible is modest by comparison. The question is whether the public gets to use what it already paid for — or whether 96.6 percent of the American record stays locked in a box, deteriorating in the dark, while the government that created it builds ever more powerful tools to watch the people who can't watch back.

Sources

A note to readers: We are committed to providing the public with accurate, factually grounded information. If you identify any errors of fact, gaps in sourcing, or flaws in the reasoning presented in this article, we would be grateful if you would bring them to our attention so they can be corrected. Mistakes are possible in any work of this kind, and we take corrections seriously.

Previous
Previous

Get ‘em Off The Streets

Next
Next

Watch The Watchers