Tether Data’s AI research division, QVAC, released what it claims is the largest synthetic dataset created for artificial intelligence training under a new initiative called QVAC Genesis. This initial release, Genesis I, a collection of 41 billion text tokens, is meant to help the world build “smarter, more capable, and highly precise STEM-focused language models.”
Each “text token” represents a tiny fragment of language, the building blocks that AI models use to understand and generate text. By training “on 41 billion of these tokens from QVAC Genesis’s dataset, models grasp not just words, but the relationships and logic that connect them.”
This dataset has been validated across educational and scientific benchmarks, demonstrating reasoning and problem-solving performance “in subjects such as mathematics, physics, biology, and medicine.”
It represents the publicly available synthetic dataset, specifically built and rigorously validated for “education-specific content, offering comprehensive coverage across key STEM domains where today’s public training datasets fall short.”
More than a just a technical milestone, this release is said to be a statement about who should own the “future of intelligence.” As AI becomes centralized, trained, hosted, and controlled by a few corporations, QVAC Genesis I is working “to return that power to the people by providing open, high-quality data for scientific research advancement.”
Tether Data also released its consumer app, QVAC Workbench, a workspace that demonstrates the potential of local on-device Artificial Intelligence. QVAC Workbench is “targeting AI enthusiasts, advanced users, and researchers. It already supports a wide variety of LLMs and other AI Models, including Llama, Medgemma, Qwen, SmolVLM, Whisper, and many more.”
The app is currently available for smartphones (Android for now, and iOS within a few days) as well as desktop platforms (Windows, macOS, and Linux), providing the comprehensive on-device support compared to current offerings.
With QVAC Workbench, all chats and interactions with the AI Models remain local on-device, “where data is owned by the user and remains 100% private.”
But it also provides a feature called “Delegated Inference,” which enables a user to connect peer-to-peer to their mobile Workbench app with the Workbench desktop app to “fully utilize the power and resources of their home or office workstations.”
By making the QVAC Genesis dataset public, they aim to encourage researchers to build and use models that may “compete with, and even surpass, proprietary systems.”
Their dataset was created using a multi-stage generation and validation process that “turns high-quality scientific and educational materials into structured learning data.”
The result is a training resource that “helps models reason, solve problems, and think critically, rather than merely imitate language.”
The release of the first two QVAC projects is said to be part of a wider mission to transform how AI exists in the real world, introducing a sort of new paradigm of ‘local intelligence,’ where tools are able to “learn and evolve directly on any device.”
The complete technical breakdown of the dataset, code-named QVAC Genesis I, is accessible now.