Survey: Synthetic data is essential to building capable AI models

Sept. 17, 2021
Synthetic data refers to computer-generated images and simulations used to train computer vision models.

Synthesis AI released a new report in conjunction with Vanson Bourne highlighting how 89% of technology executives view synthetic data as a key emerging technology to creating more capable models, cutting the cost of data labeling, improving access to data, and reducing the time it takes to build AI models.

Industry leaders believe that, on average, 59% of their industry will utilize synthetic data in five years, either independently or in combination with 'real-world' data. This suggests that synthetic data will play an important role in the development of next-generation AI models.

The survey report, Adapt or Be Left Behind: 89 Percent of Tech Execs See Synthetic Data As a Key to Transforming Their Industry, is based on a survey of 100 senior technology executives on their perceptions of synthetic data, potential benefits and barriers of implementation, and what industry leaders think it will take to continue driving the adoption of synthetic data.

Synthetic data refers to computer-generated images and simulations used to train computer vision models. Synthetic data is emerging to be an essential element in building accurate and capable AI models, as it provides developers with vast amounts of perfectly labeled data on-demand.

"AI is driven by the amount, quality and speed of training data. Synthetic training data is already making waves in several industries including autonomous vehicles and robotics. There is a critical need for more education on the underlying technology and benefits to drive broader industry adoption," said Yashar Behzadi, CEO and founder of Synthesis AI. "Building core synthetic data capability will be the key to whether or not some companies adapt or fall behind in the future. Synthetic data has the potential to deliver perfectly labeled data on-demand, potentially cutting millions of dollars and months of work related to the current process of collecting, preparing, and manually labeling training data."

Andy Thurai, vice president and principal analyst at Constellation Research, added, "Today's AI models are limited by real-world data for a couple of reasons—collecting real-world data is very expensive, and most companies don't have the time and resources to collect the volume of data that is required to train models that the tech giants do. The survey results indicate synthetic data is a new market where there is a knowledge gap that needs to be addressed. A blend of the real world and synthetic data will provide the best combination that is impossible to match just by raw data collection. If a model can handle all possible scenarios based on assumptions, then it is ready for real-world scenarios."

Synthetic data adoption is increasing, but a key to further adoption is enhanced understanding of this emerging technology across the board, all the way from the C-suite to machine learning engineers. Only half (51%) of the respondents were knowledgeable, state-of-the-art synthetic data approaches indicating a critical gap.

Respondents who were aware of recent advances in synthetic data expressed confidence in the technology's ability to address key issues with current 'real-world' data approaches. This indicates that if the knowledge gap is reduced, many more will likely see and understand synthetic data's benefits.

Prominent barriers to synthetic data adoption reported include organizational knowledge and a slow buy-in from colleagues. Other barriers to adoption included:

  • Concerns that models built with synthetic data are not as good as 'real-world' data (46%)
  • Difficulty in creating high-quality synthetic data for complex systems (45%)
  • The costs of integration and implementation (42%)