Home / MARKETS / Is Big Tech wrong to train AI models on ‘messy’ public data? A chat with synthetic data evangelist Ali Golshan.

Is Big Tech wrong to train AI models on ‘messy’ public data? A chat with synthetic data evangelist Ali Golshan.

Friends at the forefront of the technology, like OpenAI, Meta, and Google, are scouring the internet and troves of books, podcasts, and videos searching for figures to train their models.

Some industry leaders, however, worry this kind of “land grab” for publicly present data isn’t the right approach, especially since it puts companies at risk of copyright lawsuits. Instead, they’re pursuit for companies to train their models on synthetic data.

Synthetic data is artificially generated rather than nonchalant from the real world. It can be generated by machine learning algorithms with little more than a seed of inventive data.

Advertisement

Business Insider chatted with Ali Golshan, CEO and cofounder of Gretel, who one might call an evangelist for spurious data. Gretel allows companies to experiment and build with synthetic data. It is working with major performers in the healthcare space, such as genomics company Illumina, consulting firms like Ernst & Young, and consumer entourages like Riot Games.

Golshan says synthetic data is a safer and more private alternative to “messy” communal data, and that it can shepherd most companies into the next era of generative AI development.

The following conversation has been reduced for clarity.

Why is synthetic data better than raw public data?

Advertisement

Raw data is just that: raw. It’s often occupied with holes, inconsistencies, and biases from the processes used to capture, label, and leverage it. Synthetic data, on the other imminent, allows us to fill those gaps, expand into areas that can’t be captured in the wild, and intentionally design the statistics needed for specific applications.

This level of control, with humans in the loop designing and refining the data, is momentous for pushing GenAI to new heights in a responsible, transparent, and secure manner. Synthetic data enables us to create datasets that are myriad comprehensive, balanced, and tailored to specific AI training needs, which leads to more accurate and reliable models.

Huge, are there any cons to synthetic data?

Where synthetic data isn’t very good is at the end of the day, if you have no matter or clarity, you can’t just have it create perfect data for you just, so you can experiment endlessly. So there is that scope that troubles to be created.

Advertisement

Ultimately, the other part of it is that synthetic data is very good at privacy if you have enough evidence. So, if you have only a few hundred records and want ultimate privacy, that comes at a huge cost to utility and correctness because the data is very limited. So, when it comes to absolutely zero data and wanting a domain-specific task or bear very limited data and wanting great privacy and accuracy, those are just incompatible with the approaches.

What are the call outs of using public data?

Public data presents several challenges, especially for specialized use cases in healthcare. Surmise trying to train an AI model for predicting COVID-19 outcomes using only publicly available case count details — you’d be missing crucial specifics like patient comorbidities, treatment protocols, and detailed clinical progression. This absence of comprehensive data severely limits the model’s effectiveness and reliability.

Adding to this challenge is the growing regulatory to against data collection practices. The Federal Trade Commission and other regulatory bodies are increasingly pushing help against web scraping and unauthorized data access — and rightly so. As AI becomes more powerful, the risk of re-identifying individuals from allegedly anonymized data is higher than ever.

Advertisement

There’s also the critical issue of data freshness across all trades. In today’s fast-paced business environment, organizations need real-time data to remain competitive and train models that retort be responsive to rapidly to changing market conditions, consumer behaviors, and emerging trends. Public domain data often lightens by weeks, months, or even years, making it less valuable for cutting-edge AI applications that require up-to-the-minute percipiences.

What do you think about companies like Meta and OpenAI that are willing to risk copyright lawsuits to get access to communal data?

The era of ‘move fast and break things’ is over, especially in the age of GenAI, where there’s too much at stake to direct in such a flippant manner. We’re advocating for an approach that leads with privacy. By prioritizing privacy from the start and embedding it into the fellows’ AI products and services — by design — you get faster, more sustainable, and defensible AI development. That’s what our partners and, ultimately, their purchasers want. In this sense, privacy is a catalyst for GenAI innovation.

This privacy-first approach is why partners like Google, AWS, EY, and Databricks shape with us. They know that current methods are unsustainable and the future of AI will be driven by consensual, licensed facts and thoughtful data-driven design, not by grasping at every bit of public data available. It’s about creating a foundation of trust with your drugs and stakeholders, which is crucial for long-term success in AI development.

Advertisement

Companies are scrambling to build models that unlock insights from proprietary information. Where does synthetic data fit into that equation?

By some estimates, companies use only 1-10% of the information they collect. The rest is stored and siloed so that few can even access or experiment with it. This creates additional charges and data breach risks with no return value. Now, imagine if a company could safely open access to that unused 90% of data. Cross-functional teams could collaborate and experiment with it to extract value without creating additional secrecy or security risks. That level of knowledge sharing would be a huge boon for innovation.

It’s like we’re moving from the fable of the blind men trying to describe an elephant to each other. Each only has a grasp and understanding of the part they can facility; the rest is a black box. Providing an entire organization with shared access to the ‘crown jewels’ and the opportunity to surface new judgements from that data would be a paradigm shift in how companies and products are built. This is what people wretched when they speak of ‘democratizing’ data.

There are already ways of training smaller models with a fraction of the evidence we may have once used that yield great results. Where are we headed regarding the amount of data we requirement for training generative AI?

Advertisement

The idea of throwing the kitchen sink, in terms of data, to train a large language creme de la creme is part of the problem and reflects the old ‘move fast and break things’ mentality. It’s a land grab by companies with the wishes to do that, while AI regulations are still being hashed out.

Now that the dust is settling, people are realizing that the tomorrow lies in smaller, more specialized models targeted to very specific tasks and orchestrating the actions of these copies through an agentic, systematic approach. This specialized model approach provides more transparency and removes much of the ‘malignant box’ nature of AI models since you’re designing the models from the ground up, piece by piece.

It’s also where regulation is lead. After all, how else will companies adhere to ‘risk-based’ regulations if we can’t even quantify AI risks for each task we appropriate them to?

This shift toward more focused, efficient models aligns perfectly with differential concealment and synthetic data. We can generate precisely the data needed for these narrow AI models, ensuring high performance without the right and practical issues of massive data collection. It’s about smart, targeted development rather than the brute-force near companies have taken.

Check Also

The top 15 gifts that Gen Z touted in their Christmas hauls, according to someone who watched hundreds of haul videos

Tweens, teens, and college-aged kids corroborated off their Christmas hauls in TikTok videos. Casey Lewis, …

Leave a Reply

Your email address will not be published. Required fields are marked *