Artificial intelligence (AI), like the human brain, requires vast and varied data to provide useful insights and make informed decisions. For example, a generative AI tool like ChatGPT can offer higher quality responses by analysing a broad range and large volume of data relevant to user requests or questions.
What types of data does AI require and how can organisations meet these needs without adding complexity to IT management or introducing risks to the business?
Unstructured data
Historically, AI worked with structured data, typically numbers in rows and columns. However, structured data lacks the richness and depth of unstructured data, such as text, images, audio and video, which offers much more for analysis.
Despite its value, unstructured data is an underutilised asset today. Organisations recognise that unstructured data can help improve efficiency (62%) and drive innovation (31%). Yet, only 16% use a tool designed to deliver insights from unstructured data. Most of those efforts are in the early or pilot stages, according to Qlik’s Unstructured Data and GenAI Survey. This limits the potential of AI (especially generative AI), which can utilise valuable information hidden in unstructured data.
Due to the rise of generative AI, organisations are increasingly seeing the relevance and value of unstructured data, Sharad Kumar, Qlik’s field CTO for Data in Americas, tells DigitalEdge on the sidelines of the recent Qlik Connect in Orlando, Florida.
See also: 80% of AI projects are projected to fail. Here's how it doesn't have to be this way
“In the past, people may not care if there are four versions of a document because the human will examine those documents and decide how to interpret or use them. But when we start leveraging generative AI, we need to think how generative AI tools would know which of those four versions is most relevant to the assigned task,” he says.
“Organisations must start governing unstructured data, looking at data versioning (or systematically tracking changes made to data over time), ensuring that any sensitive information in the unstructured data is protected and more. So, all the good practices we apply to structured data must be brought to unstructured data. Moreover, unstructured data must be presented as vectors (instead of tables, which is the common format for structured data) for large language models to use and power generative AI applications.”
Recognising these challenges, Qlik has launched Qlik Answers, a generative AI-powered knowledge assistant designed to help businesses harness unstructured data. It delivers reliable and personalised answers from companies’ private sources, such as knowledge libraries and document repositories, ensuring instant, relevant insights.
See also: Responsible AI starts with transparency
Fully plug-and-play and self-service, Qlik Answers integrates seamlessly into existing systems, enabling workers to make informed decisions in real time. It offers full explainability so that users know the origin of answers, maintaining trust and transparency. Additionally, Qlik Answers offers “best-in-class security and governance, making it a complete, easy-to-deploy solution for leveraging unstructured data to drive improved business performance”, says Qlik.
Making data consumable
Data needs to be discoverable and consumable for it to be beneficial to AI and users.
Although organisations are moving data into centralised platforms, many have yet to see [the move helping them be data-centric or AI-driven] because there is a disconnect between data producers and data consumers. The new data product concept can eliminate that gap, bringing ownership, accountability and trust to data.
Sharad Kumar, field CTO for Data in Americas, Qlik
A data product is a reliable dataset designed for sharing and reuse within an organisation. It aims to make data accessible, consumable, insightful and actionable for a growing number of stakeholders and generative AI systems that rely on data for decision-making. Data products are purpose-built with defined shapes, consumption interfaces, maintenance and refresh cycles to serve specific needs effectively.
“For example, creating a data product around all customer interactions enables various teams and departments to use it easily [for analytics or AI] and with the confidence that the data is of good quality and does not contain sensitive or private data. This can help a salesperson better understand how to upsell to a certain customer while a business leader can gain insights on which markets to expand or enter,” Kumar adds.
To stay ahead of the latest tech trends, click here for DigitalEdge Section
According to Kumar, a data product management team should comprise analysts, data engineers, user experience designers and data architects. It is led by a data product manager, who steers the lifecycle of these products to meet business objectives.
The team works closely with sales, marketing and human resources teams to understand their needs and translate them into actionable data products. They enable self-service interaction, allowing data consumers to search, understand and access it. Additionally, the team continuously enhances the data product with new dimensions, features and functionalities based on feedback to ensure ongoing business value.
“There are six principles to make data ready for AI. Data needs to be diverse and unbiased; timely and up-to-date; accurate, reliable and trustworthy; secure; discoverable and easy to understand in context; and in a format easy for AI [and employees] to consume,” says Kumar. Qlik has been helping organisations, including the United Nations Framework Convention on Climate Change (UNFCCC), to achieve those goals and become AI- and data-driven.
Leveraging Qlik’s solutions enables UNFCCC to manage better and reconcile vast amounts of global data, which often vary in quality and context across borders. This enhances UNFCCC’s ability to standardise diverse data sources, facilitating more cohesive and comprehensive climate policies. The organisation also plans to introduce Qlik’s AI tools for continuous analysis, enabling faster response times and more precise decision-making.
“Partnering with Qlik allows us to take significant strides in our climate action efforts by leveraging Qlik’s advanced data integration and AI capabilities. It will enable us to implement more immediate, data-informed actions and better understand our diplomacy, political and climate data. We are particularly excited about exploring AI’s potential in analysing data of all types, including unstructured data, [to further] drive advancements in global climate policy and action,” says UNFCCC’s data scientist Joaquim Barris.
Real time data
Up-to-date data is also necessary for AI (including generative AI), especially in use cases requiring swift decision-making based on the most current data, such as AI-powered recommendation engines on e-commerce sites. This is why data streaming is crucial for AI.
Data streaming involves IT architecture and solutions where real-time data continuously feeds into critical business systems and AI solutions. This approach enables the integration of real-time context at an AI query execution while allowing experimenting, scaling and innovation with greater agility.
Suvig Sharma, area vice president for Asean, Confluent
He adds that many companies using generative AI are adopting data streaming to create seamless customer experiences and optimise business operations in real time. Data streaming enhances RAG-enabled workloads by integrating reliable and contextual real-time data from core business systems. RAG, or retrieval augmented generation, is increasingly popular for blending enterprise data with AI to ensure responses are contextually relevant and based on current information.
Regardless of company size and industry, adopters of data streaming are reaping direct business benefits in productivity, revenue, customer experience and satisfaction. Confluent’s 2024 Data Streaming Report reveals that 93% of adopters cite two to ten times return on data streaming investments.
Over 90% of adopters in Singapore agree that data streaming platforms broaden access to various data sources, helping contextualise models and ensure that the data ingested meets appropriate quality standards. This is crucial for generating accurate and relevant results from AI. Shifting data from an “at rest” state to one that is transient and reflects up-to-date information will provide businesses with accurate real-time services and applications.
The right data streaming platform
What should organisations look for when selecting a data streaming platform? “Organisations should seek a data streaming platform capable of providing the relevant, real-time data necessary for AI applications in a timely, secure and scalable manner across the business. The platform needs to be able to stream, connect, process and govern data however and wherever it’s needed,” says Sharma.
“As data moves through the platform, data pipelines are built to shape and deliver data in real-time, creating a system of ‘data in motion’. This platform connects and unlocks all enterprise data from source systems wherever they reside and serves it as continuously streamed, processed, fully governed and ready-to-use data products. These real-time data products become data assets instantly discoverable, contextualised, trustworthy and reusable for many use cases. Data products allow organisations to reuse various use cases to save costs and time.”
To help organisations harness AI’s potential, Confluent delivers a “complete data streaming platform” that helps customers build a shared source of real-time truth for sophisticated model building and fine-tuning.
Sharma adds: “Our AI Model Inference on Confluent Cloud for Apache Flink simplifies building and launching AI and machine learning (ML) applications through a single platform for data processing and AI/ML tasks. It reduces the operational complexity by enabling seamless coordination between data processing and AI workflows and enables accurate, real-time AI-driven decision-making.”
Trust Bank is among the organisations that have benefited from Confluent’s solution. By using Confluent’s data streaming platform for its event-driven architecture, the different teams in the Singapore bank can produce, share and consume self-service data products in real-time streams.
“Across the organisation, this drives innovation, improves agility, reduces the total cost of ownership and ensures appropriate quality controls and security policies are applied consistently, no matter the team,” Sharma adds.
Artificially generated data
Obtaining sufficient real-world data for training and evaluating AI models can be challenging due to limitations in availability, cost and regulatory restrictions such as privacy and security. As a result, Gartner estimates that by 2030, synthetic data will surpass real-world data in training AI models.
Synthetic data is generated using algorithms that mimic original data properties, ensuring similarity to real-world data without sensitive or personally identifiable information.
Synthetic data is an important asset as it is information generated on a computer to augment or replace real data to improve AI models, protect sensitive data and mitigate bias. It’s cheaper to produce, automatically labelled and sidesteps many of the logistical, ethical and privacy issues that come with training deep learning models on real-world examples.
Tan Ser Yean, CTO, IBM Singapore
In banking, synthetic data can simulate financial transactions such as banking, payments, credit cards and customer profiles. These datasets are used to develop machine learning models for fraud detection, credit risk scoring and Know Your Customer (KYC) processes, improving the efficiency and personalisation of financial services.
Retailers can also use synthetic data like location data, consumer purchase history, point-of-sale information and inventory to train machine learning models for demand forecasting, inventory management, targeted marketing and personalised recommendations. This optimises operations, lowers client acquisition costs and enhances personalised shopping experiences.
Despite its advantages, there are limitations and drawbacks to using synthetic data for AI. Tan shares that synthetic data may be unsuitable for closed-loop systems where feedback is provided to the data generator. For instance, a synthetic data generation algorithm that consistently generates male patients named “John Smith” would not accurately represent the diverse patient population in a healthcare setting.
Synthetic data can also introduce new biases if the generation algorithm is not properly designed. Lack of diversity and randomness in synthetic data may misrepresent the real-world scenario, causing the AI model to be biased or overfit.
Besides, inconsistencies or a lack of context in the synthetic data generation process can lead to inaccurate AI models. “For example, a synthetic data generation algorithm that generates synthetic images of people but fails to account for lighting conditions, facial features, or clothing may produce synthetic images that are not representative of real-world images. These inaccuracies can negatively impact the performance of AI models, especially in applications where precision is crucial, such as medical diagnosis or fraud detection,” says Tan.
Synthetic data generation, he adds, can be resource-intensive and time-consuming. “Although synthetic data can save time and resources compared to collecting and labelling real-world data, data scientists still need to spend their time and efforts validating the synthetic data before it is used. This validation process is essential to ensure that the synthetic data accurately represents the real-world data and does not introduce new biases or inaccuracies.”
Value from synthetic data
To maximise the value of synthetic data for AI while overcoming its limitations, organisations must ensure their synthetic data generators utilise a wide array of real-world data sources, including structured, semi-structured and unstructured data. That way, organisations can confidently use synthetic data that accurately represents the complexity and diversity of the original data in various applications and scenarios.
He also points out that the synthetic data generator should utilise various techniques and machine learning algorithms like generative adversarial networks, variational autoencoders and recurrent neural networks. These methods enhance the generated synthetic data’s quality, accuracy and realism. “These algorithms can help capture the underlying patterns, structures and relationships in the original data and generate synthetic data that closely resembles the original data.”
“Moreover, synthetic data generators should continuously monitor and evaluate the quality of the generated synthetic data and improve the synthetic data generation algorithm as needed. This may involve incorporating new data sources, refining the algorithms and addressing any limitations or drawbacks identified during the evaluation process. By continuously improving the synthetic data generation process, organisations can ensure that their synthetic data remains accurate, reliable and up-to-date,” Tan says.
Beyond technology, organisations must ensure they have the right processes, people and organisational culture to fully utilise synthetic data for AI. According to Tan, they should:
- Establish clear data governance practices to ensure synthetic data is generated and used responsibly, transparently and ethically. This includes establishing policies and procedures for data collection, storage, access and sharing and ensuring appropriate security measures are in place to protect synthetic data.
- Provide training courses for data scientists and other professionals involved in synthetic data generation to ensure they have the necessary skills and expertise. This may include training on the latest techniques and algorithms for generating synthetic data and best practices for data governance and security.
- Foster a culture of innovation to encourage the development of new ideas and approaches for generating synthetic data. This may involve investing in research and development, partnering with external experts and promoting a culture of experimentation and exploration.
- Encourage collaboration between data science, engineering and others to share ideas, failures and best practices. Collaboration can help improve the organisation’s current technology, process and people and ensure the synthetic data generation process remains efficient, effective and innovative.