Evaluating Generative AI for text generation

Evaluating the technology of startups focused on generative AI differs in many ways from traditional AI technology. This is because generative AI, especially those used for text generation, has unique considerations and complexities that demand a different evaluation lens. These are the key points to consider when evaluating generative AI text products as a Venture Capitalist or Angel Investor.

 

State of the art models
The current speed of AI technology evolution is incredibly fast.For 2023, we can say that strong generative AI models for text generation rely on transformer-based architecture like GPT-3, GPT-4, or BERT.

In many aspects, these models outperform more traditional architectures like Recurrent Neural Networks. Some examples and benefits of transformer-based models would be their much better contextual understanding of relationships in language (due to attention mechanism and positional encoding), their efficiency in training on data (due to parallel computation), and their capability for generalization (due to pretraining) and fine-tuning on specific downstream tasks.

What Generative AI startups should share with you: info about model architecture

Model Layer & Architecture Dependency
It is already becoming evident that in the future, there will be three types of Generative AI startups. The most robust ones rely on their own Large Language Models. These teams and startups act flexibly and independently, but they have to put in more effort in terms of pretraining.

The majority of Gen AI startups will rely on the Large Language Models provided by OpenAI/Microsoft and Google, enriching them with their own data for the specific use case.

A third group consists of fast-to-market teams that rely solely on pretrained LLMs without dedicating much time to fine-tuning. The dependency on external factors is significantly greater in this case.

A bonus point would be an approach that relies on multiple Pretrained Large Language Models at once. Diversifying LLMs can improve the versatility and robustness of their technology and enable better adaptability to different text generation tasks.

What Generative AI startups should share with you: info about large language models in use and the level of fine-tuning.

 
Model accuracy and performance
There are several metrics startups for generative AI can show you to demonstrate their models’ performance.

Evaluating Generative AI models for text generation comes along with a lot of challenges. Whether a text is right or wrong depends on the context and the specific task. The correctness of a text can be quite subjective.

While there are several metrics that provide quantitative or qualitative indicators, it’s always recommended to rely on a combination of various metrics to get a holistic view of a generative AI model’s performance.

Startups should provide you with a mix of the metrics below to let you compare and assess their generative AI performance for text generation.

  • BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores are commonly used metrics for evaluating the quality of machine-generated text. The BLEU score is precision-oriented, meaning it measures how many n-grams (contiguous sequence of n items from a given sample of text) in the machine-generated text are also in the reference text. A higher BLEU score indicates a higher similarity between the generated text and the reference, suggesting a more accurate model.
    Typical BLEU Score visualization. Source: “Transformer-based End-to-End Question Generation”, Lopez, Cruz; published on research gate.

    On the other hand, ROUGE is a recall-oriented measure that focuses on how much of the reference text is captured in the generated text. This metric is particularly useful for tasks like text summarization, where covering all the key points from the source text is essential.The relevance of these scores in evaluating generative AI models for text generation lies in their ability to quantitatively assess the quality of generated text. They provide insights into the model’s performance in terms of linguistic correctness and content overlap with the reference text. However, they should not be used in isolation, as they may not fully capture semantic correctness, coherence, or creativity – factors equally important in high-quality text generation. Using these scores alongside other metrics and human evaluation can offer a more comprehensive assessment of a model’s performance. 

  • Perplexity is a measurement in natural language processing that quantifies how well a probability model predicts a sample. In the context of generative AI models for text generation, perplexity essentially measures the uncertainty of a language model in predicting the next word in a sentence. Lower perplexity indicates that the model is less ‘perplexed’ and more certain about its prediction, meaning it’s a more accurate model. Therefore, when evaluating generative AI models, a lower perplexity score typically indicates a better model. However, like other metrics, perplexity should not be used in isolation. While it offers insights into the model’s capability of understanding the language, it might not fully capture the quality of the generated text in terms of coherence, creativity, or context-appropriateness, which are also important aspects of text generation
  • Last but not least, task based evaluation, domain knowledge and testing a demo version are crucial components for evaluating the performance of generative AI models for text generation.

    Unlike metrics like BLEU, ROUGE, or perplexity, which provide a general quantitative measure of the model’s performance, task-based evaluation focuses on how well the model performs in the context of a particular application or task. It can be especially useful for understanding the model’s real-world applicability and effectiveness.

    By involving domain experts in the evaluation process, you can accurately gauge the quality and relevance of the generated text, ensuring that it aligns with the subject matter and meets industry-specific requirements. For instance, a model trained to generate medical text would need to be evaluated by someone with knowledge in the medical field to ensure accuracy and appropriateness of terminology.

What Generative AI startups should share with you:
1. Performance metrics to evaluate the generative AI model
        2. Working demo ready to test by domain experts.

Explainability
Explainability in the context of generative AI models for text generation refers to the model’s ability to provide clear insights into why and how it generates certain outputs. Given the ‘black box’ nature of many AI models, understanding the decision-making process can be complex. This is where explainability becomes crucial. An explainable model allows investors, users, and regulators to trust the system as they can comprehend how outputs are produced.

Typical Visualization of LIME output to explain AI model decisions. Source: “Explainable AI: A Review of Machine Learning Interpretability Methods”, Linardartos; published on Researchgate


If a model is explainable, it’s easier for you as a VC or Angel investor to assess its potential for success, ethical considerations, and regulatory compliance.

To provide explainability and transparency, startups and AI teams use techniques like LIME (Local Interpretable Model-agnostic Explanations) or SHAP (SHapley Additive exPlanations). These techniques can be used to identify which features in the input are most influential in the model’s output. This can be particularly useful for understanding which words or phrases the model considers most important.

What Gen AI startups should share with you: An answer to the question of why the model made certain decisions.

Model Development Stage
Understanding the startup’s model development stage is crucial. Is it still in the experimentation phase, or has it already been deployed in the cloud, with real-world testing and validation? Mature models, especially those that have been deployed and tested, often carry less risk.

What Generative AI startups should share with you: Info about the current stage of testing and deployment.

The Data
The quality and quantity of training data are crucial factors in the performance and robustness of generative AI models for text generation. Startups must continually seek new, diverse, and relevant data sources to train their models, which could involve web scraping, partnerships with other companies, or user-generated data. However, acquiring data is just the first step; much of the work lies in cleaning and preprocessing, including removing irrelevant data, handling missing values, and resolving inconsistencies. This not only improves the quality of data but also enhances the model’s performance.

Training data should encompass a broad spectrum of language styles, domains, and topics, which helps to ensure that the AI model can handle varied inputs. It also aids in preventing model bias and bolstering the model’s generalizability. Startups can also look towards collaboration with academia, industry, and open-source communities to access publicly available datasets and shared resources, augmenting their own data collections.

Looking forward, startups must plan for future data acquisition, considering both quantity and variety. The approach to data collection should be ethically sound, respecting privacy and adhering to regulatory compliance, which is crucial for maintaining user trust and avoiding legal pitfalls.

The quality and volume of a startup’s data can significantly influence the model’s effectiveness and scalability, directly impacting the potential success of the startup. Moreover, you need to ensure that startups are adhering to ethical and legal standards in their data collection methods. Hence, understanding the startup’s strategies for maintaining and expanding their data repositories becomes a critical aspect of the investment decision-making process.

What Generative AI startups should share with you:
1. Metadata about their training data
         2. Data Strategy abstract
         3. Roadmap about future data to integrate
         4. Overview of data sources and data collection approach.

Conclusion
Evaluating generative AI startups focused on text generation requires a unique approach and understanding. Key considerations include the sophistication and architecture of the AI models in use, with a preference for transformer-based models such as GPT-3 and GPT-4. Also essential is the level of dependency on Large Language Models and the ability to fine-tune them. Performance and accuracy should be evaluated using a variety of metrics, including BLEU, ROUGE, and perplexity, complemented by task-based evaluation and testing by domain experts. Additionally, explainability of the model is key to understanding its decision-making process, and knowing the model’s development stage helps to ascertain its maturity and potential risk. Lastly, the volume and quality of data used for training significantly influence the model’s efficacy and scalability, with ethical and legal considerations surrounding data collection being paramount. Therefore, understanding these facets will ensure a comprehensive assessment of a generative AI startup’s potential as an investment opportunity.