A Budget Guide for Analyzing AI Company Funding with AI

GuestPost May 21, 2024

[Note from Randy: This is a guest post from Howe about pulling and analyzing funding data, which I feel falls under the "sounds simple but the weeds get thorny quick" class of projects. Enjoy! Also, if you're interested in sharing a story or post on this newsletter, send me an email.]

Categorized and Analyzed AI Fundraising Output

Many of us in enterprise data science have likely attempted to convert unstructured text into structured tabular data. Personally, much of my work involves aggregating, categorizing, and interpreting various public data sources into actionable and understandable information, specifically regarding how the public sector spends on different technologies (Procure.FYI and FrontierOptic).

AI tools like LLM still have limitations due to cost and accuracy in data processing, even though they significantly enhance efficiency by automating certain time-consuming tasks. What is the right blend of human and AI task allocation? We believe we've finally achieved a certain level of success in both workflow and results. Here is a link to one year of AI fundraising data by category, region, and funding stage, and this article explains how we achieved it.

Why Is Categorization and Summary Useful for Consumer Data Products?

A consumer data product isn't useful without the proper context needed to infer insights and make comparisons. Think of a great consumer data product, like Kelley Blue Book. Car prices are only comparable for very similar types of cars with similar mileage and age. The same principle applies to general business intelligence from unstructured sources. What does a company or business actually do? What stage are they at? Who are their comparable companies, tools, or partners? Without comparable metrics, the data is just a number and can't be acted upon.

It takes time because typically, you or an analyst have to go through multiple news articles, checking this information against comparable companies to get categorization and summaries. Since our main job is curating niche public technology spending and financing data(Procure.FYI and FrontierOptic) , this is the hardest part of what we do but can significantly improve data usability if it works. Can we categorize companies and products we track and summarize transaction details automatically? For most of the past year, it didn't work—partly due to my lack of optimization techniques and partly because of the limitations of language models. Recently, we finally achieved a practical level of accuracy in a small experiment we conducted, thanks largely to the advice and help from Lucas Tao, who assisted with writing the optimization technique.

Using AI Fundraising Data for Experimentation

We wanted to find a topic with abundant, regularly updated text to experiment on, so that we could eventually apply our process to other aggregated data once we achieved a certain level of success. We decided to track AI fundraising news. Everyone loves big price tags, and there are so many new fundraises each day or week. And we can get feedback on accuracy by getting more eyeballs. Although the nature of these companies is complex, fundraising news typically comes with detailed descriptions of their businesses.

By feeding these soundbites to an LLM, can we use it to summarize and categorize the data, converting unstructured information into categorized, searchable data? The goal is to create a table with company names, business verticals, stages, locations, and summaries, allowing people to quickly find the information they need. And we want to achieve this with maximum automation and lowest cost for us.

Where the Information Comes From

Different business sites send out AI fundraising alerts because what’s the point of raising money if you don’t broadcast it to the universe? You can also get fundraising updates from Google Alerts, Crunchbase and Feedly. Additionally, AI- or data-focused blogs regularly discuss AI funding, such as Supervised, Interconnected, Twitter, or newsletters like Chief AI Officer that focus specifically on this topic. The information is everywhere, and there’s no shortage of companies making big announcements.

How Do We Extract Data on a Budget

To ensure efficiency and flexibility in our small-scale project, we employ a combination of automated data pipelines and virtual assistants. We rely on virtual assistants, costing $3-5 per hour, to handle initial information extraction, particularly for relevant text selection and quantitative information. Virtual assistants can be sourced from platforms like Upwork.

Fully automating our information pipeline isn't practical due to accuracy concerns. Human oversight is needed, especially when dealing with quantitative data with near 0 error margin in our case. Language models are not always accurate in extracting the data we need. This very minimal manual input reduces the errors often associated with automated systems.

We know that automating this process to achieve a certain level of accuracy through extensive few-shot training is possible. However, it requires significant engineering effort and can be costly, particularly due to the unstructured nature of news data. More structured data sources are easier to automate, but virtual assistants excel at repetitive tasks that would be too complex or expensive for language models. They handle extracting quantitative data and relevant text segments, which lightens the load on language models.

Cost Math… Tokens are not Cheap

At the same time, there are cost concerns. LLMs, especially at scale, are not really cheap. LLMs work with tokens, which vary in length and depend on the tokenizer being used. Assuming OpenAI’s rough estimate of 1 token ~= 4 chars for their APIs, a 500 character payload would translate to roughly 125 tokens. The main cost optimization concern is how much instruction overhead do you have relative to data load. If it takes 1000 tokens worth of instruction prompting and few shot examples to process 125 tokens worth of data, you will end up with almost a 10x overhead in terms of tokens used relative to a system that simply runs a task on your 125 token data payload. This is primarily a concern when you are being charged by input token amount rather than a fixed monthly SaaS fee.

For the categorization of new companies that raised funds during the week of April 28, for example, we collected information on 42 companies. The total number of data tokens used was 2,347. This needed to be broken up into 5 chunks of 512 or fewer tokens. Each chunk required 783 tokens worth of instruction prompting (category descriptions, prompt instructions, error handling cases), resulting in a total overhead of 3915 tokens across all invocations. This is not an optimized prompt for maximal cost/performance, but it is relatively lean, with no few shot examples. This translates to a cost of 3.9 cents and while relatively inexpensive for the use case, assuming linear scaling, at 1M examples, you are talking about roughly $928.In addition, if the task complexity increases, few-shot examples will become essential, and a more descriptive prompt will be needed. If fully automated, agents will need to scan large bodies of text to identify relevant information. In such scenarios, it's important to consider the costs involved.

Just yesterday, GPT-4o was released. GPT-4o is priced at $0.005 per 1,000 input tokens and $0.015 per 1,000 output tokens. Applying the same usage scenario, the cost for GPT-4o would be half that of GPT-4 Turbo. This translates to approximately $464 for 1M examples, making GPT-4o a more cost-effective option.

LLM Optimization Techniques We Used

After early data extraction and relevant text selection, which results in a clean, categorized set of text snippets containing key details about each company's fundraising activities, we use an LLM to summarize the company's business and categorize them into different business categories. Let’s talk about techniques we used to improve LLM performance. We have been iterating to get desirable results for the better part of last year; however, the results have been consistently disappointing. A combination of improvements, however, helped us achieve useful outcomes. What helped us this April was a combination of chunking, forced JSON output, and GPT-4 Turbo, thanks again to Lucas Tao.

Chunking

Chunking is a method for breaking down large text data into manageable segments, known as chunks, to enable language models (LLMs) to better process and analyze them. This technique is especially valuable for handling extensive texts like documents, articles, or books. By applying effective chunking strategies, LLMs can better recognize linguistic patterns and relationships within the text.

Chunking reduces the amount of text a model needs to process at once by splitting it into manageable segments, each accompanied by instructions and data subsets. Transformer-based models, like GPT, have a maximum input size, often 512 tokens, beyond which they can't maintain context.

Choosing the Right Chunk Size

Chunk size plays a significant role in determining how well a model handles text. Larger chunks, like whole paragraphs, maintain broader context but may overlook finer details. Smaller chunks, such as sentences or phrases, help capture precise meanings and nuances, improving information processing and retrieval. This finer granularity allows the model to discern user intent more accurately. Smaller chunks are ideal for our use case because they enable models to focus on details within paragraphs. By reducing text from potentially thousands of tokens to 512-token chunks, we ensure that each segment contains sufficient detail for the model to work effectively without being overwhelmed.

To find the best chunk size, we conducted a hyperparameter search within a predefined range. If multiple hyperparameters need optimization, a grid search would be appropriate. The objective is to maximize chunk size, reducing costs by minimizing instructional overhead, while keeping the error rate low to ensure accurate results.

To gauge the effectiveness of chunking, we evaluate how well the model understands and responds based on these chunks. Effective chunking allows the model to capture the intended nuances of the text, delivering appropriate responses that reflect a successful breakdown of input data.

Our Chunking Code

Copy and paste away

# Initialize chunking variables
    start_idx = 0
    curr_chunk_len = 0
    chunks = []
    row_lens = df_clean.astype(str).sum(axis=1).apply(lambda x: get_tokens(x))
    # Chunk the dataframe
    for idx, length in enumerate(row_lens):
        if curr_chunk_len + length > chunk_size:
            chunks.append((df_clean.iloc[start_idx:idx], start_idx, idx, model, "column"))
            start_idx = idx  # Set new start index
            curr_chunk_len = 0  # Reset current chunk length
        curr_chunk_len += length
    # Add the last chunk if not empty
    if start_idx < len(df_clean):
        chunks.append((df_clean.iloc[start_idx:], start_idx, len(df_clean), model, "column"))
    # Limit the number of chunks if specified
    if num_examples:
        chunks = chunks[:num_examples]
    # Process chunks in parallel using a pool of workers
    with Pool(10) as p:
        result = p.map(parallel_llm, chunks)
    # Combine results from all chunks
    results = list(itertools.chain.from_iterable([output['outputs'] for output in result]))

Forced JSON Output

Language models, by default, can generate outputs in various formats based on the input prompt's structure and the internal logic developed during training. This flexibility is generally useful but can be problematic in a production environment where consistent and structured output is needed. For instance, if a model sometimes returns a plain text summary and other times a list of key-value pairs, it would complicate downstream processes like data parsing and integration into applications.

Forcing the output to be in JSON format means that every output will follow a structured, predictable schema (e.g., key-value pairs). This consistency is vital for ease of integration: Systems that consume the model's output can reliably parse and process the data without needing complex logic to handle various formats.

Forced JSON Output Code

# Creating a dictionary to store metadata and results
final = {
 "metadata": {
 # Store the time taken to execute the entire process
 "exec_time": execution_time,
 # Store the total number of data tokens processed
 "data_tokens": data_tokens,
 # Store the number of tokens used for instructions
 "instruction_tokens": instruction_tokens,
 # Calculate overhead: instruction tokens multiplied by the number of chunks minus one
 "overhead": instruction_tokens * len(chunks) - 1
 },
 # Collect the results, converting each result to JSON format
 "results": [out.model_dump(mode='json') for out in results]
}
# Writing the 'final' dictionary to a JSON file
# The filename is created dynamically using the model and chunk size variables
with open(f"{model}_{chunk_size}_summary.json", "w") as f:
 # Use json.dump to write the dictionary to the file in JSON format
 # 'indent=2' adds indentation to make the output readable
 # 'sort_keys=True' sorts the keys in alphabetical order
 json.dump(final, f, indent=2, sort_keys=True)

GPT-4 Turbo and GPT-4o Performance

Earlier in April 2024, GPT-4 Turbo received an update that significantly enhanced its performance. Notably, it now supports a context length of 128K tokens, equivalent to processing over 300 pages of text at once. This remarkable capability allows GPT-4 Turbo to excel in comprehending lengthy articles, complex reports, and detailed conversations, leading to more coherent and logically consistent text generation. Additionally, unlike its predecessors, the training data for GPT-4 Turbo has been updated until December 2023. This update enables the model to better understand and adapt to the latest societal contexts, cultural concepts, and technological advancements, ensuring it provides more accurate and relevant responses.

Just yesterday, GPT-4o was released. While we have not yet experimented with it on our specific use case, initial reports suggest that its performance is similar to GPT-4 Turbo. Although we did not notice any significant performance difference, GPT-4o is cheaper, making it a potentially cost-effective option for those looking to reduce expenses while maintaining similar performance levels.

If you're looking to reduce costs even further

OpenAI offers a "Batch API" that allows you to send asynchronous batches of requests at 50% lower costs. The only trade-off is that turnaround times can be up to 24 hours. This was launched on April 15th 2024 and announced on OpenAI's community forum, is still relatively unknown among developers.

To use this API, you will need to set up a process that periodically checks the status and retrieves the latest results. Despite the added effort, the potential cost savings can be substantial, making it an attractive option for those who prioritize budget over speed.

So this is how we created this database on a budget. From here, you can parse the data into your favorite analysis tool and work with it. We hope this guide provides useful insights for anyone looking to create their own data analytics pipeline without breaking the bank.

Howe Wang is the cofounder of Meow Catalog, a product studio offering investment and due diligence data products like Procure.FYI and FrontierOptic. He specializes in converting disparate public data into actionable product for investment due diligence. Previously, he worked at Lyft in analytics, machine learning, and data product management.