What the agents wrote

Exhaust Systems for Mazda Miata (MX‑5): The Ultimate Buyer’s Guide for NA, NB, NC & ND

Meta description:

Find the best exhaust for your Miata — sound, power, fitment, and legal tips for NA/NB/NC/ND. Brands, dyno expectations, and a practical buying checklist.

More …

Building agents with CrewAI

What was the goal?

The high level goals for this experiment with CrewAI were pretty straightforward:

  1. Get some first hand experience with CrewAI by using it to create agents.
  2. Understand how crew-ai with the aim to compare it against other frameworks over the next few weeks/months.
  3. Learn enough to use it for both work and for side-projects that might arise in the future.

The actual code / agents / tasks were created to write an authoritative and well researched article / blog post on a subject near and dear to my heart - namely Miatas (see post about a future/past Miata-bot here). Could I get a set of agents to write an interesting post about a Miata related topic (namedly after-market Exhaust systems), that I was familiar with? This way, I would not only be in a position to assess the accuracy and quality of what was writte but hopefully learn something new as well.

More …

Building GenAI applications

I have been building 2 GenAI applications with a client for the last 6 months, so this post is a reflection of some of the things I have been thinking about without being too specific about features or details of those applications. I am a big fan of perplexity.ai, so I will be using it to illustrate some points where applicable.

More …

Miata-bot - Part 4 - Choosing a LM

A common design pattern involves creating a wrapper around a commercial providers language models using their APIs and a framework like LangChain or LlamaIndex. Providers include OpenAI (GPT-Ns), Google (Gemini/Gemma), Cohere, Anthropic (Claude), etc. Fundamentally, this is a reasonable approach if the following are true:

  • The task’s value justifies the cost of employing closed-source models at scale.
  • The model provides satisfactory results without requiring fine-tuning.
  • Your domain permits sharing data with these providers to enhance their models.
  • Your organization accepts the reduced transparency associated with closed-source commercial models.

The deliberate choice being made for this Miata.Bot project is to not use any commercial GenAI models or tools, in order to retain the flexibility (to swap LMs and frameworks) and cost-effectiveness that comes from using open-source models and tools.

More …

Miata-bot - Part 3 - Data Prep

Now that some data has been collected (see details here), what are some considerations for the next steps. As it turns out, how the data is prepared really depends both on which task you’re training your language model (LM) for and what framework you’re using.

Data Processing

So let’s start with what the task is and what this means for the data:

  • The current design for the miata-bot is to use a Question-Answering (QA) Language Model (LM) - where users ask it questions about any generation of Miata and the bot provides ‘useful’ answers (more about this later).
    • So the data used to fine-tune the LM has to be structured as Question-Answer pairs.
    • The original/first post is the question and every response post is assumed to be part of the answer
  • Should all posts be in included? The short answer is probably not as recent improvements in LM performance have largely been attributed to better training data. So, if the focus is to use only high quality - how do we determine high quality threads, posts and replies? Some possibilities include:
    • Using only threads with a minimum number of replies (at least 5 replies in this dataset)
    • Responses must contain at least 1 sentence with at least 7 words.
    • What about using measures of language complexity like the GunningFox index to measure language complexity and only keeping responses that are grade level 8 or higher?
  • What about meta-data if available - should it be included as part of the context for each prompt
    • Posts often contains quotes - they are included in the thread, should they be included in the context or as part of the response?
    • Posts often contain links - should the bot get the text / pdfs from the links, then:
      • Provide the links as a part of its response?
      • Or use the text from the link (html/pdf) as context to improve its response?
    • Posts often contain images - do we to handle them or ignore them for now? We’re ignoring images for now till the bot is figured out.
  • What about pre-processing the data - as done in regular NLP tasks? Typically, other than some basic hygiene, text input to LM is mostly not pre-processed as the models themselves have a tokenizer that takes care of any pre-processing thats needed.

Data splitting

As it turns out, there are 4 generations of the Miata that are referred to as the NA, NB, NC and ND.

  • Further, when you look at the forums / sub-reddits - they too are divided by generation - which makes sense as each generation is different and has different specifications.
  • So succintly, the plan is to split the data into 3 sets of data (NA+NB - they are very similar, NC & ND) and train 1 LM per generation. An example of a generation specific question would be - How do you install an exhaust system upgrade for a ND miata?
  • We also have non-generation specific data - all that data will be used to train a general LM that is not language specific. An example of a generation non-specific question would be - What are some fun routes to drive my Miata in PA around Philadelphia?

Reddit data

I was very hopeful that data (text, images, links) from different subreddits were going to useful for LM training. As it turns out, there are some problems with using data from Reddit as detailed below. This once again illustrates how getting high quality clean data is a major challenge with training LMs

  • Signal to noise: The biggest challenge with using data from subreddits is quality - the signal to noise is pretty low and building a signal detector for just reddit data is a ‘tomorrow’ problem for now.
  • Reddit API limitations: The official Reddit API limits results to 1000 submissions - meaning instead of getting all Reddit submissions to a particular subreddit - you can only retrieve the newest/top/most controversial 1000 submissions depending on which sorting method you use. This severely limits what can be done.
  • Data poisoning: Recently, there have been reports of Reddit users on Reddit adding nonsensical content to poison LM training datasets.

Upon considering the concerns above - it makes sense to take a different approach to getting data from Reddit. One idea I’m considering is using the Reddit search API to get top submissions and posts as a data source.

Data Stats

Number of  
Conversations 74,360
Posts 463,9712
External Links 757,3639
Images 881,0540

References

  1. Deduplicating Training Data Makes Language Models Better link
  2. Scaling Instruction-Finetuned Language Models link
  3. Training Compute-Optimal Large Language Models link
  4. Towards Trustable Language Models: Investigating Information Quality of Large Language Models link