Lessons after a half-billion GPT tokens

April 11, 2024 / Ken / 32 Comments

CTO @ Truss | Former VP of Engineering and Head of Security @ FiscalNote | ex-PKC co-founder | princeton tiger '11 | writes on engineering, management, and security.

36907

Views

Points

Comments

My startup Truss (gettruss.io) released a few LLM-heavy features in the last six months, and the narrative around LLMs that I read on Hacker News is now starting to diverge from my reality, so I thought I’d share some of the more “surprising” lessons after churning through just north of 500 million tokens, by my estimate.

Some details first:

– we’re using the OpenAI models, see the Q&A at the bottom if you want my opinion of the others

– our usage is 85% GPT-4, and 15% GPT-3.5

– we deal exclusively with text, so no gpt-4-vision, Sora, whisper, etc.

– we have a B2B use case – strongly focused on summarize/analyze-extract, so YMMV

– 500M tokens actually isn’t as much as it seems – it’s about 750,000 pages of text, to put it in perspective

Lesson 1: When it comes to prompts, less is more

We consistently found that not enumerating an exact list or instructions in the prompt produced better results, if that thing was already common knowledge. GPT is not dumb, and it actually gets confused if you over-specify.

This is fundamentally different than coding, where everything has to be explicit.

Here’s an example where this bit us:

One part of our pipeline reads some block of text and asks GPT to classify it as relating to one of the 50 US states, or the Federal government. This is not a hard task – we probably could have used string/regex, but there’s enough weird corner cases that that would’ve taken longer. So our first attempt was (roughly) something like this:

Here's a block of text. One field should be "locality_id", and it should be the ID of one of the 50 states, or federal, using this list:
[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

This worked sometimes (I’d estimate >98% of the time), but failed enough that we had to dig deeper.

While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.

So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.

I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

(Random side note one: GPT was failing most often with the M states — Maryland, Maine, Massachusettes, Michigan — which you might expect of a fundamentally stochastic model.)

(Random side note two: when we asked GPT to choose an ID from a list of items, it got confused a lot less when we sent the list as prettified JSON, where each state was on its own line. I think \n is a stronger separator than a comma.)

Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat API. That’s it.

Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:

def extract_json(prompt, variable_length_input, number_retries)

The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.

This is the beauty of a powerful generalized model – less is more.

Most of the 40 lines in that function are around error handling around OpenAI API’s regular 500s/socket closed (though it’s gotten better, and given their load, it’s not surprising).

There’s some auto-truncating we built in, so we don’t have to worry about context length limits. We have my own proprietary token-length estimator. Here it is:

if s.length > model_context_size * 3
  # truncate it!
end

It fails in corner cases when there are a LOT of periods, or numbers (the token ratio is < 3 characters / token for those). So there’s another very proprietary try/catch retry logic:

if response_error_code == "context_length_exceeded"
   s.truncate(model_context_size * 3 / 1.3)

We’ve gotten quite far with this approach, and it’s been flexible enough for our needs.

Lesson 3: improving the latency with streaming API and showing users variable-speed typed words is actually a big UX innovation with ChatGPT.

We thought this was a gimmick, but users react very positively to variable-speed “typed” characters – this feels like the mouse/cursor UX moment for AI.

Lesson 4: GPT is really bad at producing the null hypothesis

“Return an empty output if you don’t find anything” – is probably the most error-prone prompting language we came across.¹ Not only does GPT often choose to hallucinate rather than return nothing, it also causes it to just lack confidence a lot, returning blank more often than it should.

Most of our prompts are in the form:

“Here’s a block of text that’s making a statement about a company, I want you to output JSON that extracts these companies. If there’s nothing relevant, return a blank. Here’s the text: [block of text]”

For a time, we had a bug where [block of text] could be empty. The hallucinations were bad. Incidentally, GPT loves to hallucinate bakeries, here are some great ones:

Sunshine Bakery
Golden Grain Bakery
Bliss Bakery

Fortunately, the solution was to fix the bug and not send it a prompt at all if there was no text (duh!). But it’s harder when “it’s empty” is harder to define programmatically, and you actually do need GPT to weigh in.

Lesson 5: “Context windows” are a misnomer – and they are only growing larger for input, not output

Little known fact: GPT-4 may have a 128k token window for input, but it’s output window is still a measly 4k! Calling it a “context window” is confusing, clearly.

But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.

GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.

We originally thought this was because of the 4k context window, but we were hitting 10 items, and it’d only be maybe 700-800 tokens, and GPT would just stop.

Now, you can of course trade in output for input by giving it a prompt, ask for a single task, then give it (prompt + task), ask for the next task, etc. But now you’re playing a game of telephone with GPT, and have to deal with things like Langchain.

Lesson 6: vector databases, and RAG/embeddings are mostly useless for us mere mortals

I tried. I really did. But every time I thought I had a killer use case for RAG / embeddings, I was confounded.

I think vector databases / RAG are really meant for Search. And only search. Not search as in “oh – retrieving chunks is kind of like search, so it’ll work!”, real google-and-bing search. Here’s some reasons why:

there’s no cutoff for relevancy. There are some solutions out there, and you can create your own cutoff heuristics for relevancy, but they’re going to be unreliable. This really kills RAG in my opinion – you always risk poisoning your retrieval with irrelevant results, or being too conservative, you miss important results.
why would you put your vectors in a specialized, proprietary database, away from all your other data? Unless you are dealing at a google/bing scale, this loss of context absolutely isn’t worth the tradeoff.
unless you are doing a very open-ended search, of say – the whole internet – users typically don’t like semantic searches that return things they didn’t directly type. For most applications of search within business apps, your users are domain experts – they don’t need you to guess what they might have meant – they’ll let you know!

It seems to me (this is untested) that a much better use of LLMS for most search cases is to use a normal completion prompt to convert a user’s search into a faceted-search, or even a more complex query (or heck, even SQL!). But this is not RAG at all.

Lesson 7: Hallucination basically doesn’t happen.

Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!).

Similarly — and I’m sure you’ve noticed this if you’re an engineer — GPT doesn’t really hallucinate code – in the sense that it doesn’t make up variables, or randomly introduce a typo in the middle of re-writing a block of code you sent it. It does hallucinate the existence of standard library functions when you ask it to give you something, but again, I see that more as the null hypothesis. It doesn’t know how to say “I don’t know”.

But if your use case is entirely, “here’s the full context of details, analyze / summarize / extract” – it’s extremely reliable. I think you can see a lot of product releases recently that emphasize this exact use case.

So it’s all about good data in, good GPT tokens responses out.

Conclusion: where do I think all this is heading?

Rather than responding with some long-form post (edit: I then subsequently did write a follow up post with more thoughts, because there were some very thought-provoking / inspiring points in the HN comments), here’s a quick from-the-hip Q&A:

Are we going to achieve AGI?

No. Not with this transformers + the data of the internet + $XB infrastructure approach.

Is GPT-4 actually useful, or is it all marketing?

It is 100% useful. This is the early days of the internet still. Will it fire everyone? No. Primarily, I see this lowering the barrier of entry to ML/AI that was previously only available to Google.

Have you tried Claude, Gemini, etc?

Yeah, meh. Actually in all seriousness, we haven’t done any serious A/B testing, but I’ve tested these with my day to day coding, and it doesn’t feel even close. It’s the subtle things mostly, like intuiting intention.

How do I keep up to date with all the stuff happening with LLMs/AI these days?

You don’t need to. I’ve been thinking a lot about The Bitter Lesson – that general improvements to model performance outweigh niche improvements. If that’s true, all you need to worry about is when GPT-5 is coming out. Nothing else matters, and everything else being released by OpenAI in the meantime (not including Sora, etc, that’s a whooolle separate thing) are basically noise.

So when will GPT-5 come out, and how good will it be?

I’ve been trying to read the signs with OpenAI, as has everyone else. I think we’re going to see incremental improvement, sadly. I don’t have a lot of hope that GPT-5 is going to “change everything”. There are fundamental economic reasons for that: between GPT-3 and GPT-3.5, I thought we might be in a scenario where the models were getting hyper-linear improvement with training: train it 2x as hard, it gets 2.2x better.

But that’s not the case, apparently. Instead, what we’re seeing is logarithmic. And in fact, token speed and cost per token is growing exponentially for incremental improvements.

If that’s the case, there’s some Pareto-optimal curve we’re on, and GPT-4 might be optimal: whereas I was willing to pay 20x for GPT-4 over GPT-3.5, I honestly don’t think I’d pay 20x per token to go from GPT-4 to GPT-5, not for the set of tasks that GPT-4 is used for.

GPT-5 may break that. Or, it may be the iPhone 5 to the iPhone 4. I don’t think that’s a loss!

Technology

25 Comments

David Vandervort
April 12, 2024 at 10:02 am

Some good learning here.I think the reason GPT-5 won’t be soon or super-impressive is that we’ve gotten most of the improvement from adding training data as is going to happen. Instead, the next leap in capability is going to require some kind of enhancement tot he transformer model itself. (But if I knew what that was, I would probably be developing it myself and getting ready to become a billionaire).

Thanks for this insightful post.

Reply
Yacov Lewis
April 12, 2024 at 10:26 am

Great piece! My experience around Langchain/RAG differs, so wanted to dig deeper:
Putting some logic around handling relevant results helps us produce useful output. Curious what differs on your folks end.

Reply
- Ken (Post author)
  April 12, 2024 at 12:46 pm
  
  This is a great point! YMMV is very important here – it’s very possible your use-case for RAG is just something we haven’t had to deal with yet! Maybe some details would interest you:
  
  1. Our use case is (1) entirely text based, and (2) exclusively “un-creative”/bounded – meaning, we can assume a lot about the user’s inputs for a given prompt.
  2. In the whole precision vs recall spectrum, we tend towards…(I had to look this up) – precision, I think! ;). Basically, we tend to be extremely conservative – we’d rather discard potentially relevant results just in case, rather than all the relevant results, but sometimes some irrelevant ones. RAG doesn’t seem to be very conducive or helpful if you want precision.
  
  Reply
Wilhelm
April 13, 2024 at 12:17 pm

> Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

Would you mind justifying this statement so I can understand what you mean?

Reply
- Ken (Post author)
  April 13, 2024 at 12:30 pm
  
  For sure! I do realize that was a bit out of nowhere.
  
  Generalization
  I tell my engineers that progression to more senior levels fundamentally is about increasing levels of delegation. What are the levels of delegation? Well, for entry-level folk, they get delegated tasks that have all the steps spelled out explicitly. As that entry-level engineer grows, they can handle tasks that are increasingly vague. At the senior level, the tasks can basically be:
  
  “Add a payment system, we need it badly, but not so badly that if Johnny needs help on a PR, do that first. Oh, and probably use Stripe unless you find something better.”
  
  When you’re CTO, your task that you’ve been entrusted with by the CEO is basically “find ways we can use technology to create value.”
  
  Quality
  On top of that, at these higher levels of delegation, quality of the outcome can approve. Because it’s vague, you can get solutions that you’d never have specified to a junior person doing the task. So not only are you giving more vague guidance, the results you expect are better than if you gave more specific guidance.
  
  That’s why you hear so many engineering leaders say “Give your people freedom to explore” – it’s a recognition of this dynamic.
  
  So, to see GPT improve quality when you’re more vague really seems impressive, to me when it happens!
  
  Reply
- Audiala AI Tour Guide
  April 14, 2024 at 7:10 am
  
  We noticed the same thing: giving more instructions, that would better detail the task, would sometimes give better results but would fail more often. The simpler the better. But we also need to think a bit like a LLM does to even improve your results: returning the country code is harder than returning the full country name for example, as abbreviating is another layer of complexity and there is maybe less examples overall of the abbreviate names versus the full name in the data set. Choose what must have been the most prevalent item in the training data.
  
  Reply
Civitello
April 13, 2024 at 12:45 pm

Regarding null hypothesis for asking for a list of companies in a block of text, would this work:
Make it two steps, first:
> Does this block of text mention a company?
If no, good you’ve got your null result.
If yes:
> Please list the names of companies in this block of text.

Reply
- Ashlynn Antrobus
  April 14, 2024 at 11:35 am
  
  40 lines is waaaaaaay to long for a single function. That should be multiple functions.
  
  That said, based on your use case described, and what I’ve been experimenting with, I think you will like Gemini 1.5
  
  Reply
Michał Flak
April 13, 2024 at 1:24 pm

I think you could benefit from two things, especially since “Our use case is […] exclusively “un-creative”/bounded”:
1) Using OpenAI JSON mode, it’s made exactly for your use case and could save you retries
2) Spinning up an open source LLM on your machine – they work really well for tasks like this, especially when coupled with a constrained generation tool like Outlines or Guidance. You can guarantee adherence to schema and avoid wasting time on “fluff” tokens like parentheses or keys, only generating the value tokens. It could greatly save costs.

I’ve written about it some time ago and tooling has progressed since then, but you may want to have a look: https://monadical.com/posts/how-to-make-llms-speak-your-language.html

Reply
alex sharp
April 13, 2024 at 2:24 pm

> Heck, we don’t even use system prompts (maybe we should…).

I’m using 3.5-turbo and 4 for similar-ish use-cases to extract json, and various text processing and classification tasks. I found both models were much better aligned generally for both the classification tasks and the “hey always give me json” (i’ve since moved to function calling, highly recommend) when using the system prompts on both model versions. 3.5-turbo especially was a big improvement when moving to a system prompt. Hope this helps. Cheers.

Reply
Phil
April 13, 2024 at 2:28 pm

Awesome post. Thanks for the insights and practical takeaways (not over engineering things). Super useful article I’ll be sending to my coworkers.

Reply
Phillip Carter
April 13, 2024 at 2:51 pm

This was a great post. Genuinely enjoyed another one from someone learning real lessons!

About RAG, I feel like it’s certainly useful for mere mortals. Me and my company are one of them. However, several of our features boil down to an explicit search + generation process, and so RAG is quite useful for us there. It’s really simple though. Just a bunch of vectors stored in Redis. We don’t even use Redis’ built-in vector search because it was so easy to just write code that fetches a group of embeddings (each group is per-user, effectively) and run search in memory. It’s fast and nowhere near the performance bottleneck.

However, I think people try to apply RAG to very complex problems and find that it’s a lot harder than just cosine similarity on large blobs of text. In our case, that’s actually plenty suitable for the job we have to do. And further, we find that there’s often a sweet spot for GPT to sort through what’s _actually_ relevant. Vector search can often yield results that aren’t actually useful, but if you increase the window big enough (but not too big) it often will include what you’d actually want, and GPT can figure it out. I wish there was a way where you could get a sense for how useful RAG would be without having to test it so much, though.

Reply
Bill
April 13, 2024 at 5:19 pm

Hi Ken,

Would you pay more per token if the price per token is increased?

Reply
- Ken (Post author)
  April 13, 2024 at 5:31 pm
  
  This is the right question to ask. I’ll answer it in three ways:
  
  1. In an overly-literal sense: No I wouldn’t like to pay more, because I’d be experiencing loss-aversion (see Prospect Theory – humans hate paying more for something they paid less for previously). I’d probably take a much more serious look at other models, and do A/B testing.
  2. More how you probably meant it, I was paying about 3x more a few months back, because GPT-4 was more than GPT-4-turbo. So, yes I’d pay at least 3x more.
  3. Say GPT-5 came out and it made our use case quite a bit better / fewer errors / less careful prompting, and more importantly, removed some of the issues I mentioned in this article. How much more would I pay? My guess is, not 20x. Probably 4x is stretching it? At 4x I’d probably take optimizations more seriously and be more discerning about when we use GPT-3.5 vs 4 vs 5. Also, now that we’re in production and not building a POC, I know exactly how much more 10x would cost us, and it just wouldn’t be tenable (which is very classically a build vs buy trade-off decision – so GPT isn’t fundamentally different than our other technologies).
  
  great question!
  
  Reply
Felarof
April 13, 2024 at 7:45 pm

If you want to a/b GPT and Claude3 test, try out http://www.clashofgpts.com

Reply
Asim Shrestha
April 13, 2024 at 9:39 pm

Thanks for the post! A few questions

1. What do you folks do to measure/validate performance. Are you using any evals currently?

2. What exactly were you using RAG for within the use case you described?

Reply
- Ken (Post author)
  April 14, 2024 at 2:11 pm
  
  good questions.
  
  The first one, I’ll repost what I put on linkedin:
  
  This is one of those semi-embarrassing answers – we honestly didn’t have time to evaluate things systematically when we were building the features. Instead, my co-founder and I basically came up with this logic/hack:
  
  We can avoid the extra time it’d take to evaluate / tinker if we choose to build with what we believe is the best model that money can buy – GPT-4. It may cost us more and be overkill to do this in some cases, but the tradeoff is that we’ll never to have to think “Ugh, would another model work better?” when we’re stuck on a terrible prompt problem during development. If an approach won’t work with GPT-4, it just won’t work period, and we knew we’d have to try a different approach – not just a different model.
  
  That decision was surprisingly freeing when we were building.
  
  For the second one, we were trying to chunk long-form documents and generate particular tasks based on whether the document mentioned certain things. Our RAG flow would have been:
  1. chunk the document into paragraphs / tables / sections
  2. embed that chunk
  3. for each category of task, run a query against the embeddings to retrieve relevent chunks
  4. pass those chunks to GPT along with a prompt to generate the task.
  
  The problem is, the queries would never not retrieve results. And there was no good way of handling this, as far as I could tell. Most relevancy scores are intentionally not normalized, and things like step-function cutoffs just weren’t reliable when we tried. We were running this against maybe 50 categories, and “hit rate” was supposed to be sparse – maybe only 20% of those categories would match any given document – so even a low false-positive rate became magnified. And since we were always passing in chunks along with a prompt, relevant or not, GPT would too often give us back hallucinated tasks. It’s not that it never worked, it’s just that it failed occasionally. Which when building a real feature with output that people have to deal with, didn’t work in our case.
  
  Hope that helps!
  
  Reply
sungho
April 14, 2024 at 2:35 am

What do you think about fine tuning, fine tuning cheaper models to reduce costs?
GPT is expensive, so it seems like a lot of people are trying to do that, and you seem to think that cost is an issue, but you don’t do that?

Reply
- Matt
  April 15, 2024 at 11:27 am
  
  Exactly! We are now able to fine-tune GPT 3.5. Why not fine-tune a bunch of specific use case models and have gpt 4 as a classification layer to reduce costs?
  
  Reply
Joe
April 14, 2024 at 2:42 am

Regarding your first example, I’ve done a nearly identical task recently. For sure, giving a large table in the prompt is overwhelming it. You do that to get machine readable output. But there’s off the shelf protocols for that, and GPT knows them. My fix was to ask for ISO 3166 codes, most of which (all the ones I checked) are _single token emissions_ (“OR”, “CA”, “FL”, “US”, etc.) Where appropriate, RFC & ISO numbers are a very handy way to _succinctly_ request machine readable outputs with ChatGPT 3.5 Turbo.

Reply
- Ken (Post author)
  April 14, 2024 at 2:12 pm
  
  Oo, using ISO 3166 codes is a really great shorthand – genius! I will have to give that a try!
  
  Reply
walter
April 14, 2024 at 7:11 am

inject a fake null hypothesis in search cases, like “Waldo Bakery”, and then proceed to ask for business names, this is like paracetamol against hallucinations

than you just filter it out

Reply
- Civitello
  April 14, 2024 at 9:58 am
  
  That’s even better than my solution!
  
  Reply
Paul Cardno
April 15, 2024 at 4:42 am

This is pretty spooky, as if I’d have written down our experience at our Fortune 100, specifically in the Marketing arena where I’m leading our GenAI initiative, it would be identical to this. Our main use-cases are creating content for products or campaigns and translation. We’re mostly leveraging 3.5, as when we started building last year, GPT-4 was so restricted in query rate limits, it was basically unusable from a development standpoint. To be fair, that was maybe because we had it before general availability.

I love your comment about context windows and the 10 items. Our experience is, repeatably, that the Open AI models need to be treated as a slightly forgetful professor. Incredibly smart, but if you let them talk for too long they’ll forget what they’re doing and results will vary, often in small but important ways. Classic example: getting it to return properly formated JSON data so you can just do a JSON.parse() in Javascript. Get it to return maybe 500 tokens and all is well. Get it above some arbitrary number and it just starts forgetting to properly escape embedded quotes, even though it was doing it fine in the first part of the response.

Also, in terms of other models? I’m constantly getting people saying “But what about Claude? Or Gemini?”. My answer stands: “Have you hit a point where GPT3.5 or GPT4 from Azure Open AI Services isn’t doing what you need it to?”.
The answer is universally “No”. They just want the one that goes to 11, irrespective of whether it’s actually useful and the cost / difficulties of maintaining anything outside of the Azure ecoystem / Microsoft’s Terms and Conditions (which were super easy to get approved by Legal, because they didn’t actually change).

Reply
Lawrence Fitzpatrick
April 15, 2024 at 5:06 am

Regarding the null hypothesis prompt: did you try 2-step workflow where the first prompt asks “does this text have any company names in it, answer yes or no?” Followed by a second prompt to enumerate the companies if the answer to the first prompt is yes. Do you think this would work?

Reply