Ken Kantzer's Blog

GPT is the Heroku of AI

April 13, 2024 / Ken / 2 Comments

3384

Views

Points

Comments

I read a comment on HN that sparked this article: GPT is kind of like DevOps from the early 2000s.

Here’s the hot take: I don’t see the primary value of GPT being in its ability to help me develop novel use cases or features – at least not right now.

The primary value is that it MASSIVELY lowers the barrier of entry to machine learning features for startups.

What’s my line of reasoning? Well, here are some surprising things about how we use it:

(But first, a caveat: there are a LOT of ways to use GPT in products – I think we haven’t even discovered the most valuable ones yet. A lot of the more novel ways to use GPT, though, like assistants and agents, we just haven’t found very useful in practice, at least in B2B SaaS. They’re just unreliable and too wonky. Not yet anyway!)

In our product, we primarily use GPT for these 4 cases, ranked in decreasing order of value:

Classification. Given a block of text, what type is it, from this list?

Data extraction. Here’s a JSON schema and a block of text, fill out the JSON schema.

Long-summary. Write an email that summarizes this block of text.

Short-summary. Give me 2-3 words to describe this block of text, so I can use it as a header (think, the ChatGPT summaries of each convo listed on the right).

Notice something interesting? The top 2 use-cases are things traditional ML can do really well.

So why are we using GPT and not traditional ML?

My previous company did a lot of traditional ML, and my main takeaway was that it was incredibly expensive to produce something valuable. This made it hard to experiment. It made it hard to maintain these features.

But now, I can literally spend a few minutes writing a prompt.

So why is it the new Heroku?

What is (was) special about Heroku? It’s a very, very expensive infrastructure platform (relatively to rolling it yourself) that promises (and delivers) on the value proposition: no devops needed.

You can be a normal engineer and have a very scalable, very stable, very robust app (complete with logging, restarts, alerts, patches, high availability, zero trust, secret management, etc, etc) without needing to know any devops.

And the expensiveness is ok, because it scales directly with usage, pay-as-you-go.

This is exactly what OpenAI has given developers with GPT: a very, very expensive way to do ML features without needing an ML team. It’s actually not even that expensive compared to the value, but it is expensive compared to cheaper locally run LLMs and using a traditional ML model, once it’s been trained.

There’s one final overlooked aspect to this

Even if the personnel costs of hiring ML engineers weren’t prohibitive to startups, on top of that, traditional ML is impossible to do without large amounts of training data. That’s a huge moat.

Startups have a bootstrapping problem when it comes to the training data for ML features.

But GPT is zero shot. That is huge. It means that the barrier to entry to ML features is now effectively zero.

Conclusion

I like thinking about OpenAI this way because it also explains why Google is having such a problem capturing this space.

Google fundamentally doesn’t have the problem that GPT (the API) primarily solves. It has gobs of money. It has gobs of ML expertise. It doesn’t want to build something that chips away at that moat.

Of course, GPT does a lot more than substitute in for traditional ML operations. But that’s where I’ve seen the value in practice right now, and boy is it valuable! It’ll be fascinating to see where all this goes.

Lessons after a half-billion GPT tokens

April 11, 2024 / Ken / 32 Comments

32192

Views

Points

Comments

My startup Truss (gettruss.io) released a few LLM-heavy features in the last six months, and the narrative around LLMs that I read on Hacker News is now starting to diverge from my reality, so I thought I’d share some of the more “surprising” lessons after churning through just north of 500 million tokens, by my estimate.

Some details first:

– we’re using the OpenAI models, see the Q&A at the bottom if you want my opinion of the others

– our usage is 85% GPT-4, and 15% GPT-3.5

– we deal exclusively with text, so no gpt-4-vision, Sora, whisper, etc.

– we have a B2B use case – strongly focused on summarize/analyze-extract, so YMMV

– 500M tokens actually isn’t as much as it seems – it’s about 750,000 pages of text, to put it in perspective

Lesson 1: When it comes to prompts, less is more

We consistently found that not enumerating an exact list or instructions in the prompt produced better results, if that thing was already common knowledge. GPT is not dumb, and it actually gets confused if you over-specify.

This is fundamentally different than coding, where everything has to be explicit.

Here’s an example where this bit us:

One part of our pipeline reads some block of text and asks GPT to classify it as relating to one of the 50 US states, or the Federal government. This is not a hard task – we probably could have used string/regex, but there’s enough weird corner cases that that would’ve taken longer. So our first attempt was (roughly) something like this:

Here's a block of text. One field should be "locality_id", and it should be the ID of one of the 50 states, or federal, using this list:
[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

This worked sometimes (I’d estimate >98% of the time), but failed enough that we had to dig deeper.

While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.

So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.

I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

(Random side note one: GPT was failing most often with the M states — Maryland, Maine, Massachusettes, Michigan — which you might expect of a fundamentally stochastic model.)

(Random side note two: when we asked GPT to choose an ID from a list of items, it got confused a lot less when we sent the list as prettified JSON, where each state was on its own line. I think \n is a stronger separator than a comma.)

Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat API. That’s it.

Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:

def extract_json(prompt, variable_length_input, number_retries)

The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.

This is the beauty of a powerful generalized model – less is more.

Most of the 40 lines in that function are around error handling around OpenAI API’s regular 500s/socket closed (though it’s gotten better, and given their load, it’s not surprising).

There’s some auto-truncating we built in, so we don’t have to worry about context length limits. We have my own proprietary token-length estimator. Here it is:

if s.length > model_context_size * 3
  # truncate it!
end

It fails in corner cases when there are a LOT of periods, or numbers (the token ratio is < 3 characters / token for those). So there’s another very proprietary try/catch retry logic:

if response_error_code == "context_length_exceeded"
   s.truncate(model_context_size * 3 / 1.3)

We’ve gotten quite far with this approach, and it’s been flexible enough for our needs.

Lesson 3: improving the latency with streaming API and showing users variable-speed typed words is actually a big UX innovation with ChatGPT.

We thought this was a gimmick, but users react very positively to variable-speed “typed” characters – this feels like the mouse/cursor UX moment for AI.

Lesson 4: GPT is really bad at producing the null hypothesis

“Return an empty output if you don’t find anything” – is probably the most error-prone prompting language we came across.¹ Not only does GPT often choose to hallucinate rather than return nothing, it also causes it to just lack confidence a lot, returning blank more often than it should.

Most of our prompts are in the form:

“Here’s a block of text that’s making a statement about a company, I want you to output JSON that extracts these companies. If there’s nothing relevant, return a blank. Here’s the text: [block of text]”

For a time, we had a bug where [block of text] could be empty. The hallucinations were bad. Incidentally, GPT loves to hallucinate bakeries, here are some great ones:

Sunshine Bakery
Golden Grain Bakery
Bliss Bakery

Fortunately, the solution was to fix the bug and not send it a prompt at all if there was no text (duh!). But it’s harder when “it’s empty” is harder to define programmatically, and you actually do need GPT to weigh in.

Lesson 5: “Context windows” are a misnomer – and they are only growing larger for input, not output

Little known fact: GPT-4 may have a 128k token window for input, but it’s output window is still a measly 4k! Calling it a “context window” is confusing, clearly.

But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.

GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.

We originally thought this was because of the 4k context window, but we were hitting 10 items, and it’d only be maybe 700-800 tokens, and GPT would just stop.

Now, you can of course trade in output for input by giving it a prompt, ask for a single task, then give it (prompt + task), ask for the next task, etc. But now you’re playing a game of telephone with GPT, and have to deal with things like Langchain.

Lesson 6: vector databases, and RAG/embeddings are mostly useless for us mere mortals

I tried. I really did. But every time I thought I had a killer use case for RAG / embeddings, I was confounded.

I think vector databases / RAG are really meant for Search. And only search. Not search as in “oh – retrieving chunks is kind of like search, so it’ll work!”, real google-and-bing search. Here’s some reasons why:

there’s no cutoff for relevancy. There are some solutions out there, and you can create your own cutoff heuristics for relevancy, but they’re going to be unreliable. This really kills RAG in my opinion – you always risk poisoning your retrieval with irrelevant results, or being too conservative, you miss important results.
why would you put your vectors in a specialized, proprietary database, away from all your other data? Unless you are dealing at a google/bing scale, this loss of context absolutely isn’t worth the tradeoff.
unless you are doing a very open-ended search, of say – the whole internet – users typically don’t like semantic searches that return things they didn’t directly type. For most applications of search within business apps, your users are domain experts – they don’t need you to guess what they might have meant – they’ll let you know!

It seems to me (this is untested) that a much better use of LLMS for most search cases is to use a normal completion prompt to convert a user’s search into a faceted-search, or even a more complex query (or heck, even SQL!). But this is not RAG at all.

Lesson 7: Hallucination basically doesn’t happen.

Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!).

Similarly — and I’m sure you’ve noticed this if you’re an engineer — GPT doesn’t really hallucinate code – in the sense that it doesn’t make up variables, or randomly introduce a typo in the middle of re-writing a block of code you sent it. It does hallucinate the existence of standard library functions when you ask it to give you something, but again, I see that more as the null hypothesis. It doesn’t know how to say “I don’t know”.

But if your use case is entirely, “here’s the full context of details, analyze / summarize / extract” – it’s extremely reliable. I think you can see a lot of product releases recently that emphasize this exact use case.

So it’s all about good data in, good GPT tokens responses out.

Conclusion: where do I think all this is heading?

Rather than responding with some long-form post (edit: I then subsequently did write a follow up post with more thoughts, because there were some very thought-provoking / inspiring points in the HN comments), here’s a quick from-the-hip Q&A:

Are we going to achieve AGI?

No. Not with this transformers + the data of the internet + $XB infrastructure approach.

Is GPT-4 actually useful, or is it all marketing?

It is 100% useful. This is the early days of the internet still. Will it fire everyone? No. Primarily, I see this lowering the barrier of entry to ML/AI that was previously only available to Google.

Have you tried Claude, Gemini, etc?

Yeah, meh. Actually in all seriousness, we haven’t done any serious A/B testing, but I’ve tested these with my day to day coding, and it doesn’t feel even close. It’s the subtle things mostly, like intuiting intention.

How do I keep up to date with all the stuff happening with LLMs/AI these days?

You don’t need to. I’ve been thinking a lot about The Bitter Lesson – that general improvements to model performance outweigh niche improvements. If that’s true, all you need to worry about is when GPT-5 is coming out. Nothing else matters, and everything else being released by OpenAI in the meantime (not including Sora, etc, that’s a whooolle separate thing) are basically noise.

So when will GPT-5 come out, and how good will it be?

I’ve been trying to read the signs with OpenAI, as has everyone else. I think we’re going to see incremental improvement, sadly. I don’t have a lot of hope that GPT-5 is going to “change everything”. There are fundamental economic reasons for that: between GPT-3 and GPT-3.5, I thought we might be in a scenario where the models were getting hyper-linear improvement with training: train it 2x as hard, it gets 2.2x better.

But that’s not the case, apparently. Instead, what we’re seeing is logarithmic. And in fact, token speed and cost per token is growing exponentially for incremental improvements.

If that’s the case, there’s some Pareto-optimal curve we’re on, and GPT-4 might be optimal: whereas I was willing to pay 20x for GPT-4 over GPT-3.5, I honestly don’t think I’d pay 20x per token to go from GPT-4 to GPT-5, not for the set of tasks that GPT-4 is used for.

GPT-5 may break that. Or, it may be the iPhone 5 to the iPhone 4. I don’t think that’s a loss!

The Parable of the Wise Hiring Manager

April 11, 2024 / Ken / 0 Comments

166

Views

Points

Comments

One day, while The Manager was walking back from a morning coffee run, a group of frazzled engineers came near and spake unto him, saying: “Most Esteemed Boss, we are unable to hire Talent – and many of our candidates refuse to take our coding challenges! The labor market is tight and our staff are burning out, should we not ease our process and forgo coding challenges, so that we might secure butts-in-seats more quickly?”

The Manager turned to them, and upon seeing that they were truly desperate, sought to teach them them using this parable:

—–

A hiring manager interviewed four engineers for a new role, and as each one approached, the hiring manager explained that the candidate must first pass a simple, non-leetcode challenge in order to advance to the next stage. The first candidate heard this and scornfully mocked the hiring manager,

“Bro, I’m an experienced engineer, if you can’t look at my github and portfolio, and my work history and see that I can code, I don’t want to work with you.”

This made the hiring manager feel sad.

The second candidate was nicer and agreed to take the challenge. But upon failing, the candidate offered up many excuses and complaints about their local dev environment, troubles with the coding platform, and that they were just dealing with a stupid bug.

The third candidate accepted the challenge gladly, but upon seeing the problem and attempting it, hung their head and wept, for the problem was beyond their skill to complete. The hiring manager had pity on them, and gave them an easier problem, which they completed gladly.

The final candidate accepted the challenge gladly, completed it successfully in one-tenth the required time, and even managed to work in a few tests and an optimization before time expired. The hiring manager was overjoyed and the candidate moved on to the next stage.

The other candidates were outraged, and they spake angrily unto the hiring manager, saying:

“We were qualified too, bro (brothers? brethren?), if you just give us a chance! Not everyone can pass a coding interview under pressure, and furthermore, the problem you gave involved pirates and recursion, and pirates are clearly not what we’d be working on on a daily basis.”

But the wise hiring manager simply smiled and said, “Thank you for your time.”

——-

When the Manager finished the parable, the engineers asked him its meaning. So he explained it to them, saying, “The wise hiring manager knew that each of these candidates would approach their work much as they approached the interview.

The first candidate is the engineer who doesn’t seem to understand that they add value to the company by coding well and fast. Instead, they spend their time writing careful specs, crafting stories, practicing over-the-top-TDD, and writing pages of documentation that no one will ever read. But somehow, when it comes time to do work, they always seem to produce code at a snail’s pace. They work between 12 and 18 months at each job, careful to move on before their work product catches up to them and they’re fired. And they manage their job hunt carefully, so that each new job is slightly better-looking than the last, ensuring them a steady upward trajectory.

The second candidate is the engineer who will eagerly accept a new assignment or project, but once they encounter the typical obstacles, they will come up with every excuse they can think of, blaming anything or anyone else for why they’re still not done. They’re a pro at offering these excuses, and they can often keep stringing you along with very believable excuses for month after month, before they too move along to their next ~~victim~~ job.

The third candidate is the engineer who is extremely driven and willing to do whatever it takes, but at the end of the day, just can’t get the job done. It’s heartbreaking because this job will be disappointment to them, and to their manager. But this is a professional team, and lowering standards will hurt everyone on the team.

The final candidate is the engineer who is both humble and confident enough to be ok with a test of their skills, because they know that the way they add value in their job is by coding well, and they’re happy to prove it. Furthermore, they demonstrated problem solving by wrangling the buggy/broken tools they’re given, under pressure, which may not be their day to day job, but is great when the team is in a tight corner or when things go sideways in production on some legacy system.

This engineer is worth 10, sometimes 100 times the other developers.”

The engineers left, and everyone marveled at the common sense wisdom of The Manager.

Learnings from 5 years of tech startup code audits

May 6, 2022 / Ken

80818

Views

Points

Comments

While I was at PKC, our team did upwards of twenty code audits, many of them for startups that were just around their Series A or B (that was usually when they had cash and realized that it’d be good to take a deeper look at their security, after the do-or-die focus on product market fit).

It was fascinating work – we dove deep on a great cross-section of stacks and architectures, across a wide variety of domains. We found all sorts of security issues, ranging from catastrophic to just plain interesting. And we also had a chance to chat with senior engineering leadership and CTOs more generally about the engineering and product challenges they were facing as they were just starting to scale.

It’s also been fascinating to see which of those startups have done well and which have faded, now that some of those audits are 7-8 years ago.

I want to share some of the more surprising things I’ve internalized from these observations, roughly ordered from most general to most security specific.

You don’t need hundreds of engineers to build a great product. I wrote a longer piece about this, but essentially, despite the general stage of startup we audited being pretty similar, the engineering team sizes varied a lot. Surprisingly, sometimes the most impressive products with the broadest scope of features were built by the smaller teams. And it was these same “small but mighty” teams that, years later, are crushing their markets.
Simple Outperformed Smart. As a self-admitted elitist, it pains me to say this, but it’s true: the startups we audited that are now doing the best usually had an almost brazenly ‘Keep It Simple’ approach to engineering. Cleverness for cleverness sake was abhorred. On the flip side, the companies where we were like ”woah, these folks are smart as hell” for the most part kind of faded. Generally, the major foot-gun (which I talk about more in a previous post on foot-guns) that got a lot of places in trouble was the premature move to microservices, architectures that relied on distributed computing, and messaging-heavy designs.
Our highest impact findings would always come within the first and last few hours of the audit. If you think about it, this makes sense: in the first few hours of the audit, you find the lowest-hanging fruit. Things that stick out like a sore thumb just from grepping the code and testing some basic functionality. During the last few hours, you’ve fully contexted in to the new codebase, and things begin to click.
Writing secure software has gotten remarkably easier in the last 10 years. I don’t have statistically sound evidence to back this up, but it seems like code written before around 2012 tended to have a lot more vulnerabilities per SLOC than code written after 2012 (we started auditing in 2014). Maybe it was the Web 2.0 frameworks, or increased security awareness amongst devs. Whatever it was, I think this means that security really has improved on a fundamental basis in terms of the tools and defaults software engineers now have available.
All the really bad security vulnerabilities were obvious. Probably a fifth of the code audits we did, we’d find The Big One – a vulnerability so bad that we’d call up our clients and tell them to fix it immediately. I can’t remember a single case where that vulnerability was very clever. In fact, that’s part of what made the worst vulnerabilities bad — we were worried primarily because they’d be easy to find and exploit. “Discoverability” has been a component of impact analysis for a while, so this isn’t new. But I do think that discoverability should be much more heavily weighted. Discoverability is everything, when it comes to actual exposure. Hackers are lazy and they look for the lowest-hanging fruit. They won’t care about finageling even a very severe heap-spray vulnerability if they can reset a user’s password because the reset token was in the response (as Uber found out circa 2016). The counterargument to this is that heavily weighting discoverability perpetuates ”Security by Obscurity,” since it relies so heavily on guessing what an attacker can or should know. But again, personal experience strongly suggests that in practice, discoverability is a great predictor of actual exploitation.
Secure-by-default features in frameworks and infrastructure massively improved security. I wrote a longer piece about this too, but essentially, things like React default escaping all HTML to avoid cross-site scripting, and serverless stacks taking configuration of operating system and web server out of the hands of developers, dramatically improved the security of the companies that used them. Compare this to our PHP audits, which were riddled with XSS. These newer stacks/frameworks are not impenetrable, but their attackable surface area is smaller in precisely the places that make a massive difference in practice.
Monorepos are easier to audit. Speaking from the perspective of security researcher ergonomics, it was easier to audit a monorepo than a series of services split up into different code bases. There was no need to write wrapper scripts around the various tools we had. It was easier to determine if a given piece of code was used elsewhere. And best of all, there was no need to worry about a common library version being different on another repo.
You could easily spend an entire audit going down the rabbit trail of vulnerable dependency libraries. It’s incredibly hard to tell if a given vulnerability in a dependency is exploitable. We as an industry are definitely underinvesting in securing foundational libraries, which is why things like Log4j were so impactful. Node and npm were absolutely terrifying in this regard—the dependency chains were just not auditable. It was a huge boon when GitHub released dependabot because we could for the most part just tell our clients to upgrade things in priority order.
Never deserialize untrusted data. This happened the most in PHP, because for some reason, PHP developers love to serialize/deserialize objects instead of using JSON, but I’d say almost every case we saw where a server was deserializing a client object and parsing it led to a horrible exploit. For those of you who aren’t familiar, Portswigger has a good breakdown of what can go wrong (incidentally, focused on PHP. Coincidence?). In short, the common thread in all deserialization vulnerabilities is that giving a user the ability to manipulate an object that is subsequently used by the server is an extremely powerful capability with a wide surface area. It’s conceptually similar to both prototype pollution, and user-generated HTML templates. The fix? It’s far better to allow a user to send a JSON object (it has so few possible data types), and to manually construct the object based on the fields in that object. It’s slightly more work, but well worth it!
Business logic flaws were rare, but when we found one they tended to be epically bad. Think about it — flaws in business logic are guaranteed to affect the business. An interesting corollary is that even if your protocol is built to provide provably-secure properties, human error in the form of bad business logic is surprisingly common (you need look no further than the series of absolutely devastating exploits that take advantage of badly written smart contracts).
Custom fuzzing was surprisingly effective. A couple years into our code auditing, I started requiring all our code audits to include making a custom fuzzers to test product APIs, authentication, etc. This is somewhat commonly done, and I stole this idea from Thomas Ptacek, which he alludes to in his Hiring Post. Before we did this, I actually thought it was a waste of time—I just always figured it was an example of misapplied engineering, and that audit hours were better spent reading code and trying out various hypothesis. But it turns out fuzzing was surprisingly effective and efficient in terms of hours spent, especially on the larger codebases.
Acquisitions complicated security quite a bit. There were more code patterns to review, more AWS accounts to look at, more variety in SDLC tooling. And of course, usually the acquisition meant an entirely new language and/or framework with its own patterns in use.
There was always at least one closet security enthusiast amongst the software engineers. It was always surprising who it was, and they almost always never knew it was them! As security skillsets get more software-skewed, there’s huge arbitrage here if these folks can be reliably identified.
Quick turnarounds on fixing vulnerabilities usually correlated with general engineering operational excellence. The best cases were clients who asked us to just give them a constant feed of anything we found, and they’d fix it right away.
Almost no one got JWT tokens and webhooks right on the first try. With webhooks, people almost always forgot to authenticate incoming requests (or the service they were using didn’t allow for authentication…which was pretty messed up!). This class of problem led to Josh, one of our researchers, to begin asking a series of questions that led to a DefCON/Blackhat talk. JWT is notoriously hard to get right, even if you’re using a library, and there were a lot of implementations that failed to properly expire tokens on logout, incorrectly checked the JWT for authenticity, or simply trusted it by default.
There’s still a lot of MD5 in use out there, but it’s mostly false positives. It turns out MD5 is used for a lot of other things besides an (in)sufficiently collision-resistant password hash. For example, because it’s so fast, it’s often used in automated testing to quickly generate a whole lot of pseudo-random GUIDs. In these cases, the insecure properties of MD5 don’t matter, despite what your static analysis tool may be screaming at you.

I’m curious if you’ve seen any of these, as well as others! Or, drop me a note if you disagree!

The Unreasonable Effectiveness of Secure-by-default

May 6, 2022 / Ken

This is one in a series of deeper-dives into various Learnings from 5 years of tech startup code audits. In that article, I list several observations I had during the course of doing code audits fro 20-30 tech startups at or around the Series A / B mark.

Creating the ”pit of success” as a way to discourage the most severe security practices has by and large been an enormous success. We started doing code audits in 2014. React had come out in March of the previous year, and AWS Lambda came out that November. We did a fair amount of serverless code audits and audits of React front-ends + API backends. We usually didn’t find much, and I think a large part of this had to do with major security features that were baked into React and it’s successors (aka escape by default, preventing XSS), and AWS Lambda (taking web server security configuration and OS configuration out of developer’s hands).

It’s possible that this was because the exploitation methods that got around React and other framework’s default-escaping approach took awhile to emerge, and are relatively difficult to exploit. But it’s also possible that the new wave of Javascript frameworks that built in these security properties actually made things more secure by default! This type of win-win, where security was introduced in a way that didn’t make developers lives harder should be considered a massive win for the security community, and my hat goes off to the framework developers for making the web a better place.¹

Lambdas and serverless were also taking off at the time, and the same observations hold true there: a whole class of very serious security vulnerabilities (configuration-based web server and OS concerns) are just not a problem on serverless. Sure, the attack surface is by no means zero, and things like RCE and SSRF are still deadly, and new exploit types will be found, but the security properties of security represent a real, step-function increase in security across the board.

You Don’t Need Hundreds of Engineers to Build a Great Product

May 6, 2022 / Ken

We did several code audits for companies that rapidly scaled their engineering orgs relatively early on (we’re talking 50-100 engineers, maybe 10-35M annual revenue, series A/B). None of them are doing well right now, and some are out of business.

What made this observation interesting is how different it is from stories of the well-known counterexamples (Uber, FB, etc, etc, etc), where it seems like the ability to double headcount every 6 months is assumed to be a key component of their ability to scale rapidly with hypergrowth.

But I think we are taking the wrong lesson from these success stories. Here’s the real lesson:

You don’t grow engineering headcount like crazy in order to achieve hyprgrowth: these companies underwent hypergrowth first, and then are forced to grow headcount rapidly.

Typically the architectures we saw from these large engineering groups were not bad: they just were not really necessary. We often looked at the code and the infrastructure and kind of scratched our heads: why were they doing all this? It just didn’t seem to make sense. But then we talked to the very smart engineers they had, and it seemed like there were good explanations for most things. Looking back, I think a lot of the complexity was frankly busy work: you bring in a lot of engineers without enough truly business-essential work to do, and they will come up with things to do. This is not a crack on them, it’s just human nature.

Technology ROI Discussions are Broken

April 27, 2022 / Ken

A note before we begin: I’m arguing that technology ROI discussions are broken, not that ROI as a decision-making tool is broken. A solid understanding of how to calculate and use ROI is an essential skill for any tech executive, and when done right, it’s a powerful decision-making tool. This post is about how technology discussions that exclusively look at ROI often result in a one-eyed analysis that lacks depth.

Technical leaders need a wider range of tools for communicating the value of technology, and especially technology innovation. Communicating the value of technology is not a trivial task—and the point of this post is that exclusive reliance on the most commonly used tool for communicating value—Return on Investment (ROI)—will lead to broken discussions.

One solution to this is to start discussing technology innovation in terms of how it will break the existing ROI model and whether that thesis is believable.

It’s important to say straight off the bat that ROI itself is in no way broken! It should be part of every investment discussion. But ROI does fail to capture some very important elements of evaluating technology investments.

Two of the primary assumptions that feed the raw data that goes into ROI calculations —namely the operational model of the business and strategic tradeoffs that underpin that model—are sometimes the very things that will change with a successful technology innovation.

I believe the approach I’ll discuss actually requires a much higher standard of excellence from technologists. This is a good thing: as the wave of internet technologies are hitting full maturity, its necessary to become increasingly discerning—skeptical, even—of overly simplistic, ROI-based business-case arguments for certain types of technology investments.

What’s Wrong with ROI?

Almost all budget processes require new investments from technology to be framed in terms of the return those investments will produce.

How to portray ROI is very nuanced¹, but a simplified explanation of how it works will suffice for our purposes. Say you’re trying to decide whether to migrate infrastructure to the cloud. You’d tabulate yearly costs on current infrastructure (summed over a 2-5 year period, depending on how finance wants it), and include an estimate of soft-costs to maintain and complete activities on that infrastructure. Then you calculate the costs to migrate, and the migration-complete costs. If the business will break even and begin saving money within an acceptable window of time, say three years, then the business case is viable.

It’ll end up looking something like this:

typical ROI investment business case — source: https://wisconsin.edu/systemwide-it

If technology was all about operational efficiency, this would be great. But the problem with ROI is that sometimes technology can also deliver value that can’t be described in terms of your company’s current ROI models.

For example, how do you know if a new technology that enables a completely new way of engaging with customers will generate more revenue? You don’t (if you’re honest), and even if you do, how do you quantify that in a financial model?

“Does it Change the Math?”

One of my former bosses used to ask this question whenever we’d propose a new technology: “yes, but does it change the math?”

What he meant was, is this something that changes how we can operate in some fundamental way, such that the equations and models are actually different now? If the answer is no, then the discourse on the change should be based on efficiency, so we’d go ahead and talk about ROI.

Things that “change the math” are almost inherently problematic for ROI discussions: if you’re arguing the model itself is going to change, representing the change in one, consistent model is very difficult. You may have new variables that didn’t even exist in the old model, and those new variables (if you’re being honest) have no historical precedent.

But the part that you shouldn’t miss is that when my boss asked the question this way, it allowed for a much richer (and more interesting) discussion about the value of a given technology.

How to Properly Discuss Technological Value

So, if ROI is sometimes insufficient, what can work? Well, I like to use this graph to frame the discussion about value:

alternative to ROI is a paraeto-graph showing tradeoffs between two strategic objectives held in tension

Usually, viable business models stabilize on a balance between two strategic objectives that are at inherent tension with each other. For example, Lower Costs vs Higher Quality. If you skimp too much on cost, quality goes down. If you pursue quality at all costs, your costs will rise.

Your business has not only picked a “sweet spot” on that curve in terms of profitability, but has also defined a curve that gives it the ability to adjust to changes in the environment:

If a new competitor enters the market, and they’re offering crazy quality, you can respond by “moving up the curve” and opening up a premium-service line that costs more. If a low-cost competitor enters the market and surprises you by gaining a lot of traction, you can “move down the curve” and cut costs at the expense of quality.

But what you can’t do is break that curve. Or can you?

That’s where technology is so important. Technology that’s truly innovative helps businesses break that curve:

There’s no great way to properly represent this in a traditional ROI (that doesn’t stop everyone from trying though!)

I think what we should do is actually produce a document that attempts to convince leadership that the model and tradeoffs your business has been operating under can be broken with the given technology. This forces technology leadership to explicitly answer some rather hard questions:

In what ways do you think the model is broken?
How, exactly, will the new technology change that?
What does the new model look like, and how is it better?

These questions are the important ones to get everyone aligned on. These questions should be fought over and scrutinized from every angle. Because these are the questions that will actually predict the value of whatever ROI calculations you end up doing.

Wrap Up

I love this approach, because it forces technology leaders to explicitly define the innovation in opposition of the current business model. It jolts us out of our formulaic, boring by-the-book OI calculations that often aren’t accurate anyway.

Another benefit of this approach is that it often exposes the problem with the potential innovation investments right away: when we draw this graph and pick the axes, the other business leaders say: “well, I don’t actually care about the two axes you defined, because while they exist, they aren’t fundamental to the business vis a vis the investment you’re proposing.” Now you’ve arrived at truth, fairly quickly!

Let’s go back to the discussion about infrastructure migration. The ROI case we sketched out is very typical. There’s probably even templates floating around that CIOs and CTOs could use to make the case.

But I think if we’re being honest, the argument is missing something, isn’t it? If the ROI discussion was really so straightforward, when Netflix moved to AWS, everyone would have immediately piled on. But there were skeptics for years. What were they skeptical about?

Well, the reality is that cloud infrastructure isn’t a little better because it makes it a bit cheaper to run servers. It’s way better, because it largely removes infrastructure as the bottleneck for innovation. Your graph, then, looks like this:

netflix's move to AWS allowed for higher hardware utilization while simultaneously dramatically improving operational flexiblity

How do you represent that as ROI? And yet, it’s the fundamental factor in cloud-enabled technology.

5 Software Engineering Foot-guns

April 22, 2022 / Ken

“In the brain of all brilliant minds resides in the corner a fool.”
Aristotle

Writing about “Best Practices” can get boring, so I thought I’d take a break this week, and write about some bad engineering practices that I’ve found the absolute hardest to undo once done. Real foot-guns, you could say.

Each of these is a bit controversial in its own way, and that’s how it should be—I’d welcome any counter-views. The prose in this post is a bit more irreverent than normal—in most cases, I’m poking fun at myself (both past and present!), as I’ve been guilty of each of these foot-guns—and a lot of them, frankly I still struggle with. Hopefully this post will generate some “motivation through transparency” 🙂

Engineering Foot-gun #1—Writing clever code instead of clear code

It’s because optimizing is fun. https://xkcd.com/1691

It’s unreasonably fun to write clever code. We even came up with a clever name for it: elegant code.¹ Who could say no to elegant code without appearing barbaric and ignorant? If instead we called it “smarty-pants code,” or “tricky code,” maybe we’d stand a chance of keeping away from it. But no—it’s elegant code—and elegance is just amazing and irresistible, all the time.

Here are some guilty pleasures that I’ve indulged in the relatively few times I’ve been cursed with a clever idea:

“I wonder if I can do this in a one-liner?” This never results in clear code.
fancy pointer-arithmetic routines
clever indexing
fancy map/reduces
recursion
even *gasp* ternary operators (rule of thumb: if the syntax causes you to pause for even a second to mentally confirm it works as you expect, it’s not worth it)

But my favorite self-argument about why it’s ok to inflict future pain on one’s unsuspecting colleagues is this one: “Well, it’s worth it because the code is more efficient.”

Efficient? If I’m honest with myself, half the time this isn’t true because the weird hack I’ve just written takes me off the happy path for any compiler optimizations in whatever language I’m writing in. And the other half of the time, the faster code I wrote gets called once per API call, so the user never notices the few nanoseconds I cleverly shaved off (purely for their benefit of course!).

If I’m really honest, this habit is hard to break because it’s hard to let go of a clever thought and not share it. And after all, what harm does some occasionally difficult-to-understand code really do? Well, a lot actually. One second-order effect of the great resignation is that more software engineers are reading more new lines of code for the first time than ever before. This means that the value of writing clear code has never been higher.²

Engineering Foot-gun #2—Not being willing to “throw it all away”

“Never form attachments to your code” they said. “Avoid the Sunk Cost Fallacy” they said. Hah! Easier said than done. I’ve found that sometimes it’s really hard to abandon a certain approach and start from scratch. This is especially baffling because the few times in my life where I’ve had the misfortune to completely destroy a draft paper or a day or two of code, rewriting it has been nothing short of pleasurable.

But somehow, deciding—willfully—to erase working code is hard. Rather than start over, I continue adding on layer after layer of hacks to my existing code until it resembles an unrecognizable frankenstein ruin, much like this orc from Lord of the Rings:

My code, after I try to save it four or five times. https://lotr.fandom.com/wiki/Gothmog_(Lieutenant_of_Morgul)

I think it has to do with some subconscious part of myself that likes to think that it’s impossible that the first solution I thought of wasn’t the perfect one! If only I were so lucky or smart.

Engineering Foot-gun #3—Creating abstractions prematurely

I think as hunter/gatherers, we were programmed to be absolutely paranoid about preparing for the future. That makes sense when you don’t know where your next meal is coming from. But this attitude is disastrous when you’re coding. Refactoring software is the cheapest activity humankind has ever invented, in terms of the ratio between initial effort and change effort. Doing it, even often, is not a bad thing.

I try to remind myself every time I start griping about refactoring code to please talk to Bonanno Pisano, the architect of the Tower of Pisa. Or talk to the scientists and engineers involved in the herculean efforts required to apply corrective fixes to the Hubble Space Telescope. These folks had it rough! Refactoring, even major refactoring, is simple by comparison, and these days, we have a lot of great advice on how to go about doing it. ³

Engineering Foot-gun #4—Not properly respecting the complexity of distributed systems

I always thought the RCAs we did for outages involving queues and microservices were tinged with irony. The causes sounded so…familiar:

“our users experienced delays in jobs X because of downstream service queues failing to process”
“a sequence of larger-than-expected messages on the queue were constantly retrying before going to DLQ, causing it to hang while responding to new enqueue requests”
“we lost messages because the queue was full”

What’s familiar about these? They’re the same problems I was trying to solve by going to a distributed system!⁴

So let’s talk about distributed message queues. Why do I use queues? Well, because I want to drop off a message in some persistent place and return immediately, without waiting till it’s done processing. That is a great property. I have a buffer, and some breathing room if things go wrong.

But I haven’t actually done anything about the underlying problem. In true Goldratt’s Theory of Constraints style, I haven’t removed the underlying constraint. All I’ve done is introduced another system (actually two systems if you include the network) where that constraint can manifest (because after all, the queue itself can get overwhelmed and stop processing my requests…)

The siren call of distributed message queues is very great. It feels like the Elegant Thing to Do. But why not wait till this really is a BAD problem, before taking on the problems of a distributed system? And if it really is a major problem, why not look at threading or in-memory queues? Or parallelizing execution of the receiver?

I think part of the reason I have trouble shaking this one is because the costs are paid down the road. Usually, messaging queues handle the small amount of load you initially put on them beautifully—it’s only as the system matures that the faults in distributed system start manifesting themselves.

Engineering Foot-gun #5—Waiting too long to ask for help

The big problem with waiting too long to ask for help is that I end up solving my own problem without help. Wait, what? Why is this bad? Isn’t solving things myself best?

It actually isn’t. The most surprising and impactful technical things I’ve ever learned came after asking for help. I say surprising, because rarely was the thing I learned a direct answer to my question. It was usually completely unexpected.

Notice, I also said after asking for help. That’s because a lot of great conversations happen in the context of discussing a problem with other people, after it’s been solved. Take a look at these cases:

“Oh, you don’t actually have to do what you’re trying to do at all, here’s a really simple core function that does exactly what you were trying to do in this whole piece of code”

Or, “Yeah, so you’re looking for X. Just so you know, this is actually a specific case of this general class of problem called…”

Or, “Ah, yes you can fix it this way. Also, did you know this really interesting concept that this made me think of?”

I wrote a couple weeks ago about Why experience is primarily about removing your Unknown Unknowns. This is exactly what’s at stake in breaking this habit. Interestingly, now that I code less, and do more leadership, I think this one is probably even more true now than it was before.⁵

Let me know what bad engineering habits you’re trying to break! 🙂

The Backlog Peter Principle

March 30, 2022 / Ken

A few years ago, I was in one of my ruts. Everything I was working on seemed to be bogged down or low-leverage. What was so frustrating was that this had come on the heels of a few amazingly productive months, where I had gotten a lot done. Worse yet, this seemed to happen cyclically: periods of productivity and a sense of accomplishment were followed by periods of delays and a sense of frustration.

Coincidentally, around the same time, I had just heard of the Peter Principle, which goes something like this:

“Every employee tends to rise to their level of incompetence”
Laurence J. Peter, THE Peter Principle (1969)

It’s brought up often in organization theory as an explanation for why there are so many ineffective managers and executives. “Employees are promoted based on their success in previous jobs until they reach a level at which they are no longer competent, as skills in one job do not necessarily translate to another.”

Basically, the idea is that advancement within an organization is an unstable system that eventually stabilizes in a deteriorated state where most people have been promoted to jobs they can’t perform.

That’s when I realized something: that same sense of an unstable system stabilizing in a deteriorated state perfectly described these cyclical ruts I would find myself in. My backlog of work was getting peter-principled!¹

Here’s what happens:

Let’s say you start out with 10 projects in your queue. You’ll tackle several of them and knock them out right away: either because they lent themselves to easy solutions or they were in your sweet spot of capability.

But as these projects are completed, all that remains are the tasks that either are really hard, or aren’t things you ever wanted to do in the first place. New projects get added, but just contribute to this situation, as the ones that are easy flow in and out of your backlog rapidly, and the ones that are bad continue to accumulate.

Your backlog continues to worsen, day after day, until you wake up one morning and find that every single project you have on your list is horrible. Every project portfolio becomes filled with useless, crappy projects.

The Solution: Creative Destruction

Some amount of creative destruction is essential to counteract this: periodically, you have to look at your list of tasks and acknowledge that some of them just won’t get done by you. You either need to take them off your list, or if they have to get done, pass them on someone who can finish them.

Something else I noticed is that taking a long vacation often cleaned out my queue in an organic way. I think that explains (to a large degree) why vacations are often followed by a period of hyper productivity – more than can be explained simply by “feeling more refreshed.” The nature of the work had changed. So this is another great way to un-peter-principle yourself.

This Applies to a Lot of Backlogs and Other Things!

The mechanic that the peter principle describes actually helps describe a lot of “groups that have things flowing through them”: product backlogs, initiatives, committees, etc. Curious if you’ve noticed this in any systems you’ve observed!²³

How to find great senior engineers

March 22, 2022 / Ken

Hiring experienced engineers is one of the most difficult and important things that engineering leaders have to pull off. But it’s hard to gauge experience in a series of short interviews. I’ve definitely worked with some amazing engineers who probably wouldn’t have been hired in some of my previous hiring pipelines.

Here are some tips on things I’ve found that work and don’t work.

✅ Things that Work

Case Studies. I love this approach, though it requires a major investment to come up with a good case study and it requires a bit more commitment from interviewees to prepare. Basically, you prepare a 1-2 page story that lays out, a particular technical scenario in deliberately broad brushstrokes, and then asks the candidate to figure out what they’d do. The specific problems are left a bit vague, because half of what you’re trying to figure out is how they go about framing the problem (that’s a key indicator for experience). I’ve found this approach to give really high signal: in other words, candidates tend to do really well or really poorly, and there aren’t a lot of “mehs.”
Three Why’s Technique. Gauging experience is not about producing a list of technologies or problem-solution anecdotes. The details really, really matter. I landed on something I’m calling the “Three Why’s” (like 5 Why’s) – it’s the practice of asking someone to describe something, and then pressing them three more times for more details. It’s amazing what kinds of clarity this brings. I’ve had candidates tell me about what seemed like a really boring, generic project, and after pressing for details, it turned out to be filled with extremely interesting problems, tons of learned heuristics and gotchas that they assumed were just not appropriate to include in a typical time-strapped interview.
Ask them to Break the Rules. This is a more specific instantiation of Peter Thiel’s famous interview question “What important truth do very few people agree with you on?” Anyone can tell you about design patterns or best practices. It’s a mark of hard-won experience to get a detailed, well-reasoned answer about when not to follow the rules and go off the beaten path. There can be wrong answers¹ here – not everyone can turn experience into the right intuition. Also, watch out for answers that represent an overcorrection for a time they’ve been badly burned by some mistake, or a dressed up generic opinion that’s in vogue at the time.

❌ Things that Don’t Work

Trusting job history over your gut instinct. There have been a few times where a candidate looked great on paper, worked at FANG, etc, but during the interview, seemed strangely out of touch with their own profession and what solutions actually work. Don’t ignore your gut! I’ve been burned by ignoring these signs. A sparkling resume is not a guarantee of success. Why is that? Maybe it’s because being at a successful company can blur the lines between things that went well due to your contributions and things that were sheer luck (or due to someone else’s contribution that was out of your line of sight). Think of it this way: you know the classic adage that some people have 10 years of experience repeating the same year? Unfortunately, it’s almost impossible to tell from someone’s resume how much experience they’ve actually acquired.
Asking about mistakes they’ve made. I used to ask this question, but don’t any more. My logic behind asking this was well-meaning: no one gains real experience without having made and learned from some serious mistakes. This one really seems like it should produce high signal, but in practice I found that it always seems to result in more, rather than less confusion about a candidates’ qualities. I think the problem with this question is that explaining mistakes in a compelling, positive way is really, really hard, and so this line of questioning actually produces more signal about whether someone’s good at managing up.
“Tell me about a big project where you were a major contributor.” I used to ask this all the time, and the intent behind it is correct: “I want to make sure they know how to find and complete impactful work that cuts across teams, so why not just directly ask?” It’s not that this question is bad – it’s just not-good: it gives very little signal. Why doesn’t it work? First of all, people are either completely unprepared for this one, or have over-prepared for it, complete with a totally synthetic, self-aggrandizing answer that frankly you have no way of verifying. This means at best, you’re getting signal on their level of preparation, not their actual experience – not a useful signal. Secondly, experience is more about triangulating and synthesizing learnings from multiple experiences in the context of a new problem than it is having done one very big thing. Not always true, but true enough that the signal you’ll get with this question isn’t worth the time.

Ken Kantzer's Blog

GPT is the Heroku of AI

Lessons after a half-billion GPT tokens

Lesson 1: When it comes to prompts, less is more

Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat API. That’s it.

Lesson 3: improving the latency with streaming API and showing users variable-speed typed words is actually a big UX innovation with ChatGPT.

Lesson 4: GPT is really bad at producing the null hypothesis

Lesson 5: “Context windows” are a misnomer – and they are only growing larger for input, not output

Lesson 6: vector databases, and RAG/embeddings are mostly useless for us mere mortals

Lesson 7: Hallucination basically doesn’t happen.

Conclusion: where do I think all this is heading?

The Parable of the Wise Hiring Manager

Learnings from 5 years of tech startup code audits

The Unreasonable Effectiveness of Secure-by-default

You Don’t Need Hundreds of Engineers to Build a Great Product

Technology ROI Discussions are Broken

What’s Wrong with ROI?

“Does it Change the Math?”

How to Properly Discuss Technological Value

Wrap Up

5 Software Engineering Foot-guns

Engineering Foot-gun #1—Writing clever code instead of clear code

Engineering Foot-gun #2—Not being willing to “throw it all away”

Engineering Foot-gun #3—Creating abstractions prematurely

Engineering Foot-gun #4—Not properly respecting the complexity of distributed systems

Engineering Foot-gun #5—Waiting too long to ask for help

The Backlog Peter Principle

The Solution: Creative Destruction

This Applies to a Lot of Backlogs and Other Things!

How to find great senior engineers

✅ Things that Work

❌ Things that Don’t Work

About This Site

Find Us

Categories

Recent Posts

Pages