One of the most interesting things about agency culture is the freedom for employees to show their personality.
Hackathon Autumn 2023: Machine Learning Assisted Development (or Prompt Engineering)
'Freedom to Fail' is occasionally touted as a key way to learn, but actual freedom to fail in the workplace is rare, so it's a treat to be able to experience it now and again.
The best part of a Hackathon is the freedom to take on a new challenge at the edge of or even outside of your skill set, and see what you can come up with. 'Freedom to Fail' is occasionally touted as a key way to learn, but actual freedom to fail in the workplace is rare, so it's a treat to be able to experience it now and again. The theme this time was 'Prompt Engineering', a term I find somewhat dubious. Based on talking with friends and colleagues, there is definitely a skill to interacting with Large Language Models and returning potentially useful output, but going in I wasn't sure I would call 'Prompt Engineering' a 'real' engineering skill. I won't bury the lede: I still don't think it's a 'real' engineering skill, but it can be useful, and the output generated can also be useful, but at best I think of it as an augment to, but not a replacement for, actually knowing what you're doing when you use it.
Here I am only attempting to address the utility of Machine Learning Assisted Development and my experience with it. There are real world instances where blind reliance on LLM models such as ChatGPT can lead to legal trouble. There are potential ethical problems in using LLM tools with training sets that aren't fully vetted, at least until legal issues around fair use and copyright are resolved. Most relevant to using LLM generated code, the questions surrounding copyright and other legal issues have led Microsoft to indemnify usage of its Copilot tool. Caveat usor.
So how was it to participate in, and watch the results of a Hackathon involving 'Prompt Engineering'? I would rate it as an equal combination of cool and frustrating. Anyone who expects LLM 'Prompt Engineering' to be able to generate large, functional systems of code is, at least today, fooling themselves. What it can do is generate bits of code and small generic frameworks of popular languages and models quickly. The key is 'popular': it is heavily dependent on its training set, and the smaller the amount of content in the training set, the less likely it is to return something useful. Based on my experience, it was particularly bad at generating 'connective code' where a more generic framework needed to be altered to fit a specific implementation. The tool my team was using to generate a basic GraphQL implementation did not fit well with AWS's GraphQL requirements, for example. We were better off using the AWS Tutorials and working from there. (Our team deliberately picked something we personally weren't familiar with. The non-Prompt Engineering outcome of this was that three back-end/devOps people make very poor front-end React developers, something that should surprise nobody.)
Where it did come in handy was giving them enough information to get a prototype service up and running more quickly than they could have researched and implemented otherwise. Some of our other teams had better luck with this, particularly in the area of using LLM models to generate code to then interact with LLM models to produce output. One nice technique a team used was generating and then using LLM output describing a variety of Role Playing characters to generate a set of canned output for various physical characteristics, personality traits, and powers which could then be put together in a variety of ways, and then fed to another ML tool to generate avatar images of those characters. This played well to the strengths of both the self-documenting nature of interacting with the LLM tools, and then passing the editorialized content as input to the image generator. The editorial step was a useful one: any user-generated content (such as a prompt that generates an image) should theoretically be vetted before being passed to a second service which generates an image that might be publically available. Confining the input of the second service to a content generated by the first service and then reviewed by humans makes it less likely that the input would cause the second service to go beyond its guard rails. The non-deterministic nature of the second service also became an asset: the same 'puzzle prompt' would result in different images over multiple runs.
The avatars themselves were entertaining (and occasionally hilarious) but were definitely 'of a theme'. They certainly weren't as original as if a set of artists went out and did their own, or if a user commissioned an artist to create an avatar of their character. But it also means that the avatars themselves didn't require a moderation policy and team for identifying and removing inappropriate content beyond that which is generated by the second service. For a three day hackathon, that's potentially not a bad trade off (assuming the art generation service itself creates no legal issues).
One other thing to note: in addition to using LLMs to generate code, one of the teams used a tool to generate the 'Marketing Speak' around introducing and hyping up their project. To its credit, it sounded exactly like 'Exciting! New! Project!' copy generated by a marketing department when one of the participants read it. But it also underlined how little new information was being created or passed there: there was no differentiation or value add, just something we'd all heard before and had learned to discount.
My takeaway at the end of the hackathon falls into a few different categories:
Training sets, and the issues surrounding permission in assembling and using them, and compensating the people who create the input that goes into those training sets, are going to be an enormous issue.
LLMs work best when given a very large training set and the code to be generated is roughly congruent with this training set. The more specialized the knowledge, and therefore less likely it is to appear in the training set, the less good the results. The tool my team used did not generate GraphQL code that worked well with AWS's service, possibly due to a lack of training material.
The ability to use the tools to string together other, well documented services reminded me quite a lot of UNIX shell scripting. The implementation is more complex than a `|`, and the tools help. I've seen 'shell scripting' create a lot of value over the years, and I think there's something to be had here. One of the teams put together a tool which used Slack output to create and populate JIRA tickets, and the created tickets, while not perfect, were sufficiently good to cut down the labor of copy-pasting information from Slack over to JIRA.
Expertise in building complex systems or nuanced output still counts. The avatar tool could generate some fun and interesting avatar pictures, but they aren't 'art' and I wouldn't try to pass them off as art. Any 'value add' they had was very low, and could potentially be seen as providing a feature which would cost too much (in moderation of user uploaded content) if done another way.
The tools don't replace 'pair programming' or having another human to talk to when troubleshooting or trying to figure something out. They may augment it, but they only go so far. Would I use one of these tools in the future to speed up my own work? Yes, I would. But I wouldn't expect they would 'just do the job' for me.