Ben Dickson.
Since the release of ChatGPT, software engineers and organizations have been looking for ways to use large language models (LLM) to increase productivity. There are many examples of LLMs generating code for complicated problems, but not enough information on how it is integrated into the software development process.
In a recent study, a team of researchers from Chalmers University of Technology – University of Gothenburg and RISE Research Institutes of Sweden observed 24 professional software engineers from 10 different companies who used ChatGPT over one week in their daily tasks. Their findings provide important information about the types of tasks software engineers use ChatGPT for, and the factors that influence their experience.
The findings have important implications for enterprises looking to integrate LLMs into their workflows.
Software engineering tasks for ChatGPT
The study, which includes a review of the chat sessions with ChatGPT 3.5 and a survey, shows that software engineers use LLMs for three main categories of tasks:
Code generation and modification: This category, which the researchers refer to as “artifact manipulation,” includes tasks such as generating, refactoring, and fixing code. Artifact manipulation accounts for about one-third of the interactions with ChatGPT. These interactions are typically short, as users either quickly receive the expected results or abandon the attempt. However, longer dialogues occur when users persistently try to get ChatGPT to provide sources or correct errors in the generated solutions.
Expert consultation: Engineers often ask ChatGPT for resources, instructions, advice, or detailed information to assist them in their work tasks. The goal of these interactions is not to obtain a concrete solution but rather to receive a nudge in the right direction. In these interactions, ChatGPT serves as a virtual colleague or a more productive alternative to searching the internet. Consultation comprised 62% of the interactions of the software engineers who took part in the study.
Guidance and learning: Software engineers sometimes use ChatGPT to acquire broader theoretical or practical knowledge related to their work tasks. These dialogues account for a small portion of the interactions but often involve multiple follow-up queries to clarify previous answers.
Strengths and weaknesses
The biggest strength of ChatGPT was helping software engineers learn new concepts. Interacting with an LLM about its internal knowledge is much easier than searching for resources on the internet. The participants in the study also used ChatGPT to assist them in brainstorming sessions. LLMs can help generate multiple alternative solutions and ideas that can be valuable during the planning and design phases of software development.
On the other hand, some participants said they did not trust the generated artifacts, especially for complex and company-specific tasks. This lack of trust often led to thorough double-checking of any suggestions provided by ChatGPT, which could be counterproductive.
Another important problem is the lack of context. LLMs don’t know company-specific information and need to be provided with that context. Compiling and providing that context in the prompt adds friction that undermines the user experience. In some cases, privacy concerns and company policies prevent engineers from sharing detailed information, which can lead to frustration and incomplete interactions.
Another important tradeoff of using ChatGPT, which is less mentioned in other studies, is reduced team communication and focus. Participants sometimes used the chatbot to answer questions that might have be better directed to a colleague. ChatGPT can reduce focus, as engineers may spend excessive time tweaking prompts to generate perfectly working code rather than fixing slightly defective outputs themselves.
What does it mean for enterprises?
If you’re employing software engineers, improving their productivity with LLMs hinges on amplifying the advantages and minimizing the tradeoffs. This study was carried out on ChatGPT 3.5, which is currently very far from the performance of frontier models such as GPT-4o and Claude 3 Opus. Current models have broader knowledge and are much better at preventing hallucinations and false information.However, when it comes to enterprise applications, there are a few problems that even frontier models don’t address. One of them is context. No matter how much training a model undergoes, it will know nothing about your company’s proprietary information.
Having chat interfaces that automatically provide contextual information to the model as your engineers interact with them will play a key role in taking the user experience to the next level. There are several ways to do this, including retrieval augmented generation (RAG), where contextual information is automatically added to the user’s prompt before being sent to the model. Alternatively, the LLM can be integrated into the user’s IDE, where it automatically uses code and project files as context when answering questions.
Another problem that needs to be solved is privacy and data-sharing restrictions. Many companies have strict rules about the kind of information you can share with third parties. This can limit the kinds of interactions that engineers can have with LLMs such as ChatGPT. A workaround is to use open models such as Llama 3. The open-source community has made impressive progress in parallel to private models. You can also run them on your servers, integrate them into your infrastructure, and make sure the data never leaves your organization.
Another point raised in the study is the energy engineers put into prompting the models. The way you frame your request and place your instructions has a significant impact on the LLM’s performance. Reducing the friction of prompt engineering can help improve the user experience and save the time engineers spend interacting with the LLM. One impressive direction in this regard is Anthropic’s Prompt Generator, which automatically creates the optimal prompt for the task you want to accomplish. Another example is OPRO, a technique developed by DeepMind that automatically optimizes prompts.
Finally, the study mentions the reduced focus caused by using ChatGPT. This challenge can be mitigated to a degree by integrating the LLM into the teamwork. An interesting example is Glue, a new corporate chat app that adds the LLM as an agent into discussion threads. Going from an isolated LLM experience to inserting the agent into group conversations can have really interesting results.
There is no doubt that LLMs will be great tools—but not replacements—for software engineers. Creating the right scaffolding for their use will make sure you get the most out of them.