Sunrise: A Higher-Order Prompting System

Abstract

This document details a statistical approach to problem-solving in large language models (LLMs) through the use of higher-order prompt functions. The techniques described in this document are based on the AI development from web engineers at The.com, and are intended to be applied to a collection of LLM-powered features known as Sunrise.

Open AI's GPT-3 is a powerful language model that can be used to generate novel and unique text responses for a wide variety of tasks. Large Language Models (LLMs) do not perform calculations, but instead predict the next word in a sequence of words. The primary example of this is the ability to generate text from an initial prompt:

Prompt: Mary had a little [blank]
Response: lamb

Prompt: The [blank] is blue.
Response: sky

Modern web development is heavily reliant on the automated generation HTML and CSS from predefined templates. Thus, the ability to generate templated code from a prompt is a natural fit for LLMs and is the basis for the Sunrise project.

Sunrise is capable of generating and parsing code to perform a variety of tasks, including:

AI-driven textual completions for a spreadsheet-like data structure, such as...
- Completions for the next value of a column or row to be entered, given the currently provided values.
- Completions to continue a sequence of values in a column or row of cells.
AI-driven code completions to aid in the creation of a website, such as...
- Completions for the next HTML element to be added to the website.
- Applying CSS to the existing elements of a website.
- Modifying existing HTML and CSS to match a desired visual design.

When combined with a web development framework, Sunrise can be used to generate a website from a non-technical description of the desired visual design, offer guidance for creating a desired visual layout of a website from a non-technical description, and provide suggestions applying relevant spreadsheet data to an automatically generated website.

Personifying language models has proven hazardous to our ability to effectively communicate how "AI" features exactly work when presented to non-technical human operators. However, we have found some utility in framing the thought-process of LLMs as if they were also a human, communicating with operator through a call-and-response system:

Prompt: Create a new HTML element with the tag name div. Set the text content of the element to 'Hello World'.
<div>Hello World</div>

The source of GPT-3's accuracy in response is the result of computational training data — a large corpus of text across varied domains of subject and thought. However, even when trained on large data sets, a language model's ability to accurately generate a relevant response is a function of training data as well as the prompt itself:

Prompt: Create two-column layout with a header and footer.
<div class="header">one</div>
         <footer>two</div>

Ironically, the large amount of training data that makes LLMs so powerful is also the source of their weakness: the training data is often too broad to be useful in a specific context. And the more specific a prompt, the more likely it is to be unique to a specific task, and the less likely it is to be useful in other contexts.

We have found that the best way to generate a relevant response is through further personification of the LLM as an answer engine querying in the specific domains of web development. The Sunrise project achieves accurate responses through a collection of utilities that automate the design of higher-order prompts, or prompts that are used to generate prompts...that are used to generate responses.

While it holds true that LLMs do not perform arithmetic, they are capable of performing a variety of tasks that are similar to arithmetic:

Prompt: What is 2 + 2?
Response: 2 + 2 is 4
Prompt: What is half of 10?
Response: 10 / 2 is 5

At first glance this may appear as if the LLM is capable of performing arithmetic, but this illusion vanishes under more difficult circumstances:

Prompt: What is the square root of 240493284.0463 / 50292053.726?
Response: 240493284.0463 / 50292053.726 is 4.75

LLMs are not capable of generating a response that is correct in the sense that it is the result of an arithmetic operation. Instead, LLMs are capable of generating a response that is similar to the result of an any arithmetic operation. The numbers in the above example are unique to the prompt, and are not present in the training data, and so the chance for a relevant response is low.

Ironically, humans perform arithmetic in a similar manner when presented with novel and large numbers: Simplify the problem to a smaller, more familiar task, and then apply the solution to the original. And when the solution to the original problem is not immediately obvious, we often resort to automation through a calculator. With this framing in mind, we can observe that humans are also capable of performing tasks that are similar to arithmetic.

And so much like a human is conditioned to reach for calculator or a spreadsheet to perform complex arithmetic, we can fine-tune an LLM to generate the JavaScript code necessary to perform complex arithmetic.

A second-order prompt is a prompt that is used to generate code that is used to generate a response. To create a second-order prompt, we need to preload some context to help the LLM understand how it should go about generating a response:

Prompt:

Do not perform calculations.

Do not compute the answer.

Use JavaScript Math and Date to perform calculations.

Respond with a performArithmetic function for the following prompt:

What is the square root of 240493284.0463 / 50292053.726?
function performArithmetic() {
  const a = 240493284.0463
  const b = 50292053.726
  const c = a / b
  const d = Math.sqrt(c)
  return d
}

With the given preamble of instructions, the LLM can accurately navigate through its training data to generate a JavaScript function that will perform accurate arithmetic. This is called meta-learning and is the primary tool we use to fine-tune LLMs to perform specific tasks.

The limits of second-order prompt engineering become apparant when an LLM is presented with complex and novel tasks. The LLM may generate a response that is not valid JavaScript, or may generate a response that is not the most efficient solution.

We can further fine-tune the LLM by providing additional upfront examples to help guide the generate a more accurate responses within the specific domain of arithmetic. In this example we include the first question and response from earlier. We then continue the pattern of providing a question and ask the LLM to complete the next performArithmetic function.

Prompt:

Do not perform calculations.

Do not compute the answer.

Use JavaScript Math and Date to perform calculations.

Respond with a performArithmetic function for the following prompt:

Question 1: What is the square root of 240493284.0463 / 50292053.726?
function performArithmetic() {
  const a = 240493284.0463
  const b = 50292053.726
  const c = a / b
  const d = Math.sqrt(c)
  return d
}
Question 2: 5 days a week, Jessie practices martial arts for 1.5 hours. Given that a week is 7 days, what is her average practice time per day each week?
function performArithmetic() {
  const daysPracticedInAWeek = 5
  const hoursPracticedInADay = 1.5
  const daysInAWeek = 7
  const totalHoursPracticedInAWeek = daysPracticedInAWeek * > hoursPracticedInADay
  const averagePracticeTimePerDay = totalHoursPracticedInAWeek / daysInAWeek
  return averagePracticeTimePerDay
}

With each successive example, the LLM is able to generate more accurate responses. However, the more examples we provide, the more likely it is that the LLM will begin to overfit to the examples, and generate responses that are not relevant to the original prompt.

Additionally, the LLM imposes a token-limit, usually around 1024 tokens depending on the model. The token-limit is the maximum number of tokens that can be within a single prompt. This means that the more examples we provide, the more likely it is that the LLM will reach the token-limit before it can generate a complete response.

Thus, a balance must be struck between the number of examples and the accuracy of the responses.

Generating second-order function for arithmetic is just but one of several modalities of response that can be generated by an LLM. A modality (or "mode") is a collection of call-and-response pairs that train the LLM of how it should reason about a specific domain of thought. In the example above, the modality of response is "functions that perform arithmetic." Other functional modalities include:

Generating and modifying HTML elements
Modifying CSS styles
Suggesting and implementing design patterns
Providing context-aware clues when the user is in a position to benefit from an unknown feature.
Predicting the user's next action, and providing a suggestion for how to achieve that action.

In all modes, the LLM does not perform the domain-specific task itself, but instead, determines the non-technical user's intentions to predefined set of programmatic internal utilities. And from the user's perspective, the LLM appears to provide responses that are novel, relevant, and accurate.

However, this still poses a meta-problem: How does the LLM determine the appropriate modality of response?

We can solve this problem by introducing a new meta-modality of response:

Prompt: Add a 1px border to all buttons.
Meta Prompt: Return confidence scores for the appropriate modality of response
{ "html": 0.8, "css": 0.5, "javascript": 0.3, "help": 0.1 }
<button style="border: 1px solid black">Button</button>

This approach is similar to how a assistant would delegate a novel but well-understood inquiry to a collegue with the appropriate subject-matter expertise. As the assistant is exposed to more inquiries, they become more accurate at determining the appropriate modality of response. The same is true for the meta-mode.

To our delight, fine-tuning a meta-mode requires no additional call-and-response examples. Rather, we can instead re-use the existing prompt data from the fine-tuned LLMs for each modality. With each sucessive mapping of a user prompt to a modality of response, the meta-mode becomes more accurate at determining the appropriate modality of response.

In the example above, the meta-mode determines the high confidence score of the HTML-mode due to the prompt's similarity to existing HTML-mode examples. When the difference between the confidence scores of the top two modalities is less than a threshold, the meta-mode asks the user to either choose from a mode manually or clarify their intent by restating the prompt.

Continuing with the ever increasing meta-ness, we can introduce a meta-meta-mode of response, i.e. Breaking down complex prompts into a series of simpler prompts, and then delegating each prompt to the appropriate modality of response.

Prompt: Get the sum of the first 1000 prime numbers and add it to the sum of the first 1000 even numbers.

Meta Meta Prompt: Break down the prompt into a series of simpler prompts
["Get the sum of the first 1000 prime numbers", "Add it to the sum of the first 1000 even numbers"]

Meta Prompt: Return confidence scores for the appropriate modality of response

[{ "javascript": 1 }, { "javascript": 1 }]

function getSumOfFirstNEvens(n) {
  let sum = 0
  for (let i = 0; i < n; i++) {
    sum += i * 2
  }
  return sum
}

function isPrime(n) {
  for (let i = 2; i < n; i++) {
    if (n % i === 0) {
      return false
    }
  }
  return true
}

function getSumOfFirstNPrimes(n) {
  let sum = 0
  let count = 0
  let i = 2
  while (count < n) {
    if (isPrime(i)) {
      sum += i
      count++
    }
    i++
  }
  return sum
}

This is again similar to how a human assistant would break down a complex inquiry into a series of simpler inquiries. This approach allows the LLM to work through a series of simpler prompts that are more likely to be understood, and allows us to work around technical limitations such as token-limits.

The domain of an fine-tuned LLM's responses can be thought of as a hyperplane in a n-dimensional space, where n is the number of examples provided. However, this process is not without its drawbacks: the LLM is only as accurate as the examples provided, and the more examples provided, the more likely it is that the LLM will overfit to the examples. In our experience, the optimal number of examples for a given modality is between 100 and 1000 examples, with more examples being necessary for more complex modalities.

Manually generating examples of second-order prompts is a daunting task, and was quickly determined to not scale to the dozens of modalities we aim to support. Through our research, we've developed an approach to automate this process by using a learning-to-learn to generate second-order prompts.

Generating second-order examples is initially performed through a human-supervised learning process, where the LLM is provided with a set of examples and prompted to generate a novel response that is similar to the examples. This process is again similar to how the operator would generate a second-order prompt, and is used to bootstrap the LLM's ability to generate second-order prompts. We call this process learning-to-learn and is the primary tool we use to automate the generation of second-order prompts.

As an human operator determines a modality's accuracy of response, the fine-tuned LLM is challenged with progressively more difficult and vague prompts. Inevitably, the LLM will begin to generate second-order functions that are either not relevant or incorrect. To combat this, we use a means regression to find consensus among a series of LLM's response to the same prompt with varying temperatures (i.e. the degree of randomness in the response):

Prompt: When I was 6 my sister was half my age. Now I'm 70. How old is my sister?

Response 1: (Temperature: 0.8)

function performArithmetic() {
  const myAge = 70
  const mySisterAge = 6
  const mySisterAgeWhenIWas6 = 3
  const mySisterAgeNow = mySisterAgeWhenIWas6 + (myAge - 6)

  return `Your sister is ${mySisterAgeNow} years old.`
}

Response 2: (Temperature: 0.5)

function performArithmetic() {
  const a = 70
  const b = 6
  const c = 3
  const d = a - b
  const e = c + d
  return e
}

Response 3: (Temperature: 0.1)

function performArithmetic() {
  return 3 + (70 - 6)
}

Response 4: (Temperature: 1)

function performArithmetic() {
  return 'sixty-seven'
}

This prompt is fairly straightforward and can be solved in a variety of ways. However, from the operator's perspective, we want to discard the responses that include sytnax errors, or functions that return irrelevant or incorrect information. Syntax errors are easily discarded by checking the response against a syntax checker, but determining the relevance and accuracy of a response requires a consensus among the returned values.

The operator is only interested in the mean occurence of a value, and can discard the responses that are not within a certain threshold of the mean. In the 4 responses above, we can execute each function and tally up the results:

"Your sister is 67 years old.": 1
67: 2
"sixty-seven": 1

Two of the four functions return the same value of 67, indicating a consensus. The subjective difference between the two functions is negligible: both achieve the same valid result. The operator can choose between one or more of these valid responses and add them to the fine-tuned LLM's examples.

If there is no consensus among the responses, this may indicate that the LLM is underperforming with the given prompt. This may actually be a good thing, as it indicates that the LLM is generating responses that are not only valid, but also novel. The operator can choose to add one or more of the responses to the fine-tuned LLM's examples, or discard them if they are not relevant to the prompt.

If all responses are invalid, the operator can choose to rephrase the prompt, or manually complete the response themselves. Alternatively, the invalid generated invalid responses can be stored for future analysis.

Our current approach to runtime security aligns with with well-known industry standards: Any data that is provided by an external source is considered unsafe and should not be executed in a privileged environment.

While this approach is suitable for typical user inputs, integrating LLM outputs requires additional measures to ensure the security of our systems. We must assume that all LLM-generated code is potentially harmful and executable only within an unprivileged environment. This is achieved by executing the LLM-generated code in a sandboxed environment, and only allowing the execution of a limited set of functions.

Sunrise: A Higher-Order Prompting System

A novel approach to AI prompt-design.

Abstract

Status of This Document

1. Introdution

1.1 Purpose

2. Prompt Engineering

2.1 Higher-order prompts

2.1.1 Second-order prompts

2.2 Fine-tuning

3. Modalities of response

3.1 Meta Modes

3.1.1 Meta Meta Modes

4. Learning-to-learn

4.1 Automation Through Means Regression

4.2 Security

5. Final Thoughts