LLM Consortium

The LLM Consortium works by sending your prompt to multiple AI models (currently GPT-4 and Claude 3 Sonnet) in parallel. Each model processes your request independently, and then a judge model (Claude 3 Haiku) analyzes and synthesizes their responses to provide the best possible answer.

If the judge model's confidence in the synthesized answer is below 0.8, the system will automatically initiate another iteration. In each iteration, the models receive refined prompts based on previous responses. This process continues until either the confidence threshold is met or the maximum of 3 iterations is reached.

The judge model (Claude 3 Haiku) analyzes responses based on multiple criteria:

Completeness and accuracy of information
Consistency between model responses
Presence of any dissenting views
Areas that might need refinement

This analysis results in a confidence score, and the system will iterate if needed to improve the response quality.

The synthesis process involves:

Combining the best insights from all model responses
Identifying and resolving any contradictions
Highlighting important dissenting views
Providing a confidence score for the final answer
Suggesting areas that might need further exploration

Each model (GPT-4 and Claude 3 Sonnet) provides a confidence score (0-1) indicating how certain it is about its own response. This score is based on:

The model's understanding of the prompt
The completeness of its response
The reliability of information provided
Any ambiguities or uncertainties in the response

The final confidence score is determined by the judge model (Claude 3 Haiku) based on multiple factors:

Agreement between model responses
Completeness of the synthesized answer
Quality of supporting evidence
Resolution of any contradictions
Overall coherence of the final response

This score represents the judge's confidence in the quality and reliability of the synthesized response.

Each model (GPT-4 and Claude 3 Sonnet) receives this structured prompt:

1. Begin by carefully considering the specific instructions provided.

2. Write your thought process inside <thought_process> tags, including:
   - Key aspects relevant to the query
   - Potential challenges or limitations
   - How response instructions affect the approach
   - Different angles and step-by-step logic

3. Provide confidence level (0-1) in <confidence> tags

4. Present final answer in <answer> tags

This ensures consistent, well-structured responses from all models.

The judge (Claude 3 Haiku) receives this analysis prompt:

The judge analyzes multiple AI responses and provides:
1. A synthesized answer combining best insights
2. Confidence in synthesis (0-1)
3. Analysis of responses
4. Notable dissenting views
5. Whether further iteration is needed

The response is structured with XML tags:
<synthesis>[Combined response]</synthesis>
<confidence>[0-1 score]</confidence>
<analysis>[Response analysis]</analysis>
<dissent>[Dissenting views]</dissent>
<needs_iteration>true/false</needs_iteration>
<refinement_areas>[Areas needing exploration]</refinement_areas>

The final confidence score can be higher or lower than individual model scores because:

Strong agreement between models can increase confidence
Complementary insights from different models can create a more complete answer
The judge model validates and fact-checks the combined response
Contradictions between models might lower the final confidence
The synthesis process may resolve uncertainties present in individual responses

Current Configuration

Final response

Confidence Level

Synthesized Response

Analysis

Dissenting Views

Model Responses

Frequently Asked Questions

Current Configuration

Final response

Confidence Level

Synthesized Response

Analysis

Dissenting Views

Model Responses

Frequently Asked Questions

How does the LLM Consortium work?

What are the multiple iterations?

How is the response quality ensured?

What happens during response synthesis?

How are individual model confidence scores calculated?

How is the final confidence score calculated?

What is the system prompt for each model?

What is the judge prompt?

Why might the final confidence score differ from individual model scores?