tinyllama-cli

Advanced Features

This guide covers advanced features of TinyLlama CLI including prompt tuning, training data export, and transcript management.

Prompt Tuning

How It Works

TinyLlama CLI automatically optimizes generation settings based on your input:

# Automatic tuning based on prompt analysis
def tune(user_input: str, turns: int) -> GenerationConfig:
    # Analyzes the prompt and adjusts settings
    pass

Tuning Triggers

Trigger Keywords	Settings Adjusted
explain, what is, why, how does, difference, compare	Temperature: 0.45, Top-P: 0.82, Max: 220
code, python, bug, error, function, class, api	Temperature: 0.40, Top-P: 0.85, Max: 280
story, poem, creative, brainstorm, ideas, fiction	Temperature: 0.88, Top-P: 0.95, Max: 320
calculate, solve, math, equation, evaluate	Temperature: 0.00, Top-P: 1.00, Max: 96

Manual Tuning

You can modify the automatic tuning in ai_cli.py:

class TinyLlamaOptimizer:
    FACTUAL_HINTS = ("explain", "what is", "why", "how does", "difference", "compare")
    CREATIVE_HINTS = ("story", "poem", "creative", "brainstorm", "ideas", "fiction")
    CODE_HINTS = ("code", "python", "bug", "error", "function", "class", "api")
    MATH_HINTS = ("calculate", "solve", "math", "equation", "evaluate", "multiply", "divide")

Adding Custom Triggers

Add your own keywords:

class TinyLlamaOptimizer:
    # Add custom triggers
    MY_HINTS = ("custom", "keywords", "here")
    
    @staticmethod
    def tune(user_input: str, turns: int) -> GenerationConfig:
        cfg = GenerationConfig()
        
        if any(hint in user_input.lower() for hint in TinyLlamaOptimizer.MY_HINTS):
            cfg.temperature = 0.5
            # ... more custom settings
        
        return cfg

Training Data Export

Automatic Export

Every time you use /save or exit the CLI, training data is exported:

training_data/tinyllama_sft.jsonl

Export Format

Each line is a JSON object:

{
  "id": "conv_2024-01-15_14-30-22",
  "source_transcript": "2024-01-15_143022.json",
  "created_at": "2024-01-15T14:30:22Z",
  "messages": [
    {"role": "system", "content": "You are a helpful AI assistant..."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level..."}
  ]
}

Training Data Fields

Field	Description
`id`	Unique conversation ID
`source_transcript`	Original transcript filename
`created_at`	ISO 8601 timestamp
`messages`	Array of message objects

Customizing Export

Modify export behavior in ai_cli.py:

def _export_training_data(self) -> None:
    # Custom export logic here
    pass

Transcript Saving

Automatic Saving

Transcripts are saved:

When you type /save
When you type /exit
When you press Ctrl+D

Transcript Location

transcripts/
├── 2024-01-15_143022.json
├── 2024-01-16_091545.json
└── ...

Transcript Format

{
  "model": "TinyLlama-1.1B-Chat-v1.0",
  "model_path": "./models/TinyLlama-1.1B-Chat-v1.0",
  "started_at": "2024-01-15T14:30:22Z",
  "settings": {
    "temperature": 0.65,
    "top_p": 0.9,
    "max_new_tokens": 256
  },
  "messages": [
    {
      "role": "user",
      "content": "Hello!",
      "timestamp": "2024-01-15T14:30:25Z"
    },
    {
      "role": "assistant", 
      "content": "Hello! How can I help you today?",
      "timestamp": "2024-01-15T14:30:28Z"
    }
  ]
}

Loading Transcripts

To load a previous transcript (requires code modification):

# Example: Load transcript at startup
python ai_cli.py --load-transcript transcripts/2024-01-15_143022.json

Web Search Integration

When to Use

The CLI automatically decides when to search based on:

def should_search_web(query: str) -> bool:
    # Search for recent events, news, weather
    # Skip for general knowledge
    pass

Search Providers

The CLI supports multiple search backends:

Serper (default) - Google’s search results
Tavily - AI-powered search
DuckDuckGo - Free, no API key needed

Setting Up Search

# Serper API
export SERPER_API_KEY="your_api_key"

# Tavily API  
export TAVILY_API_KEY="your_api_key"

Using Web Search

Just ask about current events:

You: What's the latest news about AI?

[Searching the web...]

TinyLlama: Here's the latest news about AI...

Custom Prompt Templates

TinyLlama Template

<|system|>
{system_prompt}</s>
<|user|>
{user_message}</s>
<|assistant|>
{response}</s>

Custom Templates

Add your own template in ai_cli.py:

def _prompt_template(self, extra_system: str | None = None) -> str:
    # Your custom template
    template = f"""<|system|>
{self.history[0]['content']}</s>
<|user|>
{self.history[-1]['content']}</s>
<|assistant|>
"""
    return template

HuggingFace Chat Templates

Many models use standard chat templates:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_dir)
chat = [
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "Hello"}
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)

Memory Management

Context Window

Each model has a maximum context length:

Model	Context Length
TinyLlama	2048 tokens
SmolLM2	2048 tokens
Qwen2.5	8192 tokens
Nemotron	4096 tokens

Dynamic Context

The CLI automatically manages context:

def _dynamic_max_new_tokens(self, prompt_tokens: int, requested_tokens: int) -> int:
    # Calculates safe token limits
    # Adjusts based on available RAM
    pass

RAM-Based Limits

Available RAM	Max Tokens
< 8 GB	96
< 12 GB	160
< 16 GB	224
< 24 GB	320
< 32 GB	448
> 32 GB	640

Performance Optimization

GPU Optimization

# Use float16 for faster inference on GPU
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

CPU Optimization

# Use quantized weights (if supported)
# Check your model's quantization support

Inference Speed

Tips for faster inference:

Use GPU if available
Use smaller models (TinyLlama, SmolLM2)
Reduce max_new_tokens
Use greedy decoding (do_sample=False)

Custom Commands

Adding Commands

Add custom commands in ai_cli.py:

def _handle_custom_command(self, cmd: str) -> bool:
    if cmd == "/mycommand":
        # Custom logic
        return True
    return False

Example: Context Stats

def _show_context_stats(self) -> None:
    """Show current context usage"""
    console.print(f"Messages: {len(self.history)}")
    console.print(f"Prompt tokens: {self.last_prompt_tokens}")
    console.print(f"Response tokens: {self.last_response_tokens}")

Next Steps

Usage Guide - Basic usage
Model Download - Download models
Configuration - Environment setup
API Reference - Full API documentation
Troubleshooting - Common issues

This site is open source. Improve this page.