This guide covers all configuration options for TinyLlama CLI.
Your HuggingFace API token (required for gated models):
# Linux/macOS
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Windows
$env:HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Alternative variable name:
export HUGGINGFACEHUB_API_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Custom directory for storing models (default: ./models):
export MODEL_DIR="/path/to/models"
Custom directory for chat transcripts (default: ./transcripts):
export TRANSCRIPTS_DIR="/path/to/transcripts"
Custom directory for training data export (default: ./training_data):
export TRAINING_DATA_DIR="/path/to/training_data"
For most models, you’ll need at least:
For gated models (like Llama), you’ll also need:
# Test authentication
python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.whoami())"
tinyllama-cli/
├── models/ # Default model directory
│ ├── TinyLlama-1.1B-Chat-v1.0/
│ ├── NVIDIA-Nemotron-3-Nano-4B-GGUF/
│ └── ...
├── transcripts/ # Chat transcripts
├── training_data/ # Exported training data
└── ...
Modify the code to use custom directories:
# In download_model.py
def model_dir_for(model_id: str) -> Path:
folder_name = model_id.split("/")[-1]
custom_dir = os.getenv("MODEL_DIR", "models")
return Path(custom_dir) / folder_name
The CLI uses automatic tuning, but you can modify defaults in ai_cli.py:
@dataclass
class GenerationConfig:
temperature: float = 0.65 # Randomness (0=deterministic, 1=random)
top_p: float = 0.9 # Nucleus sampling threshold
top_k: int = 40 # Top-k sampling
repetition_penalty: float = 1.1 # Penalize repetition
max_new_tokens: int = 256 # Maximum tokens to generate
do_sample: bool = True # Use sampling vs greedy
The CLI uses different prompt templates for different models:
<|system|>
{system_prompt}</s>
<|user|>
{user_message}</s>
<|assistant|>
You can modify the prompt template in ai_cli.py:
def _prompt_template(self, extra_system: str | None = None) -> str:
# Custom template logic here
pass
The default system prompt is:
You are a helpful, concise AI assistant. Keep answers clear and practical.
When unsure, say what you are uncertain about.
Modify in ai_cli.py:
SYSTEM_PROMPT = (
"Your custom system prompt here. "
"Be specific about the assistant's role and capabilities."
)
Or pass it at runtime (requires code modification).
The CLI automatically detects CUDA GPUs:
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
Force CPU mode:
# In ai_cli.py, modify the device selection
device = "cpu" # Force CPU
For better performance on modern GPUs:
# Uses float16 on GPU, float32 on CPU
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
For web search features, you may need API keys:
# Optional: Serper API for web search
export SERPER_API_KEY="your_serper_api_key"
# Optional: Tavily API
export TAVILY_API_KEY="your_tavily_api_key"
The CLI automatically decides when to search:
def should_search_web(query: str) -> bool:
# Search for recent information
# Skip for factual/educational queries
pass
# Copy example
cp .env.example .env
# HuggingFace Token (required for gated models)
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Optional: Custom directories
MODEL_DIR=./models
TRANSCRIPTS_DIR=./transcripts
TRAINING_DATA_DIR=./training_data
# Optional: API Keys
# SERPER_API_KEY=your_key
# TAVILY_API_KEY=your_key
For systems with limited RAM:
# Use quantization (if supported)
model = AutoModelForCausalLM.from_pretrained(
str(model_dir),
torch_dtype=torch.float16,
low_cpu_mem_usage=True, # Reduces peak memory
)
For better throughput (requires code changes):
# Process multiple inputs in batches
# Useful for batch inference scenarios
Enable verbose logging (requires code changes):
import logging
logging.basicConfig(level=logging.DEBUG)
Logs are written to: