This guide covers common issues and their solutions for TinyLlama CLI.
Error:
bash: python: command not found
Solution:
# Install Python via Homebrew (macOS)
brew install python
# Or via apt (Ubuntu/Debian)
sudo apt update
sudo apt install python3 python3-venv python3-pip
# Or download from python.org
Error:
bash: pip: command not found
Solution:
# Install pip
python3 -m ensurepip --upgrade
# Or
python3 -m pip install --upgrade pip
Error:
Error: venv module not found
Solution:
# Ubuntu/Debian
sudo apt install python3-venv
# macOS with Homebrew (already included)
brew install python
Error:
HuggingFaceAuthenticationError:
Invalid username or password
Cause: Gated model requires authentication
Solution:
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Error:
RepositoryNotFoundError:
404 Client Error: Not Found
Cause: Invalid model ID
Solution:
# Check model exists
python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.list_models('meta', limit=5))"
Error:
OSError: [Errno 28] No space left on device
Solution:
# Check disk space
df -h
# Free up space by removing old models
rm -rf models/old-model-name/
# Or use a smaller model
python download_model.py --model smollm2
Solution:
# Use resume_download=True (already enabled)
# Just run the download again
python download_model.py --model tinyllama
# Or try during off-peak hours
# Or use a wired connection
Error:
Local model not found.
Run `python download_model.py` first.
Solution:
# Download a model first
python download_model.py
# Or verify models exist
ls -la models/
Error:
RuntimeError: CUDA out of memory
Solution:
# Free up GPU memory
# Close other GPU applications
# Or use a smaller model
python download_model.py --model smollm2
# Or force CPU mode in ai_cli.py:
device = "cpu"
Error:
RuntimeError: Expected all tensors to be on the same device
Cause: Mixed CPU/GPU tensors
Solution:
# In ai_cli.py, ensure device consistency
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
Error:
RuntimeError: CPU out of memory
Solution:
# In ai_cli.py, use lower memory settings
model = AutoModelForCausalLM.from_pretrained(
str(self.model_dir),
low_cpu_mem_usage=True, # Add this
)
Cause: Large model, slow disk, or not enough RAM
Solution:
# Check available RAM
free -h # Linux
Activity Monitor # macOS
# Use a smaller model
python download_model.py --model smollm2
Cause: Wrong prompt template or settings
Solution:
# Adjust temperature in ai_cli.py
temperature = 0.4 # Lower for more coherent output
Cause: Low repetition penalty
Solution:
# Increase repetition penalty
repetition_penalty = 1.15 # Default is 1.1
Cause: Max tokens limit
Solution:
# Increase max tokens
max_new_tokens = 512 # Default is 256
Cause: High max tokens or temperature
Solution:
# Lower max tokens
max_new_tokens = 128
# Or lower temperature
temperature = 0.4
Problem: Text not displaying correctly
Solution:
# Check terminal supports Unicode
export LC_ALL=en_US.UTF-8
# Or use a different terminal
# iTerm2 (macOS), Windows Terminal
Problem: /help, /settings, etc. not responding
Solution:
# Make sure to type the full command with /
# Not just "help"
# Check for typos
/help # Correct
help # Wrong
Error:
APIError: API key required
Solution:
# Get API key from Serper or Tavily
export SERPER_API_KEY="your_api_key"
# or
export TAVILY_API_KEY="your_api_key"
Cause: API issues or rate limiting
Solution:
# Wait and try again
# Check your API key is correct
# Try a different search provider
| Error | Meaning | Solution |
|---|---|---|
ModuleNotFoundError |
Missing package | pip install package-name |
ImportError |
Import failed | Check Python path and dependencies |
SyntaxError |
Code error | Check the code syntax |
| Error | Meaning | Solution |
|---|---|---|
HfHubHTTPError 401 |
Unauthorized | Set HF_TOKEN |
HfHubHTTPError 403 |
Forbidden | Request model access |
HfHubHTTPError 404 |
Not found | Check model ID |
HfHubHTTPError 429 |
Rate limited | Wait and retry |
| Error | Meaning | Solution |
|---|---|---|
CUDA not available |
No GPU | Install CUDA or use CPU |
OutOfMemoryError |
OOM | Use smaller model/settings |
RuntimeError |
Various | Check GPU drivers |
Enable verbose output:
# Add to your code
import logging
logging.basicConfig(level=logging.DEBUG)
# Check Python version
python --version
# Check installed packages
pip list
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# Check HuggingFace token
python -c "from huggingface_hub import HfApi; print(HfApi().whoami())"
When reporting an issue, include:
| Model | Limitation |
|---|---|
| TinyLlama | Limited context (2048) |
| SmolLM2 | Smaller capacity |
| Qwen | May need more RAM |
| Nemotron | GGUF format may differ |
| Platform | Issue |
|---|---|
| WSL | May need CUDA setup |
| Docker | Needs GPU pass-through |
| ARM Mac | Limited GPU support |
do_sample=Falselow_cpu_mem_usage=True