Edge AI : IA locale pour latence zéro et privacy
L'Edge AI désigne l'exécution de modèles d'IA directement sur l'appareil (smartphone, IoT, voiture) plutôt que dans le cloud. Cette approche révolutionne les applications IA en offrant latence quasi-nulle, privacy totale (données jamais envoyées serveur) et fonctionnement offline.
En 2025, l'Edge AI explose grâce aux puces NPU dédiées (Apple A18 Neural Engine, Qualcomm Snapdragon 8 Gen 4, Google Tensor G5) et aux techniques de compression modèles (quantization, distillation) permettant de faire tourner des LLMs de 3-7 milliards de paramètres sur un smartphone.
Innovations 2025 :
- LLMs on-device : Llama 4 3B, Gemini Nano 2, Phi-4 mini sur smartphones
- NPUs 45+ TOPS : Neural Processing Units ultra-performantes (Snapdragon 8 Gen 4)
- Quantization INT4 : Modèles 4-bit sans perte qualité significative
- Frameworks optimisés : TensorFlow Lite, Core ML 6, ONNX Runtime Mobile
- Battery efficient : Inférence 10x moins énergivore vs cloud (moins network)
- Latency <50ms : Temps réel absolu (vs 200-500ms cloud)
Adoption Edge AI
Gartner prédit que 70% des smartphones vendus en 2026 intégreront NPU dédiée (vs 35% en 2024). Les applications Edge AI incluent traduction temps réel (Google Pixel 9), assistants vocaux offline (Apple Siri), photo enhancement (Samsung Galaxy S25) et AR/VR (Meta Quest 4).
Architecture Edge AI : Du cloud à l'appareil
Comparaison Cloud vs Edge
CLOUD AI (Architecture classique):
[Device] ──→ [Internet] ──→ [Cloud Server + GPU] ──→ [Response] ──→ [Device]
↑ ↑
Latency 200-500ms Cost $0.002/req
Privacy risk Requires connection
Problèmes:
✗ Latency: 200-500ms minimum (network + inférence)
✗ Privacy: Données envoyées serveur (RGPD risk)
✗ Cost: API calls coûteux (scale mal)
✗ Offline: Impossible sans internet
✗ Bandwidth: Consomme data mobile
════════════════════════════════════════════════════════
EDGE AI (On-device):
[Device with NPU] ──→ [Local inference] ──→ [Response]
↑
Latency <50ms
Privacy 100%
Cost $0
Works offline
Avantages:
✓ Latency: 10-50ms (10x faster)
✓ Privacy: Données restent local (RGPD compliant)
✓ Cost: Zéro coût inférence après deployment
✓ Offline: Fonctionne partout (avion, metro, remote)
✓ Bandwidth: Zéro consommation data
Stack technique Edge AI
EDGE AI STACK:
┌─────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ • Photo enhancement, voice assistant, translation │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ INFERENCE FRAMEWORKS │
│ iOS: Core ML 6 │
│ Android: TensorFlow Lite, NNAPI │
│ Cross-platform: ONNX Runtime Mobile │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ OPTIMIZATION LAYER │
│ • Quantization (FP32 → INT8/INT4) │
│ • Pruning (remove unnecessary weights) │
│ • Distillation (teacher → student model) │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ HARDWARE ACCELERATION │
│ NPU: Apple Neural Engine (35 TOPS) │
│ Qualcomm Hexagon (45 TOPS) │
│ Google Tensor TPU (28 TOPS) │
│ GPU: Adreno, Mali (fallback) │
│ DSP: Signal processing offload │
└─────────────────────────────────────────────────────┘
Compression de modèles : Techniques essentielles
1. Quantization (Réduction précision)
Principe : Réduire précision poids (FP32 → INT8 → INT4) sans perte qualité significative.
QUANTIZATION LEVELS:
FP32 (Float 32-bit) - Original:
├── Size: 100% (GPT-3 175B = 700GB)
├── Precision: Maximum
├── Speed: Baseline
└── Use: Training, cloud inference
FP16 (Float 16-bit):
├── Size: 50% (350GB)
├── Precision: -0.1% accuracy
├── Speed: 1.8x faster
└── Use: Cloud inference optimisé
INT8 (Integer 8-bit):
├── Size: 25% (175GB)
├── Precision: -1.2% accuracy
├── Speed: 3.5x faster
├── NPU support: ✓ Excellent
└── Use: Edge deployment standard
INT4 (Integer 4-bit):
├── Size: 12.5% (87GB)
├── Precision: -3.8% accuracy
├── Speed: 6x faster
├── NPU support: ✓ Latest chips (2024+)
└── Use: Aggressive edge (smartphones)
EXEMPLE Llama 4 3B:
├── FP32: 12GB (impossible smartphone)
├── FP16: 6GB (tight)
├── INT8: 3GB (comfortable)
├── INT4: 1.5GB (optimal) ✓
Code quantization PyTorch :
import torch
from torch.quantization import quantize_dynamic
# 1. Modèle original (FP32)
model = load_model("llama-4-3b.pth") # 12GB
# 2. Quantization dynamique (INT8)
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize linear layers
dtype=torch.qint8
)
# 3. Save quantized model
torch.save(quantized_model.state_dict(), "llama-4-3b-int8.pth") # 3GB
# Résultat:
# - Size: 12GB → 3GB (4x smaller)
# - Speed: 3.2x faster inference
# - Accuracy: -1.1% on benchmarks
2. Model Distillation (Compression par enseignement)
Principe : Grand modèle (teacher) enseigne à petit modèle (student).
DISTILLATION PROCESS:
[TEACHER MODEL]
Llama 4 70B (140GB)
Accuracy: 85%
↓ (Distillation training)
[STUDENT MODEL]
Llama 4 3B (6GB)
Accuracy: 78% (vs 72% sans distillation)
GAIN:
• Size: 23x smaller
• Speed: 15x faster
• Accuracy: +6 points vs training from scratch
TECHNIQUE:
Instead of:
Student learns from hard labels (0/1)
Use:
Student learns from teacher's soft probabilities
[0.02, 0.87, 0.11] vs [0, 1, 0]
→ Captures nuances teacher learned
Code distillation :
# Knowledge distillation training
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
def __init__(self, temperature=3.0, alpha=0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha # Balance distillation vs hard labels
def forward(self, student_logits, teacher_logits, true_labels):
# Soft targets (teacher)
distillation_loss = F.kl_div(
F.log_softmax(student_logits / self.temperature, dim=1),
F.softmax(teacher_logits / self.temperature, dim=1),
reduction='batchmean'
) * (self.temperature ** 2)
# Hard targets (ground truth)
student_loss = F.cross_entropy(student_logits, true_labels)
# Combined loss
return self.alpha * distillation_loss + (1 - self.alpha) * student_loss
# Training loop
teacher_model = LlamaModel70B() # Pre-trained
student_model = LlamaModel3B() # Random init
distillation_criterion = DistillationLoss(temperature=3.0, alpha=0.7)
for batch in dataloader:
inputs, labels = batch
# Teacher inference (no grad)
with torch.no_grad():
teacher_logits = teacher_model(inputs)
# Student inference
student_logits = student_model(inputs)
# Distillation loss
loss = distillation_criterion(student_logits, teacher_logits, labels)
# Backprop
loss.backward()
optimizer.step()
# Résultat après training:
# Student model: 78% accuracy (vs 72% trained normally)
# Teacher: 85% accuracy
# → Student captured 75% of teacher's knowledge!
3. Pruning (Élagage poids)
Principe : Supprimer poids peu importants (proche zéro).
PRUNING:
Original model: 3B parameters
↓ (Identify low-magnitude weights)
Sparse model: 3B params, 40% zero (1.8B effective)
↓ (Compress sparse representation)
Compressed: 1.8GB (vs 3GB original)
TYPES:
• Unstructured pruning: Remove individual weights (max compression)
• Structured pruning: Remove entire neurons/channels (hardware-friendly)
EXAMPLE:
Weight matrix [0.003, 0.421, -0.002, 0.687, 0.001]
Threshold: |w| < 0.1 → zero
Result: [0, 0.421, 0, 0.687, 0 ]
Sparsity: 60%
Frameworks de deployment mobile
TensorFlow Lite (Google - Android/iOS)
# TensorFlow → TensorFlow Lite conversion
import tensorflow as tf
# 1. Load TensorFlow model
model = tf.keras.models.load_model("model.h5")
# 2. Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT] # Quantization
converter.target_spec.supported_types = [tf.int8] # INT8
# Convert
tflite_model = converter.convert()
# 3. Save .tflite file (déployable Android/iOS)
with open("model.tflite", "wb") as f:
f.write(tflite_model)
# Size reduction: 45MB → 12MB (4x)
Android deployment :
// Android Kotlin - TFLite inference
import org.tensorflow.lite.Interpreter
class TFLiteModel(context: Context) {
private val interpreter: Interpreter
init {
// Load .tflite from assets
val model = loadModelFile(context, "model.tflite")
interpreter = Interpreter(model, Interpreter.Options().apply {
setUseNNAPI(true) // Use Android Neural Networks API
setNumThreads(4)
})
}
fun predict(input: FloatArray): FloatArray {
val output = FloatArray(10) // 10 classes
interpreter.run(input, output)
return output
}
}
// Usage
val model = TFLiteModel(context)
val result = model.predict(imagePixels)
// Latency: 18ms on Snapdragon 8 Gen 4
Core ML (Apple - iOS/macOS)
# PyTorch → Core ML conversion
import torch
import coremltools as ct
# 1. Load PyTorch model
model = torch.load("model.pth")
model.eval()
# 2. Trace model (example input)
example_input = torch.rand(1, 3, 224, 224) # Image 224x224
traced_model = torch.jit.trace(model, example_input)
# 3. Convert to Core ML
coreml_model = ct.convert(
traced_model,
inputs=[ct.ImageType(shape=(1, 3, 224, 224))],
compute_precision=ct.precision.FLOAT16, # FP16 optimization
compute_units=ct.ComputeUnit.ALL # Use Neural Engine + GPU
)
# 4. Save .mlmodel (déployable iOS)
coreml_model.save("model.mlmodel")
iOS deployment :
// iOS Swift - Core ML inference
import CoreML
import Vision
class CoreMLModel {
let model: VNCoreMLModel
init() {
let mlModel = try! model_mlmodel(configuration: MLModelConfiguration())
model = try! VNCoreMLModel(for: mlModel)
}
func predict(image: UIImage) -> String {
let request = VNCoreMLRequest(model: model) { request, error in
guard let results = request.results as? [VNClassificationObservation] else {
return
}
let topResult = results.first!
print("Prediction: \(topResult.identifier), confidence: \(topResult.confidence)")
}
let handler = VNImageRequestHandler(cgImage: image.cgImage!)
try! handler.perform([request])
}
}
// Latency: 12ms on iPhone 16 Pro (A18 Neural Engine)
ONNX Runtime Mobile (Cross-platform)
# PyTorch → ONNX → Mobile
import torch
import onnx
from onnxruntime.quantization import quantize_dynamic
# 1. PyTorch → ONNX
model = torch.load("model.pth")
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}}
)
# 2. Quantize ONNX (INT8)
quantize_dynamic(
"model.onnx",
"model_quantized.onnx",
weight_type=QuantType.QUInt8
)
# 3. Deploy on Android/iOS via ONNX Runtime Mobile
# Size: 45MB → 11MB
NPU Hardware : Nouvelles générations 2025
Comparaison puces mobiles
┌───────────────────────────────────────────────────────────────┐
│ Chip NPU TOPS AI Features Release │
├───────────────────────────────────────────────────────────────┤
│ Snapdragon 8G4 45 TOPS LLM 7B on-device Q4 2024 │
│ Apple A18 35 TOPS Core ML optimized Sept 2024 │
│ Tensor G5 28 TOPS Gemini Nano 2 Oct 2024 │
│ MediaTek 9400 30 TOPS APU 890 Nov 2024 │
│ Exynos 2500 26 TOPS Samsung Gauss AI Q1 2025 │
└───────────────────────────────────────────────────────────────┘
TOPS = Tera Operations Per Second (10^12 ops/s)
BENCHMARK LLM on-device (Llama 4 3B INT4):
├── Snapdragon 8G4: 24 tokens/sec
├── Apple A18: 19 tokens/sec
├── Tensor G5: 16 tokens/sec
└── MediaTek 9400: 18 tokens/sec
→ Snapdragon 8G4 leader performance 2025
Consommation énergétique
ENERGY EFFICIENCY (inférence Llama 4 3B, 100 tokens):
Cloud (API call):
├── Data transfer: 50mW × 2s = 100mJ
├── Waiting: 20mW × 0.5s = 10mJ
├── Display: 300mW × 2.5s = 750mJ
└── TOTAL: 860mJ
On-device (NPU):
├── NPU active: 800mW × 0.8s = 640mJ
├── Display: 300mW × 0.8s = 240mJ
└── TOTAL: 880mJ
BUT on-device avoids:
✓ Network idle (radio power)
✓ Latency wait (display on longer)
→ Real-world: On-device 3-5x plus efficient
Battery impact:
Cloud: 50-100 inferences → -1% battery
On-device: 300-500 inferences → -1% battery
Applications Edge AI révolutionnaires
1. Google Pixel 9 - Traduction temps réel
LIVE TRANSLATE (Pixel 9):
Features:
• Traduction simultanée 95 langues
• Mode conversation (bidirectionnel)
• Overlay vidéo (sous-titres temps réel)
• Works offline (modèles on-device)
Architecture:
┌────────────────────────────────────────────┐
│ Audio input (microphone) │
│ ↓ │
│ [Speech-to-Text] Whisper-tiny (39MB) │
│ ↓ │
│ [Translation] NLLB-distilled (180MB) │
│ ↓ │
│ [Text-to-Speech] Tacotron2-lite (65MB) │
│ ↓ │
│ Audio output (speaker) │
└────────────────────────────────────────────┘
Performance:
├── Latency: 450ms end-to-end
├── Accuracy: 94% (vs 96% cloud Google Translate)
├── Battery: 2% / heure conversation
└── Privacy: Zéro data envoyée serveur ✓
Total models: 284MB (all 95 languages)
NPU: Google Tensor G5 (28 TOPS)
2. Apple Intelligence - Siri on-device
SIRI 2025 (iOS 19):
New capabilities:
• Understanding complex requests (multi-step)
• On-device LLM (Phi-4 mini distilled, 2.7B)
• Personal context (emails, calendar, photos)
• Works offline (basic features)
Example:
User: "Résume mes emails importants ce matin et
crée rappel pour répondre à Marie"
Siri (on-device):
1. [Email search] Scans Mail.app (on-device index)
2. [LLM summary] "3 emails importants:
- Client Acme: Demande devis (urgent)
- Manager: Réunion Q4 déplacée jeudi
- Marie: Question budget projet"
3. [Action] Crée rappel "Répondre Marie" (Calendar.app)
Latency: 1.8s total (vs 4.2s cloud Siri)
Privacy: Emails JAMAIS envoyés serveur
Models:
├── Phi-4 mini (2.7B): 2.1GB INT4
├── Embedding model: 145MB
├── Voice recognition: 89MB
└── Total: 2.3GB
3. Meta Quest 4 - Spatial AI
META QUEST 4 VR (Q2 2025):
AI Features:
• Scene understanding (objects, surfaces)
• Hand tracking (26 keypoints, 90fps)
• Eye tracking + foveated rendering
• AI avatars (realistic expressions)
Chip: Snapdragon XR2 Gen 3
NPU: 18 TOPS (lower power vs mobile)
Scene Understanding:
[Cameras] 4× RGB + 2× depth
↓
[Object detection] YOLOv9-tiny (22MB)
↓
[Segmentation] SAM-distilled (95MB)
↓
[Spatial mapping] Real-time mesh
Latency: 11ms (90fps tracking)
Use case: AR games, virtual workspace, training simulations
Challenges et limitations Edge AI
LIMITATIONS EDGE AI 2025:
1. MODEL SIZE:
✗ Max pratique: 3-7B params (smartphones)
✗ vs Cloud: GPT-4 (1.7T params), Claude 4 (600B)
→ Solutions: Distillation, quantization aggressive
2. ACCURACY GAP:
✗ On-device: -5 à -10% vs cloud models
→ Acceptable pour majorité use cases
→ Amélioration continue (2024: -15%, 2025: -8%)
3. MEMORY CONSTRAINTS:
✗ Smartphones: 8-12GB RAM total
✗ App budget: 2-4GB max (OS + other apps)
→ Limit model size + caching strategies
4. THERMAL THROTTLING:
✗ NPU sustained: 3-5W max (heat)
✗ Burst: 8-10W (30 secondes)
→ Long inferences (video) challenging
5. FRAGMENTATION:
✗ iOS: Core ML (unified)
✓ Android: TFLite, NNAPI, proprietary SDKs
→ Complexité multi-platform
FUTURE (2026-2027):
✓ NPUs 80-100 TOPS (2x actuel)
✓ 10-15B modèles on-device (vs 3-7B today)
✓ HBM in smartphones (bandwidth 3x)
→ Gap cloud vs edge continue réduire
Articles connexes
- Fine-tuning LLMs : Guide pratique 2025 pour adapter vos modèles
- Meta Llama 4 : L'open source qui défie les modèles propriétaires
- Sécurité des modèles IA : Prompt injection et jailbreaking
Conclusion : L'IA partout, tout le temps
L'Edge AI transforme nos devices en assistants IA ultra-rapides, privés et disponibles offline. Avec les NPUs 2025 atteignant 45 TOPS et les techniques de compression (INT4, distillation), des LLMs de 3-7B paramètres tournent désormais sur smartphones avec latence <50ms.
Révolution privacy : Données restent locales (RGPD compliant) Révolution latency : 10x plus rapide vs cloud (10-50ms vs 200-500ms) Révolution coût : Zéro coût inférence après deployment
Use cases 2025 :
- Traduction temps réel offline (Google Pixel 9)
- Assistants vocaux on-device (Apple Siri 2.0)
- Photo/vidéo enhancement (Samsung Galaxy S25 AI)
- AR/VR spatial understanding (Meta Quest 4)
2026 : Prédiction 10-15B modèles on-device, NPUs 80+ TOPS, et 90% smartphones avec IA générative locale. Le cloud reste pour tâches complexes, mais l'Edge gère 70% use cases quotidiens.




