Edge AI 2025 : IA embarquée on-device pour mobiles et IoT

Edge AI : IA locale pour latence zéro et privacy

L'Edge AI désigne l'exécution de modèles d'IA directement sur l'appareil (smartphone, IoT, voiture) plutôt que dans le cloud. Cette approche révolutionne les applications IA en offrant latence quasi-nulle, privacy totale (données jamais envoyées serveur) et fonctionnement offline.

En 2025, l'Edge AI explose grâce aux puces NPU dédiées (Apple A18 Neural Engine, Qualcomm Snapdragon 8 Gen 4, Google Tensor G5) et aux techniques de compression modèles (quantization, distillation) permettant de faire tourner des LLMs de 3-7 milliards de paramètres sur un smartphone.

Innovations 2025 :

LLMs on-device : Llama 4 3B, Gemini Nano 2, Phi-4 mini sur smartphones
NPUs 45+ TOPS : Neural Processing Units ultra-performantes (Snapdragon 8 Gen 4)
Quantization INT4 : Modèles 4-bit sans perte qualité significative
Frameworks optimisés : TensorFlow Lite, Core ML 6, ONNX Runtime Mobile
Battery efficient : Inférence 10x moins énergivore vs cloud (moins network)
Latency <50ms : Temps réel absolu (vs 200-500ms cloud)

Adoption Edge AI

Gartner prédit que 70% des smartphones vendus en 2026 intégreront NPU dédiée (vs 35% en 2024). Les applications Edge AI incluent traduction temps réel (Google Pixel 9), assistants vocaux offline (Apple Siri), photo enhancement (Samsung Galaxy S25) et AR/VR (Meta Quest 4).

Architecture Edge AI : Du cloud à l'appareil

Comparaison Cloud vs Edge

CLOUD AI (Architecture classique):

[Device] ──→ [Internet] ──→ [Cloud Server + GPU] ──→ [Response] ──→ [Device]
                ↑                    ↑
           Latency 200-500ms    Cost $0.002/req
           Privacy risk          Requires connection

Problèmes:
✗ Latency: 200-500ms minimum (network + inférence)
✗ Privacy: Données envoyées serveur (RGPD risk)
✗ Cost: API calls coûteux (scale mal)
✗ Offline: Impossible sans internet
✗ Bandwidth: Consomme data mobile

════════════════════════════════════════════════════════

EDGE AI (On-device):

[Device with NPU] ──→ [Local inference] ──→ [Response]
                           ↑
                      Latency &lt;50ms
                      Privacy 100%
                      Cost $0
                      Works offline

Avantages:
✓ Latency: 10-50ms (10x faster)
✓ Privacy: Données restent local (RGPD compliant)
✓ Cost: Zéro coût inférence après deployment
✓ Offline: Fonctionne partout (avion, metro, remote)
✓ Bandwidth: Zéro consommation data

Stack technique Edge AI

EDGE AI STACK:

┌─────────────────────────────────────────────────────┐
│ APPLICATION LAYER                                   │
│ • Photo enhancement, voice assistant, translation   │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ INFERENCE FRAMEWORKS                                │
│ iOS: Core ML 6                                      │
│ Android: TensorFlow Lite, NNAPI                     │
│ Cross-platform: ONNX Runtime Mobile                │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ OPTIMIZATION LAYER                                  │
│ • Quantization (FP32 → INT8/INT4)                   │
│ • Pruning (remove unnecessary weights)              │
│ • Distillation (teacher → student model)            │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ HARDWARE ACCELERATION                               │
│ NPU: Apple Neural Engine (35 TOPS)                  │
│      Qualcomm Hexagon (45 TOPS)                     │
│      Google Tensor TPU (28 TOPS)                    │
│ GPU: Adreno, Mali (fallback)                        │
│ DSP: Signal processing offload                      │
└─────────────────────────────────────────────────────┘

Compression de modèles : Techniques essentielles

1. Quantization (Réduction précision)

Principe : Réduire précision poids (FP32 → INT8 → INT4) sans perte qualité significative.

QUANTIZATION LEVELS:

FP32 (Float 32-bit) - Original:
├── Size: 100% (GPT-3 175B = 700GB)
├── Precision: Maximum
├── Speed: Baseline
└── Use: Training, cloud inference

FP16 (Float 16-bit):
├── Size: 50% (350GB)
├── Precision: -0.1% accuracy
├── Speed: 1.8x faster
└── Use: Cloud inference optimisé

INT8 (Integer 8-bit):
├── Size: 25% (175GB)
├── Precision: -1.2% accuracy
├── Speed: 3.5x faster
├── NPU support: ✓ Excellent
└── Use: Edge deployment standard

INT4 (Integer 4-bit):
├── Size: 12.5% (87GB)
├── Precision: -3.8% accuracy
├── Speed: 6x faster
├── NPU support: ✓ Latest chips (2024+)
└── Use: Aggressive edge (smartphones)

EXEMPLE Llama 4 3B:
├── FP32: 12GB (impossible smartphone)
├── FP16: 6GB (tight)
├── INT8: 3GB (comfortable)
├── INT4: 1.5GB (optimal) ✓

Code quantization PyTorch :

import torch
from torch.quantization import quantize_dynamic

# 1. Modèle original (FP32)
model = load_model("llama-4-3b.pth")  # 12GB

# 2. Quantization dynamique (INT8)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Quantize linear layers
    dtype=torch.qint8
)

# 3. Save quantized model
torch.save(quantized_model.state_dict(), "llama-4-3b-int8.pth")  # 3GB

# Résultat:
# - Size: 12GB → 3GB (4x smaller)
# - Speed: 3.2x faster inference
# - Accuracy: -1.1% on benchmarks

2. Model Distillation (Compression par enseignement)

Principe : Grand modèle (teacher) enseigne à petit modèle (student).

DISTILLATION PROCESS:

[TEACHER MODEL]
Llama 4 70B (140GB)
Accuracy: 85%
      ↓ (Distillation training)
[STUDENT MODEL]
Llama 4 3B (6GB)
Accuracy: 78% (vs 72% sans distillation)

GAIN:
• Size: 23x smaller
• Speed: 15x faster
• Accuracy: +6 points vs training from scratch

TECHNIQUE:
Instead of:
  Student learns from hard labels (0/1)
Use:
  Student learns from teacher's soft probabilities
  [0.02, 0.87, 0.11] vs [0, 1, 0]
  → Captures nuances teacher learned

Code distillation :

# Knowledge distillation training

import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=3.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha  # Balance distillation vs hard labels

    def forward(self, student_logits, teacher_logits, true_labels):
        # Soft targets (teacher)
        distillation_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard targets (ground truth)
        student_loss = F.cross_entropy(student_logits, true_labels)

        # Combined loss
        return self.alpha * distillation_loss + (1 - self.alpha) * student_loss

# Training loop
teacher_model = LlamaModel70B()  # Pre-trained
student_model = LlamaModel3B()   # Random init

distillation_criterion = DistillationLoss(temperature=3.0, alpha=0.7)

for batch in dataloader:
    inputs, labels = batch

    # Teacher inference (no grad)
    with torch.no_grad():
        teacher_logits = teacher_model(inputs)

    # Student inference
    student_logits = student_model(inputs)

    # Distillation loss
    loss = distillation_criterion(student_logits, teacher_logits, labels)

    # Backprop
    loss.backward()
    optimizer.step()

# Résultat après training:
# Student model: 78% accuracy (vs 72% trained normally)
# Teacher: 85% accuracy
# → Student captured 75% of teacher's knowledge!

3. Pruning (Élagage poids)

Principe : Supprimer poids peu importants (proche zéro).

PRUNING:

Original model: 3B parameters
     ↓ (Identify low-magnitude weights)
Sparse model: 3B params, 40% zero (1.8B effective)
     ↓ (Compress sparse representation)
Compressed: 1.8GB (vs 3GB original)

TYPES:
• Unstructured pruning: Remove individual weights (max compression)
• Structured pruning: Remove entire neurons/channels (hardware-friendly)

EXAMPLE:
Weight matrix [0.003, 0.421, -0.002, 0.687, 0.001]
Threshold: |w| < 0.1 → zero
Result:     [0,     0.421, 0,     0.687, 0    ]
Sparsity: 60%

Frameworks de deployment mobile

TensorFlow Lite (Google - Android/iOS)

# TensorFlow → TensorFlow Lite conversion

import tensorflow as tf

# 1. Load TensorFlow model
model = tf.keras.models.load_model("model.h5")

# 2. Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # Quantization
converter.target_spec.supported_types = [tf.int8]     # INT8

# Convert
tflite_model = converter.convert()

# 3. Save .tflite file (déployable Android/iOS)
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

# Size reduction: 45MB → 12MB (4x)

Android deployment :

// Android Kotlin - TFLite inference

import org.tensorflow.lite.Interpreter

class TFLiteModel(context: Context) {
    private val interpreter: Interpreter

    init {
        // Load .tflite from assets
        val model = loadModelFile(context, "model.tflite")
        interpreter = Interpreter(model, Interpreter.Options().apply {
            setUseNNAPI(true)  // Use Android Neural Networks API
            setNumThreads(4)
        })
    }

    fun predict(input: FloatArray): FloatArray {
        val output = FloatArray(10)  // 10 classes
        interpreter.run(input, output)
        return output
    }
}

// Usage
val model = TFLiteModel(context)
val result = model.predict(imagePixels)
// Latency: 18ms on Snapdragon 8 Gen 4

Core ML (Apple - iOS/macOS)

# PyTorch → Core ML conversion

import torch
import coremltools as ct

# 1. Load PyTorch model
model = torch.load("model.pth")
model.eval()

# 2. Trace model (example input)
example_input = torch.rand(1, 3, 224, 224)  # Image 224x224
traced_model = torch.jit.trace(model, example_input)

# 3. Convert to Core ML
coreml_model = ct.convert(
    traced_model,
    inputs=[ct.ImageType(shape=(1, 3, 224, 224))],
    compute_precision=ct.precision.FLOAT16,  # FP16 optimization
    compute_units=ct.ComputeUnit.ALL  # Use Neural Engine + GPU
)

# 4. Save .mlmodel (déployable iOS)
coreml_model.save("model.mlmodel")

iOS deployment :

// iOS Swift - Core ML inference

import CoreML
import Vision

class CoreMLModel {
    let model: VNCoreMLModel

    init() {
        let mlModel = try! model_mlmodel(configuration: MLModelConfiguration())
        model = try! VNCoreMLModel(for: mlModel)
    }

    func predict(image: UIImage) -> String {
        let request = VNCoreMLRequest(model: model) { request, error in
            guard let results = request.results as? [VNClassificationObservation] else {
                return
            }

            let topResult = results.first!
            print("Prediction: \(topResult.identifier), confidence: \(topResult.confidence)")
        }

        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        try! handler.perform([request])
    }
}

// Latency: 12ms on iPhone 16 Pro (A18 Neural Engine)

ONNX Runtime Mobile (Cross-platform)

# PyTorch → ONNX → Mobile

import torch
import onnx
from onnxruntime.quantization import quantize_dynamic

# 1. PyTorch → ONNX
model = torch.load("model.pth")
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}}
)

# 2. Quantize ONNX (INT8)
quantize_dynamic(
    "model.onnx",
    "model_quantized.onnx",
    weight_type=QuantType.QUInt8
)

# 3. Deploy on Android/iOS via ONNX Runtime Mobile
# Size: 45MB → 11MB

NPU Hardware : Nouvelles générations 2025

Comparaison puces mobiles

┌───────────────────────────────────────────────────────────────┐
│ Chip              NPU TOPS   AI Features        Release       │
├───────────────────────────────────────────────────────────────┤
│ Snapdragon 8G4   45 TOPS    LLM 7B on-device   Q4 2024       │
│ Apple A18        35 TOPS    Core ML optimized  Sept 2024     │
│ Tensor G5        28 TOPS    Gemini Nano 2      Oct 2024      │
│ MediaTek 9400    30 TOPS    APU 890            Nov 2024      │
│ Exynos 2500      26 TOPS    Samsung Gauss AI   Q1 2025       │
└───────────────────────────────────────────────────────────────┘

TOPS = Tera Operations Per Second (10^12 ops/s)

BENCHMARK LLM on-device (Llama 4 3B INT4):
├── Snapdragon 8G4: 24 tokens/sec
├── Apple A18: 19 tokens/sec
├── Tensor G5: 16 tokens/sec
└── MediaTek 9400: 18 tokens/sec

→ Snapdragon 8G4 leader performance 2025

Consommation énergétique

ENERGY EFFICIENCY (inférence Llama 4 3B, 100 tokens):

Cloud (API call):
├── Data transfer: 50mW × 2s = 100mJ
├── Waiting: 20mW × 0.5s = 10mJ
├── Display: 300mW × 2.5s = 750mJ
└── TOTAL: 860mJ

On-device (NPU):
├── NPU active: 800mW × 0.8s = 640mJ
├── Display: 300mW × 0.8s = 240mJ
└── TOTAL: 880mJ

BUT on-device avoids:
✓ Network idle (radio power)
✓ Latency wait (display on longer)
→ Real-world: On-device 3-5x plus efficient

Battery impact:
Cloud: 50-100 inferences → -1% battery
On-device: 300-500 inferences → -1% battery

Applications Edge AI révolutionnaires

1. Google Pixel 9 - Traduction temps réel

LIVE TRANSLATE (Pixel 9):

Features:
• Traduction simultanée 95 langues
• Mode conversation (bidirectionnel)
• Overlay vidéo (sous-titres temps réel)
• Works offline (modèles on-device)

Architecture:
┌────────────────────────────────────────────┐
│ Audio input (microphone)                   │
│   ↓                                        │
│ [Speech-to-Text] Whisper-tiny (39MB)       │
│   ↓                                        │
│ [Translation] NLLB-distilled (180MB)       │
│   ↓                                        │
│ [Text-to-Speech] Tacotron2-lite (65MB)     │
│   ↓                                        │
│ Audio output (speaker)                     │
└────────────────────────────────────────────┘

Performance:
├── Latency: 450ms end-to-end
├── Accuracy: 94% (vs 96% cloud Google Translate)
├── Battery: 2% / heure conversation
└── Privacy: Zéro data envoyée serveur ✓

Total models: 284MB (all 95 languages)
NPU: Google Tensor G5 (28 TOPS)

2. Apple Intelligence - Siri on-device

SIRI 2025 (iOS 19):

New capabilities:
• Understanding complex requests (multi-step)
• On-device LLM (Phi-4 mini distilled, 2.7B)
• Personal context (emails, calendar, photos)
• Works offline (basic features)

Example:
User: "Résume mes emails importants ce matin et
       crée rappel pour répondre à Marie"

Siri (on-device):
1. [Email search] Scans Mail.app (on-device index)
2. [LLM summary] "3 emails importants:
   - Client Acme: Demande devis (urgent)
   - Manager: Réunion Q4 déplacée jeudi
   - Marie: Question budget projet"
3. [Action] Crée rappel "Répondre Marie" (Calendar.app)

Latency: 1.8s total (vs 4.2s cloud Siri)
Privacy: Emails JAMAIS envoyés serveur

Models:
├── Phi-4 mini (2.7B): 2.1GB INT4
├── Embedding model: 145MB
├── Voice recognition: 89MB
└── Total: 2.3GB

3. Meta Quest 4 - Spatial AI

META QUEST 4 VR (Q2 2025):

AI Features:
• Scene understanding (objects, surfaces)
• Hand tracking (26 keypoints, 90fps)
• Eye tracking + foveated rendering
• AI avatars (realistic expressions)

Chip: Snapdragon XR2 Gen 3
NPU: 18 TOPS (lower power vs mobile)

Scene Understanding:
[Cameras] 4× RGB + 2× depth
    ↓
[Object detection] YOLOv9-tiny (22MB)
    ↓
[Segmentation] SAM-distilled (95MB)
    ↓
[Spatial mapping] Real-time mesh

Latency: 11ms (90fps tracking)
Use case: AR games, virtual workspace, training simulations

Challenges et limitations Edge AI

LIMITATIONS EDGE AI 2025:

1. MODEL SIZE:
   ✗ Max pratique: 3-7B params (smartphones)
   ✗ vs Cloud: GPT-4 (1.7T params), Claude 4 (600B)
   → Solutions: Distillation, quantization aggressive

2. ACCURACY GAP:
   ✗ On-device: -5 à -10% vs cloud models
   → Acceptable pour majorité use cases
   → Amélioration continue (2024: -15%, 2025: -8%)

3. MEMORY CONSTRAINTS:
   ✗ Smartphones: 8-12GB RAM total
   ✗ App budget: 2-4GB max (OS + other apps)
   → Limit model size + caching strategies

4. THERMAL THROTTLING:
   ✗ NPU sustained: 3-5W max (heat)
   ✗ Burst: 8-10W (30 secondes)
   → Long inferences (video) challenging

5. FRAGMENTATION:
   ✗ iOS: Core ML (unified)
   ✓ Android: TFLite, NNAPI, proprietary SDKs
   → Complexité multi-platform

FUTURE (2026-2027):
✓ NPUs 80-100 TOPS (2x actuel)
✓ 10-15B modèles on-device (vs 3-7B today)
✓ HBM in smartphones (bandwidth 3x)
→ Gap cloud vs edge continue réduire

Conclusion : L'IA partout, tout le temps

L'Edge AI transforme nos devices en assistants IA ultra-rapides, privés et disponibles offline. Avec les NPUs 2025 atteignant 45 TOPS et les techniques de compression (INT4, distillation), des LLMs de 3-7B paramètres tournent désormais sur smartphones avec latence <50ms.

Révolution privacy : Données restent locales (RGPD compliant) Révolution latency : 10x plus rapide vs cloud (10-50ms vs 200-500ms) Révolution coût : Zéro coût inférence après deployment

Use cases 2025 :

Traduction temps réel offline (Google Pixel 9)
Assistants vocaux on-device (Apple Siri 2.0)
Photo/vidéo enhancement (Samsung Galaxy S25 AI)
AR/VR spatial understanding (Meta Quest 4)

2026 : Prédiction 10-15B modèles on-device, NPUs 80+ TOPS, et 90% smartphones avec IA générative locale. Le cloud reste pour tâches complexes, mais l'Edge gère 70% use cases quotidiens.

Edge AI : IA locale pour latence zéro et privacy

Innovations 2025 :

LLMs on-device : Llama 4 3B, Gemini Nano 2, Phi-4 mini sur smartphones
NPUs 45+ TOPS : Neural Processing Units ultra-performantes (Snapdragon 8 Gen 4)
Quantization INT4 : Modèles 4-bit sans perte qualité significative
Frameworks optimisés : TensorFlow Lite, Core ML 6, ONNX Runtime Mobile
Battery efficient : Inférence 10x moins énergivore vs cloud (moins network)
Latency <50ms : Temps réel absolu (vs 200-500ms cloud)

Adoption Edge AI

Architecture Edge AI : Du cloud à l'appareil

Comparaison Cloud vs Edge

CLOUD AI (Architecture classique):

[Device] ──→ [Internet] ──→ [Cloud Server + GPU] ──→ [Response] ──→ [Device]
                ↑                    ↑
           Latency 200-500ms    Cost $0.002/req
           Privacy risk          Requires connection

Problèmes:
✗ Latency: 200-500ms minimum (network + inférence)
✗ Privacy: Données envoyées serveur (RGPD risk)
✗ Cost: API calls coûteux (scale mal)
✗ Offline: Impossible sans internet
✗ Bandwidth: Consomme data mobile

════════════════════════════════════════════════════════

EDGE AI (On-device):

[Device with NPU] ──→ [Local inference] ──→ [Response]
                           ↑
                      Latency &lt;50ms
                      Privacy 100%
                      Cost $0
                      Works offline

Avantages:
✓ Latency: 10-50ms (10x faster)
✓ Privacy: Données restent local (RGPD compliant)
✓ Cost: Zéro coût inférence après deployment
✓ Offline: Fonctionne partout (avion, metro, remote)
✓ Bandwidth: Zéro consommation data

Stack technique Edge AI

EDGE AI STACK:

┌─────────────────────────────────────────────────────┐
│ APPLICATION LAYER                                   │
│ • Photo enhancement, voice assistant, translation   │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ INFERENCE FRAMEWORKS                                │
│ iOS: Core ML 6                                      │
│ Android: TensorFlow Lite, NNAPI                     │
│ Cross-platform: ONNX Runtime Mobile                │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ OPTIMIZATION LAYER                                  │
│ • Quantization (FP32 → INT8/INT4)                   │
│ • Pruning (remove unnecessary weights)              │
│ • Distillation (teacher → student model)            │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ HARDWARE ACCELERATION                               │
│ NPU: Apple Neural Engine (35 TOPS)                  │
│      Qualcomm Hexagon (45 TOPS)                     │
│      Google Tensor TPU (28 TOPS)                    │
│ GPU: Adreno, Mali (fallback)                        │
│ DSP: Signal processing offload                      │
└─────────────────────────────────────────────────────┘

Compression de modèles : Techniques essentielles

1. Quantization (Réduction précision)

Principe : Réduire précision poids (FP32 → INT8 → INT4) sans perte qualité significative.

QUANTIZATION LEVELS:

FP32 (Float 32-bit) - Original:
├── Size: 100% (GPT-3 175B = 700GB)
├── Precision: Maximum
├── Speed: Baseline
└── Use: Training, cloud inference

FP16 (Float 16-bit):
├── Size: 50% (350GB)
├── Precision: -0.1% accuracy
├── Speed: 1.8x faster
└── Use: Cloud inference optimisé

INT8 (Integer 8-bit):
├── Size: 25% (175GB)
├── Precision: -1.2% accuracy
├── Speed: 3.5x faster
├── NPU support: ✓ Excellent
└── Use: Edge deployment standard

INT4 (Integer 4-bit):
├── Size: 12.5% (87GB)
├── Precision: -3.8% accuracy
├── Speed: 6x faster
├── NPU support: ✓ Latest chips (2024+)
└── Use: Aggressive edge (smartphones)

EXEMPLE Llama 4 3B:
├── FP32: 12GB (impossible smartphone)
├── FP16: 6GB (tight)
├── INT8: 3GB (comfortable)
├── INT4: 1.5GB (optimal) ✓

Code quantization PyTorch :

import torch
from torch.quantization import quantize_dynamic

# 1. Modèle original (FP32)
model = load_model("llama-4-3b.pth")  # 12GB

# 2. Quantization dynamique (INT8)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # Quantize linear layers
    dtype=torch.qint8
)

# 3. Save quantized model
torch.save(quantized_model.state_dict(), "llama-4-3b-int8.pth")  # 3GB

# Résultat:
# - Size: 12GB → 3GB (4x smaller)
# - Speed: 3.2x faster inference
# - Accuracy: -1.1% on benchmarks

2. Model Distillation (Compression par enseignement)

Principe : Grand modèle (teacher) enseigne à petit modèle (student).

DISTILLATION PROCESS:

[TEACHER MODEL]
Llama 4 70B (140GB)
Accuracy: 85%
      ↓ (Distillation training)
[STUDENT MODEL]
Llama 4 3B (6GB)
Accuracy: 78% (vs 72% sans distillation)

GAIN:
• Size: 23x smaller
• Speed: 15x faster
• Accuracy: +6 points vs training from scratch

TECHNIQUE:
Instead of:
  Student learns from hard labels (0/1)
Use:
  Student learns from teacher's soft probabilities
  [0.02, 0.87, 0.11] vs [0, 1, 0]
  → Captures nuances teacher learned

Code distillation :

# Knowledge distillation training

import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=3.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha  # Balance distillation vs hard labels

    def forward(self, student_logits, teacher_logits, true_labels):
        # Soft targets (teacher)
        distillation_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard targets (ground truth)
        student_loss = F.cross_entropy(student_logits, true_labels)

        # Combined loss
        return self.alpha * distillation_loss + (1 - self.alpha) * student_loss

# Training loop
teacher_model = LlamaModel70B()  # Pre-trained
student_model = LlamaModel3B()   # Random init

distillation_criterion = DistillationLoss(temperature=3.0, alpha=0.7)

for batch in dataloader:
    inputs, labels = batch

    # Teacher inference (no grad)
    with torch.no_grad():
        teacher_logits = teacher_model(inputs)

    # Student inference
    student_logits = student_model(inputs)

    # Distillation loss
    loss = distillation_criterion(student_logits, teacher_logits, labels)

    # Backprop
    loss.backward()
    optimizer.step()

# Résultat après training:
# Student model: 78% accuracy (vs 72% trained normally)
# Teacher: 85% accuracy
# → Student captured 75% of teacher's knowledge!

3. Pruning (Élagage poids)

Principe : Supprimer poids peu importants (proche zéro).

PRUNING:

Original model: 3B parameters
     ↓ (Identify low-magnitude weights)
Sparse model: 3B params, 40% zero (1.8B effective)
     ↓ (Compress sparse representation)
Compressed: 1.8GB (vs 3GB original)

TYPES:
• Unstructured pruning: Remove individual weights (max compression)
• Structured pruning: Remove entire neurons/channels (hardware-friendly)

EXAMPLE:
Weight matrix [0.003, 0.421, -0.002, 0.687, 0.001]
Threshold: |w| < 0.1 → zero
Result:     [0,     0.421, 0,     0.687, 0    ]
Sparsity: 60%

Frameworks de deployment mobile

TensorFlow Lite (Google - Android/iOS)

# TensorFlow → TensorFlow Lite conversion

import tensorflow as tf

# 1. Load TensorFlow model
model = tf.keras.models.load_model("model.h5")

# 2. Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # Quantization
converter.target_spec.supported_types = [tf.int8]     # INT8

# Convert
tflite_model = converter.convert()

# 3. Save .tflite file (déployable Android/iOS)
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

# Size reduction: 45MB → 12MB (4x)

Android deployment :

// Android Kotlin - TFLite inference

import org.tensorflow.lite.Interpreter

class TFLiteModel(context: Context) {
    private val interpreter: Interpreter

    init {
        // Load .tflite from assets
        val model = loadModelFile(context, "model.tflite")
        interpreter = Interpreter(model, Interpreter.Options().apply {
            setUseNNAPI(true)  // Use Android Neural Networks API
            setNumThreads(4)
        })
    }

    fun predict(input: FloatArray): FloatArray {
        val output = FloatArray(10)  // 10 classes
        interpreter.run(input, output)
        return output
    }
}

// Usage
val model = TFLiteModel(context)
val result = model.predict(imagePixels)
// Latency: 18ms on Snapdragon 8 Gen 4

Core ML (Apple - iOS/macOS)

# PyTorch → Core ML conversion

import torch
import coremltools as ct

# 1. Load PyTorch model
model = torch.load("model.pth")
model.eval()

# 2. Trace model (example input)
example_input = torch.rand(1, 3, 224, 224)  # Image 224x224
traced_model = torch.jit.trace(model, example_input)

# 3. Convert to Core ML
coreml_model = ct.convert(
    traced_model,
    inputs=[ct.ImageType(shape=(1, 3, 224, 224))],
    compute_precision=ct.precision.FLOAT16,  # FP16 optimization
    compute_units=ct.ComputeUnit.ALL  # Use Neural Engine + GPU
)

# 4. Save .mlmodel (déployable iOS)
coreml_model.save("model.mlmodel")

iOS deployment :

// iOS Swift - Core ML inference

import CoreML
import Vision

class CoreMLModel {
    let model: VNCoreMLModel

    init() {
        let mlModel = try! model_mlmodel(configuration: MLModelConfiguration())
        model = try! VNCoreMLModel(for: mlModel)
    }

    func predict(image: UIImage) -> String {
        let request = VNCoreMLRequest(model: model) { request, error in
            guard let results = request.results as? [VNClassificationObservation] else {
                return
            }

            let topResult = results.first!
            print("Prediction: \(topResult.identifier), confidence: \(topResult.confidence)")
        }

        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        try! handler.perform([request])
    }
}

// Latency: 12ms on iPhone 16 Pro (A18 Neural Engine)

ONNX Runtime Mobile (Cross-platform)

# PyTorch → ONNX → Mobile

import torch
import onnx
from onnxruntime.quantization import quantize_dynamic

# 1. PyTorch → ONNX
model = torch.load("model.pth")
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}}
)

# 2. Quantize ONNX (INT8)
quantize_dynamic(
    "model.onnx",
    "model_quantized.onnx",
    weight_type=QuantType.QUInt8
)

# 3. Deploy on Android/iOS via ONNX Runtime Mobile
# Size: 45MB → 11MB

NPU Hardware : Nouvelles générations 2025

Comparaison puces mobiles

┌───────────────────────────────────────────────────────────────┐
│ Chip              NPU TOPS   AI Features        Release       │
├───────────────────────────────────────────────────────────────┤
│ Snapdragon 8G4   45 TOPS    LLM 7B on-device   Q4 2024       │
│ Apple A18        35 TOPS    Core ML optimized  Sept 2024     │
│ Tensor G5        28 TOPS    Gemini Nano 2      Oct 2024      │
│ MediaTek 9400    30 TOPS    APU 890            Nov 2024      │
│ Exynos 2500      26 TOPS    Samsung Gauss AI   Q1 2025       │
└───────────────────────────────────────────────────────────────┘

TOPS = Tera Operations Per Second (10^12 ops/s)

BENCHMARK LLM on-device (Llama 4 3B INT4):
├── Snapdragon 8G4: 24 tokens/sec
├── Apple A18: 19 tokens/sec
├── Tensor G5: 16 tokens/sec
└── MediaTek 9400: 18 tokens/sec

→ Snapdragon 8G4 leader performance 2025

Consommation énergétique

ENERGY EFFICIENCY (inférence Llama 4 3B, 100 tokens):

Cloud (API call):
├── Data transfer: 50mW × 2s = 100mJ
├── Waiting: 20mW × 0.5s = 10mJ
├── Display: 300mW × 2.5s = 750mJ
└── TOTAL: 860mJ

On-device (NPU):
├── NPU active: 800mW × 0.8s = 640mJ
├── Display: 300mW × 0.8s = 240mJ
└── TOTAL: 880mJ

BUT on-device avoids:
✓ Network idle (radio power)
✓ Latency wait (display on longer)
→ Real-world: On-device 3-5x plus efficient

Battery impact:
Cloud: 50-100 inferences → -1% battery
On-device: 300-500 inferences → -1% battery

Applications Edge AI révolutionnaires

1. Google Pixel 9 - Traduction temps réel

LIVE TRANSLATE (Pixel 9):

Features:
• Traduction simultanée 95 langues
• Mode conversation (bidirectionnel)
• Overlay vidéo (sous-titres temps réel)
• Works offline (modèles on-device)

Architecture:
┌────────────────────────────────────────────┐
│ Audio input (microphone)                   │
│   ↓                                        │
│ [Speech-to-Text] Whisper-tiny (39MB)       │
│   ↓                                        │
│ [Translation] NLLB-distilled (180MB)       │
│   ↓                                        │
│ [Text-to-Speech] Tacotron2-lite (65MB)     │
│   ↓                                        │
│ Audio output (speaker)                     │
└────────────────────────────────────────────┘

Performance:
├── Latency: 450ms end-to-end
├── Accuracy: 94% (vs 96% cloud Google Translate)
├── Battery: 2% / heure conversation
└── Privacy: Zéro data envoyée serveur ✓

Total models: 284MB (all 95 languages)
NPU: Google Tensor G5 (28 TOPS)

2. Apple Intelligence - Siri on-device

SIRI 2025 (iOS 19):

New capabilities:
• Understanding complex requests (multi-step)
• On-device LLM (Phi-4 mini distilled, 2.7B)
• Personal context (emails, calendar, photos)
• Works offline (basic features)

Example:
User: "Résume mes emails importants ce matin et
       crée rappel pour répondre à Marie"

Siri (on-device):
1. [Email search] Scans Mail.app (on-device index)
2. [LLM summary] "3 emails importants:
   - Client Acme: Demande devis (urgent)
   - Manager: Réunion Q4 déplacée jeudi
   - Marie: Question budget projet"
3. [Action] Crée rappel "Répondre Marie" (Calendar.app)

Latency: 1.8s total (vs 4.2s cloud Siri)
Privacy: Emails JAMAIS envoyés serveur

Models:
├── Phi-4 mini (2.7B): 2.1GB INT4
├── Embedding model: 145MB
├── Voice recognition: 89MB
└── Total: 2.3GB

3. Meta Quest 4 - Spatial AI

META QUEST 4 VR (Q2 2025):

AI Features:
• Scene understanding (objects, surfaces)
• Hand tracking (26 keypoints, 90fps)
• Eye tracking + foveated rendering
• AI avatars (realistic expressions)

Chip: Snapdragon XR2 Gen 3
NPU: 18 TOPS (lower power vs mobile)

Scene Understanding:
[Cameras] 4× RGB + 2× depth
    ↓
[Object detection] YOLOv9-tiny (22MB)
    ↓
[Segmentation] SAM-distilled (95MB)
    ↓
[Spatial mapping] Real-time mesh

Latency: 11ms (90fps tracking)
Use case: AR games, virtual workspace, training simulations

Challenges et limitations Edge AI

LIMITATIONS EDGE AI 2025:

1. MODEL SIZE:
   ✗ Max pratique: 3-7B params (smartphones)
   ✗ vs Cloud: GPT-4 (1.7T params), Claude 4 (600B)
   → Solutions: Distillation, quantization aggressive

2. ACCURACY GAP:
   ✗ On-device: -5 à -10% vs cloud models
   → Acceptable pour majorité use cases
   → Amélioration continue (2024: -15%, 2025: -8%)

3. MEMORY CONSTRAINTS:
   ✗ Smartphones: 8-12GB RAM total
   ✗ App budget: 2-4GB max (OS + other apps)
   → Limit model size + caching strategies

4. THERMAL THROTTLING:
   ✗ NPU sustained: 3-5W max (heat)
   ✗ Burst: 8-10W (30 secondes)
   → Long inferences (video) challenging

5. FRAGMENTATION:
   ✗ iOS: Core ML (unified)
   ✓ Android: TFLite, NNAPI, proprietary SDKs
   → Complexité multi-platform

FUTURE (2026-2027):
✓ NPUs 80-100 TOPS (2x actuel)
✓ 10-15B modèles on-device (vs 3-7B today)
✓ HBM in smartphones (bandwidth 3x)
→ Gap cloud vs edge continue réduire

Conclusion : L'IA partout, tout le temps

Use cases 2025 :

Traduction temps réel offline (Google Pixel 9)
Assistants vocaux on-device (Apple Siri 2.0)
Photo/vidéo enhancement (Samsung Galaxy S25 AI)
AR/VR spatial understanding (Meta Quest 4)

2026 : Prédiction 10-15B modèles on-device, NPUs 80+ TOPS, et 90% smartphones avec IA générative locale. Le cloud reste pour tâches complexes, mais l'Edge gère 70% use cases quotidiens.

Edge AI 2025 : IA embarquée on-device pour mobiles et IoT

Sommaire

Sources

À propos de Marie Laurent

Sommaire

Accélérez vos entraînements IA sur GPU

Edge AI 2025 : IA embarquée on-device pour mobiles et IoT

Sommaire

Sources

À propos de Marie Laurent

Sommaire

Accélérez vos entraînements IA sur GPU

Articles similaires

Articles similaires