Private AI Stack open source souveraine 2026

Vue d'ensemble de la Private AI Stack

La stack recommandée s'organise en 10 tiers fonctionnels. Chaque tier est découplé et peut être remplacé indépendamment — principe fondamental pour éviter le vendor lock-in, même avec des composants open source.

┌─────────────────────────────────────────────────────────────┐
│  TIER 5 : UI / CHAT          Open WebUI 0.4               │
├─────────────────────────────────────────────────────────────┤
│  TIER 9 : API GATEWAY        Kong OSS 3.6                 │
├───────────────────────┬─────────────────────────────────────┤
│  TIER 3 : ORCHESTRAT. │  TIER 6 : AUTH                    │
│  LangChain 0.2        │  Keycloak 24                      │
│  LlamaIndex 0.10      │                                   │
├───────────────────────┼─────────────────────────────────────┤
│  TIER 2 : SERVING     │  TIER 4 : VECTORDB                │
│  vLLM 0.4.3           │  Qdrant 1.9                       │
├───────────────────────┼─────────────────────────────────────┤
│  TIER 1 : MODÈLES LLM │  TIER 8 : OBJECT STORAGE          │
│  ELODIE 32B           │  MinIO RELEASE.2024-03            │
│  KEVINA 32B           │                                   │
│  Llama 3.3 70B        │                                   │
├───────────────────────┴─────────────────────────────────────┤
│  TIER 7 : MONITORING  Prometheus + Grafana + DCGM          │
├─────────────────────────────────────────────────────────────┤
│  TIER 10 : IaC        Terraform + Helm + Ubuntu 22.04 LTS  │
└─────────────────────────────────────────────────────────────┘

10tiers fonctionnels

100%open source (licences libres)

0€de licence logicielle

~8 500€coût infra/mois pour 100 users

Tier 1 : Modèles LLM

Le choix du modèle est la décision la plus visible — mais elle est réversible, contrairement aux choix d'infrastructure. Les modèles peuvent être changés sans toucher à la stack.

Modèle	Params	VRAM requise	Licence	Forces
ELODIE 32B	32B	64 GB (BF16)	Propriétaire IP	Français natif, souverain, optimisé entreprise
KEVINA 32B	32B	64 GB (BF16)	Propriétaire IP	Code, analyse technique, raisonnement structuré
Mistral Small 3.1	24B	48 GB (BF16)	Apache 2.0	Multimodal, très bon rapport perf/taille
Llama 3.3 70B Instruct	70B	140 GB (BF16)	Llama 3.3 License	Raisonnement complexe, benchmark SOTA open
Qwen2.5 72B Instruct	72B	144 GB (BF16)	Apache 2.0*	Code, multilingual, très compétitif
DeepSeek-R1 32B	32B	64 GB (BF16)	MIT	Raisonnement, math, chain-of-thought

# Téléchargement modèle depuis Hugging Face Hub
pip install huggingface_hub

# Télécharger Llama 3.3 70B (accès HF requis)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
  --local-dir /models/llama-3.3-70b \
  --local-dir-use-symlinks False \
  --token hf_YOUR_TOKEN

# Vérifier l'intégrité (sha256 des shards)
cd /models/llama-3.3-70b
sha256sum -c checksums.sha256 2>&1 | grep -v OK
# Si output vide : tous les fichiers sont intègres

# Taille du modèle téléchargé
du -sh /models/llama-3.3-70b
# 141G    llama-3.3-70b

Tier 2 : Serving LLM — vLLM 0.4

vLLM est le moteur de référence pour l'inférence LLM en production. Ses fonctionnalités clés : PagedAttention (gestion KV cache), continuous batching, compatibilité API OpenAI, et support multi-GPU natif.

vLLM vs alternatives

Solution	Backend	Multi-GPU	API OpenAI	Perf. relative	Cas d'usage
vLLM 0.4	Python + CUDA	Oui (TP+PP)	Oui (native)	100% (référence)	Production, haute charge
Ollama 0.3	Go + llama.cpp	Partiel	Oui (compatible)	~40-60%	Dev local, poste unique
llama.cpp server	C++ + CUDA	Partiel	Oui	~50-70%	CPU/GPU hybride
TGI (HF)	Rust + Python	Oui	Partiel	~85-95%	Fine-tuning + serving
LiteLLM Proxy	Python	Via backend	Oui (proxy)	Dépend backend	Multi-provider routing

# Installation vLLM 0.4.3 en environnement isolé
python3 -m venv /opt/vllm-env
source /opt/vllm-env/bin/activate

pip install vllm==0.4.3 \
  --extra-index-url https://download.pytorch.org/whl/cu124

# Test rapide : modèle 7B pour validation
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --port 8000 &

# Test de l'API
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
       "messages":[{"role":"user","content":"Bonjour !"}]}'

# Sortie attendue :
# {"choices":[{"message":{"content":"Bonjour ! Comment puis-je vous aider ?"...}}]}

Tier 3 : Orchestration LLM — LangChain vs LlamaIndex

Critère	LangChain 0.2	LlamaIndex 0.10
Spécialité principale	Agents, chains, outils	RAG, indexation documents
Intégrations	700+ (vaste écosystème)	200+ (axé données)
RAG out-of-the-box	Bon (via chains)	Excellent (core feature)
Agents multi-étapes	Excellent (LangGraph)	Bon (QueryPipeline)
Performance streaming	Bonne	Très bonne
Courbe apprentissage	Élevée (abstractions nombreuses)	Moyenne
Licence	MIT	MIT

Recommandation : Utilisez LlamaIndex pour les pipelines RAG purs (ingestion, indexation, retrieval). Utilisez LangChain/LangGraph pour les agents complexes multi-outils. Les deux s'intègrent nativement avec vLLM via l'API OpenAI compatible.

# llama_index_rag.py — Pipeline RAG complet avec LlamaIndex + vLLM + Qdrant
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.node_parser import SentenceSplitter
import qdrant_client

# Configuration LLM : vLLM compatible OpenAI
Settings.llm = OpenAILike(
    model="elodie-32b",
    api_base="http://vllm-service:8000/v1",
    api_key="your-vllm-api-key",
    is_chat_model=True,
    context_window=16384,
    max_tokens=2048,
)

# Modèle d'embedding (local, souverain)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-m3",  # Multilingue FR/EN, 1024 dims
    device="cuda",
)

Settings.node_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64,
)

# Connexion Qdrant
qdrant = qdrant_client.QdrantClient("http://qdrant:6333")
vector_store = QdrantVectorStore(
    client=qdrant,
    collection_name="knowledge-base",
)

# Créer l'index
index = VectorStoreIndex.from_vector_store(vector_store)

# Query engine avec reranking
from llama_index.postprocessor.colbert_rerank import ColbertRerank
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Récupérer 10 candidats
    node_postprocessors=[
        ColbertRerank(top_n=3)  # Reranker → garder les 3 meilleurs
    ],
    streaming=True
)

response = query_engine.query(
    "Quelle est notre politique de confidentialité des données clients ?"
)
for token in response.response_gen:
    print(token, end="", flush=True)

Tier 4 : VectorDB — Qdrant recommandé

Qdrant est notre recommandation principale pour la Private AI Stack 2026. Écrit en Rust, il combine performance native, facilité de déploiement (binaire unique ou Docker), et filtrage de métadonnées avancé — sans la complexité opérationnelle de Milvus (qui requiert etcd + MinIO comme dépendances).

# docker-compose-qdrant.yaml — Qdrant HA avec réplication
version: '3.8'
services:
  qdrant-node1:
    image: qdrant/qdrant:v1.9.2
    command: qdrant --uri http://qdrant-node1:6335
    ports:
      - "6333:6333"  # HTTP API
      - "6334:6334"  # gRPC API
      - "6335:6335"  # Internal cluster port
    volumes:
      - qdrant-storage-1:/qdrant/storage
    environment:
      QDRANT__LOG_LEVEL: INFO
      QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: 8

  qdrant-node2:
    image: qdrant/qdrant:v1.9.2
    command: qdrant --uri http://qdrant-node2:6335
    ports:
      - "6343:6333"
      - "6344:6334"
      - "6345:6335"
    volumes:
      - qdrant-storage-2:/qdrant/storage

volumes:
  qdrant-storage-1:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /nvme/qdrant/node1
  qdrant-storage-2:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /nvme/qdrant/node2

# qdrant_setup.py — Création collection avec configuration optimisée
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    OptimizersConfigDiff, QuantizationConfig, ScalarQuantization
)

client = QdrantClient("http://qdrant:6333")

# Créer une collection avec HNSW + quantization scalaire
client.create_collection(
    collection_name="knowledge-base",
    vectors_config=VectorParams(
        size=1024,           # BGE-M3 embedding size
        distance=Distance.COSINE,
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                # Nombre de connexions par nœud HNSW
        ef_construct=200,    # Qualité de construction de l'index
        full_scan_threshold=10000,
    ),
    optimizers_config=OptimizersConfigDiff(
        deleted_threshold=0.2,
        vacuum_min_vector_number=1000,
        memmap_threshold=50000,  # Basculer en mmap au-delà de 50k vecteurs
    ),
    quantization_config=QuantizationConfig(
        scalar=ScalarQuantization(
            type="int8",
            quantile=0.99,
            always_ram=True   # Garder les quantized vecteurs en RAM
        )
    ),
    replication_factor=2,  # Réplication sur 2 nœuds
    write_consistency_factor=1,
)

print(f"Collection créée avec {client.get_collection('knowledge-base').vectors_count} vecteurs")

Tier 5 : UI et Chat — Open WebUI

Open WebUI (anciennement Ollama WebUI) est l'interface chat la plus complète de l'écosystème open source. Elle supporte nativement l'API OpenAI (donc vLLM), gère les utilisateurs, les conversations, les documents RAG, et s'authentifie via OAuth2/OIDC Keycloak.

# open-webui-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: llm-apps
spec:
  replicas: 2
  selector:
    matchLabels:
      app: open-webui
  template:
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:v0.4.5
        ports:
        - containerPort: 8080
        env:
        # Backend LLM : pointer vers vLLM
        - name: OPENAI_API_BASE_URL
          value: "http://kong-gateway/v1"
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: api-key
        # Auth OAuth2 Keycloak
        - name: OAUTH_CLIENT_ID
          value: "open-webui"
        - name: OAUTH_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: keycloak-secrets
              key: client-secret
        - name: OPENID_PROVIDER_URL
          value: "https://keycloak.interne/realms/intelligence-privee/.well-known/openid-configuration"
        - name: OAUTH_SCOPES
          value: "openid email profile"
        # Configuration UX
        - name: DEFAULT_MODELS
          value: "elodie-32b"
        - name: DEFAULT_USER_ROLE
          value: "user"
        - name: ENABLE_SIGNUP
          value: "false"  # Inscription uniquement via Keycloak
        - name: WEBUI_NAME
          value: "Intelligence Privée — Assistant IA"
        - name: WEBUI_URL
          value: "https://ai.entreprise.fr"
        volumeMounts:
        - name: webui-data
          mountPath: /app/backend/data
      volumes:
      - name: webui-data
        persistentVolumeClaim:
          claimName: open-webui-pvc

Tier 6 : Authentification — Keycloak 24

# Installation Keycloak 24 via Helm
helm repo add bitnami https://charts.bitnami.com/bitnami

cat > keycloak-values.yaml << 'EOF'
authentication:
  adminUser: admin
  adminPassword: "${KEYCLOAK_ADMIN_PASSWORD}"

postgresql:
  enabled: true
  auth:
    database: keycloak
    username: keycloak
    password: "${PG_PASSWORD}"

extraEnvVars:
  - name: KC_HOSTNAME
    value: keycloak.interne
  - name: KC_PROXY
    value: edge
  - name: KC_HTTP_RELATIVE_PATH
    value: /

replicaCount: 2

resources:
  requests:
    cpu: 2
    memory: 2Gi
  limits:
    cpu: 4
    memory: 4Gi
EOF

helm install keycloak bitnami/keycloak \
  -f keycloak-values.yaml \
  --namespace auth \
  --create-namespace \
  --version 22.0.0

# Vérification
kubectl get pods -n auth
# keycloak-0   1/1   Running   0   2m
# keycloak-1   1/1   Running   0   1m

Tier 7 : Monitoring — Prometheus + Grafana + DCGM

# Installation stack monitoring complète en une commande
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=200Gi \
  --set grafana.adminPassword="${GRAFANA_PASSWORD}" \
  --set alertmanager.enabled=true

# DCGM Exporter (métriques GPU)
helm install dcgm-exporter nvdp/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

# Import dashboards Grafana (IDs publics)
curl -X POST http://admin:${GRAFANA_PASSWORD}@grafana:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d '{"id":12239,"overwrite":true,"folderId":0}'  # NVIDIA DCGM dashboard

curl -X POST http://admin:${GRAFANA_PASSWORD}@grafana:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d '{"id":315,"overwrite":true,"folderId":0}'    # Kubernetes cluster dashboard

Tier 8 : Stockage objets — MinIO

# Installation MinIO Operator + Tenant via Helm
helm install minio-operator \
  minio/operator \
  --namespace minio-operator \
  --create-namespace

# Créer un tenant MinIO (4 nœuds, erasure coding)
cat > minio-tenant.yaml << 'EOF'
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
  name: llm-storage
  namespace: minio
spec:
  image: minio/minio:RELEASE.2024-03-15T01-07-19Z
  pools:
  - servers: 4
    volumesPerServer: 2
    volumeClaimTemplate:
      spec:
        storageClassName: fast-nvme
        resources:
          requests:
            storage: 2Ti  # 2 TB par volume = 16 TB total, 8 TB utiles (EC:4+2)
  mountPath: /data
  requestAutoCert: true
  features:
    bucketDNS: true
  buckets:
  - name: llm-models
  - name: llm-documents
  - name: llm-backups
EOF
kubectl apply -f minio-tenant.yaml

Tier 9 : API Gateway — Kong OSS

# Installation Kong OSS en mode DB-less (recommandé pour K8s)
helm repo add kong https://charts.konghq.com

cat > kong-values.yaml << 'EOF'
env:
  database: "off"  # DB-less mode
  router_flavor: expressions
  log_level: notice

ingressController:
  enabled: true
  ingressClass: kong

proxy:
  type: LoadBalancer
  tls:
    enabled: true
    servicePort: 443

resources:
  requests:
    cpu: 2
    memory: 2Gi
  limits:
    cpu: 4
    memory: 4Gi

replicaCount: 2

serviceMonitor:
  enabled: true  # Metrics Prometheus
EOF

helm install kong kong/kong \
  -f kong-values.yaml \
  --namespace kong \
  --create-namespace \
  --version 2.38.0

Tier 10 : Infrastructure as Code — Terraform + Helm

# Structure du repo IaC recommandée
tree /infra/
# infra/
# ├── terraform/
# │   ├── environments/
# │   │   ├── production/
# │   │   │   ├── main.tf
# │   │   │   ├── variables.tf
# │   │   │   └── terraform.tfvars
# │   │   └── staging/
# │   └── modules/
# │       ├── k8s-cluster/
# │       ├── gpu-nodes/
# │       └── networking/
# └── helm/
#     ├── environments/
#     │   ├── production.yaml
#     │   └── staging.yaml
#     └── helmfile.yaml  ← orchestration des charts

# helmfile.yaml — Déploiement orchestré de toute la stack
repositories:
- name: prometheus-community
  url: https://prometheus-community.github.io/helm-charts
- name: nvdp
  url: https://nvidia.github.io/k8s-device-plugin
- name: kong
  url: https://charts.konghq.com
- name: bitnami
  url: https://charts.bitnami.com/bitnami
- name: minio
  url: https://charts.min.io

releases:
# Ordre 1 : Infrastructure de base
- name: nvidia-device-plugin
  namespace: kube-system
  chart: nvdp/nvidia-device-plugin
  version: 0.15.0

# Ordre 2 : Stockage et bases de données
- name: minio-operator
  namespace: minio-operator
  chart: minio/operator
  needs: [kube-system/nvidia-device-plugin]

# Ordre 3 : Authentification
- name: keycloak
  namespace: auth
  chart: bitnami/keycloak
  version: 22.0.0
  valuesFiles: [helm/environments/{{`{{.Environment.Name}}`}}.yaml]
  needs: [minio-operator/minio-operator]

# Ordre 4 : Monitoring
- name: kube-prometheus-stack
  namespace: monitoring
  chart: prometheus-community/kube-prometheus-stack
  version: 57.2.0
  needs: [kube-system/nvidia-device-plugin]

# Ordre 5 : API Gateway
- name: kong
  namespace: kong
  chart: kong/kong
  version: 2.38.0
  needs: [auth/keycloak]

# Ordre 6 : Applications LLM (déploiement custom)
- name: vllm
  namespace: llm-production
  chart: ./charts/vllm
  needs: [kong/kong, monitoring/kube-prometheus-stack]

# Ordre 7 : UI
- name: open-webui
  namespace: llm-apps
  chart: ./charts/open-webui
  needs: [llm-production/vllm, auth/keycloak]

Ordre d'installation et dépendances

# Script d'installation complète (ordre critique)
#!/bin/bash
set -euo pipefail

echo "[1/7] Préparation nœuds GPU (drivers NVIDIA + containerd)"
ansible-playbook -i inventory/production gpu-nodes-setup.yaml

echo "[2/7] Provisioning cluster Kubernetes (k3s ou kubeadm)"
helmfile apply --file infra/helm/helmfile.yaml \
  --selector name=nvidia-device-plugin
kubectl wait --for=condition=Ready node --all --timeout=300s

echo "[3/7] Déploiement stockage MinIO"
helmfile apply --selector name=minio-operator
kubectl apply -f infra/k8s/minio-tenant.yaml
kubectl wait --for=condition=Ready tenant/llm-storage \
  -n minio --timeout=600s

echo "[4/7] Déploiement authentification Keycloak"
helmfile apply --selector name=keycloak
kubectl rollout status deployment/keycloak -n auth --timeout=300s
# Importer la configuration realm
python3 scripts/keycloak_import_realm.py

echo "[5/7] Déploiement monitoring"
helmfile apply --selector name=kube-prometheus-stack
helmfile apply --selector name=dcgm-exporter

echo "[6/7] Déploiement Kong API Gateway"
helmfile apply --selector name=kong
kubectl apply -f infra/k8s/kong-config.yaml

echo "[7/7] Déploiement vLLM + Qdrant + Open WebUI"
# Préparer les modèles
python3 scripts/download_models.py --model elodie-32b
# Déployer
helmfile apply --selector name=qdrant
helmfile apply --selector name=vllm
helmfile apply --selector name=open-webui

echo "Stack déployée avec succès !"
echo "Open WebUI : https://ai.entreprise.fr"
echo "Grafana : https://monitoring.entreprise.fr"
echo "Keycloak : https://auth.entreprise.fr"

Coût infrastructure pour 100 utilisateurs

Composant	Ressources	Coût/mois (cloud FR)	Coût/mois (on-prem amorti)
GPU serving (2× A100 80GB)	2 GPU H100 ou 4 A100	5 000–7 000 €	2 500–3 500 €
CPU nodes (Keycloak, Kong, monitoring)	3× 16 vCPU, 64 GB RAM	600–900 €	300–500 €
Stockage NVMe (MinIO, Qdrant, modèles)	20 TB NVMe SSD	400–600 €	200–350 €
Réseau (egress, load balancer)	—	100–200 €	50–100 €
Backup et monitoring storage	5 TB	100–150 €	50–80 €
Total mensuel	—	6 200–8 850 €	3 100–4 530 €
Coût par utilisateur/mois	—	62–88 €	31–45 €

Coût logiciel vs coût infrastructure

Ces chiffres couvrent l'infrastructure uniquement. Le coût réel inclut également : les jours-ingénieurs d'installation et de maintenance (comptez 0.5 ETP pour une infrastructure 100 users), la formation des équipes, et le support. Un fournisseur comme Intelligence Privée peut absorber ces coûts opérationnels dans une offre managée, ramenant le coût réel total à 40-70 € par utilisateur/mois tout compris.

Ce qu'il faut retenir

La stack complète repose sur 10 composants open source éprouvés — aucune licence propriétaire, aucun vendor lock-in technologique.
Qdrant est recommandé sur Milvus pour sa simplicité opérationnelle (binaire Rust, pas de dépendances externes) sans compromis sur les performances.
Open WebUI est l'interface de référence : compatible vLLM/API OpenAI, auth Keycloak OIDC, RAG intégré — prête en production en quelques heures.
L'ordre d'installation est critique : stockage → auth → monitoring → gateway → LLM → UI. Les dépendances sont fortes entre couches.
Pour 100 utilisateurs : budget infra 6 000–9 000 €/mois en cloud souverain français, 3 000–4 500 € en on-premise amorti sur 5 ans.

Votre Private AI Stack déployée en 30 jours

Intelligence Privée déploie et opère votre pile IA souveraine complète — du bare metal à l'interface utilisateur — avec les modèles ELODIE et KEVINA intégrés, le support en français et un SLA de 99.9%.

Démarrer ma stack IA souveraine →

FAQ

Peut-on remplacer Keycloak par un autre fournisseur OIDC ?

Oui. Tout composant supportant OpenID Connect (OIDC) fonctionne : Microsoft Entra ID (Azure AD), Okta, Auth0, ou même un Authentik auto-hébergé. La condition est que votre IdP expose un endpoint JWKS pour la validation des tokens JWT. Configurez Kong avec le plugin jwt pointant vers votre JWKS endpoint et Open WebUI avec les variables OAUTH_* correspondantes. Keycloak est recommandé car il est 100% open source, auto-hébergeable, et supporte la fédération LDAP/AD native.

Qdrant peut-il remplacer une base de données traditionnelle pour les métadonnées ?

Non. Qdrant est optimisé pour la recherche vectorielle — pas pour les requêtes relationnelles complexes, les transactions ACID ou les jointures. Dans la Private AI Stack, Qdrant stocke les embeddings et les métadonnées simples associées (source, date, département). Pour les données relationnelles (utilisateurs, sessions, coûts, logs), utilisez PostgreSQL qui est déjà présent comme backend Keycloak. Les deux cohabitent naturellement.

Comment mettre à jour les composants de la stack sans tout réinstaller ?

Helmfile simplifie les mises à jour : modifiez la version du chart dans helmfile.yaml et exécutez helmfile apply --selector name=vllm pour mettre à jour uniquement vLLM. Chaque composant peut être mis à jour indépendamment. Respectez les dépendances : une mise à jour de Keycloak peut nécessiter une reconfiguration des clients OIDC dans Kong et Open WebUI. Maintenez un environnement de staging identique à la production pour tester les mises à jour avant déploiement.

Private AI Stack — assembler sa pile IA souveraine open source en 2026