Q-Learning

An agent learns to navigate a grid through reinforcement learning
Instructions Claude
Prompt utilise pour regenerer cette page :
Page: Q-Learning
Description: "An agent learns to navigate a grid through reinforcement learning"
Category: artificial-intelligence
Icon: brain
Tags: reinforcement, q-learning, grid
Status: new

Front matter (index.md):
  title: "Q-Learning"
  description: "An agent learns to navigate a grid through reinforcement learning"
  icon: "brain"
  tags: ["reinforcement", "q-learning", "grid"]
  status: ["new"]

HTML structure (index.md):
  <section class="grid-container" id="grid-wrapper">
    <canvas id="grid-canvas"></canvas>
  </section>

Widget files:
- _stats.right.md (weight: 10): ##### Statistics
  <dl class="qlearn-stats"> with:
    - Episode: dd#stat-episode (initial "0")
    - Steps: dd#stat-steps (initial "0")
    - Reward: dd#stat-reward (initial "0")
    - Epsilon: dd#stat-epsilon (initial "1.0")

- _controls.right.md (weight: 20): ##### Controls
  <div class="qlearn-controls"> with:
    {{< button id="btn-play" icon="play" aria="Play" class="is-start" >}}
    {{< button id="btn-play" icon="pause" aria="Pause" class="is-stop" >}}
    {{< button id="btn-step" icon="skip-forward" aria="Step" >}}
    {{< button id="btn-reset-q" icon="refresh" aria="Reset Q-Table" >}}
  Note: both play/pause buttons share id="btn-play"
  Sliders:
    - Speed: input#slider-speed type=range min=1 max=100 value=10
    - Learning Rate (alpha): input#slider-alpha type=range min=0.01 max=1 step=0.01 value=0.1
    - Discount (gamma): input#slider-gamma type=range min=0.1 max=0.99 step=0.01 value=0.9
    - Epsilon (epsilon): input#slider-epsilon type=range min=0.1 max=1 step=0.01 value=1.0
  Checkbox:
    - input#check-decay checked: "Auto decay epsilon"

- _source.after.md (weight: 90): Explains Q-Learning algorithm: Q-Table, update rule Q(s,a)=Q(s,a)+alpha*(r+gamma*max(Q(s',a'))-Q(s,a)), epsilon-greedy, episodes from top-left to bottom-right goal. Click cells to edit: empty -> wall -> trap -> goal cycle.

Architecture (single file default.js):
  IIFE, imports: panic from '/_lib/panic_v3.js'

  Constants:
    GRID_SIZE=8, MAX_STEPS_PER_EPISODE=200
    ACTIONS: [{dx:0,dy:-1,label:'up-arrow'}, {dx:0,dy:1,label:'down-arrow'}, {dx:-1,dy:0,label:'left-arrow'}, {dx:1,dy:0,label:'right-arrow'}]
    Cell types: CELL_EMPTY='empty', CELL_WALL='wall', CELL_TRAP='trap', CELL_GOAL='goal', CELL_START='start'
    Rewards: REWARD_STEP=-0.1, REWARD_GOAL=10, REWARD_TRAP=-10
    Heatmap colors: COLOR_NEGATIVE={r:255,g:68,b:68}, COLOR_NEUTRAL={r:255,g:255,b:255}, COLOR_POSITIVE={r:68,g:255,b:68}
    CELL_CYCLE=[CELL_EMPTY, CELL_WALL, CELL_TRAP, CELL_GOAL]

  Cell class:
    constructor(type='empty'): stores type.
    getReward(): goal=10, trap=-10, default=-0.1.
    isTerminal(): true if goal or trap.

  Grid class:
    constructor(): creates 8x8 Cell array. Default layout:
      Start: (0,0). Goal: (7,7).
      Traps: (3,2), (5,4), (2,6), (6,1).
      Walls: (1,1), (2,3), (3,3), (4,5), (5,5), (6,2).
    getCell(x,y): returns Cell or null if out of bounds.
    setCell(x,y,type): sets Cell at position.
    isWall(x,y): true if null or wall.

  Agent class:
    constructor(startX=0, startY=0): stores start position and current position.
    move(actionIndex, grid): computes new position from ACTIONS[actionIndex]. If wall/OOB, stays. Returns {x, y, moved}.
    reset(): returns to start position.

  QLearning class:
    constructor(gridSize, numActions): creates 3D Q-table [y][x][action] initialized to 0.
    _createTable(): returns 3D zero array.
    chooseAction(x, y, epsilon): epsilon-greedy. Random < epsilon -> random action, else getBestAction.
    update(x, y, action, reward, nextX, nextY, alpha, gamma, terminal): Bellman equation. currentQ + alpha * (reward + gamma * maxNextQ - currentQ). maxNextQ=0 if terminal.
    getBestAction(x, y): argmax over Q-values for state.
    getMaxQ(x, y): max Q-value for state.
    reset(): reinitializes Q-table to zeros.

  Renderer class:
    constructor(canvas): gets 2d context, dpr. State: agentX/Y, pulse animation (pulseTimer, pulseX/Y, pulseType, pulseAlpha). Cached colors: background, border, wall, text, trapTint, goalTint, startTint, agent.
    cacheColors(): reads --background-color-surface, --draw-color-surface, --text-color-secondary, --text-color-primary, --draw-color-primary from CSS.
    initSize(): HiDPI support. Logical size = min(container.clientWidth, 494). Sets canvas buffer to logical*dpr, CSS display size to logical, ctx.setTransform(dpr).
    width/height getters: canvas buffer / dpr.
    cellSize getter: width / GRID_SIZE.
    render(grid, qLearning, agent): clears canvas with border color. Finds maxAbsQ for normalization. Draws each cell: _drawCell (heatmap bg) + _drawArrow (best action). Then pulse overlay, then agent circle.
    renderAgent/renderQValues/renderArrows: no-op API compat methods, render() handles all.
    pulseCell(x, y, type): sets pulse state, alpha=0.6, clears after 200ms timeout.
    updateCellType(x, y, type): no-op (grid data is source of truth).
    eventToGrid(event): converts click to grid coords via getBoundingClientRect / cellSize.
    _drawCell(ctx, cx, cy, cw, ch, cell, qLearning, gx, gy, maxAbsQ): walls get wall color. Others: heatmap from normalized Q (maxQ/maxAbsQ clamped -1..1). Then tint overlay for trap (red), goal (green), start (blue).
    _drawArrow(ctx, cx, cy, cw, ch, cell, qLearning, gx, gy): skips walls/goal/trap. Skips if all Q=0. Draws Unicode arrow for best action. fontSize = max(12, cw*0.35), alpha 0.7.
    _drawPulse(ctx, size, gap): colored flash overlay (green for goal, red for trap) with pulseAlpha.
    _drawAgent(ctx, agent, size): circle at cell center, radius=6, agent color.
    _getMaxAbsQ(qLearning, grid): scans all non-wall cells, returns max absolute Q (min 1).
    _interpolateColor(value): value in [-1,1]. Negative: red->white. Positive: white->green. Returns rgb() string.

  Simulation state: grid, agent, qLearning, renderer. episodeCount, stepCount, totalReward. Parameters: speed=10, alpha=0.1, gamma=0.9, epsilon=1.0, decayEnabled=true. isRunning, timerId.

  Simulation logic:
    step(): chooseAction(epsilon-greedy), move agent, get reward, check terminal (goal/trap/maxSteps), update Q-table, pulse on terminal, render. If terminal: endEpisode(). Returns boolean.
    endEpisode(): increments episodeCount. Epsilon decay: epsilon *= 0.995 (min 0.01) if decayEnabled. Updates slider display. Resets agent/stepCount/totalReward. Renders.
    runSimulation(): stepsPerTick = max(1, floor(speed/10)). Runs steps, schedules setTimeout with interval = max(10, 1000/speed).
    togglePlay(): toggles isRunning, .is-running on .qlearn-controls. Starts/stops simulation.
    singleStep(): only when paused. Calls step() + updateStats().
    resetQTable(): stops if running. Resets Q-table, agent, counters. Reads epsilon from slider.

  UI helpers:
    updateStats(): updates stat-episode, stat-steps, stat-reward (toFixed(1)), stat-epsilon (toFixed(3)).
    updateSliderDisplay(sliderId, value): sets slider.value and #sliderId-value textContent.
    handleCellClick(event): only when paused. eventToGrid, bounds check, protect (0,0). Cycles type through CELL_CYCLE. Updates grid, renders.
    bindSliders(): input events for slider-speed/alpha/gamma/epsilon + check-decay checkbox.
    bindControls(): click events for btn-play (togglePlay), btn-step (singleStep), btn-reset-q (resetQTable).

  Initialization:
    init(): gets grid-canvas. Creates Grid, Agent(0,0), QLearning(8,4), Renderer(canvas). cacheColors, initSize, render. Binds controls + sliders + canvas click. MutationObserver on documentElement data-theme attribute for recoloring. updateStats.
    Auto-init: readyState check, DOMContentLoaded fallback.

SCSS file (default.scss):
  .grid-container: flex, justify center, max-width 100%
  #grid-canvas: cursor pointer

  layout-main scope:
    .qlearn-stats: flex, justify center, gap 2rem, flex-wrap
      .stat: flex column centered, gap 0.25rem
        .label: 0.75rem, uppercase, muted
        .value: 1.5rem, weight 600, tabular-nums

    .qlearn-controls: flex row nowrap, justify center, gap 0.5rem
      .is-start: display block (visible)
      .is-stop: display none (hidden)
      &.is-running: .is-start none, .is-stop block

    .slider-group: flex row, align center, gap 0.5rem
      label: 0.85rem, primary text
      input[type="range"]: accent-color primary

    .check-group: flex row, align center, gap 0.5rem
      checkbox: accent-color primary, pointer
      label: 0.85rem, primary text, pointer
Page entierement generee et maintenue par IA, sans intervention humaine.