RL Playground for Beginners

Controls

Environment Settings

Live Stats

Human Player Stats

Steps

0

Score

0

State (row, col)

(0, 0)

Last Reward

Status

AI Training Stats

Episode

0

Epsilon (ε)

1.00

Last Ep. Steps

Last Ep. Reward

Manual Controls

Reinforcement Learning: The Ultimate Beginner's Guide

Step 1: What is Reinforcement Learning?

Reinforcement Learning (RL) lets computers and robots learn by **trial and error**. It’s like playing a game: try something, see if it works, get a reward for good moves or a penalty for mistakes, and gradually get smarter! RL is used for self-driving cars, smart robots, video game agents, and more.

  • The key: The computer/robot is not told the answers ahead of time, it discovers them.

Step 2: The Parts of RL

  • Agent: The "player" or learner (the robot 🤖 in the grid).
  • Environment: The world/arena where the agent moves (the grid with walls, treasure, traps).
  • State: The current situation (where the agent is, e.g., row 2, column 3).
  • Action: Any move the agent can make (Up, Down, Left, Right).
  • Reward: The feedback/punishment the agent receives for each move (positive: treasure, negative: trap/wall).

Step 3: The Agent's Journey

  1. Start at some grid cell (state).
  2. Pick an action (move Up, Down, Left, Right).
  3. Land in a new state, receive a reward or penalty.
  4. Repeat, learning which state-action moves lead to success over time.

Step 4: What is a Policy?

A policy is the agent’s personal map: a rule telling it what action to take from every possible state. At first, its policy is random—after lots of learning, it becomes very smart (go straight to the goal, avoid traps, etc).

Think of it as a game plan that gets better as the agent explores!

Step 5: What is the Q-Table and Q-Values?

The **Q-table** is like the agent’s memory book of best moves. For every position (“state”) in the grid, the agent keeps a score called a “Q-value” for each direction (action).

  • High Q-value: “This move led to reward!”
  • Low/negative Q-value: “This move led to a trap or wall.”
  • At the beginning, all Q-values are zero because the agent knows nothing.

Step 6: How Does Q-Learning Work?

As the agent explores, the Q-table is updated using **Q-learning**. Here’s how the agent gets smarter, step by step:

  1. The agent tries a move; receives a reward.
  2. It updates its Q-value for that move, using the magic formula:
    Q[state][action] ← Q[state][action] + learning_rate × [reward + discount × max(Q[next_state]) - Q[state][action]]
  3. The agent keeps repeating this, and Q-values grow “smarter” with each experience.
  • learning_rate (alpha): How quickly the agent changes its beliefs.
  • discount (gamma): How much it values future rewards over immediate ones.
  • max(Q[next_state]): The highest Q for the next spot (the best the agent *could* achieve from there).

Over many tries, the agent’s Q-table starts showing the fastest/safest routes to the treasure.

Step 7: Example — Rewards in This Game

If the agent...Reward
Finds the Treasure+10
Falls into a Trap-10
Bumps into a Wall-5
Takes an empty step-0.1

Step 8: Why is RL so Powerful?

For small grids, humans can find a path. But if the grid gets bigger, with lots of traps and walls, it becomes hard to “just memorize”—you need to learn wisely, from experience. RL lets agents learn clever strategies for any layout, even ones it has never seen before!

  • Try making the playground harder above and see how the agent “learns” what to do!
  • Keep training—see Q-table arrows that point the smart way.

Step 9: Final Challenge — Try It Yourself!

In this playground, you can be the agent! Race against the AI. Tinker with the grid, number of traps, and more, and see how RL helps the agent master the maze. This is the start of all powerful AI learning!