Introduction
Our project focuses on robot cognition and collaboration to perform an optimization task. The goal was to create a restaurant environment and train robots to manage and optimize its operations. The restaurant is a pizza fast-food establishment where customers enter, place their orders, wait to be served, then pay and eat. The robots are divided into two types: waiter robots and cooker robots. They must learn to cooperate in order to meet demand as quickly as possible, keep the restaurant clean, and improve customer satisfaction to increase revenue. This report details the environment, the robots, the AI model we used, and our results.
The objective of the project is twofold: maximizing restaurant income and customer satisfaction. Income is determined by the difference between revenue and expenses, which depend on the number of customers, their orders, and their satisfaction levels. Higher customer satisfaction leads to increased patronage, but the daily number of customers remains uncertain. To optimize operations, robots must predict customer flow and arrival times, order the right quantity of food, and synchronize their tasks while managing energy levels to meet demand efficiently. Customer satisfaction is enhanced by minimizing delivery times. This project employs multi-agent reinforcement learning to develop optimal robot policies that maximize income and satisfaction over a set period. Traditional planning methods are insufficient due to the uncertainty in customer arrivals, order numbers, and delivery times, making it necessary to implement a complex environment and advanced planning techniques.
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize a cumulative reward. Unlike supervised learning, where the model learns from labeled data, RL relies on trial and error, using a reward system to guide the learning process. The agent takes actions in a given state, receives feedback in the form of rewards or penalties, and adjusts its strategy accordingly to improve performance over time. Key components of RL include the agent (the decision-maker), the environment (where the agent operates), the state (a representation of the environment at a given time), the action (a choice the agent can make), and the reward (a signal indicating success or failure). RL algorithms, such as Q-learning and Deep Q-Networks (DQN), are widely used in applications like robotics, gaming, autonomous vehicles, and financial trading, where decision-making under uncertainty is crucial.

Environnement
The environment is a restaurant that can host a maximum of 200 customers per day. The restaurant operates for 4 hours, which corresponds to 240 timestamps in our model. Each day, every customer decides whether to visit, with the probability of coming determined by a Bernoulli distribution based on their self-satisfaction. Initially, each customer has a 50% chance of visiting. Depending on their experience, their satisfaction can increase or decrease, influencing their likelihood of returning the next day. When a customer decides to come, their arrival timestamp follows a normal distribution with a mean of 120 timestamps and a variance of 60 timestamps.
The environment is partially observable for the robots managing the restaurant, as they do not know customer satisfaction, the number of customers arriving, or their arrival times. A scheduler is created daily, determining customer arrival times, but this information is not accessible to the robots.
The robots must efficiently manage the restaurant to respond to demand as quickly as possible. There are two types of robots: waiters and cookers. The cooker robot has three possible actions: cooking a pizza, washing a plate, or recharging itself. Each action takes exactly one timestamp. The energy consumed by the robot for cooking or washing differs, and the energy provided by one timestamp of charging is a configurable parameter. The waiter robot can take customer orders, deliver pizzas, clean the restaurant, and recharge itself. The number of robots of each type can be set as a parameter in the model, and each action takes one timestamp.
Each customer enters the restaurant at a random timestamp, and their satisfaction is calculated based on the time they wait from arrival until being served. If a customer waits longer than the average waiting time, their satisfaction decreases linearly with each passing timestamp. If they are not served by the end of the day, their satisfaction drops to zero, and they will never return. Additionally, satisfaction is influenced by the restaurant’s cleanliness level. Each time a robot cooks or a waiter delivers, a small amount of dirt accumulates, affecting the cleanliness score. This cleanliness score is multiplied by the waiting time score to determine the final customer satisfaction score.
Each time a customer is served, they pay $10. Each time a cooker robot makes a pizza, it costs $2. The price, cost, average waiting time, and dirt score levels can all be configured in the model.
A simulation in our environment represents a year, with the number of days and hours of operation per day configurable. For this project, we trained our model based on a 4-hour opening time (240 timestamps) per day and a simulation period of 50 days.

Goal and Expectation
Our project aims to develop a model that will train our robots to efficiently manage the restaurant. The primary objective is to maximize the restaurant's annual income. To achieve this, the robots must collaborate to enhance customer satisfaction as much as possible. The more satisfied a customer is, the higher the likelihood they will return the next day, ultimately increasing the restaurant’s daily income. In this project, we will utilize reinforcement learning and implement a multi-agent Deep Q-learning algorithm to train the robots. Our goal is to determine the optimal policy for managing the restaurant and achieving the highest possible daily income. Additionally, we aim for our robots to maximize customer satisfaction while maintaining cleanliness in the restaurant.
Agent
Our agents are the robots, and our goal is to train them to manage the restaurant as efficiently as possible. Each robot has a specific type—either a cooker or a waiter and a set of predefined actions it can perform, as described earlier. It is important to note that every action takes one timestamp, which represents one minute. The "cook" action results in the creation of one pizza. Additionally, there is a limited number of plates that need to be washed before they can be reused. When a cooker robot performs the "wash" action, a predefined number of plates are cleaned in one timestamp, and this value can be adjusted in the configuration. Similarly, for the waiter's "clean" action, the amount of dirt a robot can remove in one timestamp is also configurable. Our model has been trained using three cooker robots and three waiter robots.
Deep Q-Learning
Deep Q-Learning (DQN) is a model-free reinforcement learning algorithm that combines the power of Q-learning with deep neural networks to solve complex tasks. In the context of your restaurant management problem, DQN will allow the robots to learn optimal policies by interacting with the environment, receiving feedback, and adjusting their actions accordingly. The goal is to maximize the long-term reward, which, in this case, corresponds to maximizing restaurant income and customer satisfaction while minimizing costs and inefficiencies.
Model Architecture:
In DQN, the neural network acts as a function approximator for the Q-value function, which estimates the expected future rewards for a given state-action pair. The architecture typically consists of several layers:
-
Input Layer: The input layer receives the current state of the environment, which for your case would include parameters such as the number of customers, the number of waiter and cooker robots available, the energy level of robots, the current dirtiness of the restaurant, and other environmental factors. This input is encoded as a vector, representing the state of the system at that particular timestamp.
-
Hidden Layers:
-
The hidden layers consist of fully connected layers that transform the input into a more abstract feature space. These layers typically use ReLU (Rectified Linear Unit) activations, which help the model learn complex patterns in the data. In your case, the hidden layers help the network to understand the relationships between the number of robots, customer arrival times, and the impact of the robots' actions on restaurant operations.
-
The first hidden layer could capture low-level features like robot activity or dirt levels, while deeper layers may capture more abstract features like the synchronization of robot actions, waiting time of customers, and the relationship between cooking time and customer satisfaction.
-
-
Output Layer: The output layer represents the Q-values corresponding to each possible action for a given state. In your setup, the possible actions for the robots might include:
-
For the waiter: take order, deliver pizza, clean the restaurant, recharge.
-
For the cooker: cook pizza, wash plates, recharge. Each action corresponds to a Q-value, representing the expected future reward if that action is chosen in the current state.
-
Rewards:
In your restaurant management problem, the reward signal is crucial for teaching the robots to optimize their behavior. Rewards are based on both income generation and customer satisfaction. The following reward structure can be defined:
-
Income Reward: After every action (like cooking a pizza, serving a customer, or cleaning the restaurant), a reward could be provided based on the amount of money the restaurant makes. For example, if a robot cooks a pizza, the reward might be +10 (corresponding to the payment from the customer), and a negative reward could be given for inefficient actions like overcooking or wasting ingredients.
-
Customer Satisfaction Reward: The robots should also be rewarded for minimizing customer waiting time and maintaining a clean environment. A positive reward is given if the customer is served quickly, and the restaurant cleanliness is improved. If the waiting time exceeds a threshold or the restaurant becomes too dirty, the reward becomes negative, reflecting the deterioration in customer satisfaction.
-
Penalty for Unmet Demands: If the robots fail to satisfy the demand (e.g., a customer leaves because they were not served in time), a penalty (negative reward) is assigned, encouraging the robots to optimize their coordination and efficiency.
The reward function for each robot could be as follows:
-
For Cooker Robots: +10 for cooking a pizza, -2 for washing a plate (cost), -5 for idling or doing unnecessary tasks.
-
For Waiter Robots: +15 for delivering a pizza promptly, +5 for cleaning the restaurant, -10 for letting the restaurant become too dirty, and -5 for delayed customer service
Exploration and Exploitation:
One of the challenges in reinforcement learning is balancing exploration (trying new actions) and exploitation (choosing the best-known action). This balance is achieved by using an epsilon-greedy strategy:
-
Exploration: With probability ϵ\epsilonϵ, the robot selects a random action (exploration).
-
Exploitation: With probability 1−ϵ1 - \epsilon1−ϵ, the robot selects the action with the highest Q-value (exploitation).
As the training progresses, ϵ\epsilonϵ is decayed, shifting the behavior of the robots from exploration to exploitation, allowing them to converge to an optimal policy.
Training Process:
The training process involves updating the Q-values based on the observed rewards. As the robots interact with the environment, they accumulate experiences in the form of state-action-reward-next state tuples, known as experience replay. These experiences are stored in a replay buffer and sampled randomly to train the network. This technique helps stabilize learning by breaking the correlation between consecutive experiences.
To summarize, the Deep Q-Learning model for your restaurant management problem uses a neural network to approximate the Q-value function. It learns the best policy by interacting with the environment, observing the outcomes of its actions, and adjusting based on a reward system that reflects both the income generation and customer satisfaction. Through training, the robots will optimize their actions to ensure the smooth operation of the restaurant, maximizing profits and customer satisfaction
Loss Function:
The loss function in Deep Q-Learning is derived from the Bellman equation, which defines the relationship between the Q-values at different time steps. The goal is to minimize the difference between the predicted Q-values and the target Q-values (which come from the environment’s feedback). The loss function is typically defined as the mean squared error between the predicted Q-value and the target Q-value:

Where:
-
θ represents the weights of the neural network,
-
r is the immediate reward received after performing action aaa,
-
γ is the discount factor (which controls the importance of future rewards),
-
Q(s′,a′) is the Q-value for the next state s′ and next action a′, and
-
θ− represents the target network's weights (used to stabilize training).
The loss function encourages the neural network to minimize the discrepancy between predicted Q-values and the expected Q-values based on the rewards and the model’s future predictions.
Code
Our code is structured into nine Python files, each serving a specific role in the simulation and training of our reinforcement learning model. Six of these files define the environment, agent, visual rendering, and functions related to actions, preconditions, and effects. One file implements the reinforcement learning algorithm, another is dedicated to training the model, and the last one handles result visualization.
Notably, we did not use Unified Planning, as it is not well-suited for reinforcement learning tasks. However, we maintained a structure similar to Unified Planning, utilizing fluents, actions with preconditions, and effects.
-
Config.py: This file defines all environment parameters, including action efficiency, action energy cost, the maximum number of customers, opening hours, average waiting time, and more. It must be instantiated to start the model and allows users to configure real variables for their restaurant.
-
State.py: This file manages fluents and creates numerical representations for each robot's state. The state representation differs depending on whether the robot is a cooker or a waiter. For cookers, the state includes battery level, available clean plates, and pending pizza orders. For waiters, the state tracks accumulated dirt, battery level, customer statuses, and waiting times.
-
Action.py: This file defines the action space for the robots. It implements postconditions and effects, determines valid actions from a given state, and updates fluents after an action is executed. A locking mechanism prevents conflicts, ensuring that two robots do not perform the same action on the same resource (e.g., two waiters serving the same customer or two cookers preparing the same order).
-
Modelization.py: This file constructs the visual representation of the environment at each timestep. It provides two views: one showing the environment before actions are taken and another displaying the environment after actions are executed. Key visual elements include:
-
Cooker robots (yellow) and waiter robots (orange)
-
Battery levels displayed on the left
-
Customer states on the right, where:
-
Pink indicates a customer who has not yet ordered
-
Green represents a waiting customer
-
-
Other indicators include accumulated dirt, delivered customers, available plates, ordered pizzas, and pizzas ready to be served.
-
Chosen actions appear in a grey rectangle at each timestep.
-
-
Environment.py: This file integrates all components (state, actions, visual model, and config) and implements the environment as a Gym-like environment with step, render, and reset functions. The step function simulates one timestep, retrieving the current state, observation space, and action space for each robot. The environment can be reset or restarted for a new simulation.
-
Utils.py: This file contains utility functions for selecting actions with the highest Q-value and generating random arrival times for customers.
-
DQN.py: This file implements the multi-agent Deep Q-Learning algorithm. Each robot has an independent fully connected neural network (FNN) that receives its observation space at each timestep and outputs Q-values for all possible actions. The robot selects the action with the highest Q-value among valid options. The file also contains:
-
A predict function that returns Q-values based on current observations.
-
A learn function that stores experiences and updates the neural network via backpropagation.
-
-
Simulator.py: This file handles model training. Robots simulate 2,000 years, each containing 50 days and 240 timesteps per day. During training, robots first explore their environment and later exploit learned strategies. The training process follows an epsilon-greedy policy, balancing exploration and exploitation. Robots aim to maximize customer satisfaction, as higher satisfaction leads to increased customer numbers and profitability. Rewards are given at the end of each day based on the average customer satisfaction score.
-
TestSimulator.py: This file runs a test simulation of a day or an entire year after training. It evaluates model performance based on daily and yearly income results.
Model and Algorithm
Multi-agent reinforcement learning (MARL) is challenging, particularly for robots. We implemented a fully decentralized Deep Q-Learning algorithm, where each robot has its own fully connected neural network (FNN) with two hidden layers. Each robot makes independent decisions based on partial observations of the restaurant.
The goal is for robots to learn the optimal policy for maximizing customer satisfaction. During training, the predict function is called at each timestep to compute Q-values for all actions. Actions are chosen randomly during exploration and based on the neural network's predictions during exploitation.
Each timestep, robots store actions and observations. At the end of the day, they receive rewards based on customer satisfaction, and their neural networks are updated using backpropagation. After training, the robot models are saved for testing.
Rewards and Punishments
Defining an effective reward function was crucial. Initially, we considered using daily income as a reward but found it inadequate since customers might pay even if dissatisfied. Instead, we based rewards on global customer satisfaction at the end of each day.
Additional small rewards and penalties guide behavior:
-
Cooker robots:
-
Penalized for idling when tasks are available.
-
Penalized for recharging unnecessarily if they still have sufficient battery.
-
-
Waiter robots:
-
Rewarded for taking orders or delivering to customers at the top of the waiting list (higher rewards for prioritizing first-arrived customers).
-
Penalized for idling when tasks exist.
-
Rewarded for cleaning when dirt levels are high.
-
By reinforcing these behaviors, the robots learn strategies that maximize long-term customer satisfaction and profitability.

Results
We trained our algorithm for 10,000 simulated years, equivalent to 325 days, with 240 timestamps per day. The tradeoff between exploration and exploitation decreases over time, reaching a stable level of 10%-90% after year 7,000. The training process was lengthy, taking six days to complete. Each simulated year takes approximately one minute.
After training, we obtained the following results:
-
The robots successfully handled 150 customers per day, achieving a satisfaction score of 0.8.
-
After year 8,000, the restaurant’s daily income ranged between $1,100 and $1,200 (with $1,200 being the maximum possible revenue due to the 150-customer limit).
-
The restaurant remained above the first level of cleanliness most of the time.
-
The robots learned to monitor their battery levels and recharge themselves when necessary.
-
By year 4,500, the robots understood that to maximize customer satisfaction, all customers had to be served by the end of the day.
-
By year 6,350, they adopted the optimal policy of serving customers in the order they arrived.
The training proved effective for managing a maximum of 200 customers with six robots. Additionally, the robots improved their coordination: for example, two cooking robots remained focused on cooking, while one was almost always washing dishes.
After training, we recorded both a full day and a full year of simulation with visualization to showcase the results. These results, along with modifiable parameters, can be accessed directly in the simulator.py file.
We also analyzed yearly results to assess how well the restaurant retained customers. In our environment, the more satisfied a customer is with their last visit, the higher their chances of returning the next day. With a maximum capacity of 200 customers and an initial satisfaction rate of 0.5, the robots started serving around 100 customers per day. By meeting demand and increasing satisfaction, they were able to attract more customers.
We observed that:
-
When the number of customers was below 125, satisfaction remained between 0.53 and 0.65—an improvement over the initial level. The robots managed to serve all customers, ensuring no one left unserved.
-
When the number of customers exceeded 125, the robots struggled to maintain satisfaction above 0.5 and to serve everyone.
-
After a year of service, customer satisfaction ranged between 0.55 and 0.6, with an average of 123 customers per day. The restaurant achieved a daily profit of $915, which is a very good result.
Here is a link to a short video showcasing the results:
Limits and improvement
The results demonstrated that using reinforcement learning for restaurant management by robots can significantly improve performance in this domain. Our findings confirm that this type of algorithm works effectively, providing robots with intelligence that can help solve complex tasks. However, we encountered some issues. For instance, the robots did not predict the customer arrival probability distribution (which follows a normal distribution with a mean of 120 and variance of 60). Instead of adapting to the customer arrival times, they acted based on which customer was present at the moment.
Another challenge was that the robots seemed to take and deliver orders randomly, rather than following the order of arrival.
We believe that more training and exploration could help address these issues and improve the results. Additionally, there is a limitation on the number of customers the robots can handle, with a cap at 125 customers per day. We think that this limitation could be overcome by further training or by increasing the number of robots in the restaurant.