TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

:page_facing_up: Download PDF | :link: View on arXiv

Source arXiv
arXiv ID 2604.06610v1
Authors Nan Zhang, Zishuo Wang, Shuyu Huang et al.
Published Apr 8, 2026
Categories cs.LG, cs.AI
Curated by @stevek
Curated on Apr 20, 2026
Tags digital-twin

Decentralised online learning enables runtime adaptation in cyber-physical multi-agent systems, but when operating conditions change, learned policies often require substantial trial-and-error interaction before recovering performance. To address this, we propose TwinLoop, a simulation-in-the-loop d…



Full-Text Markdown

TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

Nan Zhang [†∗] , Zishuo Wang [†] , Shuyu Huang [†] , Georgios Diamantopoulos [∗†] , Nikos Tziritas [‡] , Panagiotis Oikonomou [‡] and Georgios Theodoropoulos [∗†]

Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen, China

Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China

Department of Informatics and Telecommunications, University of Thessaly, Lamia, Greece

Abstract —Decentralised online learning enables runtime adaptation in cyber-physical multi-agent systems, but when operating conditions change, learned policies often require substantial trialand-error interaction before recovering performance. To address this, we propose TwinLoop , a simulation-in-the-loop digital twin framework for online multi-agent reinforcement learning. When a context shift occurs, the digital twin is triggered to reconstruct the current system state, initialise from the latest agent policies, and perform accelerated policy improvement with simulation what-if analysis before synchronising updated parameters back to the agents in the physical system. We evaluate TwinLoop in a vehicular edge computing task-offloading scenario with changing workload and infrastructure conditions. The results suggest that digital twins can improve post-shift adaptation efficiency and reduce reliance on costly online trial-and-error.

Index Terms —Digital twins, multi-agent reinforcement learning, vehicular edge computing, task offloading.

I. INTRODUCTION

Decentralised self-adaptation is increasingly important in cyber-physical multi-agent systems, where agents must continuously adapt their decisions to dynamically changing environments [1], [2]. In such systems, online learning provides a natural basis for runtime adaptation, and a growing body of work has explored learning-based self-adaptation to cope with uncertainty and evolving operating conditions [3]–[6]. However, in highly dynamic environments, pure online learning approaches often suffer from costly exploration and slow convergence [4], [7]. This is exacerbated in multi-agent scenarios where evolving agent behaviour becomes an additional unpredictable factor that affects environment dynamics [6].

A key challenge is that as the environment evolves, agent policies are increasingly likely to face conditions for which they do not possess the knowledge to handle [8]. To recover performance in these cases, agents typically need to adapt through trial-and-error interactions with the physical environment (online learning) [9]. However, such real-world exploration can be costly, as it can temporarily degrade system performance and, in safety- or resource-sensitive settings, even

impose operational risks on the underlying infrastructure [7]. A mechanism is therefore needed to rehearse adaptation before costly trial-and-error fully unfolds in the physical system.

Digital Twins (DTs) offer a promising foundation for such a mechanism. A DT maintains a synchronised virtual replica of the physical system, enabling state monitoring, environment reconstruction, and simulation-based what-if exploration [10]. Although prior studies have explored the use of DTs for reinforcement learning (RL) and optimisation [11]–[13], an important question remains insufficiently studied: can a DT, when triggered at runtime, use what-if simulation to accelerate the adaptation of decentralised online learning agents after environmental change?

Task offloading in Vehicular Edge Computing (VEC) provides a representative testbed for studying this problem [14]. In VEC, vehicles continuously generate computation-intensive tasks, such as video processing and voice assistant services [15]. Since each vehicle has limited onboard computational capacity, tasks can either be executed locally or offloaded to nearby Road-Side Units (RSUs) with stronger processing capability. Each vehicle is associated with a local decision agent that dynamically determines where tasks should be executed in order to minimise task completion latency [15]. This setting is inherently dynamic due to temporally varying network conditions, changing topologies, vehicle mobility, and fluctuating task demand [14].

In this paper, we propose TwinLoop , a DT-supported online learning framework that accelerates the adaptation of decentralised online learning agents. The crux of the system is a digital twin that mirrors the operating condition of the VEC system including agent policies. TwinLoop leverages faster what-if simulation to enable policy exploration and adaptation to take place in the digital twin, with adapted policies then synchronised back to the physical system. The benefit of this approach is twofold: cost-free exploration and broader scenario coverage for better generalisation.

The main contributions of this paper are as follows:

This work was supported by the Research Institute of Trustworthy Autonomous Systems, the Guangdong Province Innovative and Entrepreneurial Team Programme (No. 2017ZT07X386) and by the MSc programme in Informatics and Computational Biomedicine of the University of Thessaly. N. Zhang, G. Diamantopoulos, and G. Theodoropoulos are the corresponding authors.

  • We highlight the problem of costly adaptation in decentralised online learning after environmental shifts and motivate digital twins as a runtime what-if simulation mechanism for policy rehearsal.

  • We propose TwinLoop , a simulation-in-the-loop DT framework in which a runtime-triggered twin reconstructs the current system state, performs accelerated policy improvement in simulation, and synchronises updated parameters back to the physical system.

  • We evaluate TwinLoop in a VEC task-offloading scenario and show that it improves post-shift adaptation efficiency and reduces reliance on costly real-world trial-and-error.

II. RELATED WORK

----- Start of picture text -----

EDGE EDGE EDGE EDGE
SERVER SERVER SERVER SERVER
Task
AGENT AGENT AGENT Offload?
NO YES
Local Chose Server
Execution
LOAD LOAD LOAD LOAD
----- End of picture text -----

Fig. 1: System model under consideration.

A. Online Decentralised Learning in Dynamic Environments

Existing work on decentralised self-adaptation through online learning mainly addresses trial-and-error coordination in real-world environments. Cardellini et al. [4] adopt a hierarchical design in which local learners make adaptation decisions and a central coordinator resolves conflicts, whereas D’Angelo et al. [5] advocate a fully decentralised scheme based on interagent information sharing and trust-aware updates. Dragan et al. [6] coordinate multiple learners through factored Q- functions over local and shared goals. However, these studies still rely on costly exploration in the real environment rather than pre-adaptation through simulation.

B. Digital Twins for Online Learning

Digital twins have increasingly been explored as a means to support online learning. In early work, we explored simulation for prescriptive what-if analysis and data-driven adaptation [16], [17]; more recently, we have investigated knowledge management and learning in cognitive digital twins [18]–[21].

Most prior DT-assisted online learning studies focus on single-agent settings. Lee et al. [9] use a DT to temporarily train a decision policy at runtime decision points before applying it to the real system. Wang et al. [11] let the agent execute one action in reality while exploring additional actions in the twin, and combine real and simulated rewards for policy improvement. Wu et al. [22] establish a continuous feedback loop in which real trajectories improve the DT, while the DT in turn supports reinforcement learning. Deng et al. [23] enable dynamic switching between expert knowledge and an online reinforcement learning policy.

In distributed and multi-agent learning, Sun et al. [24] use a descriptive DT to monitor IoT node status and adapt federated aggregation frequency. Overall, prior work either uses DTs for single-agent simulation-in-the-loop learning or adopts descriptive DTs for coordination support, with limited attention to continuously updated DTs for online simulation in decentralised multi-agent learning.

C. Digital Twin-Assisted Decentralised and Multi-Agent Learning for Task Offloading in VEC

In VEC task offloading, DT-assisted learning remains limited. Following the taxonomy introduced in our earlier work [10], existing studies are still largely confined to descriptive twins for current state mirroring and predictive twins for near-future estimation, while prescriptive DTs based on online what-if simulation-in-the-loop analysis remain largely

unexplored. Most decentralised and multi-agent studies are descriptive. Zhang et al. [25] mirror network and resource states to estimate cooperation gains and enable adaptive agents aggregation; while some other works focus on mirroring per-vehicle task processing context [26] and vehicle/RSU states [27]. By contrast, predictive DTs further estimate future task arrivals and throughput to support offloading and resource reservation [12].

III. SYSTEM MODEL AND PROBLEM FORMULATION

We consider a VEC scenario with M vehicles and N RSU edge servers as illustrated in Fig. 1. Each vehicle generates computation tasks following a Poisson process with arrival rate λ . Upon task arrival, each vehicle independently decides whether to process the task locally or offload it to one of the servers via a Vehicle-to-Infrastructure (V2I) wireless link, with the objective of minimising end-to-end completion latency.

A. System Model

Communication Model. Following [15], the uplink transmission rate between vehicle vi and server sj is modelled as:

where B is the V2I channel bandwidth, P is the vehicle uplink transmit power, dij is the Euclidean distance between vi and sj , α denotes the path-loss exponent, and σ[2] represents the background noise power. The transmission delay for a task with input data size bi is (assuming negligible downlink delay):

Computation Model. Each server sj maintains a processing rate fj and a cumulative backlog Lj of pending computation cycles from previously accepted tasks. The computation delay for a task with demand Di at server sj is modelled as:

The local computational platform of a vehicle is similarly modelled using a local processing rate fi and its cumulative cycle backlog Li . The computation delay for processing a task with demand Di locally on vehicle i is:

----- Start of picture text -----

DIGITAL TWIN
PHYSICAL SYSTEM MODEL Training Scenario Generation and Simulation
Vehicle ModelCar ModelsCar Models Server ModelsEdge ServerServer Models Policy Online Policy Online
Task Model Policy Net. Model Vehicle LearningTraining Replay … Vehicle LearningTraining Replay
A Memory N Memory
Vehicle States Edge Server States Agent Policy Nets Updated Policy Networks
FEEDBACK LOOP
Offloading Local Execution
Task QueueTask Queue CPUCPU Task CPUPolicy Net. LearningOnline Learnin Onlineg MemoryReplay
Queue Offloading DRL Agent Task Generation…
EDGE SERVERS
VEHICLES
PHYSICAL TWIN
LOAD LOAD
----- End of picture text -----

Fig. 2: Architecture of the DT-assisted VEC system.

End-to-End Latency. The end-to-end task completion latency is:

Objective: The objective of each vehicle i is to minimise the total end-to-end latency of all tasks:

where Ki is the total number of tasks generated by vehicle i .

B. Reinforcement Learning Formulation

We model the task offloading decision of each vehicle as a Markov Decision Process (MDP), defined by the tuple ⟨S, A, S[′] , R⟩ , where S[′] denotes the next state space reached after executing an action. Each vehicle maintains an independent agent that observes local information and selects an offloading target without coordination with other vehicles. State Space. At each decision step, vehicle vi constructs a local observation vector:

where fi , Li are the vehicle’s local processing rate and backlog; bi , Di are the task data size and computation demand; and fj , Lj , rij describe the processing rate, backlog, and uplink rate of each server sj .

Action Space. The action ai ∈{ 0 , 1 , . . . , N } selects the execution target. ai = 0 denotes local execution; ai = j > 0 denotes offloading to server sj .

Reward Function. The reward is the negative end-to-end latency of the completed task:

Maximising the cumulative discounted reward therefore directly minimises the objective in (6).

Learning Algorithm. Each vehicle agent is trained independently using a Dueling Double DQN (D3QN) [28]. Unlike standard DQN, Dueling DQN decouples the estimation of the state-value function V ( s ; θV ) and the state-dependent advantage function A ( s, a ; θA ). The Q-value is given by:

where θ = {θV , θA} denotes the network parameters of a single agent. The Double DQN component mitigates Q-value overestimation by decoupling action selection from action evaluation using separate online and target networks.

Exploration Policy. Action selection follows a Boltzmann policy. Given the Q-values Q ( s, a ; θ ) for valid actions, the probability of selecting action a is:

where τ > 0 is a temperature parameter; a higher τ encourages exploration, while τ → 0 leads to a greedy policy.

C. DT-Assisted Adaptive Offloading

The proposed framework consists of a Digital Twin (DT) and its corresponding Physical Twin (PT), where the PT denotes the physical system being mirrored and assisted by the DT, as illustrated in Fig. 2. The framework operates in three stages:

Stage 1: Snapshot. Upon triggering, the DT acquires a snapshot Φ of the current PT system state:

where {fj, Lj, pj} are the servers’ processing rates, backlogs, and positions, {fi, Li, pi} are the vehicles’ processing rates, backlogs, and positions, λ is the task inter-arrival rate, w is the task-type distribution, and ΘPT = {θi}[M] i =1[is the current set]

----- Start of picture text -----

What-if … What-if …
Training Training
Agents in DT
Time
Synchronise Inject knowledge Revise knowledge
knowledge and state
“Amount” of
Agents in PT applicable knowledge
Trigger DT Trigger DT Time
Online Learning Environment context change
----- End of picture text -----

Fig. 3: Workflow of the DT-assisted VEC system.

Algorithm 1 DT-Assisted Adaptation

Require: DT training budget T DT Ensure: Updated agent weights ΘPT in the PT 1: Stage 1: Snapshot 2: Acquire the current PT system snapshot Φ 3: Stage 2: DT Training 4: Reconstruct and initialise the DT from Φ 5: Set exploration temperature τ ← τ 0 6: for each step during T DT do 7: Observe s ; select a via Boltzmann policy 8: Execute a ; observe r = −t[e][2] [e] and s[′] 9: Store ( s, a, r, s[′] ) in replay buffer B 10: Update ΘDT using D3QN 11: end for

12: Stage 3: Weight Synchronisation 13: Synchronise weights back to the PT: ΘPT ΘDT 14: Resume online learning in the PT

of network weights of all agents. In a real distributed system, obtaining a globally consistent snapshot Φ can be challenging as it requires synchronisation across multiple agents. In this work, we assume that such snapshots are available as their acquisition falls outside the scope of this paper.

Stage 2: DT Training. The DT uses Φ to initialise both the simulated environment and the agent network weights. The DT starts with a high exploration rate ( τ = τ 0) at the beginning of training, allowing agents to rapidly discover strategies suited to the new environment conditions through simulation.

Stage 3: Weight Synchronisation. Upon completion of DT training, the updated weights ΘDT are synchronised back to the PT. Each vehicle then updates its local agent with ΘDT and continues online learning in the physical environment. The above procedure is formally described in Algorithm 1 and the workflow of the proposed approach is illustrated in Fig. 3.

IV. EXPERIMENTAL SETUP

A. Simulation Settings

Experiments are conducted on a fixed 2 km × 2 km grid road network simulated using SUMO with N = 16 stationary RSU edge servers. To evaluate adaptation to dynamic conditions, we use a 3 500 s simulation. It includes a 500 s warm-up

Fig. 4: 2km × 2km simulation scenario.

phase (excluded from comparative analysis), followed by three 1 000 s phases, each introducing an environmental change. The initial system configuration in the warm-up phase follows the setup in [15]. The wireless channel adopts a path-loss model ( P = 30 dBm, α = 4 . 0, B = 1 MHz, σ[2] = 10 [−][13] W). The 16 RSU servers comprise four low-tier (2.0 GHz), four mid-tier (2.5 GHz), and eight high-tier (3.5 GHz) nodes. We deploy 45 active vehicles with onboard processing capacities drawn uniformly from [0 . 8 , 1 . 2] GHz. Task arrivals follow a Poisson process with rate λ = 1 / 8 s [−][1] . Six task types are considered, with computation demands ranging from 10[9] to 10[10] cycles and data sizes of 5 × 10[5] –10[7] bits.

Each subsequent phase modifies the initial configuration as follows: Phase 1 (500–1 500 s): Active vehicles increase to 65, and workloads shift toward computation-heavy tasks ( 10[10] cycles). Phase 2 (1 500–2 500 s): Half of the servers enter maintenance, degrading processing rates by 50–70%, while the task arrival rate increases by 20%. Phase 3 (2 500– 3 500 s): All server capacities are restored, but the dominant workload shifts to lightweight tasks.

These phases aim to emulate realistic traffic and infrastructure fluctuations seen in urban IoV systems. The increase in vehicles and computation-heavy workloads (Phase 1) mirrors rush-hour congestion where more users generate latencysensitive applications, while partial server degradation reflects maintenance windows or unexpected outages in edge infrastructure. The final shift to lightweight tasks (Phase 3) captures off-peak conditions where demand stabilises and applications become less resource-intensive.

B. Experiment Settings

To verify the effectiveness of the DT in accelerating policy convergence, we compare the proposed DT-assisted approach against the following baselines: (a) Random : assigns tasks uniformly across all servers and local execution; (b) Online : performs online learning exclusively in the PT without DT assistance; (c) Offline : deploys the converged weights obtained from the full Online run; and (d) Exploit : operates identically

TABLE I: Phase-wise end-to-end latency (seconds) for all methods.

Category
Method
Warmup (0–500 s)
Mean
P90
P99
Phase 1 (500–1500 s)
Mean
P90
P99
Phase 2 (1500–2500 s)
Mean
P90
P99
Phase 3 (2500–3500 s)
Mean
P90
P99
Baselines
Random
Online
Offine
7.404 15.180 37.575
6.731 13.065 30.003
4.282 7.045 13.793
113.712 349.580 606.062
18.293 43.060 82.834
27.031 55.224 95.262
1141.270 3324.453 4571.616
11.856
19.785
66.930
36.974
74.408 133.062
558.378 1576.397 1793.818
4.240
7.708
18.031
5.741
10.168
49.572
Exploit
Exploit (_T_DT=250)
Exploit (_T_DT=500)
6.342 12.896 24.270
6.944 13.335 36.308
11.626 22.922 44.726
9.678 17.431 36.053
14.020
22.594
66.986
10.979
18.629
35.848
4.395
7.770
17.228
4.567
8.496
18.340
DT Single-Scenario
k=1, _T_DT=500
k=2, _T_DT=500
k=4, _T_DT=500
6.689 13.075 27.602
6.338 12.645 23.575
6.699 13.848 24.999
10.187 19.396 39.223
8.320 15.071 30.482
9.676 16.705 34.632
10.529
16.761
29.889
10.698
17.372
34.961
12.151
20.517
45.506
4.532
8.570
18.134
4.380
7.905
18.619
4.683
8.600
18.753
DT Multi-Scenario
k=1, _T_DT=500
k=2, _T_DT=500
7.067 14.326 30.676
6.508 12.980 24.445
9.828 17.541 36.990
8.448 14.238 27.291
11.386
18.795
38.182
12.986
21.114
44.444
4.859
8.808
21.141
4.523
8.433
18.401

Note: Each phase covers 1,000 s of simulation time. Bold denotes the best value in each column. k : number of DT triggers per phase; T DT: scenario duration; Multi-Scenario: 3 scenarios per trigger.

Fig. 5: Phase-wise mean latency across methods.

Fig. 6: Latency trajectory: Online vs. DT-assisted adaptation.

The results demonstrate that the DT functions as a convergence accelerator. DT-assisted training substantially reduces latency following phase transitions, with the benefit concentrated in the initial period following each environmental change (Fig. 6). Compared to the Online baseline, the proposed method (DT single-scenario, k = 1, T DT = 500s) yields significant initial gains in Phase 1. In Phase 2, the mean gap narrows yet the P99 reduction remains significant. The Offline baseline shows severe performance degradation across phases. This is consistent with the premise that offline learning methods struggle in highly dynamic environments.

to the proposed method, but switches the agents in PT to exploitation ( τ = τ min) after weight synchronisation. Both the PT and DT use SUMO[1] for vehicle mobility simulation, with wireless channel, task offloading, and RL logic implemented in Python. The DT runs 25 × faster than the PT. Experiments[2] were run on an AMD Ryzen 7 6800H CPU with 16 GB RAM and an NVIDIA RTX 3060 GPU under Ubuntu 24.04.

To investigate the factors affecting DT effectiveness, we evaluate variants along three dimensions: DT triggers per phase ( k ∈{ 1 , 2 , 4 } ), DT training time per trigger ( T DT ∈ { 250 , 500 } ), and single-scenario versus multi-scenario simulation. All scenarios are initialised from the received PT snapshot, with vehicle routes randomly assigned in each case. In the multi-scenario setting, the DT runs a sequence of perturbed scenarios, carrying model weights over between them. Perturbations (5%) are applied to parameters such as vehicle speed, task computation demand, and task data size.

In Phase 3 all methods, excluding random, exhibit a similar latency profile. This is primarily attributed to the operating conditions of phase 3, decreased task computational requirements and recovered edge server capacity, that model off-peak settings. This decreases the pressure put on the agents allowing all approaches to achieve low latency.

Focusing the comparison on the duration of training in the DT, the Exploit baseline is used to evaluate the effects of training for T DT = 250s and T DT = 500s. Notably, Exploit is chosen to eliminate the noise added by exploration in the PT. The results of the analysis show that the longer trained variant achieves a substantial recrudesce in latency. This highlights the fact that longer training times are beneficial as they allow the policy to better adapt to the new conditions.

V. EXPERIMENTAL RESULTS

We evaluate the proposed DT-assisted framework against the baselines and variants defined in Section IV-B. Performance is evaluated using the per-task end-to-end latency t[e][2] [e] . We report the phase-wise mean latency, 90th-percentile (P90), and 99thpercentile (P99). Table I summarises the phase-wise mean, P90, and P99 latencies while Fig. 5 provides a visual overview of the phase-wise mean latency across all methods.

Focusing on the analysis on the DT trigger frequency ( k ) results show that triggering DT adaptation can enhance performance under high environmental pressure but too frequent updates have destabilising effects. Specifically, results show

1https://eclipse.dev/sumo/

2Code, data, and experiment scripts are available at: https://github.com/asialab-sustech/TwinLoop

that increasing k from 1 to 2 leads to a reduction in latency under Phase 1, but a further increase to k = 4 leads to a regression in performance under all phases.

Focusing on multi-scenario vs. single-scenario DT-assisted adaptation, the differences are marginal. This can be primarily attributed to the fact that the simulation scenario used for evaluation models a homogeneous urban setting diminishing the value of training on varying scenarios. We expect multiscenario training to lead to more impactful results under heterogeneous scenarios covering both urban and rural settings; such evaluation constitutes future work.

Overall, the results show that DT assistance is most effective during the most demanding periods (Phase 1 and 2) effectively reducing mean latency and suppressing tail values. The results further suggest that trigger frequency plays a critical role: triggering the DT too rarely limits its adaptation benefit, whereas overly frequent triggering can harm performance. This highlights adaptive DT trigger scheduling as an important direction for future work.

VI. CONCLUSION AND FUTURE DIRECTION

This paper proposed a Digital Twin-assisted deep reinforcement learning framework to accelerate online learning convergence in multi-agent environments. Using vehicular edge computing (VEC) task offloading as a representative case study, experimental results have demonstrated the ability of the proposed DT-assisted online policy adaptation to significantly accelerate policy convergence leading to decreased end-toend task execution latency. Future work directions include the development of an adaptive DT triggering mechanism based on detecting context shifts and evolving conditions; a policy merging scheme to allow for policy augmentation rather than replacement, as well as conducting further evaluation of the framework in highly heterogeneous network topologies to explore its generalisation capabilities.

REFERENCES

  • [1] F. Quin, D. Weyns, and O. Gheibi, “Decentralized Self-Adaptive Systems: A Mapping Study,” in 2021 International Symposium on Software Engineering for Adaptive and Self-Managing Systems , 2021, pp. 18–29.

  • [2] H. Muccini, M. Sharaf, and D. Weyns, “Self-adaptation for Cyberphysical Systems: A Systematic Literature Review,” in 11th International Symposium on Software Engineering for Adaptive and SelfManaging Systems . New York, NY, USA: ACM, 2016, pp. 75–81.

  • [3] O. Gheibi, D. Weyns, and F. Quin, “Applying Machine Learning in Selfadaptive Systems: A Systematic Literature Review,” ACM Transactions on Autonomous and Adaptive Systems , vol. 15, no. 3, pp. 9:1–9:37, 2021.

  • [4] V. Cardellini, F. Lo Presti, M. Nardelli, and G. Russo Russo, “Decentralized self-adaptation for elastic Data Stream Processing,” Future Generation Computer Systems , vol. 87, pp. 171–185, 2018.

  • [5] M. D’Angelo, M. Caporuscio, V. Grassi, and R. Mirandola, “Decentralized learning for self-adaptive QoS-aware service assembly,” Future Generation Computer Systems , vol. 108, pp. 210–227, Jul. 2020.

  • [6] P.-A. Dragan, A. Metzger, and K. Pohl, “Coordinated Online Reinforcement Learning for Self-Adaptive Systems Using Factored Q-Learning,” in 2025 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS) , Sep. 2025, pp. 76–87.

  • [7] A. Metzger, C. Quinton, Z. A. Mann, L. Baresi, and K. Pohl, “Realizing self-adaptive systems via online reinforcement learning and featuremodel-guided exploration,” Computing , vol. 106, no. 4, Apr. 2024.

  • [8] S. Padakandla, “A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments,” ACM Computing Surveys , vol. 54, no. 6, pp. 127:1–127:25, Jul. 2021.

  • [9] D. Lee, Y.-S. Kang, and S. D. Noh, “Digital twin-driven deep reinforcement learning for real-time optimisation in dynamic AGV systems,” International Journal of Production Research , pp. 1–19, Aug. 2025.

  • [10] N. Zhang, R. Bahsoon, and G. Theodoropoulos, “Towards Engineering Cognitive Digital Twins with Self-Awareness,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) . IEEE, Oct. 2020, pp. 3891–3896.

  • [11] X. Wang, L. Ma, H. Li, Z. Yin, T. Luan, and N. Cheng, “Digital TwinAssisted Efficient Reinforcement Learning for Edge Task Scheduling,” in 2022 IEEE 95th Vehicular Technology Conference , 2022, pp. 1–5.

  • [12] J. Zheng, Y. Zhang, T. H. Luan, P. K. Mu, G. Li, M. Dong, and Y. Wu, “Digital Twin Enabled Task Offloading for IoVs: A Learning-Based Approach,” IEEE Transactions on Network Science and Engineering , vol. 11, no. 1, pp. 659–672, Jan. 2024.

  • [13] G. Diamantopoulos, N. Tziritas, R. Bahsoon, and G. Theodoropoulos, “Dynamic data-driven digital twins for blockchain systems,” in International Conference on Dynamic Data Driven Applications Systems . Springer, 2022, pp. 283–292.

  • [14] A. Uddin, A. H. Sakr, and N. Zhang, “Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures,” Jun. 2025.

  • [15] X. Chen, B. Xiao, X. Lin, Z. Chen, and G. Min, “Multi-agent collaboration for vehicular task offloading using federated deep reinforcement learning,” IEEE Trans. Mobile Comput. , vol. 24, no. 9, 2025.

  • [16] C. Kennedy and G. Theodoropoulos, “Intelligent Management of Data Driven Simulations to Support Model Building in the Social Sciences,” in Computational Science – ICCS 2006 . Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 562–569.

  • [17] G. Theodoropoulos, C. Kennedy, P. Lee, C. Skelcher, E. Ferrari, and V. J. Sorge, “DDDAS in the social sciences,” in Handbook of Dynamic Data Driven Applications Systems: Volume 2 . Springer International Publishing, 2023, pp. 765–791.

  • [18] N. Zhang, R. Bahsoon, N. Tziritas, and G. Theodoropoulos, “Knowledge equivalence in digital twins of intelligent systems,” ACM Trans. Model. Comput. Simul. , vol. 34, no. 1, Jan. 2024.

  • [19] N. Zhang, C. Vergara-Marcillo, G. Diamantopoulos, J. Shen, N. Tziritas, R. Bahsoon, and G. Theodoropoulos, “Large language models for explainable decisions in dynamic digital twins,” in Dynamic Data Driven Applications Systems . Springer Nature Switzerland, 2026, pp. 81–89.

  • [20] N. Zhang, R. Bahsoon, N. Tziritas, and G. Theodoropoulos, “Explainable human-in-the-loop dynamic data-driven digital twins,” in Dynamic Data Driven Applications Systems . Springer Nature Switzerland, 2024, pp. 233–243.

  • [21] Z. Hua, P. Oikonomou, K. Djemame, N. Tziritas, and G. Theodoropoulos, “A digital twin-based multi-agent reinforcement learning framework for vehicle-to-grid coordination,” in Algorithms and Architectures for Parallel Processing . Springer Nature Singapore, 2026, pp. 512–530.

  • [22] J. Wu, Z. Huang, P. Hang, C. Huang, N. De Boer, and C. Lv, “Digital Twin-enabled Reinforcement Learning for End-to-end Autonomous Driving,” in 2021 IEEE 1st International Conference on Digital Twins and Parallel Intelligence (DTPI) , Jul. 2021, pp. 62–65.

  • [23] J. Deng, Q. Zheng, G. Liu, J. Bai, K. Tian, C. Sun, Y. Yan, and Y. Liu, “A Digital Twin Approach for Self-optimization of Mobile Networks,” in 2021 IEEE Wireless Communications and Networking Conference Workshops (WCNCW) . Nanjing, China: IEEE, Mar. 2021, pp. 1–6.

  • [24] W. Sun, S. Lei, L. Wang, Z. Liu, and Y. Zhang, “Adaptive Federated Learning and Digital Twin for Industrial Internet of Things,” IEEE Transactions on Industrial Informatics , vol. 17, no. 8, Aug. 2021.

  • [25] K. Zhang, J. Cao, and Y. Zhang, “Adaptive Digital Twin and Multiagent Deep Reinforcement Learning for Vehicular Edge Computing and Networks,” IEEE Transactions on Industrial Informatics , vol. 18, no. 2, pp. 1405–1413, Feb. 2022.

  • [26] Y. Xie, Q. Wu, and P. Fan, “Digital Twin Vehicular Edge Computing Network: Task Offloading and Resource Allocation,” in 2024 7th International Conference on Information Communication and Signal Processing (ICICSP) , Sep. 2024, pp. 1137–1141.

  • [27] P. Singh, B. Hazarika, K. Singh, W.-J. Huang, and T. Q. Duong, “GenAIEnhanced Federated Multiagent DRL for Digital-Twin-Assisted IoV Networks,” IEEE Internet of Things Journal , vol. 12, no. 5, pp. 4834– 4851, Mar. 2025.

  • [28] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” in 33rd International conference on machine learning , 2016, pp. 1995– 2003.