Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

:page_facing_up: Download PDF | :link: View on arXiv

Source arXiv
arXiv ID 2603.08930v1
Authors Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario et al.
Published Mar 9, 2026
Categories cs.CV, cs.AI
Curated by @stevek
Curated on Apr 21, 2026
Tags digital-twin, agriculture

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environment




Full-Text Markdown

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Heesup Yun Isaac Kazuo Uyehara Earl Ranario Lars Lundqvist Christine H. Diepenbrock Brian N. Bailey J. Mason Earles University of California, Davis

{hspyun, ikuyehara, ewranario, llund, chdiepenbrock, bnbailey, jmearles}@ucdavis.edu

Abstract

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs— Gemma 3 and Qwen3-VL—to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

1. Introduction

Digital twins are systems that simulate a physical system and update its states based on sensor measurements of the real world [19]. In agriculture, digital twins can make a simulation of crops, environment, and management, enabling artificial intelligence (AI) driven plant simulation and “what-if” experiments across various crops and envi-

ronments [11, 14, 17, 21, 23, 37]. Crop modeling is considered a useful component of digital twins because it can simulate biophysical processes such as photosynthesis and water-use [40]. Specifically, detailed 3D representations of the field, including plant positions and canopy structure, are important for accurate crop simulation [18].

Digital twin models often use structured document formats, such as JavaScript object notation (JSON) or extensible markup language (XML) to express plant and environmental model definitions, data inputs, and stored data outputs to another system [26]. For example, Lobet et al. [34] introduced the root system markup language using XML to express plant root architecture data across different root phenotyping and modeling tools. Kim and Heo [30] used soil data from JSON and XML files as input to digital twin models. Baker et al. [9] used a JSON structure to interface their twin model with Unreal Engine. Alves et al. [2] used JSON to interface environmental context data with the smart farming management system.

Meanwhile, recent advances in large language models (LLMs) have enabled an understanding of the schema and content of structured documents and reproducing them from a given context [16, 20, 42, 46, 47]. In addition, vision language models (VLMs) can analyze images and generate responses, not only explaining their contents but also counting and localizing objects within them [28, 35, 38, 41, 44]. By combining this vision capability with LLM’s structured document generation, VLMs can generate structured outputs from an image. Kim et al. [29] proposed a method that converts an image into a JSON text document using an image encoder and text decoder, without optical character recognition. Lee et al. [31] proposed Pix2Struct, which converts a webpage screenshot into a structured HTML document. Liu et al. suggested DePlot [32] and MatCha [33] for translating a chart or graph image into a structured Markdown table or a structured dataset.

Driven by the versatile image-to-text capabilities of recent VLMs, specialized agricultural benchmarks have emerged to evaluate tasks such as disease identification

1

Figure 1. Overview of the data-driven synthetic data generation pipeline and real-to-sim evaluation framework. (1) Data-Driven Synthetic Dataset Generation: Spatial features and structural parameters were extracted from real-world field data to procedurally synthesize highfidelity cowpea plant plots. (2) Sim-to-Real Evaluation: The vision language model (VLM) was evaluated via few-shot in-context learning.

and quantification. Zhou et al. [48] proposed a vision language framework for crop leaf disease classification using the Qwen-VL model. Yang et al. [45] trained AgriGPTVL on Agri-3M-VL and evaluated it on AgriBench-VL-4K, which contains agricultural open-ended visual question answering (VQA) and image-grounded multiple choice question answering (MCQA). Awais et al. [5] trained AgroGPT on the AgroInstruct dataset to enable multi-turn multimodal dialogue about agricultural concepts (e.g., diseases, weeds, insects, fruits), and assessed VQA using the AgroEvals benchmark. Shinoda et al. [39] introduced AgroBench (Agronomist AI Benchmark), an expert-annotated multiplechoice QA benchmark that evaluates VLMs’ performance on seven agricultural tasks: disease, pest, weed identification, crop, disease, traditional management, and machine usage. Arshad et al. [4] introduced the AgEval benchmark for the identification, classification, and quantification of plant disease and stress levels.

Even though these previous studies covered a wide range of agricultural tasks, applying a vision-language model to generate representation of 3D plot models for digital twins has yet to be tested. Generating the 3D plot models involves multiple recognition tasks such as plant-type classification, environmental factor regression, plant localization, and biophysical parameter regression. Since most previous research has treated these tasks as separate, performing

them simultaneously will be extremely challenging. Therefore, generating a benchmark to test the VLM’s performance on a 3D plot simulation task for digital twins and designing an experiment identifying the factors that affect the generation performance are needed.

In this paper, we propose a benchmark to evaluate the performance of VLMs in generating configurations for 3D plot simulation, including synthetic and real datasets. We also present a novel approach that leverages VLMs to automate the generation of plant simulation parameters in JSON format, a task that has not been explored in the literature. We tested the VLMs using five in-context learning methods, and three proposed categories of evaluation metrics for the generated plant simulation configurations: JSON integrity, geometric evaluations, and biophysical evaluations. Finally, this study discussed factors affecting the evaluation metrics and our understanding of these factors.

2. Methods

2.1. Drone Remote Sensing Dataset

Drone-based remote sensing of a cowpea ( Vigna unguiculata (L.) Walp.) breeding experiment was conducted in 2025 at an experimental field in California. The plot dimensions were 1.5 m x 3.0 m with 1.5 m alleys, and the plants were planted in a single row. The field had 15 beds,

2

each composed of 12 plots when excluding border plots, and contained 60 cowpea genotypes, with three replicate blocks. The cowpeas were planted on May 27, 2025, and harvested on October 9, 2025.

The orthophotos were processed with Open Drone Map and exported as GeoTIFF files. Plot boundaries were annotated using QGIS, and the plot images were cropped accordingly. The exact plant count and locations were manually annotated in 10 DAP images and used as ground-truth.

2.2. Cowpea Plot Simulation Program

A plant simulator program was developed using the Helios 3D simulation library [7, 8]. Plant growth simulation was performed to generate 10 DAP images, and later-stage images were generated with flowers and pods. In addition, plant biophysical traits were simulated and represented through variations in leaf color, influenced by leaf pigments and structure, using the PROSPECT [27] leaf optics model.

The plant simulator gets a JSON file as input and generates a virtual cowpea plot. We designed a JSON file that not only contains the crucial information needed to simulate a cowpea plot but is also human-readable and easily reproduced by LLMs. The JSON file has six top-level metadata fields: random seed, metadata, environment, field, plant properties, and camera. The JSON file starts with a random seed, which dominates random sampling when automatically generating plants in the simulation. The metadata key includes year, location, plant type, and days after planting. The environment key contains soil and sun environment data, such as soil spectral data category and specular reflection coefficient, and sun elevation and azimuth degrees. The field key includes the field layout, such as plot size and number of beds. Plot keys are nested under the field key, and the plot key contains plot bed and row ID, and individual plant locations as an array. Then the plant property key includes PROSPECT model parameters determining the leaf color spectrum, such as the number of mesophyll layers, chlorophyll content, and carotenoid content. Image sensor and camera positioning related parameters are in the camera key, such as shutter speed, ISO, camera resolution and model, camera height, and a look-at vector.

2.3. Synthetic Cowpea Plot Dataset

For a scalable digital twin dataset generation, an image processing based plant detection algorithm was used. Fig. 1 (1) shows the overview of the process. The ExG [15] vegetation index was applied to segment plants from the background, and plant blobs were detected using OpenCV blob detection functions. The detected blobs were divided into individual plants based on the blob’s aspect ratio, and the center positions were refined to the center of gravity of each blob. The plant locations in meters were calculated from the plot dimensions and pixel locations, with the image center

set to (0,0).

The plant stand count and plant locations were detected from the drone orthophoto of 10 DAP cowpea, and the detected plant locations were saved in JSON format to reconstruct the plants in the plant simulator. The remaining plant simulation parameters were randomized by sampling values from predefined uniform distributions when generating a simulated plot’s JSON. The plant simulator produces a plot based on the input JSON file and simulates growth at 30, 50, 70, and 90 DAP. For each DAP, 224 simulated plot images were generated, resulting in a total of 1,120 images. The simulated plot images at 381x1080 resolution were exported as JPEG files.

2.4. Vision Language Models

To evaluate the VLM’s ability to generate a simulation config JSON file from an image, state-of-the-art vision language models were tested. Since generating a digital twin simulation configuration file from an image is not a typical use case for VLMs and requires many trial-and-error iterations, we started by using open-source instruction-following models, Gemma 3 [43] and Qwen3-VL [6]. Gemma 3 was released in March 2025, and is a lightweight open multimodal model with parameter counts ranging from 1B to 27B. Qwen3-VL was released in December 2025 and includes models from 2B to 235B parameters. We tested the 4B, 12B, and 27B models for Gemma3 and the 4B, 8B, and 30B models for Qwen3-VL. A self-hosted Ollama (https://ollama.com/) server was used to provide an API endpoint for interacting with the models via a Python script. The maximum context size was set to 32K to ensure it covers the maximum length of in-context learning prompts and generation results around 20K, and possibly longer when the model hallucinates.

2.5. In Context Learning

LLMs can learn from context and provide an answer based on the given information, without changing the model weights [13]. This is called few-shot learning or in-context learning. We tested in-context learning methods by gradually providing more context to the model and observed changes in JSON generation performance.

To maximize the model’s reasoning capabilities, models were instructed to add the “reasoning” key at the beginning of the JSON output. This two-stage process enables the model to reason freely in natural language, then complete the remaining JSON structure based on the analysis [42].

Summarization and examples of the five in-context learning methods is shown in Table 1. The simplest way to generate a structured output from LLM is to give format restriction instructions (FRI), such as “Response as JSON" or “Provide your output in the following valid JSON for-

3

Table 1. In-context learning configurations and example texts added to the prompt. ’
(more)’ signifies abbreviated content for presentation purposes and was not part of the prompt input.

Confguration Example
Method 1: You are a plant phenotyping expert analyzing Helios 3D plant simulator images. Extract all
Zero-shot JSON genera- simulation parameters that produced this image and output them as JSON. — Parameter Ref-
tion erence — 1. metadata 
 (more)
Answer:
Method 2: JSON SCHEMA: { “reasoning”: “string”, “seed”: “integer”,
Method 1 “metadata”: { “year”: “integer”, “location”: “string”,
+JSON schema “plant_type”: “string”, “dap”: “integer” }, 
(more) }
Method 3: Example
1:
{ “reasoning”: "Visual analysis: Plant maturity
_×_3
Method 2 suggests 10 days growth. 
(more) }
+few-shot JSONs
Method 4: Example 1: user: 〈IMAGEâŒȘ
A: { “reasoning”: "Visual analysis: Plant
_×_3
Method 3 maturity suggests 10 days growth. 
(more) }
+few-shot images
Method 5: Ground truth hints for target image: Plant age: 10 DAP, Plant count: 14, Sun position: 62.9°
Method 4 elev., 169.4° azim., Plant locations (rx, ry): [(0.163, 0.073), (0.162, 0.085), . . . , (0.174, 0.743)].
+grounding info Convert to meters:x= (rx −0.5)×1.3521,y =−(_ry −0.5)×3._8405

mat”. Therefore, the first method served as a baseline that defined the LLM’s role, task, goal, and JSON FRI. The second method added a JSON schema to the first, so the models can reference the JSON structure and variable types. The third method added a few-shot JSON examples to the second method, so that the model can see how the values were filled from the JSON schema. An important piece of information from the example JSON was the reasoning key text, which provided guidelines for building reasoning text step by step. The fourth method included few-shot images and JSON responses so that the models can learn how to extract visual features from images and output them as JSON. This context was treated as chat history between the user and the assistant, simulating N pairs of questions and answers.

The last method added grounding information that can be easily derived from an image and its metadata, such as the number of plants and their approximate locations. It also contains the sun’s elevation and azimuth angle, which can be calculated from the plot’s latitude, longitude, and timestamp. We designed the last method to fully utilize the available information from the image and field, and autocomplete the JSON. Providing this information can provide a shortcut to the model; however, there was still uncertainty about how well the models would find it.

2.6. LoRa Fine-tuning

To test the effect of fine-tuning in few-shot in-context learning performance, we performed parameter-efficient fine-tuning (PEFT) using LoRA [24]. The Qwen3-VL 32B model was fine-tuned with r =16 and α =16, resulting in 141.9M trainable parameters, which is 0.65% of the 32B model capacity. The training dataset consisted of 1,788

synthetic cowpea plot images, not including the in-context learning evaluation dataset. Unsloth [22] Python library was used to accelerate the training process and reduce the memory overhead. The model was trained for 3 epochs with an effective batch size of 64 on four NVIDIA A100 GPUs for about three hours.

2.7. Evaluation metrics

Mean Guess Baseline: As a naĂŻve baseline, mean guess baselines were calculated. In the synthetic dataset, all parameters except plant counts and locations were sampled from a uniform distribution, so the theoretical MAE for the mean-guess baseline is a quarter of the distribution range. The plant count and location distributions were nonuniform, resulting in a mean guess MAE that was less than a quarter of the distribution span.

JSON integrity: We evaluated JSON integrity metrics to assess performance differences in JSON generation across models and contexts, focusing on the accuracy of models’ responses from a natural language perspective. The first metric was the JSON syntax error rate, the ratio of results with JSON syntax errors that could not parse the JSON from the response. The second metric was the JSON key-missing rate, which counts the number of missing keys in the response and divides by the total number of JSON keys in the ground truth. Lastly, the BLEU-4 [36] score was computed to assess the similarity between the generated and groundtruth outputs.

Geometric evaluations: Plant growth led to major geometric changes in the cowpea plot, altering plant size, structure, and the visibility of plant organs. The DAP was evaluated using mean absolute error (MAE) between the gener-

4

Figure 2. Multi-model evaluation metric comparisons. Blue colors represent Gemma3 [43] models, orange colors represent [6] models, and green colors represent LoRA [24] fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.

ated JSON and ground truth. To quantify the spatial alignment between the predicted S 1 and ground-truth S 2 plant locations, we used Chamfer Distance [10],

to calculate plant location error, where d( P, Q ) represents the average nearest-neighbor distance from set P to Q . Other scalar variables, such as the number of plants, sun elevation, sun azimuth, and leaf pitch, were evaluated using MAE.

Biophysical evaluations: Predictions of leaf compound concentrations, which are chlorophyll, carotenoid, and anthocyanin content, as well as water mass, dry matter, and leaf structure (N), were evaluated using MAE.

2.8. Real orthophoto evaluations

To test the sim-to-real gap when the synthetic data-based in-context-learning method was applied to real image, a real image dataset from drone orthophoto was evaluated. Fig. 1 (2) shows the overview of the real image evaluation. Based on the synthetic dataset evaluation result, the best-performing model was selected and evaluated on real

image dataset. Since the real image dataset provides only a subset of parameters available from the synthetic dataset, only DAP, plant count, plant locations, sun elevation angle, and azimuth angle were evaluated. The DAP was calculated from the planting date and image capture date. The plant count and locations were annotated by the author and saved in COCO JSON format. The sun elevation and azimuth angles were calculated using the pvlib [3] Python library, with the plot center’s latitude and longitude and the exact image capture timestamp.

3. Results

All the evaluation metrics shown with the with 95% confidence intervals. Statistical significance within the same incontext learning was assessed using the Kruskal-Wallis H- test, followed by pairwise Mann-Whitney U tests with Bonferroni correction for comparisons between models, DAPs, and input image types. Distinct lowercase letters denote significant differences (p < 0.05).

5

Figure 3. Synthetic dataset days after planting (DAP) effect on evaluation metrics. Orange colors represent [6] models, and green colors represent LoRA [24] fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.

3.1. Synthetic dataset evaluation result

The synthetic dataset was evaluated with a synthetic fewshot learning context. Fig. 2 shows the evaluation results across three models, three model sizes, and five in-contextlearning methods. Adding grounding information significantly reduced errors across all metrics, providing a checkpoint that models can use to generate JSON given context.

The BLEU-4 score showed interactions across models and in-context learning prompts. Gemma3 models achieved a higher BLEU-4 score than Qwen3-VL models without a few-shot example, and similar scores after a few-shot example was provided. The fine-tuned Qwen3-Vl model showed the highest BLEU-4 scores except for the when grounding information was provided.

Geometric Evaluations: The models generally showed decreasing DAP MAEs as model size increased. However, adding few-shot examples did not lower the MAE values. Qwen3-VL models showed less MAE than Gemma3 models for most model sizes and in context-learning methods. Specifically, sometimes the Qwen3-VL 4B model showed lower MAE errors for DAP than the Gemma3 27B model. Grounding information reduced all models’ MAE by less than 1.9 days. The fine-tuned model showed lower MAE values than the original model in most contexts. In some cases, the larger fine-tuned model showed lower MAE values than the smaller model.

Plant count MAE values showed a less substantial effect of model size and in-context learning methods. However, the plant location Chamfer distance showed Qwen3VL models had lower error than Gemma models; larger and more context decreased the plant location error. Specifi-

cally, Qwen3-VL 4B model showed lower MAE values than Gemma3 27B model. Fine-tuning lowered the plant count MAE and location Chamfer distance when only baseline context or JSON schema were given.

Biophysical Evaluations: For most leaf pigment estimation tasks, models failed to estimate values and exhibited high MAE, regardless of model size or context. Chlorophyll content and leaf structure showed generally lower MAE values for the larger models, but there were no significant differences when more contexts were added. Furthermore, anthocyanin MAE increased with model size, suggesting a gap between the model’s knowledge and the dataset.

Effect of Dataset DAP: Fig. 3 shows the synthetic data evaluation result of JSON integrity and geometric evaluations from the Qwen3-VL 32B and it’s fine-tuned model, based on dataset DAP. Based on Fig. 3 (a), (b), and (c), there was no noticeable effect on JSON integrity from the DAP dataset, except for the BLUE-4 scores. DAP MAE values showed a pattern across the contexts, increasing MAE from DAP 10 to 30, decreasing from DAP 30 to 70, and increasing from DAP 70 to 90. The fine-tuned model did not show the same pattern and in some cases showed higher MAE values than the mean guess baseline.

Plant count MAE increased for both models when DAP increased. The fine-tuning improved the MAE from 10 to 50 DAP and showed similar MAE for 70 and 90 DAP. Plant location Chamfer distance increased when DAP progressed, and fine-tuning lowered the error when base prompt and JSON schema were given, but showed higher values after few-shot examples were given.

Sun elevation MAE values increased as the DAP progressed, but sun azimuth MAE values remained similar. The

6

Figure 4. Evaluations on the synthetic dataset, the real ortho dataset, and the blind baseline from the original and fine-tuned Qwen3VL model. Orange colors represent [6] models, and green colors represent LoRA [24] fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.

fine tuning lowered the sun elevation MAE values, but did not improved the sun azimuth MAE. There were no visible effects on leaf pitch MAE across the dataset DAP and model fine-tuning.

3.2. Real Image Evaluation Result

The real images from drone orthophoto were evaluated with synthetic few-shot learning contexts. Fig. 4 shows the evaluation results of the original and fine-tuned Qwen3VL 32B model, three different image inputs, and five incontext-learning methods. Because the real dataset did not have complete ground truth JSONs, BLEU-4 score, leaf pitch, and biophysical evaluations were excluded.

JSON Integrity Evaluations: Providing a real image showed higher syntax error rates and key-missing rates, and the fine-tuned model showed lower error rates than the original model.

Geometric Evaluation: The real ortho dataset’s DAP MAE remained similar as more context was provided to the model, and was higher than the synthetic data evaluations, up to 4.7 DAP. Plant count MAE on a real image dataset was higher than that on the synthetic dataset up to 5.3 plants, but plant locations showed lower MAE values around 0.1m than on the synthetic dataset. Sun elevation and azimuth evaluations on the real image dataset yielded lower MAEs than on the synthetic dataset. The fine-tuned model showed lower plant count MAE for real images, but did not lower the plant location errors.

Visual Evaluation: Fig. 5 shows simulated cowpea plots generated by five in-context learning methods given an example real image. When only the baseline prompt was

given, the model generated vertically aligned plants with approximate DAP. Adding a JSON schema affected the generated JSON even if only the variable types and key orders were added to the context. Adding a few-shot JSON and image also changed the rendered output, but there was no clear pattern in which direction the few-shot examples shifted the simulation renderings. But adding few-shot images sometimes caused the model to generate double-row planted plots, even though the cowpea plots were singlerow planted. Adding grounding information to the context yielded the most similar simulated cowpea plots, highlighting the importance of plant count and localization accuracy.

3.3. Ablation Study

To verify whether the model references the image when generating the answer, the Qwen3-VL 32B model was tested by omitting the target image from the final prompt and send only “Answer now:”. The real image dataset’s ground truth data was used to calculate errors for the blind baseline.

However, in some cases, it achieved lower MAE values than those from the synthetic and real-dataset evaluations, especially when the evaluation results were close to the mean-guess baseline, such as plant count and sun elevation.

4. Discussion

Structured Output Generation Performance: A stricter method that forces LLMs to respond in JSON format is enabled by setting JSON mode in the API, which provides a template and generates structured output that fol-

7

Figure 5. Examples of simulated cowpea plot generation results based on in-context learning methods. Real images were given to Qwen3-VL 32B model to generate a cowpea plot simulation configuration, and the images were rendered by the simulation program.

lows the given schema [1]. Zhang et al. [47] argued that structured decoding can improve tool-use performance by removing errors in JSON generation, even without finetuning the model. Researchers reported that structured output could not only improve the performance of classification tasks such as information extraction and name entity recognition, but also decrease the number of generated tokens [12, 20]. However, forcing an LLM to generate responses in a structured format reduced performance on complex reasoning tasks such as mathematics, last-letter concatenation, and shuffling objects [42]. Also, Iwanowski and Gahbler [25] reported that forcing structured output for object localization tasks can lead to hallucinations, generating fake objects to follow the perfect structure. In Fig. 2, our results showed a maximum 6.6% syntax error and an 8.5% JSON key-missing error, both of which were easily fixable, such as omitting the last curly bracket or adding an additional

comma to the last element. Since research on generating structured outputs for agricultural tasks is limited, further research is needed to maximize task accuracy and minimize generation errors.

Effect of Model Size, Contextual Bias, and Blind (noimage) baseline : Performance across model sizes and context levels exhibited non-linear trends. Increasing model size occasionally worsened certain metrics, such as anthocyanin MAE, potentially because larger models focus on global context while smaller ones remain more sensitive to local patterns. This irregularity suggests that model scale alone does not guarantee improved accuracy on challenging agricultural tasks.

Furthermore, additional context often introduces contextual bias rather than improving reasoning. We observed that when models failed to extract reliable visual cues, they tended to default to the provided context, either by copying values directly from few-shot examples or following the parameter distribution of the prompt. This phenomenon was particularly evident in smaller models, where DAP and plant count errors increased after providing few-shot JSON examples.

The blind baseline further highlights this reliance on prompt-driven priors. When lower errors were seen from the blind baseline, it suggested that the model could not reliably capture signals in the provided images. In several cases, the blind baseline achieved lower error metrics than evaluations with real images by simply adhering to the distribution of the few-shot context. This suggests that adding image input can act as noise when the model fails to capture a reliable signal, leading it to prioritize contextual information over genuine visual inference.

5. Conclusion

We proposed a benchmark for cowpea plot simulation that includes simulated cowpea plot images with corresponding JSON configuration files, as well as real images to test the syn-to-real gap. We also suggested in-context learning methods that automatically generate 3D cowpea plot simulations. To the best of our knowledge, this is the first study to utilize VLMs to generate the structural JSON configurations required for plant simulations directly from images.

However, our result still has limitations: the VLMs have not yet been able to estimate or reduce errors to levels close to the human-annotated ground truth or even achieve a basic computer-vision-based approach. Therefore, in future research, improving parameter estimation accuracy will be achieved by incorporating more curated, detailed context. For example, adding a color book of every leaf color, based on leaf pigment, or providing a few more shot examples, with the context window extended to 128K tokens, can be tested to improve accuracy. Also, fine-tuning the model with the generated synthetic dataset will be

8

tested.

References

  • [1] Structured outputs · Ollama Blog. https://ollama.com/public/Structured outputs. 8

  • [2] Rafael Gomes Alves, Rodrigo Filev Maia, and FĂĄbio Lima. Development of a Digital Twin for smart farming: Irrigation management system for water saving. Journal of Cleaner Production , 388:135920, 2023. 1

  • [3] Kevin S. Anderson, Clifford W. Hansen, William F. Holmgren, Adam R. Jensen, Mark A. Mikofski, and Anton Driesse. Pvlib python: 2023 project update. Journal of Open Source Software , 8(92):5994, 2023. 5

  • [4] Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy, Rim Nassiri, Asheesh K. Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubramanian, Aditya Balu, Adarsh Krishnamurthy, and Soumik Sarkar. Leveraging Vision Language Models for Specialized Agricultural Tasks. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 6320–6329, Tucson, AZ, USA, 2025. IEEE. 2

  • [5] Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, and Rao Muhammad Anwer. AgroGPT : Efficient Agricultural Vision-Language Model with Expert Tuning. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 5687–5696, Tucson, AZ, USA, 2025. IEEE. 2

  • [6] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-VL Technical Report, 2025. 3, 5, 6, 7

  • [7] Brian N. Bailey. Helios: A Scalable 3D Plant and Environmental Biophysical Modeling Framework. Frontiers in Plant Science , 10, 2019. 3

  • [8] Brian N. Bailey. A generalized framework for procedural generation of three-dimensional static and dynamic plant model geometries, 2025. 3

  • [9] Dirk Norbert Baker, Felix Maximilian Bauer, Mona Giraud, Andrea Schnepf, Jens Henrik Göbbert, Hanno Scharr, Ebba Þora Hvannberg, and Morris Riedel. A scalable pipeline to create synthetic datasets from functional– structural plant models for deep learning. in silico Plants , 6 (1):diad022, 2024. 1

  • [10] H G Barrow, J M Tenenbaum, R C Bolles, and H C Wolf. Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching. 5

  • [11] Joaquim Bellvert, Ana PelechĂĄ, MagĂ­ Pamies-Sans, Jordi Virgili, Mireia Torres, and Jaume CasadesĂșs. Assimilation of Sentinel-2 Biophysical Variables into a Digital Twin for the Automated Irrigation Scheduling of a Vineyard. Water , 15(14):2506, 2023. 1

  • [12] Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models. Proceedings of the ACM on Programming Languages , 7(PLDI):1946–1969, 2023. 8

  • [13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, 2020. 3

  • [14] Stefano Cesco, Paolo Sambo, Maurizio Borin, Bruno Basso, Guido Orzes, and Fabrizio Mazzetto. Smart agriculture and digital twins: Applications and challenges in a vision of sustainability. European Journal of Agronomy , 146:126809, 2023. 1

  • [15] D. M. Woebbecke, G. E. Meyer, K. Von Bargen, and D. A. Mortensen. Color Indices for Weed Identification Under Various Soil, Residue, and Lighting Conditions. Transactions of the ASAE , 38(1):259–269, 1995. 3

  • [16] Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models, 2024. 1

  • [17] Marc EscribĂ -Gelonch, Shu Liang, Pieter van Schalkwyk, Ian Fisk, Nguyen Van Duc Long, and Volker Hessel. Digital Twins in Agriculture: Orchestration and Applications. Journal of Agricultural and Food Chemistry , 72(19):10737– 10752, 2024. 1

  • [18] Mathieu Gaillard, Chenyong Miao, James C. Schnable, and Bedrich Benes. Voxel carving-based 3D reconstruction of sorghum identifies genetic determinants of light interception efficiency. Plant Direct , 4(10):e00255, 2020. 1

  • [19] Baskar Ganapathysubramanian, Soumik Sarkar, Arti Singh, and Asheesh K. Singh. Digital twins for the plant sciences. Trends in Plant Science , 30(5):576–577, 2025. 1

  • [20] Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning, 2023. 1, 8

  • [21] V. Gonzalez-Dugo, P. Zarco-Tejada, E. NicolĂĄs, P. A. Nortes, J. J. AlarcĂłn, D. S. Intrigliolo, and E. Fereres. Using high resolution UAV thermal imagery to assess the variability in the water status of five fruit tree species within a commercial orchard. Precision Agriculture , 14(6):660–678, 2013. 1

  • [22] Daniel Han, Michael Han, and Unsloth team. Unsloth, 2023. 4

  • [23] Tobias Hank, Heike Bach, and Wolfram Mauser. Using a Remote Sensing-Supported Hydro-Agroecological Model for Field-Scale Simulation of Heterogeneous Crop Growth and

9

Yield: Application for Wheat in Central Europe. Remote Sensing , 7:3934–3965, 2015. 1

  • [24] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, 2021. 4, 5, 6, 7

  • [25] Marcin Iwanowski and Marcin Gahbler. Multiple Large AI Models’ Consensus for Object Detection—A Survey. Applied Sciences , 15(24):12961, 2025. 8

  • [26] Michael Jacoby and Thomas UslĂ€nder. Digital Twin and Internet of Things—Current Standards Landscape. Applied Sciences , 10(18):6519, 2020. 1

  • [27] S. Jacquemoud and F. Baret. PROSPECT: A model of leaf optical properties spectra. Remote Sensing of Environment , 34(2):75–91, 1990. 3

  • [28] Seunggu Kang, WonJun Moon, Euiyeon Kim, and Jae-Pil Heo. VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting. Proceedings of the AAAI Conference on Artificial Intelligence , 38(3):2714–2722, 2024. 1

  • [29] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-Free Document Understanding Transformer. In Computer Vision – ECCV 2022 , pages 498–517. Springer Nature Switzerland, Cham, 2022. 1

  • [30] Steven Kim and Seong Heo. An agricultural digital twin for mandarins demonstrates the potential for individualized agriculture. Nature Communications , 15(1):1561, 2024. 1

  • [31] Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding, 2022. 1

  • [32] Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023 , pages 10381–10399, Toronto, Canada, 2023. Association for Computational Linguistics. 1

  • [33] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Eisenschlos. MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 12756–12770, Toronto, Canada, 2023. Association for Computational Linguistics. 1

  • [34] Guillaume Lobet, Michael P. Pound, Julien Diener, Christophe Pradal, Xavier Draye, Christophe Godin, Mathieu Javaux, Daniel Leitner, FĂ©licien Meunier, Philippe Nacry, Tony P. Pridmore, and Andrea Schnepf. Root System Markup Language: Toward a Unified Root Architecture Description Language. Plant Physiology , 167(3):617–627, 2015. 1

  • [35] Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to

Count to Ten. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 3147–3157, Paris, France, 2023. IEEE. 1

  • [36] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02 , page 311, Philadelphia, Pennsylvania, 2001. Association for Computational Linguistics. 4

  • [37] Nikolaos Peladarinos, Dimitrios Piromalis, Vasileios Cheimaras, Efthymios Tserepas, Radu Adrian Munteanu, and Panagiotis Papageorgas. Enhancing Smart Agriculture by Implementing Digital Twins: A Comprehensive Review. Sensors (Basel, Switzerland) , 23(16):7128, 2023. 1

  • [38] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, 2021. 1

  • [39] Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. AgroBench: VisionLanguage Model Benchmark in Agriculture. 2

  • [40] Soualihou Soualiou, Zhiwei Wang, Weiwei Sun, Philippe de Reffye, Brian Collins, GaĂ«tan Louarn, and Youhong Song. Functional–Structural Plant Models Mission in Advancing Crop Science: Opportunities and Prospects. Frontiers in Plant Science , 12, 2021. 1

  • [41] Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5198–5215, Dublin, Ireland, 2022. Association for Computational Linguistics. 1

  • [42] Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, 2024. 1, 3, 8

  • [43] Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre RamĂ©, Morgane RiviĂšre, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jeanbastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, GaĂ«l Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, AndrĂĄs György, AndrĂ© Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot,

10

Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, C. J. Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-PluciŽnska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini, Jean Pouget-Abadie, Jetha Chan, Joe Stanton, John Wieting, Jonathan Lai, Jordi Orbay, Joseph Fernandez, Josh Newlan, Ju-yeong Ji, Jyotinder Singh, Kat Black, Kathy Yu, Kevin Hui, Kiran Vodrahalli, Klaus Greff, Linhai Qiu, Marcella Valentine, Marina Coelho, Marvin Ritter, Matt Hoffman, Matthew Watson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar, Natasha Noy, Nathan Byrd, Nick Roy, Nikola Momchev, Nilay Chauhan, Noveen Sachdeva, Oskar Bunyan, Pankil Botarda, Paul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya Tafti, Rakesh Shivanna, Renjie Wu, Renke Pan, Reza Rokni, Rob Willoughby, Rohith Vallu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy, Shruti Sheth, Siim PÔder, Sijal Bhatnagar, Sindhu Raghuram Panyam, Sivan Eiger, Susan Zhang, Tianqi Liu, Trevor Yacovone, Tyler Liechty, Uday Kalra, Utku Evci, Vedant Misra, Vincent Roseberry, Vlad Feinberg, Vlad Kolesnikov, Woohyun Han, Woosuk Kwon, Xi Chen, Yinlam Chow, Yuvein Zhu, Zichuan Wei, Zoltan Egyed, Victor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar, Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris Warkentin, Vahab Mirrokni, Evan Senter, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Yossi Matias, D. Sculley, Slav Petrov, Noah Fiedel, Noam Shazeer, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, JeanBaptiste Alayrac, Rohan Anil, Dmitry, Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev, Cassidy Hardin, Robert Dadashi, and Léonard Hussenot. Gemma 3 Technical Report, 2025. 3, 5

Don’t Fine-Tune, Decode: Syntax Error-Free Tool Use via Constrained Decoding, 2023. 1, 8

  • [48] Yueyue Zhou, Hongping Yan, Kun Ding, Tingting Cai, and Yan Zhang. Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models. Sensors , 24(18):6109, 2024. 2

  • [44] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016. 1

  • [45] Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, and Shijian Li. AgriGPT-VL: Agricultural Vision-Language Understanding Suite, 2025. 2

  • [46] Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs, 2026. 1

  • [47] Kexun Zhang, Hongqiao Chen, Lei Li, and William Wang.

11