Toward a unified system for crop analytics

Lessons from the Tidyverse

Dr. Emi Tanaka

Australian National University

28th October 2025

Sowing seeds for continuous development of a software suite and community of practice in crop analytics

Toward a unified system
for crop analytics

Yet another system…

Toward a unified system for crop analytics

Lessons from the Tidyverse

Tidyverse

An opinionated collection of R packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures

Tidyverse has been immensely impactful

Illustrative data: Lupin MET data

Yield of 9 varieties of lupin at different planting densities across 2 years and multiple locations.

data(verbyla.lupin, package = "agridat")
str(verbyla.lupin)

'data.frame':   1420 obs. of  13 variables:
 $ gen    : Factor w/ 9 levels "Danja","Gungurru",..: 3 2 4 2 2 8 1 2 1 4 ...
 $ site   : Factor w/ 11 levels "S01","S02","S03",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ rep    : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...
 $ rate   : int  40 60 40 50 40 40 10 10 60 10 ...
 $ row    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ col    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ serp   : Factor w/ 4 levels "SE1","SE2","SE3",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ linrow : num  -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 0.05 0.15 ...
 $ lincol : num  -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 -2.5 ...
 $ linrate: num  -0.169 1.831 -0.169 0.831 -0.169 ...
 $ yield  : num  0.617 1.194 1.099 0.941 0.983 ...
 $ year   : int  91 91 91 91 91 91 91 91 91 91 ...
 $ loc    : Factor w/ 8 levels "Badgingerra",..: 1 1 1 1 1 1 1 1 1 1 ...

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(d$gen, d$loc), ]}()

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(gen, loc)

Subset the data to the year 1991

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(d$gen, d$loc), ]}()

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(gen, loc)

Select the columns gen, loc, and yield

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(d$gen, d$loc), ]}()

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(gen, loc)

Multiply yield by 10 to convert t/ha to kg/ha

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(d$gen, d$loc), ]}()

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(gen, loc)

Get the mean yield by genotype and location

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(-d$yield), ]}()

Arrange results by descending mean yield

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(desc(yield))

         gen         loc     yield
1     Merrit    MtBarker 17.175571
2   Gungurru    MtBarker 15.887556
3     Warrah    MtBarker 15.725571
4      Danja    MtBarker 14.991444
5     Yorrel   Newdegate 14.598500
6    Unicrop    MtBarker 14.170000
7      Danja    Corrigin 13.582083
8     Yandee    Corrigin 13.184667
9  Illyarrie    MtBarker 13.078857
10  Gungurru    Corrigin 12.657417
11     Danja   Newdegate 12.476917
12   Unicrop    Corrigin 12.067000
13    Yorrel    MtBarker 11.992286
14    Yorrel    Corrigin 11.767167
15    Merrit   Newdegate 11.762583
16 Illyarrie    Corrigin 11.655583
17    Warrah    Corrigin 10.219583
18     Danja WonganHills 10.051417
19 Illyarrie   Newdegate  9.702250
20    Yandee    MtBarker  9.593091
21    Yorrel WonganHills  9.233667
22    Yandee   Newdegate  8.911750
23  Gungurru WonganHills  8.591333
24   Unicrop WonganHills  8.044667
25 Illyarrie WonganHills  8.022750
26    Merrit Badgingerra  6.997583
27    Warrah   Newdegate  6.976417
28  Gungurru Badgingerra  6.601833
29     Danja Badgingerra  6.348167
30    Yandee Badgingerra  5.981750
31    Warrah Badgingerra  5.864500
32 Illyarrie Badgingerra  5.545417
33   Unicrop Badgingerra  4.624083
34    Yorrel Badgingerra  4.211500
35    Merrit    Corrigin        NA
36  Gungurru   Newdegate        NA
37   Unicrop   Newdegate        NA
38    Merrit WonganHills        NA
39    Yandee WonganHills        NA
40    Warrah WonganHills        NA

Data wrangling syntax comparison

Base R

verbyla.lupin |> 
 subset(year == 91) |> 
 subset(select = c(gen, loc, yield)) |> 
 transform(yield = 10 * yield) |> 
 aggregate(\(x) mean(x), 
           by = . ~ gen + loc) |> 
 {\(d) d[order(d$gen, d$loc), ]}()

library(dplyr) (Tidyverse)

verbyla.lupin |> 
 filter(year == 91) |> 
 select(gen, loc, yield) |> 
 mutate(yield = 10 * yield) |> 
 summarise(yield = mean(yield), 
           .by = c(gen, loc)) |> 
 arrange(gen, loc)

The performance is similar between the two approaches
The syntax above doesn’t seem that different either?
But to use dplyr, you need a tiny bit more effort (install and load the package)
Yet, considerable number of people use dplyr for data wrangling in R
Why?

Tidy design principles

“[Tidyverse’s] primary goal is to facilitate the conversation that a human has with a dataset, and we want to help dig a “pit of success” where the least-effort path trends towards a positive outcome. The primary tool to dig the pit is API design: by carefully considering the external interface to a function, we can help guide the user towards success”

Wickham (in his work-in-progress book) states tidyverse has four guiding princples:
- Human-centered
- Consistent
- Composable
- Inclusive

https://design.tidyverse.org

Examining the Interface Design of Tidyverse

Tanaka (2025) Australian & New Zealand Journal of Statistics (to appear)

“While Tidyverse has been lauded for adopting a user-centered design, arguably some elements of the design focus on the work domain instead of the end-user.”

https://arxiv.org/abs/2510.10382

Interface design approaches

Ecological interface design (EID)

EID has been successfully applied in a broad range of sociotechnical systems (e.g. power distribution, transportation, military, medicine, and network management) for over 30 years¹
The central idea of EID is to organise and make visible the system constraints and relationships within the work domain to users, effectively making the invisible visible²

SRK Behaviour Taxonomy

Skill-based behaviour

Automatic actions performed with little conscious thought

Rule-based behaviour

Actions guided by stored rules or procedures

Knowledge-based behaviour

Actions that require conscious problem solving and decision making

Take-aways

Why not “optimise the analysts” by developing “cognitive ergonomic” tools?

Tanaka (2025) Examining the Interface Design of Tidyverse. ANZJS (to appear) https://arxiv.org/abs/2510.10382

“We recommend that developers adopt an iterative design that is informed by user feedback, analysis and complete coverage of the work domain, and ensure perceptual visibility of system constraints and relationships.”

Get in touch with Jules, Fonti or myself for contributing to “Sowing seeds for continuous development of a software suite and community of practice in crop analytics” 🌱
✉️ emi.tanaka@anu.edu.au 🌐 anu-aagi.github.io

These slides are available at emitanaka.org/slides/AAGI2025/ and made reproducibly using Quarto reveal.js