versionAI-assisted Data Cleaning Workflow
1 Introduction
1.1 Welcome
This site documents a reproducible, AI-assisted workflow for public health data analysis with R.
1.1.1 What this site covers
- AI-assisted code generation
- Data cleaning and validation, variable labeling and reference categories
- Exporting data for R, SPSS, and Stata
⚠️ This is a template document.
1.2 Setting Up
1.2.1 Install R
R is the core statistical computing environment used for data cleaning and analysis.
Visit the Comprehensive R Archive Network (CRAN): https://cran.r-project.org
Download and install R for your operating system:
- Windows
- macOS
- Linux
Accept default installation settings.
After installation, verify R by opening it and running:
1.2.2 Install RStudio
RStudio is an integrated development environment (IDE) that makes working with R scripts, projects, and Quarto documents easier.
Download RStudio Desktop (Free version): https://posit.co/download/rstudio-desktop/
Install RStudio after R is installed.
Open RStudio and confirm it detects your R installation.
1.2.3 Install Required R Packages
Run the following once inside RStudio:
install.packages(c(
"dplyr",
"janitor",
"skimr",
"jsonlite",
"ellmer",
"haven",
"labelled",
"rio",
"ollamar"
))1.2.4 Installing a Local LLM (Offline and Hospital-Safe)
For environments where cloud-based AI tools are restricted (for example hospitals and government institutions), a local large language model (LLM) can be used. This workflow supports Ollama, which runs entirely offline.
1.2.5 Install Ollama
Download Ollama from: https://ollama.com
Install Ollama for your operating system (Windows/macOS/Linux).
After installation, verify Ollama is running by opening a terminal and typing:
ollama --version
1.2.6 Install the Gemma Model
Gemma is a lightweight, open-source model suitable for structured code generation tasks.
In your system terminal, run:
ollama run codegemma(You may also use larger variants if system resources permit.)
1.2.7 Verify Ollama + Gemma in R
Once Ollama is running in the background, test the connection from R:
library(ellmer)
model <- chat_ollama(
model = "codegemma"
)
chat(model, "Respond with: Ollama is working")If successful, the model will return a short confirmation message.
1.3 Licence
© EpiPulse
This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.
You are free to share and adapt this material with appropriate attribution.
2 R Code With AI
2.1 AI Data Cleaning
2.1.1 Statement to add in methodology
“AI-assisted code generation was used for data cleaning, with all scripts reviewed and executed by the investigators.”
Always de-identify data before sending anything to AI.
2.1.2 Load and summary for AI
library(dplyr)
library(janitor)
library(skimr)
library(jsonlite)
df <- read.csv("data.csv", stringsAsFactors = FALSE)
df <- clean_names(df)
data_summary <- list(
structure = str(df),
missing_pct = round(colMeans(is.na(df)) * 100, 2),
unique_character_values = df |>
select(where(is.character)) |>
summarise(across(everything(), ~ paste(unique(.), collapse = ", "))),
numeric_ranges = df |>
select(where(is.numeric)) |>
summarise(across(everything(), ~ paste(range(., na.rm = TRUE), collapse = " - ")))
)
summary_text <- toJSON(data_summary, pretty = TRUE)2.2 Setup model and API key
2.2.1 OpenAI
Sys.setenv(OPENAI_API_KEY = "paste_api_key")
library(ellmer)
ai_generate_cleaning_code <- function(summary_text, model) {
prompt <- paste0(
"You are a public health data analyst.\n\n",
"INSTRUCTIONS:\n",
"- Generate ONLY executable R code\n",
"- No explanations outside code\n",
"- Do NOT load libraries\n",
"- Do NOT delete rows\n",
"- Do NOT impute values\n",
"- Do NOT change outcome definitions\n",
"- Include comments in the R code\n\n",
"TASKS:\n",
"- Trim whitespace in character variables\n",
"- Standardize categorical labels\n",
"- Convert variables to appropriate types\n",
"- Flag impossible numeric values\n",
"- Create missingness flags (not imputation)\n\n",
"DATA SUMMARY:\n",
summary_text
)
response <- model$chat(prompt)
response$content
}
prompt <- paste("Use the specified cleaning instructions and return R code only.")
response <- chat_openai(
model = "gpt-4.1-mini",
params = list(temperature = 0.1)
)$chat(prompt)
cleaning_code <- response$content
cat(cleaning_code)2.2.2 Local LLM (offline, hospital-safe)
No API key required for local LLM.
library(ollamar)
library(jsonlite)
ai_generate_cleaning_code <- function(summary_text,
model_name = "codegemma") {
messages <- list(
list(
role = "user",
content = paste0(
"INSTRUCTIONS:\n",
"- Generate ONLY executable R code\n",
"- No explanations outside code\n",
"- Do NOT load libraries\n",
"- Do NOT delete rows\n",
"- Do NOT impute values\n",
"- Do NOT change outcome definitions\n",
"- Include comments in the R code\n",
"TASKS:\n",
"- Trim whitespace in character variables\n",
"- Standardize categorical labels\n",
"- Convert variables to appropriate types\n",
"- Flag impossible numeric values\n",
"- Create missingness flags (not imputation)\n",
"DATA SUMMARY:\n",
summary_text
)
)
)
resp <- ollamar::chat(
model = model_name,
messages = messages
)
json <- jsonlite::fromJSON(rawToChar(resp$body))
code <- json$message$content
if (is.null(code) || code == "") {
stop("Model returned empty content")
}
code <- gsub("^```[rR]?\\s*|```$", "", code)
code
}
cleaning_code <- ai_generate_cleaning_code(summary_text)
cat(cleaning_code)2.2.3 Execute after review
eval(parse(text = cleaning_code))2.3 AI Data Typing + Labels + Reference Categories
Mention this in methods (recommended):
“AI-assisted code generation was used to standardize variable types, labels, and reference categories. All code was reviewed and executed by the investigators.”
2.3.1 Load function
library(ollamar)
library(jsonlite)
ai_generate_typing_and_labels_code <- function(summary_text,
model_name = "codegemma") {
prompt <- paste0(
"You are a senior public health biostatistician and R instructor.
TASK:
Generate ONLY executable R code (no explanations outside code).
AUDIENCE:
Public health beginners (MBBS / MPH / MD Community Medicine).
REQUIREMENTS:
1. Ensure correct data types:
- numeric -> as.numeric
- categorical -> factor
2. Add VARIABLE LABELS (column descriptions)
3. Add VALUE LABELS for categorical variables
4. Explicitly set REFERENCE CATEGORIES using relevel()
5. Comment clearly why each reference category is chosen
6. Do NOT delete rows
7. Do NOT impute missing values
8. Do NOT load libraries
9. Assume dataframe name is `df`
REFERENCE CATEGORY RULES:
- Sex -> reference = \"Male\"
- Binary exposure -> reference = \"No\" / \"Unexposed\"
- Education -> lowest level
- Socioeconomic status -> lowest class
- Outcome variables -> DO NOT relevel
OUTPUT:
Return ONLY R code with clear comments.
DATA SUMMARY:
", summary_text
)
messages <- list(
list(role = "user", content = prompt)
)
resp <- ollamar::chat(
model = model_name,
messages = messages
)
json <- jsonlite::fromJSON(rawToChar(resp$body))
code <- json$message$content
if (is.null(code) || code == "") {
stop("Model returned empty content")
}
code <- gsub("^```[rR]?\\s*|```$", "", code)
code
}2.3.2 Generate the code
typing_label_code <- ai_generate_typing_and_labels_code(summary_text)
cat(typing_label_code)2.3.3 Execute code after review
eval(parse(text = typing_label_code))3 Sample Code Output
3.1 Data inspection
3.1.1 Load libraries
library(dplyr)
library(janitor)
library(skimr)3.1.2 Load data
df <- read.csv("data.csv", stringsAsFactors = FALSE)3.1.3 Inspect
dim(df); names(df); str(df); summary(df)3.1.4 Clean names
df <- clean_names(df)3.1.5 Explore
skim(df)3.1.6 Unique values
df |> select(where(is.character)) |> lapply(unique)3.1.7 Missingness
colMeans(is.na(df)) * 1003.1.8 Validate
table(df$sex)
range(df$age, na.rm = TRUE)3.2 Data typing + labels + reference categories (sample)
# Ensure age is numeric
df <- df |>
mutate(age = as.numeric(age))
# Convert sex to factor and set reference category
# Reference = Male (standard epidemiologic practice)
df <- df |>
mutate(
sex = factor(sex, levels = c("Male", "Female")),
sex = relevel(sex, ref = "Male")
)
# Convert smoking status to factor
# Reference = No (unexposed group)
df <- df |>
mutate(
smoking = factor(smoking, levels = c("No", "Yes")),
smoking = relevel(smoking, ref = "No")
)
# Education level (reference = lowest education)
df <- df |>
mutate(
education = factor(
education,
levels = c("No formal education", "Primary", "Secondary", "Higher")
),
education = relevel(education, ref = "No formal education")
)
# Add variable labels (for tables and reports)
attr(df$age, "label") <- "Age in completed years"
attr(df$sex, "label") <- "Sex of the participant"
attr(df$smoking, "label") <- "Current smoking status"
attr(df$education, "label") <- "Highest educational attainment"4 Download Final Data
“The final cleaned and labelled dataset was exported in R (.rds)/SPSS (.sav)/Stata (.dta) format to ensure operability across analytical platforms.”
4.0.0.1 Load library
# install.packages(c("haven", "labelled", "rio"))
library(haven)
library(labelled)
library(rio)4.0.0.2 R format (best for reproducibility)
save(df, file = "final_dataset.RData")
# to load use:
# load("final_dataset.RData")4.0.0.3 SPSS format (.sav)
write_sav(df, "final_dataset.sav")4.0.0.4 Stata format (.dta)
write_dta(df, "final_dataset.dta")4.0.0.5 CSV format (.csv)
Warning: variable labels and reference categories will be lost.
write.csv(df, "final_dataset.csv", row.names = FALSE)
# Best practice: export a data dictionary so labels are retained
data_dictionary <- tibble::tibble(
variable = names(df),
label = sapply(df, function(x) attr(x, "label")),
class = sapply(df, class)
)
write.csv(data_dictionary, "data_dictionary.csv", row.names = FALSE)4.0.0.6 Export all formats
export_all_formats <- function(df, name = "final_dataset") {
saveRDS(df, paste0(name, ".rds"))
write_sav(df, paste0(name, ".sav"))
write_dta(df, paste0(name, ".dta"))
write.csv(df, paste0(name, ".csv"), row.names = FALSE)
}
export_all_formats(df)