AI-assisted Data Cleaning Workflow

1 Introduction

1.1 Welcome

This site documents a reproducible, AI-assisted workflow for public health data analysis with R.

1.1.1 What this site covers

  1. AI-assisted code generation
  2. Data cleaning and validation, variable labeling and reference categories
  3. Exporting data for R, SPSS, and Stata

⚠️ This is a template document.

1.2 Setting Up

1.2.1 Install R

R is the core statistical computing environment used for data cleaning and analysis.

Visit the Comprehensive R Archive Network (CRAN): https://cran.r-project.org

Download and install R for your operating system:

  • Windows
  • macOS
  • Linux

Accept default installation settings.

After installation, verify R by opening it and running:

version

1.2.2 Install RStudio

RStudio is an integrated development environment (IDE) that makes working with R scripts, projects, and Quarto documents easier.

Download RStudio Desktop (Free version): https://posit.co/download/rstudio-desktop/

Install RStudio after R is installed.

Open RStudio and confirm it detects your R installation.

1.2.3 Install Required R Packages

Run the following once inside RStudio:

install.packages(c(
  "dplyr",
  "janitor",
  "skimr",
  "jsonlite",
  "ellmer",
  "haven",
  "labelled",
  "rio",
  "ollamar"
))

1.2.4 Installing a Local LLM (Offline and Hospital-Safe)

For environments where cloud-based AI tools are restricted (for example hospitals and government institutions), a local large language model (LLM) can be used. This workflow supports Ollama, which runs entirely offline.

1.2.5 Install Ollama

Download Ollama from: https://ollama.com

Install Ollama for your operating system (Windows/macOS/Linux).

After installation, verify Ollama is running by opening a terminal and typing:

ollama --version

1.2.6 Install the Gemma Model

Gemma is a lightweight, open-source model suitable for structured code generation tasks.

In your system terminal, run:

ollama run codegemma

(You may also use larger variants if system resources permit.)

1.2.7 Verify Ollama + Gemma in R

Once Ollama is running in the background, test the connection from R:

library(ellmer)

model <- chat_ollama(
  model = "codegemma"
)

chat(model, "Respond with: Ollama is working")

If successful, the model will return a short confirmation message.


1.3 Licence

© EpiPulse

This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

You are free to share and adapt this material with appropriate attribution.

CC BY 4.0

2 R Code With AI

2.1 AI Data Cleaning

2.1.1 Statement to add in methodology

“AI-assisted code generation was used for data cleaning, with all scripts reviewed and executed by the investigators.”

Always de-identify data before sending anything to AI.

2.1.2 Load and summary for AI

library(dplyr)
library(janitor)
library(skimr)
library(jsonlite)

df <- read.csv("data.csv", stringsAsFactors = FALSE)
df <- clean_names(df)

data_summary <- list(
  structure = str(df),
  missing_pct = round(colMeans(is.na(df)) * 100, 2),
  unique_character_values = df |>
    select(where(is.character)) |>
    summarise(across(everything(), ~ paste(unique(.), collapse = ", "))),
  numeric_ranges = df |>
    select(where(is.numeric)) |>
    summarise(across(everything(), ~ paste(range(., na.rm = TRUE), collapse = " - ")))
)

summary_text <- toJSON(data_summary, pretty = TRUE)

2.2 Setup model and API key

2.2.1 OpenAI

Sys.setenv(OPENAI_API_KEY = "paste_api_key")

library(ellmer)

ai_generate_cleaning_code <- function(summary_text, model) {
  prompt <- paste0(
    "You are a public health data analyst.\n\n",
    "INSTRUCTIONS:\n",
    "- Generate ONLY executable R code\n",
    "- No explanations outside code\n",
    "- Do NOT load libraries\n",
    "- Do NOT delete rows\n",
    "- Do NOT impute values\n",
    "- Do NOT change outcome definitions\n",
    "- Include comments in the R code\n\n",
    "TASKS:\n",
    "- Trim whitespace in character variables\n",
    "- Standardize categorical labels\n",
    "- Convert variables to appropriate types\n",
    "- Flag impossible numeric values\n",
    "- Create missingness flags (not imputation)\n\n",
    "DATA SUMMARY:\n",
    summary_text
  )

  response <- model$chat(prompt)
  response$content
}

prompt <- paste("Use the specified cleaning instructions and return R code only.")

response <- chat_openai(
  model = "gpt-4.1-mini",
  params = list(temperature = 0.1)
)$chat(prompt)

cleaning_code <- response$content
cat(cleaning_code)

2.2.2 Local LLM (offline, hospital-safe)

No API key required for local LLM.

library(ollamar)
library(jsonlite)

ai_generate_cleaning_code <- function(summary_text,
                                      model_name = "codegemma") {

  messages <- list(
    list(
      role = "user",
      content = paste0(
        "INSTRUCTIONS:\n",
        "- Generate ONLY executable R code\n",
        "- No explanations outside code\n",
        "- Do NOT load libraries\n",
        "- Do NOT delete rows\n",
        "- Do NOT impute values\n",
        "- Do NOT change outcome definitions\n",
        "- Include comments in the R code\n",
        "TASKS:\n",
        "- Trim whitespace in character variables\n",
        "- Standardize categorical labels\n",
        "- Convert variables to appropriate types\n",
        "- Flag impossible numeric values\n",
        "- Create missingness flags (not imputation)\n",
        "DATA SUMMARY:\n",
        summary_text
      )
    )
  )

  resp <- ollamar::chat(
    model = model_name,
    messages = messages
  )

  json <- jsonlite::fromJSON(rawToChar(resp$body))
  code <- json$message$content

  if (is.null(code) || code == "") {
    stop("Model returned empty content")
  }

  code <- gsub("^```[rR]?\\s*|```$", "", code)
  code
}

cleaning_code <- ai_generate_cleaning_code(summary_text)
cat(cleaning_code)

2.2.3 Execute after review

eval(parse(text = cleaning_code))

2.3 AI Data Typing + Labels + Reference Categories

Mention this in methods (recommended):

“AI-assisted code generation was used to standardize variable types, labels, and reference categories. All code was reviewed and executed by the investigators.”

2.3.1 Load function

library(ollamar)
library(jsonlite)

ai_generate_typing_and_labels_code <- function(summary_text,
                                               model_name = "codegemma") {

  prompt <- paste0(
"You are a senior public health biostatistician and R instructor.

TASK:
Generate ONLY executable R code (no explanations outside code).

AUDIENCE:
Public health beginners (MBBS / MPH / MD Community Medicine).

REQUIREMENTS:
1. Ensure correct data types:
   - numeric -> as.numeric
   - categorical -> factor
2. Add VARIABLE LABELS (column descriptions)
3. Add VALUE LABELS for categorical variables
4. Explicitly set REFERENCE CATEGORIES using relevel()
5. Comment clearly why each reference category is chosen
6. Do NOT delete rows
7. Do NOT impute missing values
8. Do NOT load libraries
9. Assume dataframe name is `df`

REFERENCE CATEGORY RULES:
- Sex -> reference = \"Male\"
- Binary exposure -> reference = \"No\" / \"Unexposed\"
- Education -> lowest level
- Socioeconomic status -> lowest class
- Outcome variables -> DO NOT relevel

OUTPUT:
Return ONLY R code with clear comments.

DATA SUMMARY:
", summary_text
  )

  messages <- list(
    list(role = "user", content = prompt)
  )

  resp <- ollamar::chat(
    model = model_name,
    messages = messages
  )

  json <- jsonlite::fromJSON(rawToChar(resp$body))
  code <- json$message$content

  if (is.null(code) || code == "") {
    stop("Model returned empty content")
  }

  code <- gsub("^```[rR]?\\s*|```$", "", code)
  code
}

2.3.2 Generate the code

typing_label_code <- ai_generate_typing_and_labels_code(summary_text)
cat(typing_label_code)

2.3.3 Execute code after review

eval(parse(text = typing_label_code))

3 Sample Code Output

3.1 Data inspection

3.1.1 Load libraries

library(dplyr)
library(janitor)
library(skimr)

3.1.2 Load data

df <- read.csv("data.csv", stringsAsFactors = FALSE)

3.1.3 Inspect

dim(df); names(df); str(df); summary(df)

3.1.4 Clean names

df <- clean_names(df)

3.1.5 Explore

skim(df)

3.1.6 Unique values

df |> select(where(is.character)) |> lapply(unique)

3.1.7 Missingness

colMeans(is.na(df)) * 100

3.1.8 Validate

table(df$sex)
range(df$age, na.rm = TRUE)

3.2 Data typing + labels + reference categories (sample)

# Ensure age is numeric
df <- df |>
  mutate(age = as.numeric(age))

# Convert sex to factor and set reference category
# Reference = Male (standard epidemiologic practice)
df <- df |>
  mutate(
    sex = factor(sex, levels = c("Male", "Female")),
    sex = relevel(sex, ref = "Male")
  )

# Convert smoking status to factor
# Reference = No (unexposed group)
df <- df |>
  mutate(
    smoking = factor(smoking, levels = c("No", "Yes")),
    smoking = relevel(smoking, ref = "No")
  )

# Education level (reference = lowest education)
df <- df |>
  mutate(
    education = factor(
      education,
      levels = c("No formal education", "Primary", "Secondary", "Higher")
    ),
    education = relevel(education, ref = "No formal education")
  )

# Add variable labels (for tables and reports)
attr(df$age, "label") <- "Age in completed years"
attr(df$sex, "label") <- "Sex of the participant"
attr(df$smoking, "label") <- "Current smoking status"
attr(df$education, "label") <- "Highest educational attainment"

4 Download Final Data

“The final cleaned and labelled dataset was exported in R (.rds)/SPSS (.sav)/Stata (.dta) format to ensure operability across analytical platforms.”

4.0.0.1 Load library

# install.packages(c("haven", "labelled", "rio"))
library(haven)
library(labelled)
library(rio)

4.0.0.2 R format (best for reproducibility)

save(df, file = "final_dataset.RData")
# to load use:
# load("final_dataset.RData")

4.0.0.3 SPSS format (.sav)

write_sav(df, "final_dataset.sav")

4.0.0.4 Stata format (.dta)

write_dta(df, "final_dataset.dta")

4.0.0.5 CSV format (.csv)

Warning: variable labels and reference categories will be lost.

write.csv(df, "final_dataset.csv", row.names = FALSE)

# Best practice: export a data dictionary so labels are retained
data_dictionary <- tibble::tibble(
  variable = names(df),
  label = sapply(df, function(x) attr(x, "label")),
  class = sapply(df, class)
)

write.csv(data_dictionary, "data_dictionary.csv", row.names = FALSE)

4.0.0.6 Export all formats

export_all_formats <- function(df, name = "final_dataset") {
  saveRDS(df, paste0(name, ".rds"))
  write_sav(df, paste0(name, ".sav"))
  write_dta(df, paste0(name, ".dta"))
  write.csv(df, paste0(name, ".csv"), row.names = FALSE)
}

export_all_formats(df)

5 Descriptive Data Analysis