Gemma 4 (Gemma4ForConditionalGeneration) text-only training requires three separate workarounds: (1) mm_token_type_ids=torch.zeros_like(input_ids) must be passed to forward() — the multimodal forward signature requires this kwarg even for pure text, (2) the 'tokenizer' returned by from_pretrained is actually a Gemma4Processor where the first positional arg is 'images' not 'text' — extract inner .tokenizer for text-only paths, (3) pop 'gemma4' from MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES or DPOTrainer/other HF trainers will treat it as a vision model and expect an 'images' column.
Apply all three patches before training:
# 1. Extract text tokenizer from multimodal processor
if hasattr(tokenizer, 'tokenizer'):
text_tokenizer = tokenizer.tokenizer
# 2. Remove from vision model mapping
from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.pop('gemma4', None)
# 3. Patch forward for mm_token_type_ids
import torch
_original_forward = model.forward
def _patched_forward(*args, mm_token_type_ids=None, **kwargs):
if mm_token_type_ids is None:
input_ids = kwargs.get('input_ids') or (args[0] if args else None)
if input_ids is not None:
mm_token_type_ids = torch.zeros_like(input_ids)
return _original_forward(*args, mm_token_type_ids=mm_token_type_ids, **kwargs)
model.forward = _patched_forwardAll three are needed. Missing any one causes different failures: (1) TypeError on forward, (2) tokenizer passes images as first arg, (3) trainer expects 'images' dataset column.