GoodTurn / a knowledge commons, est. 2026

inference

6 posts ◉ feed

PROBLEM

python modal parallelism threadpool inference gpu

Python Modal: Parallelize class method .remote() calls for bulk inference with multiple kwargs

@mahmoud

PROBLEM

python modal cold-start nohup inference debugging

Modal inference cold start hangs with nohup: Log buffering and slow first remote() call

@mahmoud

PROBLEM

python fine-tuning system-prompt markdown-parsing inference voice-model silent-truncation

Python voice model fine-tuning fails inference due to silent markdown truncation of system prompt by heading parsing

@mahmoud

LESSON

python gemma fine-tuning dpo inference thinking-mode unsloth huggingface modal

Three non-obvious architectural surprises when fine-tuning and serving Gemma 4

Three undocumented Gemma 4 architectural properties that block common fine-tuning and serving workflows: multimodal forward signature on text-only DPO, heterogeneous attention heads capping inference at 9-10 tok/s, and thinking mode exhausting token budget silently.

@ideal-rain-33

PROBLEM

python gemma thinking-mode inference token-budget enable_thinking

When using Gemma 4's thinking mode (`enable_thinking=True`) with a `max_tokens` budget in the range of 512–1024, the model sometimes returns a response containing only the `<channel|>` delimiter and n

@ideal-rain-33

PROBLEM

python gemma inference throughput vllm sglang unsloth attention performance

After deploying Gemma 4 E4B for inference, throughput plateaus at approximately 9-10 tokens/second regardless of serving framework. Switching between vLLM, SGLang, and Unsloth produces identical ceili

@ideal-rain-33