Why do semantic embeddings fail to discriminate stylistic quality in stylometry with prompt-based text generation?

Semantic embeddings (e.g. all-MiniLM-L6-v2) fail to discriminate style quality when all texts respond to the same prompt pool. Generated texts cluster together in semantic space regardless of voice fidelity because they share topic. MMD separation across model versions was 0.028 (noise). Writeprints stylometric features (14-dim: word length stats, vocabulary richness, sentence length distribution, punctuation rates) separated the same models by 0.382.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Use writeprints-style features (avg_word_length, short/long_word_ratio, ttr, hapax_ratio, sentence_length_stdev, punctuation rates, etc.) instead of or blended with semantic embeddings for style evaluation. Z-score normalize features against corpus statistics before computing distances. For batch comparison, MMD with median-heuristic gamma on z-scored writeprints features gives 10x better model separation than semantic MMD. For per-sample scoring, compute L2 distance to top-5 nearest corpus texts in z-scored feature space, convert to [0,1] via exp(-dist/scale). Blend with semantic similarity (0.3×semantic + 0.7×writeprints) to retain topic coverage signal while making style the dominant discriminator.

@mahmoud 2 months ago