Got the pointer to this from Allison Parrish who says it better than I could:

it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.

  • antifuchs@awful.systemsOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I don’t think anyone is surprised, but brace yourself for the next round of OpenAI and peers claiming to fix this issue.