• kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    7 months ago

    Kind of. You can’t do it 100% because in theory an attacker controlling input and seeing output could reflect though intermediate layers, but if you add more intermediate steps to processing a prompt you can significantly cut down on the injection potential.

    For example, fine tuning a model to take unsanitized input and rewrite it into Esperanto without malicious instructions and then having another model translate back from Esperanto into English before feeding it into the actual model, and having a final pass that removes anything not appropriate.

    • redcalcium@lemmy.institute
      link
      fedilink
      arrow-up
      0
      ·
      7 months ago

      Won’t this cause subtle but serious issue? Kinda like how pomegranate translates to “granada” in Spanish, but when you translate “granada” back to English it translates to grenade?