zuzatm 2 hours ago

One notable difference from what one would expect from a LLM-RL paper is the use of test-time RL. I guess when you have a very strong verification, you can specialize your network to solve only your problem. Curious if this can be also be applied in natural language reasoning.