John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI

约翰·舒尔曼(OpenAI联合创始人)——推理、强化学习与人类反馈、2027年通用人工智能计划

Dwarkesh Podcast

2024-05-15

1 小时 35 分钟
PDF

单集简介 ...

Chatted with John Schulman (cofounded OpenAI and led ChatGPT creation) on how posttraining tames the shoggoth, and the nature of the progress to come... Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes. Timestamps (00:00:00) - Pre-training, post-training, and future capabilities (00:16:55) - Plan for AGI 2025 (00:29:18) - Teaching models to reason (00:39:45) - The Road to ChatGPT (00:51:07) - What makes for a good RL researcher? (00:59:53) - Keeping humans in the loop (01:14:11) - State of research, plateaus, and moats Sponsors If you’re interested in advertising on the podcast, fill out this form. * CommandBar is an AI user assistant that any software product can embed to non-annoyingly assist, support, and unleash their users. Used by forward-thinking CX, product, growth, and marketing teams. Learn more at commandbar.com. Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
更多

单集文稿 ...

  • Today I have the pleasure to speak with John Shulman,

  • who is one of the co-founders of OpenAI and leads the post training team here.

  • And he also led the creation of ChatGBT and is the author of many of the most important and widely cited papers in AI and RL,

  • including PPO and many others.

  • So John, really excited to chat with you.

  • Thanks for coming on the podcast.

  • Thanks for having me on the podcast.

  • I'm a big fan.

  • Oh, thank you.

  • Thank you for saying that.

  • So the first question I had is,

  • we have these distinctions between pre-training and post-training beyond what is actually happening in terms of loss function and training regimes.

  • I'm just curious, taking a step back conceptually, like what kind of thing is pre-training creating?

  • What does post-training do on top of that?

  • In pre-training,

  • you're basically training to imitate all of the content on the internet or on the web,

  • including websites and code and so forth.

  • So you get a model that can basically generate content that looks like random web pages from the internet.

  • And the model is also trained to maximize likelihood where it has to put a probability on everything.

  • So the objective is basically predicting the next token given the previous tokens.