ICML 2024 tutorial for project overview

Frequently Asked Questions

Question: Any collaboration possibilities?

Answer: At FAIR, we encourage collaborations with external researchers, especially students and professors from academia. However, due to company policy, we cannot share code or GPU resources. This means that unless you have your own GPU resources, we can only discuss ideas at a high level and not run experiments together. Currently, my only intern position is committed to Tian Ye, so no additional intern headcount. Two exceptions below:

If you're from Meta (any team) and is willing to work with us informally in your spare time (20+hrs/week, at least 6 months commitment), then we can consider collaboration. Zicheng Xu was one such amazing collaborator from Meta Ads and we have written two papers together. Zicheng has gained 1,000,000 GPU hours of LLM pretraining experience through this collaboration.
[ddl has passed] If you're an early-year UCB/NYU/CMU/UW PhD student, there may be 2-year co-mentorship program with me (starting Sep 2025), application deadline was Jan 10. I am one of the mentors and might have 1 headcount for this position.

Question: Any future timeline to share on this initiative?

Answer: The longer, in-depth videos of Parts 1, 2.1, 2.2, 3.1, and 3.2 are already available on our website and YouTube. We target at releasing similar in-depth videos for Parts 3.3 in some future. Regarding Part 4 and beyond, we are exploring several important directions.

Question: Any plans on code/data release?

Answer: We strongly believe in the importance of code and data sharing. However, as a small team with multiple priorities, we need to manage our time carefully. Obtaining legal approval at Meta needs time, and we have limited people: only 1 programmer for Parts 1/3, and 0.5 programmer for Part 2.

For Part 1, the data (CFG trees) are included in the PDF, and generating from the CFGs requires only simple code, which we are not providing due to time constraints.
For Part 3, we have detailed how to generate the data, which involves simple random generation of names and employers. We are not allowed to release the bioR data as it is difficult to human-verify all Llama outputs for legal purposes. However, we plan to release the bioS data and the prompts to generate the bioR data - this is pending both legal and ethical reviews right now.
For Part 2, we have released the code for generating the iGSM data. We have also provided all necessary pseudocode in the released PDF paper to help readers understand our data generation process.

Page updated

Google Sites

Report abuse