Frequently Asked Questions
Question: Any collaboration possibilities?
Answer: At FAIR, we encourage collaborations with external researchers, especially students and professors from academia. However, due to company policy, we cannot share code or GPU resources. This means that unless you have your own GPU resources, we can only discuss ideas at a high level and not run experiments together. Currently, my only intern position is committed to Tian Ye, so no additional intern headcount. Two exceptions below:
If you're from Meta (any team) and is willing to work with us informally in your spare time (20+hrs/week, at least 6 months commitment), then we can consider collaboration. Zicheng Xu was one such amazing collaborator from Meta Ads and we have written two papers together. Zicheng has gained 1,000,000 GPU hours of LLM pretraining experience through this collaboration.
If you're an early-year UCB/NYU/CMU/UW PhD student, there may be 2-year co-mentorship program with me (starting Sep 2025), application deadline Jan 10. I am one of the mentors and might have 1 headcount for this position.
Question: Any future timeline to share on this initiative?
Answer: The longer, in-depth videos of Parts 1, 2.1, 2.2, 3.1, and 3.2 are already available on our website and YouTube. We target at releasing similar in-depth videos for Parts 3.3 in some future. Regarding Part 4 and beyond, we are exploring several important directions.
Question: Any plans on code/data release?
Answer: We strongly believe in the importance of code and data sharing. However, as a small team with multiple priorities, we need to manage our time carefully. Obtaining legal approval need time, but we have limited people: only 1 programmer for Parts 1 and 3, and 0.5 programmer for Part 2 (Tian Ye is only 6mo/year at Meta).
For Part 1, the data (CFG trees) are included in the PDF, and generating from the CFGs requires only simple code, which we are not providing due to time constraints.Â
For Part 3, we have detailed how to generate the data, which involves simple random generation of names and employers. We might not be able to release the bioR data as it is difficult to human-verify all Llama outputs. However, we will certainly release the bioS data and the prompts to generate the bioR data - this is pending both legal and ethical reviews right now.
For Part 2, we will release the code for generating the iGSM data. The code is ready (and refactored) and is now pending legal approval. In the meantime, we have provided all necessary pseudocode in the released PDF paper to help readers understand the data generation process.