The Pile: An 800GB Dataset of Diverse Text for Language Modeling Paper • 2101.00027 • Published Dec 31, 2020 • 9