Maurice Weber
mauriceweber
AI & ML interests
None yet
Organizations
Add paper citation
1
#30 opened about 1 year ago
by
davanstrien
RPV2 ccnet preprocessing
1
#29 opened over 1 year ago
by
bpwl0121
How can I download the sample-10B fastestly?
1
#28 opened over 1 year ago
by
zgxiao
defunct book subset
π
1
4
#28 opened about 2 years ago
by
polinaeterna
How much disk space would the whole HF dataset take?
1
#27 opened over 1 year ago
by
protossw512
rpv2-subsamples
1
#26 opened almost 2 years ago
by
mauriceweber
The doc_id in duplicates is should contain?
3
#24 opened almost 2 years ago
by
newbietuan
Deduplication steps
π
1
23
#15 opened about 2 years ago
by
ilyayudkovich
Here's a download script parallelized using Spark
π₯
π
2
1
#22 opened almost 2 years ago
by
srowen
what is the meaning of snapshots in redpajama-data-v2?
2
#21 opened almost 2 years ago
by
choidonghun
How to join documents and quality signals when downloading directly
3
#19 opened almost 2 years ago
by
tgshdyfuhuf
Missing duplicates parquet files
5
#18 opened about 2 years ago
by
bebensee
Script to download all files of 1B sample data locally
π
3
2
#13 opened about 2 years ago
by
ivanzhouyq
What is the total size, of the entirety of this dataset in TB?
1
#10 opened about 2 years ago
by
Bayaz
What's the concept on partitions
2
#5 opened about 2 years ago
by
SwatCat
quality_signals, minhash and duplicates missing
2
#3 opened about 2 years ago
by
sheshanshag
Request to add retries into RedPajama-Data-V2.py script
1
#16 opened about 2 years ago
by
yura38
How to obtain duplicates from minhash?
1
#8 opened about 2 years ago
by
cq
Obtaining Filtered Samples
4
#12 opened about 2 years ago
by
ssingh22
How big is the data size of en?
5
#6 opened about 2 years ago
by
newbietuan