Dataset enquiry

by Hercules628 - opened Mar 17

Mar 17

I'd like to ask what dataset are you training on? How you obtain that dataset? How you train on the particular dataset? Cuz I am doing a similar project as well

adrlau

Owner Mar 17

•

edited Mar 17

I did some modification to this dataset with new synthetic prompts (much shorter and more human like)

redcathode/thingiverse-openscad

Trained on images to the models. (both renders and makes, but did a deduplication on likeness)

Did you test the model?

I did not really get it to perform that well in my limited tests.
Used a jupyter file close to the one given by unsloth for qwen2-vl in 4bit training.

(if you would like im open to try out some of your ideas)

Hercules628

Mar 18

Yea, I tried to fine tune on the same dataset as well with llama3.2:3B in unsloth 4 bit training. However, the result is not as expected as well. If I give a long prompt to the fine tuned llama the response of the llama won't stop (Keep generating repeated code). Right now I am discovering the cause of this phenonmenon. And I am considering will full parameter fine tuning will work for it.

Hercules628

Mar 18

And actually, I have just tested some other small scale model as well. And I found that the trained knowledge maybe not enough for small scale model. So, right now I am still thinking how to achieve it. :(

adrlau

Owner Mar 19

I did a full finetune on my dataset on the qwen 2.5 vl 3b model, om my 1.7k samples, it seemed to grasp openscad syntax and my formatting, but way to stupid to actually do what i wanted.

Hercules628

Mar 19

Wow that's impressive. I thought we need a lot of data to let the model to learn the OpenSCAD syntax like (30K) Thanks for your information. It helps a lot. Right now I am trying on using latest RAG technique to generate the OpenSCAD code. I am testing whether this will work or not. We can keep updating our progress. Nice to meet you.

Hercules628

Mar 20

want to ask you a question. How you performs the modification of the dataset? Manully? or how?

adrlau

Owner Mar 20

I mostly do a set of python scripts to mange the data. I do some static filtering on the base scad code. (mostly to strip long or unnecesary comments and enforcing consistent linespacing and indents) , in addition to synthetic data added on top of the existing dataset. (prompting a llm to generate userprompts, and some cot chains for the code generation) In the end i just stitch it together.

I do some manual verification on samples of the dataset to ensure it is not entirely bad, like how i realised i got a lot of refusals when trying to generate the synthetic cot chain, it often said it wouldnt, so i started filtering out short cot chains.

adrlau

Owner Mar 21

I did the new finetune, and the 7b model with some cot is way better than the 3b model was. I rarily get code with syntax issues, even if it still struggles a bit with the 3d space understanding. Probably need more than 2k examples for a proper model. (did r=16 and 2 epocs on non quant lora finetune)

model:
adrlau/qwen2.5-7B-vl-openscad-v1.5

dataset:
adrlau/openscad-vision2

Hercules628

Mar 24

Oh, Thxxxxxx a lotttttt.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment