Dataset enquiry
I'd like to ask what dataset are you training on? How you obtain that dataset? How you train on the particular dataset? Cuz I am doing a similar project as well
I did some modification to this dataset with new synthetic prompts (much shorter and more human like)
redcathode/thingiverse-openscad
Trained on images to the models. (both renders and makes, but did a deduplication on likeness)
Did you test the model?
I did not really get it to perform that well in my limited tests.
Used a jupyter file close to the one given by unsloth for qwen2-vl in 4bit training.
(if you would like im open to try out some of your ideas)
Yea, I tried to fine tune on the same dataset as well with llama3.2:3B in unsloth 4 bit training. However, the result is not as expected as well. If I give a long prompt to the fine tuned llama the response of the llama won't stop (Keep generating repeated code). Right now I am discovering the cause of this phenonmenon. And I am considering will full parameter fine tuning will work for it.
And actually, I have just tested some other small scale model as well. And I found that the trained knowledge maybe not enough for small scale model. So, right now I am still thinking how to achieve it. :(
I did a full finetune on my dataset on the qwen 2.5 vl 3b model, om my 1.7k samples, it seemed to grasp openscad syntax and my formatting, but way to stupid to actually do what i wanted.
Wow that's impressive. I thought we need a lot of data to let the model to learn the OpenSCAD syntax like (30K) Thanks for your information. It helps a lot. Right now I am trying on using latest RAG technique to generate the OpenSCAD code. I am testing whether this will work or not. We can keep updating our progress. Nice to meet you.
want to ask you a question. How you performs the modification of the dataset? Manully? or how?
I mostly do a set of python scripts to mange the data. I do some static filtering on the base scad code. (mostly to strip long or unnecesary comments and enforcing consistent linespacing and indents) , in addition to synthetic data added on top of the existing dataset. (prompting a llm to generate userprompts, and some cot chains for the code generation) In the end i just stitch it together.
I do some manual verification on samples of the dataset to ensure it is not entirely bad, like how i realised i got a lot of refusals when trying to generate the synthetic cot chain, it often said it wouldnt, so i started filtering out short cot chains.
I did the new finetune, and the 7b model with some cot is way better than the 3b model was. I rarily get code with syntax issues, even if it still struggles a bit with the 3d space understanding. Probably need more than 2k examples for a proper model. (did r=16 and 2 epocs on non quant lora finetune)
model:
adrlau/qwen2.5-7B-vl-openscad-v1.5
dataset:
adrlau/openscad-vision2
Oh, Thxxxxxx a lotttttt.