preprocessing code and code2seq with astminer #1791
Replies: 3 comments
-
Hi Rudolf! code2seq / code2code sounds like a fascinating use case. Is your suggestion to add Raw code strikes me as quite different from the existing After calling astimer to convert from code to JSON, we'd still need a conversion from JSON format into raw tensors, using some tokenizer+vocabulary, perhaps If you are interested to try prototyping something quickly, I would recommend pre-processing your code snippets offline into the JSON format, and then using text features all the way through. |
Beta Was this translation helpful? Give feedback.
-
Adding on to this discussion, I found this model Codet5-small on huggingface: https://huggingface.co/Salesforce/codet5-small What's interesting there is that they use the RobertaTokenizer for preparing the code tokens: So I guess it would be possible to define a new encoder for Ludwig that uses the RobertaTokenizer? Thought it looks like that encoder is already there? https://github.com/ludwig-ai/ludwig/blob/master/ludwig/encoders/text_encoders.py#L937 🎉 And now as I get to the end here, I've realized with recent updates Ludwig supports the encoder out of the box: https://ludwig.ai/latest/configuration/features/text_features/#roberta-encoder |
Beta Was this translation helpful? Give feedback.
-
Hi @rudolfolah, indeed roberta is supported. Please let us know if you run into any issues using it! |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm trying to setup a small ML implementation that converts code snippets into other pieces of code (for example, refactored, optimized, unit test, etc.).
I was wondering if anyone has tried out
astminer
for pre-processing? It has a JSON format for storage after parsing the AST for a code snippet. Their example looks like this:Which input feature type would this be? sequence? vector?
Beta Was this translation helpful? Give feedback.
All reactions