preprocessing code and code2seq with astminer #1791

rudolfolah · 2022-03-05T18:09:20Z

rudolfolah
Mar 5, 2022

Hi! I'm trying to setup a small ML implementation that converts code snippets into other pieces of code (for example, refactored, optimized, unit test, etc.).

I was wondering if anyone has tried out astminer for pre-processing? It has a JSON format for storage after parsing the AST for a code snippet. Their example looks like this:

{
  "label": "1.java",
  "path": "src/test/resources/examples/1.java",
  "ast": [
    { "token": "EMPTY", "typeLabel": "CompilationUnit", "children": [1] },
    { "token": "class", "typeLabel": "TypeDeclaration", "children": [2, 3, 4] },
    ...
  ]
}

Which input feature type would this be? sequence? vector?

justinxzhao · 2022-03-07T16:32:42Z

justinxzhao
Mar 7, 2022
Maintainer

Hi Rudolf!

code2seq / code2code sounds like a fascinating use case. Is your suggestion to add astimer as a preprocessor for raw code input? What is the output of the model? Also raw code?

Raw code strikes me as quite different from the existing TEXT type. I can imagine adding a new CODE type, which inherits from the SEQUENCE type, that cleanly defines its own preprocessors.

After calling astimer to convert from code to JSON, we'd still need a conversion from JSON format into raw tensors, using some tokenizer+vocabulary, perhaps jsont?

If you are interested to try prototyping something quickly, I would recommend pre-processing your code snippets offline into the JSON format, and then using text features all the way through.

0 replies

rudolfolah · 2022-12-07T21:18:54Z

rudolfolah
Dec 7, 2022
Author

Adding on to this discussion, I found this model Codet5-small on huggingface: https://huggingface.co/Salesforce/codet5-small

What's interesting there is that they use the RobertaTokenizer for preparing the code tokens:

code
docs

So I guess it would be possible to define a new encoder for Ludwig that uses the RobertaTokenizer?

Thought it looks like that encoder is already there? https://github.com/ludwig-ai/ludwig/blob/master/ludwig/encoders/text_encoders.py#L937

🎉 And now as I get to the end here, I've realized with recent updates Ludwig supports the encoder out of the box: https://ludwig.ai/latest/configuration/features/text_features/#roberta-encoder

0 replies

justinxzhao · 2022-12-13T23:02:39Z

justinxzhao
Dec 13, 2022
Maintainer

Hi @rudolfolah, indeed roberta is supported. Please let us know if you run into any issues using it!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing code and code2seq with astminer #1791

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

preprocessing code and code2seq with astminer #1791

rudolfolah Mar 5, 2022

Replies: 3 comments

justinxzhao Mar 7, 2022 Maintainer

rudolfolah Dec 7, 2022 Author

justinxzhao Dec 13, 2022 Maintainer

rudolfolah
Mar 5, 2022

justinxzhao
Mar 7, 2022
Maintainer

rudolfolah
Dec 7, 2022
Author

justinxzhao
Dec 13, 2022
Maintainer