-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umbrella issue for ML team #231
Comments
Thank you very much for writing this up, @EgorBu. There's a lot to consider here, but this will be really helpful for our planning. |
+ https://github.com/bblfsh/bblfshd/issues/236 (JS driver related) |
2) was fixed in bblfsh/javascript-driver#50 |
I've checked the issues list and some of them are already resolved. @creachadair Maybe we can make a separate project for this to track the progress? |
This would make a good milestone. I'll set that up. Edit: I started doing this, but it wound up opening up various design questions I didn't want to pre-answer, so I went with the simpler approach @bzz suggested below for now. |
Short-term and as a first step we could convert PR description to have a checkboxes and mark issues that are solved. Happy to do that, if nobody has any objections or better short-term suggestions. |
It's a good idea, and I started working on this. So far I have mostly just rearranged things cosmetically, checking off items that are complete and separating out the FRs from the bugs. |
Did a pass over checkboxes in Description and updated resolved ones. |
I think we can close this one, assuming all feature requests have their own issue filed. |
It looks that way. If I'm wrong, let me know and I'll take care of it. |
Hello,
As we discussed in slack here it's umbrella issue for ML team. I will try to aggregate our wishes and problems here.
Issues
Let's start with issues:
bblfshd issues
<defunct>
/var/lib/bblfshd/tmp/
#168 A lot of files in/var/lib/bblfshd/tmp/
bblfshctl driver list
mismatch withhtop
#219 Memory is not released, number of instances inbblfshctl driver list
mismatch withhtop
Go & Python client issues
supported_languages
raises unimplemented exception python-client#125 Functionsupported_languages
raises unimplemented exceptionlanguage
argument inclient.parse
changes number of files signigicantly python-client#126language
argument inclient.parse
changes number of files significantlyDrivers
Support Cython python-driver#150 Support CythonSDK
Web
Miscellaneous
Babelfish (aka. bblfsh)
link is broken documentation#199Babelfish (aka. bblfsh)
link is brokenI believe that some of these issues can be closed already.
Wishlist
Let's start second part about wishes (probably it's a bit late for new year's resolution but anyway).
It should show our priorities.
Feature extraction for source code
This part is very important for everybody in MLonCode area and still it's quite complicated to do.
In summary:
code
->nodes
->"".join(nodes) == code
bblfsh
by researchers in this areaWhat we need (and not only we) -
code
->UAST
->code
. It's not possible right now so we did several tricks to overcome this problem.We wrote tokenizer for source code (right now only JS-lang is supported) that allows to do
code
-> {UAST
,info about parent of nodes
,tokens
} wheretokens
has property"".join(tokens) == code
.To do it we had to find reserved keywords for particular language and made logic to split everything between UAST nodes with tokens into reserved keywords and spaces/quotes/newlines.
Also it's important to know which roles are used in particular language (obviously only subset of all roles is used for some language -> and it can be used to reduce dimensionality of features like we did here)
Position information
We rely heavily on position information of identifiers and literals - if it's not available - it makes our life much harder.
So we have to make workarounds like this for [bug] Missed token for regular expressions.
How it can be checked - make long tests for each driver for 10-20-30 best repositories for each language and check that
node.token == code[node.start_offset:node.end_offset]
Integration/regression tests on popular repositories
Maybe proces the top 10-20-30 repositories for each supported language.
Make long running tests to check if everything is good with drivers-sdk-etc - it will allow to avoid many problems like
bblfshctl driver list
mismatch withhtop
#219 Memory is not released, number of instances inbblfshctl driver list
mismatch withhtop
(probably related to A lot of <defunct> #159).and so on.
Language extensions
Follow up of this discussion https://src-d.slack.com/archives/C7UDG9VNY/p1533802242000193. In summary: https://github.com/babel/babel/ is excluded from our list of repos to check because more an issue of the native driver not covering all node types in this repo. Probably it's driver/
enry
issue to detect such kind of language extensions. (another example https://github.com/samgozman/YoptaScript#%D0%9F%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D1%8B)Partial parsing
It could help to check that code wasn't changed in the way to break parsing of UAST. Right now it just parse full content of file even if one space is added.
The text was updated successfully, but these errors were encountered: