Conference’17, July 2017, Washington, DC, USA Li, Kim, and Chandra
https://github.com/00-00-00/ably-chat/archive/9bb2e36acc24f1cd684ef5d1b98d837055ba9cc8.zip
https://github.com/01sadra/Detoxiom/archive/c3fffd36989b0cd93bd09cbaa35123b9d605f989.zip
https://github.com/0411ameya/MPG_update/archive/27ac5531ca2c2f123e0cb854ebcb4d0441e2bc98.zip
https://github.com/0508994/MinesweeperGO/archive/ba0e0e45d2da21dde2365ce09277aad511de6885.zip
https://github.com/07101994/My-PPT-Presentation/archive/b89b17a962d5c3e5682fa751228a9f9ca593d77b.zip
https://github.com/0912718/ICT-lab/archive/d1d723edb722013cc83761f0f9df252cfd3361c3.zip
https://github.com/0Cubed/ZeroMediaPlayer/archive/d84c675f9dc8b16f823bb252db9ee368fbd5cd8e.zip
...
Listing 1. GitHub repositories download links example.
{
"id": 4716813,
"filepath": "Mindgames/VideoStreamServer/playersdk/src
/main/java/com/kaltura/playersdk/
PlayerViewController.java",
"method_name": "notifyKPlayerEvent",
"start_line": 506,
"end_line": 566,
"url": "https://github.com/Mindgames/VideoStreamServer
/blob/b7c73d2bcd296b3a24f83cf67d6a5998c7a1af6b/
playersdk/src/main/java/com/kaltura/playersdk/
PlayerViewController.java\#L506-L566"
}
Listing 2. Search corpus example.
{
"stackoverflow_id": 1109022,
"question": "Close/hide the Android Soft Keyboard",
"question_url": "https://stackoverflow.com/questions
/1109022/close-hide-the-android-soft-keyboard",
"question_author": "Vidar Vestnes",
"question_author_url":
"https://stackoverflow.com/users/133858",
"answer": "// Check if no view has focus:\nView view =
this.getCurrentFocus();\nif (view != null) {
InputMethodManager imm = (InputMethodManager)
getSystemService(Context.INPUT_METHOD_SERVICE);
imm.hideSoftInputFromWindow(view.getWindowToken()
, 0);}",
"answer_url": "https://stackoverflow.com/a/1109108",
"answer_author": "Reto Meier",
"answer_author_url":
"https://stackoverflow.com/users/822",
"examples": [1841045, 1800067, 1271795],
"examples_url": [
"https://github.com/alextselegidis/easyappointments-
android-client/blob/39f1e8...",
"https://github.com/zelloptt/zello-android-client-
sdk/blob/87b45b6...",
"https://github.com/systers/conference-android/blob/
a67982abf54e0...",
]
}
Listing 3. Evaluation dataset example.
• answer_url: URL of the Stack Overflow answer to the ques-
tion.
• examples: 3 methods from the search corpus that best an-
swer the question (most similar to the Stack Overflow an-
swer).
• examples_url: GitHub links to the examples.
Note that there m a y be more acceptable answers to each
question. See Listing 3 for a concrete example of an evalua-
tion question in this dataset. The source of the question and
answer pairs is extracted from the Stack Exchange Network
[4].
2.4 NCS / UNIF S core Sheet
We provide the evaluation results for two code search mod-
els of our creation, each with two variations:
• NCS: an unsupervised model which uses word embed-
ding derived directly from the search corpus[6].
• NCS
postrank
: an extension of the base NCS model that
performs a post-pass ranking, as explained in [6].
• UNIF
android
, UNIF
stackoverflow
: a super vised extension of
the NCS model that uses a bag-of-words-based neural
network with attention. The supervision is learned us-
ing GitHub-Android-Train and StackOverflow-Android-
Train datasets, respectively, as described in [1].
We provide the rank o f the first correct answer (FRank) for
each question in our evaluation dataset. The score sheet is
saved in a comm a -delimited csv file as illustrated in List-
ing 4.
No.,StackOverflow ID,NCS FRank,NCS_postrank FRank,
UNIF_android FRank,UNIF_stackoverflow FRank
1,1109022,NF,1,1,1
2,4616095,17,1,31,19
3,3004515,2,1,5,2
4,1560788,1,4,5,1
5,3423754,5,1,22,10
6,1397361,NF,3,2,1
Listing 4. Score sheet example. "NF" stands for correct answer not found
in the top 50 returned results.
3 How we Obtained the Dataset
In this section, we describe the proce dure for how we ob-
tained the data.
GitHub repositories. We obtained the information of
the GitHub repositories with the GitHub REST API [3], and
the source files were downloaded using publicly available
links.
Search corpus. The search corpus was obtained by di-
viding each file in the GitHub repositories by m ethod-level
granularity.
Evaluation dataset. The benchmark questions were c ol-
lecte d from a data dump publicly released by Stack Exchange
[4]. To select the set of Stack Overflow question and an-
swer pairs, we created a heuristics-based filtering pipeline
where we discarded open-ended, discussion-style questions.
We first obtained the most popular 17,000 questions on Stack