arXiv:1908.09804v6 [cs.SE] 2 Oct 2019
Neural Code Search Evaluation Dataset
Hongyu Li
Facebook, Inc.
U.S.A
hongyul@fb.com
Seohyun Kim
Facebook, Inc.
U.S.A
skim131@fb.com
Satish Chandra
Facebook, Inc.
U.S.A
satch@fb.com
Abstract
There has been an increase of interest in code search us-
ing natural language. Assessing the performance of such
code search models can be difficult without a readily avail-
able evaluation suite. In this paper, we present an evaluation
dataset consisting of natural language query and code snip-
pet pairs, with the hope that future work in this area can use
this dataset as a common benchmark. We also provide the
results of two co de search models ([6] and [1]) from recent
work as a benchmark.
1 Introduction
In recent years, learning the mapping between natural lan-
guage and code snippets has been a popular field of research.
In particular, [6], [1], [2] have explored finding relevant code
snippets given a natural language query, with the models
varying from using word embe ddings and IR tech niques to
using sophisticated neural networks. To evaluate the per-
formance of these models, Stack Overflow questions and
code answer pairs are prime candidates, as Stack Overflow
questions well resemble what a developer may ask. Such an
example is "Close/hide the Android Soft Keyboard".
12
One
of the first answers
34
on Stack Overflow correctly answers
this question. However, collecting these questions can be
tedious, and systematically comparing various models can
pose a challenge.
To this end, we have constructed an evaluation dataset,
which contains natural language queries and relevant code
snippet answers from Stack Overflow. It also includes code
snippet examples from the search corpus (public reposito-
ries from GitHub) that correctly answers each query. We
hope that this dataset can be served as a benchm a r k to eval-
uate p erformance across various code search models.
The paper is organized as follows. First we will explain
what data we are releasing in the dataset. Th en we will de-
scribe the process for obtaining this dataset. Finally, we will
evaluate two code search models of our own creation, NC S
and UNI F, on the evaluation dataset as a benchmark.
1
hps://stackoverfl ow.com/questions/1109022/
close-hide-the-android-so-keyboard
2
Author: Vid ar Vestnes.
hps://stackoverfl ow.com/users/133858/vidar-vestnes
3
hps://stackoverfl ow.com/a/1109108
4
Author: Reto Meier. hps://stackoverflow.com/users/822/reto-meier
2 Dataset Contents
In this section, we explain what data we are releasing.
2.1 GitHub Repositories
The most p opular Android repositories on GitHub (ranked
by the number of stars) is used to create the search corpus.
For each repository that we indexed, we provide the link,
specific to the commit that was used.
5
In total, there are
24,549 repositories.
6
We will release a text file containing
the download links for these GitHub repositories. See List-
ing 1 for an example.
2.2 Search Corpus
The search co r pus is indexed using all method bodies parsed
from the 24,549 GitHub repositories. In total, there are 4,716,814
methods in this corpus. The code search model will find rel-
evant code snippets (i.e. method bodies) from this corpus
given a natural language query. In this data release, we will
provide the following information for each method in the
corpus:
id: Each method in the corpus has a unique numeric identi-
fier. This ID number will also be referenced in our evaluation
dataset.
filepath: The file path is in the format o f
:owner/:repo/relative-file-path-to-the-repo
met hod_name
start_line: Starting line numb er of the method in the le.
end_line: Ending line numb er of the method in the le.
url: GitHub link to the method body with commit ID and line
numbers encoded.
Listing 2 provides an example of a method in the search cor-
pus.
2.3 Evaluation Dataset
The evaluation dataset is composed of 287 Stack Overflow
question and answer pairs, for which we release the follow-
ing information:
stackoverflow_id: Stack Overflow post ID.
ques tion: Title of the Stack Overflow post .
ques tion_url: URL of the Stack Overflow post.
answer: Code snippet answer to the question.
5
From August 2018
6
There were originally 26,109 repositories - the difference is due to reasons
outside of our control (e.g. repositories getting deleted). Note t hat not all
of the links in this dataset may not always be available in the future for the
similar reasons.
Conference’17, July 2017, Washington, DC, USA Li, Kim, and Chandra
https://github.com/00-00-00/ably-chat/archive/9bb2e36acc24f1cd684ef5d1b98d837055ba9cc8.zip
https://github.com/01sadra/Detoxiom/archive/c3fffd36989b0cd93bd09cbaa35123b9d605f989.zip
https://github.com/0411ameya/MPG_update/archive/27ac5531ca2c2f123e0cb854ebcb4d0441e2bc98.zip
https://github.com/0508994/MinesweeperGO/archive/ba0e0e45d2da21dde2365ce09277aad511de6885.zip
https://github.com/07101994/My-PPT-Presentation/archive/b89b17a962d5c3e5682fa751228a9f9ca593d77b.zip
https://github.com/0912718/ICT-lab/archive/d1d723edb722013cc83761f0f9df252cfd3361c3.zip
https://github.com/0Cubed/ZeroMediaPlayer/archive/d84c675f9dc8b16f823bb252db9ee368fbd5cd8e.zip
...
Listing 1. GitHub repositories download links example.
{
"id": 4716813,
"filepath": "Mindgames/VideoStreamServer/playersdk/src
/main/java/com/kaltura/playersdk/
PlayerViewController.java",
"method_name": "notifyKPlayerEvent",
"start_line": 506,
"end_line": 566,
"url": "https://github.com/Mindgames/VideoStreamServer
/blob/b7c73d2bcd296b3a24f83cf67d6a5998c7a1af6b/
playersdk/src/main/java/com/kaltura/playersdk/
PlayerViewController.java\#L506-L566"
}
Listing 2. Search corpus example.
{
"stackoverflow_id": 1109022,
"question": "Close/hide the Android Soft Keyboard",
"question_url": "https://stackoverflow.com/questions
/1109022/close-hide-the-android-soft-keyboard",
"question_author": "Vidar Vestnes",
"question_author_url":
"https://stackoverflow.com/users/133858",
"answer": "// Check if no view has focus:\nView view =
this.getCurrentFocus();\nif (view != null) {
InputMethodManager imm = (InputMethodManager)
getSystemService(Context.INPUT_METHOD_SERVICE);
imm.hideSoftInputFromWindow(view.getWindowToken()
, 0);}",
"answer_url": "https://stackoverflow.com/a/1109108",
"answer_author": "Reto Meier",
"answer_author_url":
"https://stackoverflow.com/users/822",
"examples": [1841045, 1800067, 1271795],
"examples_url": [
"https://github.com/alextselegidis/easyappointments-
android-client/blob/39f1e8...",
"https://github.com/zelloptt/zello-android-client-
sdk/blob/87b45b6...",
"https://github.com/systers/conference-android/blob/
a67982abf54e0...",
]
}
Listing 3. Evaluation dataset example.
answer_url: URL of the Stack Overflow answer to the ques-
tion.
examples: 3 methods from the search corpus that best an-
swer the question (most similar to the Stack Overflow an-
swer).
examples_url: GitHub links to the examples.
Note that there m a y be more acceptable answers to each
question. See Listing 3 for a concrete example of an evalua-
tion question in this dataset. The source of the question and
answer pairs is extracted from the Stack Exchange Network
[4].
2.4 NCS / UNIF S core Sheet
We provide the evaluation results for two code search mod-
els of our creation, each with two variations:
NCS: an unsupervised model which uses word embed-
ding derived directly from the search corpus[6].
NCS
postrank
: an extension of the base NCS model that
performs a post-pass ranking, as explained in [6].
UNIF
android
, UNIF
stackoverflow
: a super vised extension of
the NCS model that uses a bag-of-words-based neural
network with attention. The supervision is learned us-
ing GitHub-Android-Train and StackOverflow-Android-
Train datasets, respectively, as described in [1].
We provide the rank o f the first correct answer (FRank) for
each question in our evaluation dataset. The score sheet is
saved in a comm a -delimited csv file as illustrated in List-
ing 4.
No.,StackOverflow ID,NCS FRank,NCS_postrank FRank,
UNIF_android FRank,UNIF_stackoverflow FRank
1,1109022,NF,1,1,1
2,4616095,17,1,31,19
3,3004515,2,1,5,2
4,1560788,1,4,5,1
5,3423754,5,1,22,10
6,1397361,NF,3,2,1
Listing 4. Score sheet example. "NF" stands for correct answer not found
in the top 50 returned results.
3 How we Obtained the Dataset
In this section, we describe the proce dure for how we ob-
tained the data.
GitHub repositories. We obtained the information of
the GitHub repositories with the GitHub REST API [3], and
the source files were downloaded using publicly available
links.
Search corpus. The search corpus was obtained by di-
viding each file in the GitHub repositories by m ethod-level
granularity.
Evaluation dataset. The benchmark questions were c ol-
lecte d from a data dump publicly released by Stack Exchange
[4]. To select the set of Stack Overflow question and an-
swer pairs, we created a heuristics-based filtering pipeline
where we discarded open-ended, discussion-style questions.
We first obtained the most popular 17,000 questions on Stack
Neural Code Search E valuation Dataset Conference’17, July 2017, Washington, DC, USA
Tabl e 1. Number of questions answered in th e top 1, 5, 10 and MRR for
NCS, NCS
postrank
, UNIF
android
and UNIF
stackoverflow
.
Model Answered@1 Answered@5 Answered@10 MRR
NCS 33 74 98 0.189
NCS
postrank
85 151 180 0.4
UNIF
android
25 74 110 0.178
UNIF
stackoverflow
104 164 188 0.465
Overflow with Android” and “Ja va” tags. The dataset is fur-
ther filtered with the foll owing criteria: 1) there exists an up-
voted code answer, 2) the ground truth code snippet has at
least one match in the search corpus. From this pipeline, we
were able to obtain 518 questions. Finally, we manually went
through these questions and filtered out questions with vague
queries and/or code answers. The final dataset contains 287
Stack Overflow question and a nswers pairs.
NCS / UNIF score sheet. To judge whether a method
body correctly answers the query, we compare how similar
it is to the Stack Overflow answer - we do this systemati-
cally using a code-to-code similarity tool, c a l l ed Aroma [5].
Aroma gives a similarity score between two code snippets;
if this score is above a certain threshold (0.25 in our case),
we co unt it as success. This similarity score, aims to mimic
manually assessing th e correctness of search results in an
automatic a nd reproducible fashion, while leaving out hu-
man judgment in the process. More details on how we chose
this threshold can be found in [1].
4 Evaluation
We provide the results for four models: NCS, NCS
postrank
,
UNIF
android
, and U NIF
stackoverflow
.
Table 1 reports the number of questions answered within
the top_n returned code snippet, where n = 1, 5, and 1 0 (An-
swered@1, 5, 10 in Table 1), as well as the Mean Reciprocal
Rank (MRR).
References
[1] Jose Cambronero, Hongyu Li Seohyun Kim, Koushik Sen, and Satish
Chandra. When deep learning met code search. CoRR, abs/1905.03813,
2019. URL: hps://arxiv.org/abs/1905.03813, arXiv:1905.03813.
[2] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. In
Proceedings of the 40th International Conference on Software Engineer-
ing, pages 933–944. ACM, 2018.
[3] GitHub Inc. Github rest api v3. URL: hps://developer. github.com/v3/
search/.
[4] Stack Exchange Inc. datastack exchange data dump, 2018. C C-BY-SA
3.0. URL: hps://archive.org/details/stackexchange.
[5] Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chan-
dra. Aroma: Code recommendation via structural code search.
CoRR, abs/1812.01158, 2018. URL: hp://arxiv.org/abs/1812.01158,
arXiv:1812.01158.
[6] Saksham Sachdev, Hongyu Li, Sifei Luan, Se ohyun Kim, Koushik Sen,
and Satish Chandra. Retrieval on source code: a neural code search.
In Proceedings of the 2nd ACM SIGPLAN International Workshop on Ma-
chine Learning and Programming Languages, pages 3 1–41. ACM, 2018.