Neural Code Search Evaluation Dataset

arXiv:1908.09804v6 [cs.SE] 2 Oct 2019

Hongyu Li

Facebook, Inc.

U.S.A

hongyul@fb.com

Seohyun Kim

Facebook, Inc.

U.S.A

skim131@fb.com

Satish Chandra

Facebook, Inc.

U.S.A

satch@fb.com

Abstract

There has been an increase of interest in code search us-

ing natural language. Assessing the performance of such

code search models can be diﬃcult without a readily avail-

able evaluation suite. In this paper, we present an evaluation

dataset consisting of natural language query and code snip-

pet pairs, with the hope that future work in this area can use

this dataset as a common benchmark. We also provide the

results of two co de search models ([6] and [1]) from recent

work as a benchmark.

1 Introduction

In recent years, learning the mapping between natural lan-

guage and code snippets has been a popular ﬁeld of research.

In particular, [6], [1], [2] have explored ﬁnding relevant code

snippets given a natural language query, with the models

varying from using word embe ddings and IR tech niques to

using sophisticated neural networks. To evaluate the per-

formance of these models, Stack Overﬂow questions and

code answer pairs are prime candidates, as Stack Overﬂow

questions well resemble what a developer may ask. Such an

example is "Close/hide the Android Soft Keyboard".

One

of the ﬁrst answers

on Stack Overﬂow correctly answers

this question. However, collecting these questions can be

tedious, and systematically comparing various models can

pose a challenge.

To this end, we have constructed an evaluation dataset,

which contains natural language queries and relevant code

snippet answers from Stack Overﬂow. It also includes code

snippet examples from the search corpus (public reposito-

ries from GitHub) that correctly answers each query. We

hope that this dataset can be served as a benchm a r k to eval-

uate p erformance across various code search models.

The paper is organized as follows. First we will explain

what data we are releasing in the dataset. Th en we will de-

scribe the process for obtaining this dataset. Finally, we will

evaluate two code search models of our own creation, NC S

and UNI F, on the evaluation dataset as a benchmark.

hps://stackoverfl ow.com/questions/1109022/

close-hide-the-android-so-keyboard

Author: Vid ar Vestnes.

hps://stackoverfl ow.com/users/133858/vidar-vestnes

hps://stackoverfl ow.com/a/1109108

Author: Reto Meier. hps://stackoverflow.com/users/822/reto-meier

2 Dataset Contents

In this section, we explain what data we are releasing.

2.1 GitHub Repositories

The most p opular Android repositories on GitHub (ranked

by the number of stars) is used to create the search corpus.

For each repository that we indexed, we provide the link,

speciﬁc to the commit that was used.

In total, there are

24,549 repositories.

We will release a text ﬁle containing

the download links for these GitHub repositories. See List-

ing 1 for an example.

2.2 Search Corpus

The search co r pus is indexed using all method bodies parsed

from the 24,549 GitHub repositories. In total, there are 4,716,814

methods in this corpus. The code search model will ﬁnd rel-

evant code snippets (i.e. method bodies) from this corpus

given a natural language query. In this data release, we will

provide the following information for each method in the

corpus:

• id: Each method in the corpus has a unique numeric identi-

ﬁer. This ID number will also be referenced in our evaluation

dataset.

• ﬁlepath: The ﬁle path is in the format o f

:owner/:repo/relative-file-path-to-the-repo

• met hod_name

• start_line: Starting line numb er of the method in the ﬁle.

• end_line: Ending line numb er of the method in the ﬁle.

• url: GitHub link to the method body with commit ID and line

numbers encoded.

Listing 2 provides an example of a method in the search cor-

pus.

2.3 Evaluation Dataset

The evaluation dataset is composed of 287 Stack Overﬂow

question and answer pairs, for which we release the follow-

ing information:

• stackoverﬂow_id: Stack Overﬂow post ID.

• ques tion: Title of the Stack Overﬂow post .

• ques tion_url: URL of the Stack Overﬂow post.

• answer: Code snippet answer to the question.

From August 2018

There were originally 26,109 repositories - the diﬀerence is due to reasons

outside of our control (e.g. repositories getting deleted). Note t hat not all

of the links in this dataset may not always be available in the future for the

similar reasons.

Conference’17, July 2017, Washington, DC, USA Li, Kim, and Chandra

https://github.com/00-00-00/ably-chat/archive/9bb2e36acc24f1cd684ef5d1b98d837055ba9cc8.zip

https://github.com/01sadra/Detoxiom/archive/c3fffd36989b0cd93bd09cbaa35123b9d605f989.zip

https://github.com/0411ameya/MPG_update/archive/27ac5531ca2c2f123e0cb854ebcb4d0441e2bc98.zip

https://github.com/0508994/MinesweeperGO/archive/ba0e0e45d2da21dde2365ce09277aad511de6885.zip

https://github.com/07101994/My-PPT-Presentation/archive/b89b17a962d5c3e5682fa751228a9f9ca593d77b.zip

https://github.com/0912718/ICT-lab/archive/d1d723edb722013cc83761f0f9df252cfd3361c3.zip

https://github.com/0Cubed/ZeroMediaPlayer/archive/d84c675f9dc8b16f823bb252db9ee368fbd5cd8e.zip

...

Listing 1. GitHub repositories download links example.

{

"id": 4716813,

"filepath": "Mindgames/VideoStreamServer/playersdk/src

/main/java/com/kaltura/playersdk/

PlayerViewController.java",

"method_name": "notifyKPlayerEvent",

"start_line": 506,

"end_line": 566,

"url": "https://github.com/Mindgames/VideoStreamServer

/blob/b7c73d2bcd296b3a24f83cf67d6a5998c7a1af6b/

playersdk/src/main/java/com/kaltura/playersdk/

PlayerViewController.java\#L506-L566"

}

Listing 2. Search corpus example.

{

"stackoverflow_id": 1109022,

"question": "Close/hide the Android Soft Keyboard",

"question_url": "https://stackoverflow.com/questions

/1109022/close-hide-the-android-soft-keyboard",

"question_author": "Vidar Vestnes",

"question_author_url":

"https://stackoverflow.com/users/133858",

"answer": "// Check if no view has focus:\nView view =

this.getCurrentFocus();\nif (view != null) {

InputMethodManager imm = (InputMethodManager)

getSystemService(Context.INPUT_METHOD_SERVICE);

imm.hideSoftInputFromWindow(view.getWindowToken()

, 0);}",

"answer_url": "https://stackoverflow.com/a/1109108",

"answer_author": "Reto Meier",

"answer_author_url":

"https://stackoverflow.com/users/822",

"examples": [1841045, 1800067, 1271795],

"examples_url": [

"https://github.com/alextselegidis/easyappointments-

android-client/blob/39f1e8...",

"https://github.com/zelloptt/zello-android-client-

sdk/blob/87b45b6...",

"https://github.com/systers/conference-android/blob/

a67982abf54e0...",

]

}

Listing 3. Evaluation dataset example.

• answer_url: URL of the Stack Overﬂow answer to the ques-

tion.

• examples: 3 methods from the search corpus that best an-

swer the question (most similar to the Stack Overﬂow an-

swer).

• examples_url: GitHub links to the examples.

Note that there m a y be more acceptable answers to each

question. See Listing 3 for a concrete example of an evalua-

tion question in this dataset. The source of the question and

answer pairs is extracted from the Stack Exchange Network

[4].

2.4 NCS / UNIF S core Sheet

We provide the evaluation results for two code search mod-

els of our creation, each with two variations:

• NCS: an unsupervised model which uses word embed-

ding derived directly from the search corpus[6].

• NCS

postrank

: an extension of the base NCS model that

performs a post-pass ranking, as explained in [6].

• UNIF

android

, UNIF

stackoverﬂow

: a super vised extension of

the NCS model that uses a bag-of-words-based neural

network with attention. The supervision is learned us-

ing GitHub-Android-Train and StackOverﬂow-Android-

Train datasets, respectively, as described in [1].

We provide the rank o f the ﬁrst correct answer (FRank) for

each question in our evaluation dataset. The score sheet is

saved in a comm a -delimited csv ﬁle as illustrated in List-

ing 4.

No.,StackOverflow ID,NCS FRank,NCS_postrank FRank,

UNIF_android FRank,UNIF_stackoverflow FRank

1,1109022,NF,1,1,1

2,4616095,17,1,31,19

3,3004515,2,1,5,2

4,1560788,1,4,5,1

5,3423754,5,1,22,10

6,1397361,NF,3,2,1

Listing 4. Score sheet example. "NF" stands for correct answer not found

in the top 50 returned results.

3 How we Obtained the Dataset

In this section, we describe the proce dure for how we ob-

tained the data.

GitHub repositories. We obtained the information of

the GitHub repositories with the GitHub REST API [3], and

the source ﬁles were downloaded using publicly available

links.

Search corpus. The search corpus was obtained by di-

viding each ﬁle in the GitHub repositories by m ethod-level

granularity.

Evaluation dataset. The benchmark questions were c ol-

lecte d from a data dump publicly released by Stack Exchange

[4]. To select the set of Stack Overﬂow question and an-

swer pairs, we created a heuristics-based ﬁltering pipeline

where we discarded open-ended, discussion-style questions.

We ﬁrst obtained the most popular 17,000 questions on Stack

Neural Code Search E valuation Dataset Conference’17, July 2017, Washington, DC, USA

Tabl e 1. Number of questions answered in th e top 1, 5, 10 and MRR for

NCS, NCS

postrank

, UNIF

android

and UNIF

stackoverﬂow

Model Answered@1 Answered@5 Answered@10 MRR

NCS 33 74 98 0.189

NCS

postrank

85 151 180 0.4

UNIF

android

25 74 110 0.178

UNIF

stackoverﬂow

104 164 188 0.465

Overﬂow with “Android” and “Ja va” tags. The dataset is fur-

ther ﬁltered with the foll owing criteria: 1) there exists an up-

voted code answer, 2) the ground truth code snippet has at

least one match in the search corpus. From this pipeline, we

were able to obtain 518 questions. Finally, we manually went

through these questions and ﬁltered out questions with vague

queries and/or code answers. The ﬁnal dataset contains 287

Stack Overﬂow question and a nswers pairs.

NCS / UNIF score sheet. To judge whether a method

body correctly answers the query, we compare how similar

it is to the Stack Overﬂow answer - we do this systemati-

cally using a code-to-code similarity tool, c a l l ed Aroma [5].

Aroma gives a similarity score between two code snippets;

if this score is above a certain threshold (0.25 in our case),

we co unt it as success. This similarity score, aims to mimic

manually assessing th e correctness of search results in an

automatic a nd reproducible fashion, while leaving out hu-

man judgment in the process. More details on how we chose

this threshold can be found in [1].

4 Evaluation

We provide the results for four models: NCS, NCS

postrank

UNIF

android

, and U NIF

stackoverﬂow

Table 1 reports the number of questions answered within

the top_n returned code snippet, where n = 1, 5, and 1 0 (An-

swered@1, 5, 10 in Table 1), as well as the Mean Reciprocal

Rank (MRR).

References

[1] Jose Cambronero, Hongyu Li Seohyun Kim, Koushik Sen, and Satish

Chandra. When deep learning met code search. CoRR, abs/1905.03813,

2019. URL: hps://arxiv.org/abs/1905.03813, arXiv:1905.03813.

[2] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. In

Proceedings of the 40th International Conference on Software Engineer-

ing, pages 933–944. ACM, 2018.

[3] GitHub Inc. Github rest api v3. URL: hps://developer. github.com/v3/

search/.

[4] Stack Exchange Inc. datastack exchange data dump, 2018. C C-BY-SA

3.0. URL: hps://archive.org/details/stackexchange.

[5] Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chan-

dra. Aroma: Code recommendation via structural code search.

CoRR, abs/1812.01158, 2018. URL: hp://arxiv.org/abs/1812.01158,

arXiv:1812.01158.

[6] Saksham Sachdev, Hongyu Li, Sifei Luan, Se ohyun Kim, Koushik Sen,

and Satish Chandra. Retrieval on source code: a neural code search.

In Proceedings of the 2nd ACM SIGPLAN International Workshop on Ma-

chine Learning and Programming Languages, pages 3 1–41. ACM, 2018.