Statistical Unigram Analysis for Source Code Repository

Weifeng Xu, Dianxiang Xu, Omar A. El Ariss, Yunkai Liu, Abdulrahman Alatawi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-8
Number of pages8
ISBN (Electronic)9781509065493
DOIs
StatePublished - Jun 30 2017
Event3rd IEEE International Conference on Multimedia Big Data, BigMM 2017 - Laguna Hills, United States
Duration: Apr 19 2017Apr 21 2017

Publication series

NameProceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017

Other

Other3rd IEEE International Conference on Multimedia Big Data, BigMM 2017
CountryUnited States
CityLaguna Hills
Period4/19/174/21/17

Fingerprint

Computer programming languages
Statistical methods
Processing
Statistical Models

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing
  • Media Technology

Cite this

Xu, W., Xu, D., El Ariss, O. A., Liu, Y., & Alatawi, A. (2017). Statistical Unigram Analysis for Source Code Repository. In Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017 (pp. 1-8). [7966707] (Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigMM.2017.13
Xu, Weifeng ; Xu, Dianxiang ; El Ariss, Omar A. ; Liu, Yunkai ; Alatawi, Abdulrahman. / Statistical Unigram Analysis for Source Code Repository. Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 1-8 (Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017).
@inproceedings{32e66eca16494087a53d32fa6b065659,
title = "Statistical Unigram Analysis for Source Code Repository",
abstract = "Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.",
author = "Weifeng Xu and Dianxiang Xu and {El Ariss}, {Omar A.} and Yunkai Liu and Abdulrahman Alatawi",
year = "2017",
month = "6",
day = "30",
doi = "10.1109/BigMM.2017.13",
language = "English (US)",
series = "Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1--8",
booktitle = "Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017",
address = "United States",

}

Xu, W, Xu, D, El Ariss, OA, Liu, Y & Alatawi, A 2017, Statistical Unigram Analysis for Source Code Repository. in Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017., 7966707, Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017, Institute of Electrical and Electronics Engineers Inc., pp. 1-8, 3rd IEEE International Conference on Multimedia Big Data, BigMM 2017, Laguna Hills, United States, 4/19/17. https://doi.org/10.1109/BigMM.2017.13

Statistical Unigram Analysis for Source Code Repository. / Xu, Weifeng; Xu, Dianxiang; El Ariss, Omar A.; Liu, Yunkai; Alatawi, Abdulrahman.

Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 1-8 7966707 (Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Statistical Unigram Analysis for Source Code Repository

AU - Xu, Weifeng

AU - Xu, Dianxiang

AU - El Ariss, Omar A.

AU - Liu, Yunkai

AU - Alatawi, Abdulrahman

PY - 2017/6/30

Y1 - 2017/6/30

N2 - Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

AB - Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

UR - http://www.scopus.com/inward/record.url?scp=85027725617&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85027725617&partnerID=8YFLogxK

U2 - 10.1109/BigMM.2017.13

DO - 10.1109/BigMM.2017.13

M3 - Conference contribution

T3 - Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017

SP - 1

EP - 8

BT - Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Xu W, Xu D, El Ariss OA, Liu Y, Alatawi A. Statistical Unigram Analysis for Source Code Repository. In Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 1-8. 7966707. (Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017). https://doi.org/10.1109/BigMM.2017.13