We study effects of compilation and decompilation to code clone detection in Java. Compilation/decompilation canonicalise syntactic changes made to source code and can be used as source code normalisation. We used NiCad to detect clones before and after decompilation in three open source software systems, JUnit, JFreeChart, and Tomcat. We filtered and compared the clones in the original and decompiled clone set and found that 1,201 clone pairs (78.7\%) are common between the two sets while 326 pairs (21.3\%) are only in one of the sets. A manual investigation identified 325 out of the 326 pairs as true clones. The 252 original-only clone pairs contain a single false positive while the 74 decompiled-only clone pairs are all true positives. Many clones in the original source code that are detected only after decompilation are type-3 clones that are difficult to detect due to added or deleted statements, keywords, package names; flipped if-else statements; or changed loops. We suggest to use decompilation as normalisation to compliment clone detection. By combining clones found before and after decompilation, one can achieve higher recall without losing precision.
The 326 clone pairs manually validated can be downloaded here.
The source code of clone mapper tool can be downloaded here.
The three selected systems for our study (junit4, jfreechart, tomcat) can be downloaded below:
git clone https://github.com/cragkhit/crjk-iwsc17.git
Please contact Chaiyong Ragkhitwetsagul (ucabagk at ucl dot ac dot uk) for any inquiries regarding this project.