TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion




Source Code Plagiarism and Collusion, Cross-Language Detection, TF-IDF, Computing Education, Information Retrieval


Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises.


Download data is not yet available.


Acampora G., Cosma G.: A fuzzy-based approach to programming language independent source-code plagiarism detection. In: The 2015 IEEE International Conference on Fuzzy Systems, pp. 1–8. IEEE, 2015. ISBN 978-1-4673-7428-6. URL

Agrawal M., Sharma D.K.: A state of art on source code plagiarism detection. In: The 2nd International Conference on Next Generation Computing Technologies, pp. 236–241. IEEE, Dehradun, 2016. ISBN 978-1-5090-3257-0. URL

Al-Khanjari Z.A., Fiaidhi J.A., Al-Hinai R.A., Kutti N.S.: PlagDetect: a Java programming plagiarism detection tool. In: ACM Inroads, vol. 1(4), pp. 66–71, 2010. ISSN 21532184. URL

Allyson F.B., Danilo M.L., Jos´e S.M., Giovanni B.C.: Sherlock N-Overlap: invasive normalization and overlap coefficient for the similarity analysis between source code. In: IEEE Transactions on Computers, 2018. ISSN 0018-9340. URL

Arwin C., Tahaghoghi S.M.M.: Plagiarism detection across programming languages. In: The 29th Australasian Computer Science Conference - Volume 48, p. 328. Australian Computer Society, Hobart, 2006. ISBN 1920682309. URL

Bohning D.: Multinomial logistic regression algorithm. In: Annals of the Institute of Statistical Mathematics, vol. 44(1), pp. 197–200, 1992. ISSN 0020-3157. URL

Brixtel R., Fontaine M., Lesner B., Bazin C., Robbes R.: Language-independent clone detection applied to plagiarism detection. In: The 10th IEEE Working Conference on Source Code Analysis and Manipulation, pp. 77–86. IEEE, Timisoara, 2010. ISBN 978-1-4244-8655-7. URL

Budiman A., Karnalim O.: Automated hints generation for investigating source code plagiarism and identifying the culprits on in-class individual programming assessment. In: Computers, vol. 8(1), p. 11, 2019. URL

Burrows S., Tahaghoghi S.M.M., Zobel J.: Efficient plagiarism detection for large code repositories. In: Software: Practice and Experience, vol. 37(2), pp. 151–175, 2007. ISSN 00380644. URL

Cortes C., Vapnik V.: Support-vector networks. In: Machine Learning, vol. 20(3), pp. 273–297, 1995. URL

Cosma G., Joy M.: Towards a Definition of source-code plagiarism. In: IEEE Transactions on Education, vol. 51(2), pp. 195–200, 2008. ISSN 0018-9359. URL

Cosma G., Joy M.: An approach to source-code plagiarism detection and investigation using Latent Semantic Analysis. In: IEEE Transactions on Computers, vol. 61(3), pp. 379–394, 2012. ISSN 0018-9340. URL

Croft W.B., Metzler D., Strohman T.: Search engines : information retrieval in practice. Addison-Wesley, 2010. ISBN 0136072240.

Domin C., Pohl H., Krause M.: Improving plagiarism detection in coding assignments by dynamic removal of common ground. In: The 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1173–1179. ACM Press, San Jose, 2016. ISBN 9781450340823. URL

Engels S., Lakshmanan V., Craig M.: Plagiarism detection using feature-based neural networks. In: The 38th SIGCSE Technical Symposium on Computer Science Education, vol. 39, p. 34. ACM Press, 2007. ISBN 1-59593-361-1. ISSN 00978418. URL

Faidhi J.A.W., Robinson S.K.: An empirical approach for detecting program similarity and plagiarism within a university programming environment. In: Computers & Education, vol. 11(1), pp. 11–19, 1987. ISSN 0360-1315. URL

Flores E., Barrón-Cede˜no A., Moreno L., Rosso P.: Cross-language source code reuse detection using Latent Semantic Analysis. In: Journal of Universal Computer Science, vol. 21(13), pp. 1708–1725, 2015. URL

Flores E., Barrón-Cede˜no A., Moreno L., Rosso P.: Uncovering source code reuse in large-scale academic environments. In: Computer Applications in Engineering Education, vol. 23(3), pp. 383–390, 2015. URL

Fraser R.: Collaboration, collusion and plagiarism in computer science coursework. In: Informatics in Education, vol. 13(2), pp. 179–195, 2014. URL

Fu D., Xu Y., Yu H., Yang B.: WASTK: a weighted abstract syntax tree kernel method for source code plagiarism detection. In: Scientific Programming, vol.2017, pp. 1–8, 2017. ISSN 1058-9244. URL

Halak B., El-Hajjar M.: Plagiarism detection and prevention techniques in engineering education. In: The 11th European Workshop on Microelectronics Education, pp. 1–3. IEEE, Southampton, 2016. ISBN 978-1-4673-8584-8. URL

Halstead M.H.: An experimental determination of the ”purity” of a trivial algorithm. In: ACM SIGMETRICS Performance Evaluation Review, vol. 2(1), pp. 10–15, 1973. URL

Inoue U.,Wada S.: Detecting plagiarisms in elementary programming courses. In: The 9th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2308–2312. IEEE, 2012. ISBN 978-1-4673-0024-7. URL

Jadalla A., Elnagar A.: PDE4Java: plagiarism detection engine for java source code: a clustering approach. In: International Journal of Business Intelligence and Data Mining, vol. 3(2), p. 121, 2008. ISSN 1743-8187. URL

Karnalim O.: A low-level structure-based approach for detecting source code plagiarism. In: IAENG International Journal of Computer Science, vol. 44(4), pp. 501–522, 2017. URL

Karnalim O.: Source code plagiarism detection with low-level structural representation and information retrieval. In: International Journal of Computers and Applications, 2019. URL

Karnalim O., Budi S.: The effectiveness of low-level structure-based approach toward source code plagiarism level taxonomy. In: The 6th International Conference on Information and Communication Technology, pp. 130–134. IEEE, Bandung, 2018. ISBN 978-1-5386-4572-7. URL

Karnalim O., Budi S., Toba H., Joy M.: Source code plagiarism detection in academia with information retrieval: dataset and the observation. In: Informatics in Education, vol. 18(2), pp. 321–344, 2019. ISSN 1648-5831. URL

Kermek D., Novak M.: Process model improvement for source code plagiarism detection in student programming assignments. In: Informatics in Education, vol. 15(1), pp. 103–126, 2016. URL

Kikuchi H., Goto T., Wakatsuki M., Nishino T.: A source code plagiarism detecting method using alignment with abstract syntax tree elements. In: The 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 1–6. IEEE, Las Vegas, 2014. ISBN 978-1-4799-5604-3. URL

Liang Y.D.: Introduction to Java programming, comprehensive version (9th Edition). Pearson, 2013.

Liu C., Chen C., Han J., Yu P.S.: Gplag: detection of software plagiarism by program dependence graph analysis. In: The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 872. ACM Press, Philadelphia, 2006. ISBN 1595933395. URL

Malabarba S., Devanbu P., Stearns A.: MoHCA-Java: : a tool for C++ to Java conversion support. In: The 21st international conference on Software engineering, pp. 650–653. ACM Press, Los Angeles, 1999. ISBN 1581130740. URL

Maletic J.I., Collard M.L.: Exploration, analysis, and manipulation of source code using srcML. In: The 37th International Conference on Software Engineering, pp. 951–952. ACM, Florence, 2015. URL

Misic M.J., Protic J.u., Tomasevic M.V.: Improving source code plagiarism detection: lessons learned. In: The 25th Telecommunication Forum, pp. 1–8. IEEE, Belgrade, 2017. ISBN 978-1-5386-3073-0. URL

Ohmann T., Rahal I.: Efficient clustering-based source code plagiarism detection using PIY. In: Knowledge and Information Systems, vol. 43(2), pp. 445–472, 2015. ISSN 0219-1377. URL

Ottenstein K.J.: An algorithmic approach to the detection and prevention of plagiarism. In: ACM SIGCSE Bulletin, vol. 8(4), pp. 30–41, 1976. ISSN 00978418. URL

Parr T.: The definitive ANTLR 4 reference. Pragmatic Bookshelf, 2013.

Pineiro C., Abuin J.M., Pichel J.C.: Perldoop2: A big data-oriented source-to-source Perl-Java compiler. In: The 15th International Conference on

Dependable, Autonomic and Secure Computing, pp. 933–940. IEEE, Orlando, 2017. ISBN 978-1-5386-1956-8. URL

Poon J.Y.H., Sugiyama K., Tan Y.F., Kan M.Y.: Instructor-centric source code plagiarism detection and plagiarism corpus. In: The 17th ACM Annual Conference on Innovation and Technology in Computer Science Education, p. 122. ACM Press, Haifa, 2012. ISBN 9781450312462. URL

Prechelt L., Malpohl G., Philippsen M.: Finding plagiarisms among a set of programs with JPlag. In: Journal of Universal Computer Science, vol. 8(11), pp. 1016–1038, 2002. URL

Rabbani F.S., Karnalim O.: Detecting source code plagiarism on .NET programming languages using low-level representation and adaptive local alignment. In: Journal of Information and Organizational Sciences, vol. 41(1), pp. 105–123, 2017. ISSN 18469418. URL

Ragkhitwetsagul C., Krinke J., Clark D.: Similarity of source code in the presence of pervasive modifications. In: The 16th International Working Conference on Source Code Analysis and Manipulation, pp. 117–126. IEEE, Raleigh, 2016. ISBN 978-1-5090-3848-0. URL

Ragkhitwetsagul C., Krinke J., Clark D.: A comparison of code similarity analysers. In: Empirical Software Engineering, vol. 23(4), pp. 2464–2519, 2018. URL

Rosales F., Garc´ia A., Rodr´iguez S., Pedraza J.L., M´endez R., Nieto M.M.: Detection of plagiarism in programming assignments. In: IEEE Transactions on Education, vol. 51(2), pp. 174–183, 2008. ISSN 0018-9359. URL

Sidorov G., Ibarra Romero M., Markov I., Guzman-Cabrera R., Chanona-Hern´andez L., Vel´asquez F.: Measuring similarity between Karel programs using character and word n-grams. In: Programming and Computer Software, vol. 43(1), pp. 47–50, 2017. URL

Simon, Cook B., Sheard J., Carbone A., Johnson C.: Academic integrity: differences between computing assessments and essays. In: The 13th Koli Calling International Conference on Computing Education Research, pp. 23–32. ACM Press, Koli, 2013. ISBN 9781450324823. URL

Song H.J., Park S.B., Park S.Y.: Computation of program source code similarity by composition of parse tree and call graph. In: Mathematical Problems in Engineering, vol. 2015, pp. 1–12, 2015. ISSN 1024-123X. URL

Sulistiani L., Karnalim O.: ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. In: Computer Applications in Engineering Education, vol. 27(1), pp. 166–182, 2019. ISSN 10613773. URL

Ullah F., Wang J., Farhan M., Habib M., Khalid S.: Software plagiarism detection in multiprogramming languages using machine learning approach. In: Concurrency and Computation: Practice and Experience, p. e5000, 2018. URL

Ullah F., Wang J., Farhan M., Jabbar S., Wu Z., Khalid S.: Plagiarism detection in students’ programming assignments based on semantics: multimedia e-learning based smart assessment methodology. In: Multimedia Tools and Applications, 2018. ISSN 1380-7501. URL

Verco K.L., Wise M.J.: Software for detecting suspected plagiarism: comparing structure and attribute-counting systems. In: The 1st Australasian Conference on Computer Science Education, pp. 81–88. ACM Press, Sydney, 1996. ISBN 0897918452. URL

Wang L., Jiang L., Qin G.: A search of verilog code plagiarism detection method. In: The 13th International Conference on Computer Science & Education, pp. 1–5. IEEE, Colombo, 2018. ISBN 978-1-5386-5495-8. URL

Wise M.J.: Yap3: improved detection of similarities in computer program and other texts. In: The 27th SIGCSE Technical Symposium on Computer Science Education, vol. 28, pp. 130–134. ACM Press, Philadelphia, 1996. ISBN 089791757X. URL

Yang F.P., Jiau H.C., Ssu K.F.: Beyond plagiarism: an active learning method to analyze causes behind code-similarity. In: Computers & Education, vol. 70, pp. 161–172, 2014. ISSN 0360-1315. URL

Yasaswi J., Purini S., Jawahar C.V.: Plagiarism detection in programming assignments using deep features. In: The 4th IAPR Asian Conference on Pattern Recognition, pp. 652–657. IEEE, Nanjing, 2017. ISBN 978-1-5386-3354-0. URL




How to Cite

Karnalim, O. (2020). TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion. Computer Science, 21(1).