TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

Authors

DOI:

https://doi.org/10.7494/csci.2020.21.1.3389

Keywords:

Source Code Plagiarism and Collusion, Cross-Language Detection, TF-IDF, Computing Education, Information Retrieval

Abstract

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises.

Downloads

Download data is not yet available.

References

Acampora G., Cosma G.: A fuzzy-based approach to programming language independent source-code plagiarism detection. In: The 2015 IEEE International Conference on Fuzzy Systems, pp. 1–8. IEEE, 2015. ISBN 978-1-4673-7428-6. URL http://dx.doi.org/10.1109/FUZZ-IEEE.2015.7337935.

Agrawal M., Sharma D.K.: A state of art on source code plagiarism detection. In: The 2nd International Conference on Next Generation Computing Technologies, pp. 236–241. IEEE, Dehradun, 2016. ISBN 978-1-5090-3257-0. URL http://dx.doi.org/10.1109/NGCT.2016.7877421.

Al-Khanjari Z.A., Fiaidhi J.A., Al-Hinai R.A., Kutti N.S.: PlagDetect: a Java programming plagiarism detection tool. In: ACM Inroads, vol. 1(4), pp. 66–71, 2010. ISSN 21532184. URL http://dx.doi.org/10.1145/1869746.1869766.

Allyson F.B., Danilo M.L., Jos´e S.M., Giovanni B.C.: Sherlock N-Overlap: invasive normalization and overlap coefficient for the similarity analysis between source code. In: IEEE Transactions on Computers, 2018. ISSN 0018-9340. URL http://dx.doi.org/10.1109/TC.2018.2881449.

Arwin C., Tahaghoghi S.M.M.: Plagiarism detection across programming languages. In: The 29th Australasian Computer Science Conference - Volume 48, p. 328. Australian Computer Society, Hobart, 2006. ISBN 1920682309. URL https://dl.acm.org/citation.cfm?id=1151730.

Bohning D.: Multinomial logistic regression algorithm. In: Annals of the Institute of Statistical Mathematics, vol. 44(1), pp. 197–200, 1992. ISSN 0020-3157. URL http://dx.doi.org/10.1007/BF00048682.

Brixtel R., Fontaine M., Lesner B., Bazin C., Robbes R.: Language-independent clone detection applied to plagiarism detection. In: The 10th IEEE Working Conference on Source Code Analysis and Manipulation, pp. 77–86. IEEE, Timisoara, 2010. ISBN 978-1-4244-8655-7. URL http://dx.doi.org/10.1109/SCAM.2010.19.

Budiman A., Karnalim O.: Automated hints generation for investigating source code plagiarism and identifying the culprits on in-class individual programming assessment. In: Computers, vol. 8(1), p. 11, 2019. URL http://dx.doi.org/10.3390/computers8010011.

Burrows S., Tahaghoghi S.M.M., Zobel J.: Efficient plagiarism detection for large code repositories. In: Software: Practice and Experience, vol. 37(2), pp. 151–175, 2007. ISSN 00380644. URL http://dx.doi.org/10.1002/spe.750.

Cortes C., Vapnik V.: Support-vector networks. In: Machine Learning, vol. 20(3), pp. 273–297, 1995. URL http://dx.doi.org/10.1007/BF00994018.

Cosma G., Joy M.: Towards a Definition of source-code plagiarism. In: IEEE Transactions on Education, vol. 51(2), pp. 195–200, 2008. ISSN 0018-9359. URL http://dx.doi.org/10.1109/TE.2007.906776.

Cosma G., Joy M.: An approach to source-code plagiarism detection and investigation using Latent Semantic Analysis. In: IEEE Transactions on Computers, vol. 61(3), pp. 379–394, 2012. ISSN 0018-9340. URL http://dx.doi.org/10.1109/TC.2011.223.

Croft W.B., Metzler D., Strohman T.: Search engines : information retrieval in practice. Addison-Wesley, 2010. ISBN 0136072240.

Domin C., Pohl H., Krause M.: Improving plagiarism detection in coding assignments by dynamic removal of common ground. In: The 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1173–1179. ACM Press, San Jose, 2016. ISBN 9781450340823. URL http://dx.doi.org/10.1145/2851581.2892512.

Engels S., Lakshmanan V., Craig M.: Plagiarism detection using feature-based neural networks. In: The 38th SIGCSE Technical Symposium on Computer Science Education, vol. 39, p. 34. ACM Press, 2007. ISBN 1-59593-361-1. ISSN 00978418. URL http://dx.doi.org/10.1145/1227504.1227324.

Faidhi J.A.W., Robinson S.K.: An empirical approach for detecting program similarity and plagiarism within a university programming environment. In: Computers & Education, vol. 11(1), pp. 11–19, 1987. ISSN 0360-1315. URL http://dx.doi.org/10.1016/0360-1315(87)90042-X.

Flores E., Barrón-Cede˜no A., Moreno L., Rosso P.: Cross-language source code reuse detection using Latent Semantic Analysis. In: Journal of Universal Computer Science, vol. 21(13), pp. 1708–1725, 2015. URL http://www.jucs.org/jucs_21_13/cross_language_source_code.

Flores E., Barrón-Cede˜no A., Moreno L., Rosso P.: Uncovering source code reuse in large-scale academic environments. In: Computer Applications in Engineering Education, vol. 23(3), pp. 383–390, 2015. URL http://dx.doi.org/10.1002/cae.21608.

Fraser R.: Collaboration, collusion and plagiarism in computer science coursework. In: Informatics in Education, vol. 13(2), pp. 179–195, 2014. URL http://dx.doi.org/10.15388/infedu.2014.01.

Fu D., Xu Y., Yu H., Yang B.: WASTK: a weighted abstract syntax tree kernel method for source code plagiarism detection. In: Scientific Programming, vol.2017, pp. 1–8, 2017. ISSN 1058-9244. URL http://dx.doi.org/10.1155/2017/7809047.

Halak B., El-Hajjar M.: Plagiarism detection and prevention techniques in engineering education. In: The 11th European Workshop on Microelectronics Education, pp. 1–3. IEEE, Southampton, 2016. ISBN 978-1-4673-8584-8. URL http://dx.doi.org/10.1109/EWME.2016.7496465.

Halstead M.H.: An experimental determination of the ”purity” of a trivial algorithm. In: ACM SIGMETRICS Performance Evaluation Review, vol. 2(1), pp. 10–15, 1973. URL http://dx.doi.org/10.1145/1041606.1041608.

Inoue U.,Wada S.: Detecting plagiarisms in elementary programming courses. In: The 9th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2308–2312. IEEE, 2012. ISBN 978-1-4673-0024-7. URL http://dx.doi.org/10.1109/FSKD.2012.6234186.

Jadalla A., Elnagar A.: PDE4Java: plagiarism detection engine for java source code: a clustering approach. In: International Journal of Business Intelligence and Data Mining, vol. 3(2), p. 121, 2008. ISSN 1743-8187. URL http://dx.doi.org/10.1504/IJBIDM.2008.020514.

Karnalim O.: A low-level structure-based approach for detecting source code plagiarism. In: IAENG International Journal of Computer Science, vol. 44(4), pp. 501–522, 2017. URL http://www.iaeng.org/IJCS/issues_v44/issue_4/IJCS_44_4_11.pdf.

Karnalim O.: Source code plagiarism detection with low-level structural representation and information retrieval. In: International Journal of Computers and Applications, 2019. URL http://dx.doi.org/10.1080/1206212X.2019.1589944.

Karnalim O., Budi S.: The effectiveness of low-level structure-based approach toward source code plagiarism level taxonomy. In: The 6th International Conference on Information and Communication Technology, pp. 130–134. IEEE, Bandung, 2018. ISBN 978-1-5386-4572-7. URL http://dx.doi.org/10.1109/ICoICT.2018.8528768.

Karnalim O., Budi S., Toba H., Joy M.: Source code plagiarism detection in academia with information retrieval: dataset and the observation. In: Informatics in Education, vol. 18(2), pp. 321–344, 2019. ISSN 1648-5831. URL http://dx.doi.org/10.15388/infedu.2019.15.

Kermek D., Novak M.: Process model improvement for source code plagiarism detection in student programming assignments. In: Informatics in Education, vol. 15(1), pp. 103–126, 2016. URL http://dx.doi.org/10.15388/infedu.2016.06.

Kikuchi H., Goto T., Wakatsuki M., Nishino T.: A source code plagiarism detecting method using alignment with abstract syntax tree elements. In: The 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 1–6. IEEE, Las Vegas, 2014. ISBN 978-1-4799-5604-3. URL http://dx.doi.org/10.1109/SNPD.2014.6888733.

Liang Y.D.: Introduction to Java programming, comprehensive version (9th Edition). Pearson, 2013.

Liu C., Chen C., Han J., Yu P.S.: Gplag: detection of software plagiarism by program dependence graph analysis. In: The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 872. ACM Press, Philadelphia, 2006. ISBN 1595933395. URL http://dx.doi.org/10.1145/1150402.1150522.

Malabarba S., Devanbu P., Stearns A.: MoHCA-Java: : a tool for C++ to Java conversion support. In: The 21st international conference on Software engineering, pp. 650–653. ACM Press, Los Angeles, 1999. ISBN 1581130740. URL http://dx.doi.org/10.1145/302405.302918.

Maletic J.I., Collard M.L.: Exploration, analysis, and manipulation of source code using srcML. In: The 37th International Conference on Software Engineering, pp. 951–952. ACM, Florence, 2015. URL https://dl.acm.org/citation.cfm?id=2819225.

Misic M.J., Protic J.u., Tomasevic M.V.: Improving source code plagiarism detection: lessons learned. In: The 25th Telecommunication Forum, pp. 1–8. IEEE, Belgrade, 2017. ISBN 978-1-5386-3073-0. URL http://dx.doi.org/10.1109/TELFOR.2017.8249481.

Ohmann T., Rahal I.: Efficient clustering-based source code plagiarism detection using PIY. In: Knowledge and Information Systems, vol. 43(2), pp. 445–472, 2015. ISSN 0219-1377. URL http://dx.doi.org/10.1007/s10115-014-0742-2.

Ottenstein K.J.: An algorithmic approach to the detection and prevention of plagiarism. In: ACM SIGCSE Bulletin, vol. 8(4), pp. 30–41, 1976. ISSN 00978418. URL http://dx.doi.org/10.1145/382222.382462.

Parr T.: The definitive ANTLR 4 reference. Pragmatic Bookshelf, 2013.

Pineiro C., Abuin J.M., Pichel J.C.: Perldoop2: A big data-oriented source-to-source Perl-Java compiler. In: The 15th International Conference on

Dependable, Autonomic and Secure Computing, pp. 933–940. IEEE, Orlando, 2017. ISBN 978-1-5386-1956-8. URL http://dx.doi.org/10.1109/DASC-PICom-DataCom-CyberSciTec.2017.156.

Poon J.Y.H., Sugiyama K., Tan Y.F., Kan M.Y.: Instructor-centric source code plagiarism detection and plagiarism corpus. In: The 17th ACM Annual Conference on Innovation and Technology in Computer Science Education, p. 122. ACM Press, Haifa, 2012. ISBN 9781450312462. URL http://dx.doi.org/10.1145/2325296.2325328.

Prechelt L., Malpohl G., Philippsen M.: Finding plagiarisms among a set of programs with JPlag. In: Journal of Universal Computer Science, vol. 8(11), pp. 1016–1038, 2002. URL http://jucs.org/jucs_8_11/finding_plagiarisms_among_a/Prechelt_L.pdf.

Rabbani F.S., Karnalim O.: Detecting source code plagiarism on .NET programming languages using low-level representation and adaptive local alignment. In: Journal of Information and Organizational Sciences, vol. 41(1), pp. 105–123, 2017. ISSN 18469418. URL http://dx.doi.org/10.31341/jios.41.1.7.

Ragkhitwetsagul C., Krinke J., Clark D.: Similarity of source code in the presence of pervasive modifications. In: The 16th International Working Conference on Source Code Analysis and Manipulation, pp. 117–126. IEEE, Raleigh, 2016. ISBN 978-1-5090-3848-0. URL http://dx.doi.org/10.1109/SCAM.2016.13.

Ragkhitwetsagul C., Krinke J., Clark D.: A comparison of code similarity analysers. In: Empirical Software Engineering, vol. 23(4), pp. 2464–2519, 2018. URL http://dx.doi.org/10.1007/s10664-017-9564-7.

Rosales F., Garc´ia A., Rodr´iguez S., Pedraza J.L., M´endez R., Nieto M.M.: Detection of plagiarism in programming assignments. In: IEEE Transactions on Education, vol. 51(2), pp. 174–183, 2008. ISSN 0018-9359. URL http://dx.doi.org/10.1109/TE.2007.906778.

Sidorov G., Ibarra Romero M., Markov I., Guzman-Cabrera R., Chanona-Hern´andez L., Vel´asquez F.: Measuring similarity between Karel programs using character and word n-grams. In: Programming and Computer Software, vol. 43(1), pp. 47–50, 2017. URL http://dx.doi.org/10.1134/S0361768817010066.

Simon, Cook B., Sheard J., Carbone A., Johnson C.: Academic integrity: differences between computing assessments and essays. In: The 13th Koli Calling International Conference on Computing Education Research, pp. 23–32. ACM Press, Koli, 2013. ISBN 9781450324823. URL http://dx.doi.org/10.1145/2526968.2526971.

Song H.J., Park S.B., Park S.Y.: Computation of program source code similarity by composition of parse tree and call graph. In: Mathematical Problems in Engineering, vol. 2015, pp. 1–12, 2015. ISSN 1024-123X. URL http://dx.doi.org/10.1155/2015/429807.

Sulistiani L., Karnalim O.: ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. In: Computer Applications in Engineering Education, vol. 27(1), pp. 166–182, 2019. ISSN 10613773. URL http://dx.doi.org/10.1002/cae.22066.

Ullah F., Wang J., Farhan M., Habib M., Khalid S.: Software plagiarism detection in multiprogramming languages using machine learning approach. In: Concurrency and Computation: Practice and Experience, p. e5000, 2018. URL http://dx.doi.org/10.1002/cpe.5000.

Ullah F., Wang J., Farhan M., Jabbar S., Wu Z., Khalid S.: Plagiarism detection in students’ programming assignments based on semantics: multimedia e-learning based smart assessment methodology. In: Multimedia Tools and Applications, 2018. ISSN 1380-7501. URL http://dx.doi.org/10.1007/s11042-018-5827-6.

Verco K.L., Wise M.J.: Software for detecting suspected plagiarism: comparing structure and attribute-counting systems. In: The 1st Australasian Conference on Computer Science Education, pp. 81–88. ACM Press, Sydney, 1996. ISBN 0897918452. URL http://dx.doi.org/10.1145/369585.369598.

Wang L., Jiang L., Qin G.: A search of verilog code plagiarism detection method. In: The 13th International Conference on Computer Science & Education, pp. 1–5. IEEE, Colombo, 2018. ISBN 978-1-5386-5495-8. URL http://dx.doi.org/10.1109/ICCSE.2018.8468817.

Wise M.J.: Yap3: improved detection of similarities in computer program and other texts. In: The 27th SIGCSE Technical Symposium on Computer Science Education, vol. 28, pp. 130–134. ACM Press, Philadelphia, 1996. ISBN 089791757X. URL http://dx.doi.org/10.1145/236452.236525.

Yang F.P., Jiau H.C., Ssu K.F.: Beyond plagiarism: an active learning method to analyze causes behind code-similarity. In: Computers & Education, vol. 70, pp. 161–172, 2014. ISSN 0360-1315. URL http://dx.doi.org/10.1016/J.COMPEDU.2013.08.005.

Yasaswi J., Purini S., Jawahar C.V.: Plagiarism detection in programming assignments using deep features. In: The 4th IAPR Asian Conference on Pattern Recognition, pp. 652–657. IEEE, Nanjing, 2017. ISBN 978-1-5386-3354-0. URL http://dx.doi.org/10.1109/ACPR.2017.146.

Downloads

Published

2020-01-27

How to Cite

Karnalim, O. (2020). TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion. Computer Science, 21(1). https://doi.org/10.7494/csci.2020.21.1.3389

Issue

Section

Articles