GDPKG-LLM: INTEGRATING GENE, DISEASE, AND PHARMACOGENOMICS KNOWLEDGE GRAPHS FOR COGNITIVE NEUROSCIENCE USING LARGE LANGUAGE MODELS

Kheirolah Rahsepar Fard; Ali  Sarabadani; Hamid Dalvand

doi:10.7494/csci.2025.26.3.6673

Authors

Kheirolah Rahsepar Fard Department of Computer Engineering and Information Technology, University of Qom
Ali Sarabadani Department of Computer Engineering and Information Technology, University of Qom
Hamid Dalvand id Dalvand Department of Occupational Therapy, School of Rehabilitation, Tehran University of Medical Sciences

DOI:

https://doi.org/10.7494/csci.2025.26.3.6673

Abstract

Using the structures of large language models (LLMs) in creating knowledgediagrams to understand more about the relationship between the entities ofcognitive and biological sciences has become a hot point of research. Due to thegreat knowledge behind the curtain and the deep connections of this research,it is not possible to use the traditional approaches of machine learning and deeplearning. In this study,the main goal is to create a comprehensive and integratedknowledge graph(KG) from the combination of three knowledge sources: GeneOntology (GO), Disease Ontology (DO), and PharmKG. Large language models(LLMs) have been used to create this knowledge base. The main purpose ofthis KG is to understand the relationships between genes, diseases and drugs.The pro- posed approach was called GDPKG-LLM. It has several key steps,including entity matching, similarity analysis, graph alignment and using GPT-4. GDPKG-LLM was able to extract more than 16,800 nodes and 838,000 edgesfrom these three knowledge bases and provide a rich KG. This graph providesmeaningful relationships, making it a valuable resource for future research inpersonalized medicine and neuroscience. The reviewed evaluation criteria showthe superiority of GDPKG-LLM, which strengthens the validity of this model.

Downloads

Download data is not yet available.

Author Biography

Kheirolah Rahsepar Fard, Department of Computer Engineering and Information Technology, University of Qom

Department: Faculty of Computer Engineering and Information Technology Department, University of Qom, Qom, Iran.
Position : Professor (Assistant)
Degree: Doctor of Applied Mathematics (Numerical Analysis)

References

Abdin M., Jacobs S.A., Awan A.A., Aneja J., Awadallah A., Awadalla H., BachN., Bahree A., Bakhtiari A., Behl H., et al.: Phi-3 technical report: A highly capa-ble language model locally on your phone. In: arXiv preprint arXiv:2404.14219,2024.

Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F.L., AlmeidaD., Altenschmidt J., Altman S., Anadkat S., et al.: Gpt-4 technical report. In:arXiv preprint arXiv:2303.08774, 2023.

Alharbi R., Ahmed U., Dobriy D., Lajewska W., Menotti L., Saeedizade M.J.,Dumontier M.: Exploring the role of generative AI in constructing knowledgegraphs for drug indications with medical context. In: Proceedings http://ceur-ws. org ISSN, vol. 1613, p. 0073, 2023.

Allen B.P., Groth P.T.: Evaluating Class Membership Relations in KnowledgeGraphs using Large Language Models. In: arXiv preprint arXiv:2404.17000,2024.

Belleau F., Nolin M.A., Tourigny N., Rigault P., Morissette J.: Bio2RDF: towardsa mashup to build bioinformatics knowledge systems. In: Journal of biomedicalinformatics, vol. 41(5), pp. 706–716, 2008.

Bollacker K., Evans C., Paritosh P., Sturge T., Taylor J.: Freebase: a collabora-tively created graph database for structuring human knowledge. In: Proceedingsof the 2008 ACM SIGMOD international conference on Management of data, pp.1247–1250. 2008.

Bordes A., Usunier N., Garcia-Duran A., Weston J., Yakhnenko O.: Translatingembeddings for modeling multi-relational data. In: Advances in neural informa-tion processing systems, vol. 26, 2013.

Boylan J., Mangla S., Thorn D., Ghalandari D.G., Ghaffari P., Hokamp C.: KGValidator: A Framework for Automatic Validation of Knowledge Graph Con- struction. In: arXiv preprint arXiv:2404.15923, 2024.

Canbek G., Taskaya Temizel T., Sagiroglu S.: PToPI: A comprehensive review,analysis, and knowledge representation of binary classification performance mea-sures/metrics. In: SN Computer Science, vol. 4(1), p. 13, 2022.

Chiang W.L., Liu X., Si S., Li Y., Bengio S., Hsieh C.J.: Cluster-gcn: An efficientalgorithm for training deep and large graph convolutional networks. In: Proceed-ings of the 25th ACM SIGKDD international conference on knowledge discovery& data mining, pp. 257–266. 2019.

Choudhary N., Reddy C.K.: Complex logical reasoning over knowledge graphsusing large language models. In: arXiv preprint arXiv:2305.01157, 2023.

Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., BarhamP., Chung H.W., Sutton C., Gehrmann S., et al.: Palm: Scaling language model-ing with pathways. In: Journal of Machine Learning Research, vol. 24(240), pp.1–113, 2023.

Cm S., Prakash J., Singh P.K.: Question answering over knowledge graphs usingBERT based relation mapping. In: Expert Systems, vol. 40(10), p. e13456, 2023.

Colin R.: Exploring the limits of transfer learning with a unified text-to-texttransformer. In: J. Mach. Learn. Res., vol. 21, pp. 140–1, 2020.

Dettmers T., Minervini P., Stenetorp P., Riedel S.: Convolutional 2d knowledgegraph embeddings. In: Proceedings of the AAAI conference on artificial intelli-gence, vol. 32. 2018.

Gao Y., Li R., Croxford E., Tesch S., To D., Caskey J., W. Patterson B.,M. Churpek M., Miller T., Dligach D., et al.: Large language models and medicalknowledge grounding for diagnosis prediction. In: medRxiv, pp. 2023–11, 2023.

Ge X., Wang Y.C., Wang B., Kuo C.C.J., et al.: Knowledge Graph Embedding:An Overview. In: APSIPA Transactions on Signal and Information Processing,vol. 13(1), 2024.

Godsil C., Royle G.F.: Algebraic graph theory, vol. 207. Springer Science &Business Media, 2001.

Guo Q., Cao S., Yi Z.: A medical question answering system using large languagemodels and knowledge graphs. In: International Journal of Intelligent Systems,vol. 37(11), pp. 8548–8564, 2022.

Himmelstein D.S., Lizee A., Hessler C., Brueggeman L., Chen S.L., Hadley D.,Green A., Khankhanian P., Baranzini S.E.: Systematic integration of biomedicalknowledge prioritizes drugs for repurposing. In: Elife, vol. 6, p. e26726, 2017.

Hu W., Fey M., Zitnik M., Dong Y., Ren H., Liu B., Catasta M., Leskovec J.:Open graph benchmark: Datasets for machine learning on graphs. In: Advancesin neural information processing systems, vol. 33, pp. 22118–22133, 2020.

Jiang P., Xiao C., Cross A., Sun J.: Graphcare: Enhancing healthcare predictionswith personalized knowledge graphs. In: arXiv preprint arXiv:2305.12788, 2023.

Jin Q., Wang Z., Floudas C.S., Chen F., Gong C., Bracken-Clarke D., Xue E.,Yang Y., Sun J., Lu Z.: Matching patients to clinical trials with large languagemodels. In: ArXiv, 2023.

Kenton J.D.M.W.C., Toutanova L.K.: Bert: Pre-training of deep bidirectionaltransformers for language understanding. In: Proceedings of naacL-HLT, vol. 1,p. 2. Minneapolis, Minnesota, 2019.

Khorashadizadeh H., Mihindukulasooriya N., Tiwari S., Groppe J., Groppe S.:Exploring in-context learning capabilities of foundation models for generatingknowledge graphs from text. In: arXiv preprint arXiv:2305.08804, 2023.

Labrak Y., Bazoge A., Morin E., Gourraud P.A., Rouvier M., Dufour R.: Biomis-tral: A collection of open-source pretrained large language models for medicaldomains. In: arXiv preprint arXiv:2402.10373, 2024.

Li Q., Li X., Chen L., Wu D.: Distilling knowledge on text graph for socialmedia attribute inference. In: Proceedings of the 45th International ACM SIGIRConference on Research and Development in Information Retrieval, pp. 2024–2028. 2022.

Luo R., Sun L., Xia Y., Qin T., Zhang S., Poon H., Liu T.Y.: BioGPT: generativepre-trained transformer for biomedical text generation and mining. In: Briefingsin bioinformatics, vol. 23(6), p. bbac409, 2022.

Ma Y., Tang J.: Deep learning on graphs. Cambridge University Press, 2021.

Miaschi A., DellOrletta F.: Contextual and non-contextual word embeddings:an in-depth linguistic investigation. In: Proceedings of the 5th Workshop onRepresentation Learning for NLP, pp. 110–119. 2020.

Mikolov T.: Efficient estimation of word representations in vector space. In:arXiv preprint arXiv:1301.3781, 2013.

Miller G.A.: WordNet: a lexical database for English. In: Communications ofthe ACM, vol. 38(11), pp. 39–41, 1995.

Nayyeri M., Cil G.M., Vahdati S., Osborne F., Rahman M., Angioni S., SalatinoA., Recupero D.R., Vassilyeva N., Motta E., et al.: Trans4E: Link prediction onscholarly knowledge graphs. In: Neurocomputing, vol. 461, pp. 530–542, 2021.

Nocedal J., Wright S.J.: Numerical optimization. Springer, 1999.

Page L.: The PageRank citation ranking: Bringing order to the web. Tech. rep.,Technical Report, 1999.

Pan S., Luo L., Wang Y., Chen C., Wang J., Wu X.: Unifying large languagemodels and knowledge graphs: A roadmap. In: IEEE Transactions on Knowledgeand Data Engineering, 2024.

Peng C., Yang X., Yu Z., Bian J., Hogan W.R., Wu Y.: Clinical concept and rela-tion extraction using prompt-based machine reading comprehension. In: Journalof the American Medical Informatics Association, vol. 30(9), pp. 1486–1493, 2023.

Qiu X., Sun T., Xu Y., Shao Y., Dai N., Huang X.: Pre-trained models fornatural language processing: A survey. In: Science China technological sciences,vol. 63(10), pp. 1872–1897, 2020.

Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I., et al.: Languagemodels are unsupervised multitask learners. In: OpenAI blog, vol. 1(8), p. 9,2019.

Shao Y., Taylor S., Marshall N., Morioka C., Zeng-Treitler Q.: Clinical textclassification with word embedding features vs. bag-of-words features. In: 2018IEEE International conference on big data (big data), pp. 2874–2878. IEEE, 2018.

Soman K., Rose P.W., Morris J.H., Akbas R.E., Smith B., Peetoom B., Villouta-Reyes C., Cerono G., Shi Y., Rizk-Jackson A., et al.: Biomedical knowledgegraph-enhanced prompt generation for large language models. In: arXiv preprintarXiv:2311.17330, 2023.

Suchanek F.M., Kasneci G., Weikum G.: Yago: a core of semantic knowledge.In: Proceedings of the 16th international conference on World Wide Web, pp.697–706. 2007.

Sun Z., Deng Z.H., Nie J.Y., Tang J.: Rotate: Knowledge graph embedding byrelational rotation in complex space. In: arXiv preprint arXiv:1902.10197, 2019.

Team G., Anil R., Borgeaud S., Wu Y., Alayrac J.B., Yu J., Soricut R., SchalkwykJ., Dai A.M., Hauth A., et al.: Gemini: a family of highly capable multimodalmodels. In: arXiv preprint arXiv:2312.11805, 2023.

Tirilly P., Claveau V., Gros P.: From Text to Images: Weighting Schemes forImage Retrieval. In: Journal of Multimedia, vol. 10(1), 2015.

Touvron H., Lavril T., Izacard G., Martinet X., Lachaux M.A., Lacroix T.,Rozi`ere B., Goyal N., Hambro E., Azhar F., et al.: Llama: Open and efficientfoundation language models. In: arXiv preprint arXiv:2302.13971, 2023.

Trouillon T., Welbl J., Riedel S., Gaussier ´E., Bouchard G.: Complex embeddingsfor simple link prediction. In: International conference on machine learning, pp.2071–2080. PMLR, 2016.

Varshney D., Zafar A., Behera N.K., Ekbal A.: Knowledge grounded medicaldialogue generation using augmented graphs. In: Scientific Reports, vol. 13(1),p. 3310, 2023.

Wawrzik F., Rafique K.A., Rahman F., Grimm C.: Ontology learning applica-tions of knowledge base construction for microelectronic systems information. In:Information, vol. 14(3), p. 176, 2023.

Webb T., Holyoak K.J., Lu H.: Emergent analogical reasoning in large languagemodels. In: Nature Human Behaviour, vol. 7(9), pp. 1526–1541, 2023.

Wu Y., Hu N., Bi S., Qi G., Ren J., Xie A., Song W.: Retrieve-rewrite-answer:A kg-to-text enhanced llms framework for knowledge graph question answering.In: arXiv preprint arXiv:2309.11206, 2023.

Xia F., Sun K., Yu S., Aziz A., Wan L., Pan S., Liu H.: Graph learning: Asurvey. In: IEEE Transactions on Artificial Intelligence, vol. 2(2), pp. 109–127,2021.

Xiong H., Wang Z., Li X., Bian J., Xie Z., Mumtaz S., Barnes L.E.: Converg-ing paradigms: The synergy of symbolic and connectionist ai in llm-empowered2024/10/29; 05:52 str. 24/25autonomous agents. In: arXiv preprint arXiv:2407.08516, 2024.

Xu R., Shi W., Yu Y., Zhuang Y., Jin B., Wang M.D., Ho J.C., Yang C.: Ram-ehr:Retrieval augmentation meets clinical predictions on electronic health records. In:arXiv preprint arXiv:2403.00815, 2024.

Yang B., Yih W.t., He X., Gao J., Deng L.: Embedding entities and relations forlearning and inference in knowledge bases. In: arXiv preprint arXiv:1412.6575,2014.

Yuan H., Yuan Z., Gan R., Zhang J., Xie Y., Yu S.: BioBART: Pretrainingand evaluation of a biomedical generative language model. In: arXiv preprintarXiv:2204.03905, 2022.

Zhao W.X., Zhou K., Li J., Tang T., Wang X., Hou Y., Min Y., Zhang B.,Zhang J., Dong Z., et al.: A survey of large language models. In: arXiv preprintarXiv:2303.18223, 2023.

Zheng Z., Zhou B., Yang H., Tan Z., Waaler A., Kharlamov E., Soylu A.: Low-Dimensional Hyperbolic Knowledge Graph Embedding for Better Extrapolationto Under-Represented Data. In: European Semantic Web Conference, pp. 100–120. Springer, 2024.

Zhu J., Cui Y., Liu Y., Sun H., Li X., Pelger M., Yang T., Zhang L., Zhang R.,Zhao H.: Textgnn: Improving text encoder via graph neural network in sponsoredsearch. In: Proceedings of the Web Conference 2021, pp. 2848–2857. 2021.