Recherche de tag: Jupyter

exemple d'utilisation du module "pyensembl" via un NoteBook Jupyter (Anaconda) [Python]

25.09.2019     erwan06      Anaconda Jupyter pyensembl 

  on va comparer les références de transcrits (identifiants Ensembl grch38) avec le fichier de référence "VariantNGS.tsv" suivant:
Gene name Accession Number
Gene name Accession Number
VHL ENST00000256474
SETD2 ENST00000409792
PBRM1 ENST00000337303
FHIT ENST00000341848
RASSF1 ENST00000357043
KDM5C ENST00000375401
MITF ENST00000394351
BAP1 ENST00000460680
KDM6A ENST00000377967
ABL1 ENST00000318560
AKT1 ENST00000349310
ALK ENST00000389048
APC ENST00000457016
ATM ENST00000278616
BRAF ENST00000288602
CDH1 ENST00000261769
CDKN2A ENST00000304494
CSF1R ENST00000286301
CTNNB1 ENST00000349496
DDR2 ENST00000367922
EGFR ENST00000275493
ERBB2 ENST00000269571
ERBB4 ENST00000342788
EZH2 ENST00000320356
FBXW7 ENST00000281708
FGFR1 ENST00000447712
FGFR2 ENST00000358487
FGFR3 ENST00000440486
FLT3 ENST00000241453
GNA11 ENST00000078429
GNAQ ENST00000286548
GNAS ENST00000371085
HNF1A ENST00000257555
HRAS ENST00000397596
IDH1 ENST00000345146
IDH2 ENST00000330062
JAK2 ENST00000381652
JAK3 ENST00000458235
KDR ENST00000263923
KIT ENST00000288135
KRAS ENST00000311936
MAP2K1 ENST00000307102
MET ENST00000318493
MLH1 ENST00000231790
MPL ENST00000372470
NOTCH1 ENST00000277541
NPM1 ENST00000517671
NRAS ENST00000369535
PDGFRA ENST00000257290
PIK3CA ENST00000263967
PTEN ENST00000371953
PTPN11 ENST00000351677
RB1 ENST00000267163
RET ENST00000355710
SMAD4 ENST00000342988
SMARCB1 ENST00000263121
SMO ENST00000249373
SRC ENST00000445403
STK11 ENST00000326873
TP53 ENST00000269305

#!/usr/bin/env python
# coding: utf-8

# # Installing PyEnsembl
# `PyEnsembl`is a Python interface to Ensembl reference genome metadata such as exons and transcripts. `PyEnsembl` downloads GTF and FASTA files from the Ensembl FTP server and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.

# In[52]:

import pyensembl

# In[53]:

from pyensembl import EnsemblRelease

# In[54]:

data = EnsemblRelease(78)

# In[55]:

import os, sys

# In[56]:


# In[57]:

f = open("TruSeq.txt",'r',encoding='utf8')

# In[58]:

L = f.readlines()

# In[59]:


# In[60]:

genes = [str(L[i]).rstrip("\n") for i in range(1,len(L))]

# In[61]:

liste = [str(genes[i]).split('\t') for i in range(len(genes))]

# In[62]:

liste_simple = []
for elt in range(0,len(liste)):
    liste_simple = liste_simple + liste[elt]

# In[63]:


# ## get all Genes ID from TrueSeq Amplicon cancer panel

# In[64]:

f = open("TruSeq_geneIDs.txt",'w',encoding = 'utf8')
tTruSeq = []

for i in range(0,len(liste_simple)):
    gene_courant = str(liste_simple[i]).rstrip()
    gene_ID = pyensembl.ensembl_grch38.genes_by_name(gene_courant)[0].gene_id
    f.write("gene: {} , gene_ID : {} \n".format(str(liste_simple[i]),gene_ID))
    tTruSeq = [gene_ID, pyensembl.ensembl_grch38.genes_by_name(gene_courant)[0].transcripts] + tTruSeq

# In[65]:

f = open("genotype_transcripts.txt",'w',encoding = 'utf8')

# ## compare genotype Transcript IDs vs NGS phenotypic transcripts (VariantsNGS.tsv)

# In[66]:

import os, sys

# In[67]:

import pyensembl

# In[68]:


# In[69]:

f = open("VariantsNGS.tsv",'r',encoding='utf8')

# In[70]:

L2 = f.readlines() 

# In[71]:


# In[72]:

genes = [str(L2[i]).rstrip("\n") for i in range(1,len(L2))]

# In[73]:

liste2 = [str(genes[i]).split('\t') for i in range(len(genes))]

# In[74]:

liste_simple2 = []
for elt in range(0,len(liste2)):
    liste_simple2 = liste_simple2 + liste2[elt]

# In[75]:


# ## manual cross-check of each transcript IDs

# In[76]:

VHL = pyensembl.ensembl_grch38.genes_by_name("VHL")

# In[77]:

tVHL = VHL[0].transcripts

# In[78]:


# same Transcript IDs found

# In[79]:

SETD2 = pyensembl.ensembl_grch38.genes_by_name("SETD2")
tSETD2 = SETD2[0].transcripts

# different transcript IDs found : ENST00000431180 != ENST00000409792

# l'automatisation de la comparaison des deux listes simples n'est pas encore terminee

0/5 - [0 rating]