Home | Documentation | WebService | PDF article |
Question: How Do People Go About Pubmed Text Mining. 7.5 years ago. If you have to do hardcore text/data-mining on millions of documents, you'll need to get a local copy of MEDLINE and PubMed Central. Which is a tool for analyzing whatever corpus you feed it. It knows about biological terms, so you can search for. Mining data to make sense out of it has applications in varied fields of industry and academia. In this article, we explore the best open source tools that can aid us in data mining. Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting information from it.
Scope and license
This tool, Gene set to diseases, computes disease enrichment analysis on gene sets using biomedical literature data. Please note that returned results are only predictions and not curated data. License: free for non-commercial use; special conditions for commercial use or non-academic users (contact us for details).Citation
This tool was created by Dr. Jean Fred Fontaine and Prof. Miguel A. Andrade-Navarro from the Johannes Gutenberg University of Mainz, Germany. The following citation can be used for citation: Andrade-Navarro MA, Fontaine JF (2016). Gene Set to Diseases (GS2D): Disease Enrichment Analysis on Human Gene Sets with Literature Data. Genomics and Computational Biology, 2(1): e33. Accessible from http://cbdm.uni-mainz.de/geneset2diseases- This tool may not reflect the most current and accurate biomedical/scientific data available from NLM. Though, PubMed data is updated daily and other databases several times a year. Parameters Type of analysis: gene set to disease From a gene set as input, the tool outputs a list of diseases enriched in the literature of those genes taken as a set.
- Example PubMed Search: (mining[tiab] OR 'data mining'[mesh] OR. Understanding how text and data mining tools work - by those using the tools for SRs.
Data
The data comes from NCBI (e.g. Entrez Gene) and NLM databases (e.g. PubMed and MeSH). Genes are identified by official gene symbols or Entrez Gene IDs. Diseases are identified by MeSH C terms. Disease-related citations are retrieved from annotations in PubMed and gene-related citations are retrieved from the Entrez Gene database. This tool may not reflect the most current and accurate biomedical/scientific data available from NLM. Though, PubMed data is updated daily and other databases several times a year.Parameters
Type of analysis: gene set to disease
From a gene set as input, the tool outputs a list of diseases enriched in the literature of those genes taken as a set. Genes must be listed using official gene symbols or Entrez Gene IDs. The search box facilitates the retrieval of gene symbols with an auto-completion feature.Type of analysis: gene to disease / disease to gene
For each gene or disease provided as input, the tool returns corresponding individual gene-disease associations. Associations are defined as significant co-occurrences of gene and disease mentions in annotations of PubMed citations. Diseases must be listed using exact MeSH terms. The search box facilitates the retrieval of such terms with an auto-completion feature. This type of analysis is automatically selected for single-gene queries of a gene set analysis.Search box
The search box finds genes or diseases having at least 1 significant association in the dataset. Official gene symbols or exact MeSH terms are retrieved. Selecting an item from the suggestions copies it to the list of items for the query.Min number of disease-related citations for a gene (min=3)
By increasing the required minimum number of disease-related citations for a gene, the selection of individual gene-disease associations should be more precise but less sensitive. By decreasing this number, the selection of individual gene-disease associations should be less precise but more sensitive.For a gene set, min number of genes significantly associated with a disease
By increasing the required minimum number of genes from the gene set to be significantly associated with a disease, the computed list of enriched diseases should be more precise but less sensitive. By decreasing this number, the computed list of enriched diseases should be less precise but more sensitive.Max False Discovery Rate (FDR)
Filters the computed list of enriched diseases for a gene set by FDR. The lower, the more confident the results.Max number of rows to display
Limits the number of rows in the first view of the HTML output table. All results are still available using a navigation system.Type of output
The Web page output provides an interactive table, and tab-separated values would be better suited for import in external software.Output pages
The computation results in a table of diseases enriched for a gene set (type of analysis Gene set to disease) or in a table of significant gene-disease co-occurrences from the literature (type of analysis Gene to disease). The table may be divided in several pages accessible from a navigation bar at the bottom right-hand corner, rows can be filtered using a the Search box at the top left-hand corner, and columns can be sorted by clicking on their headers (shit+click on an other column header for multi-column sorting). The default sorting is by P-value although sorting by gene-disease citations could be more relevant if an FDR cut-off is already applied (e.g. FDR<0.05). The table is then followed by a legend and a reminder of the input parameters.Type of analysis: gene set to disease
When querying a gene set in order to get a list of enriched diseases, the output page shows a table of disease with the following columns:- Disease: disease term from the MeSH vocabulary (links to related MeSH database entry)
- Genes count: number of input genes significantly associated with the disease in the literature (you may filter out low counts < 5)
- Genes percentage: number of input genes significantly associated with the disease in the literature / number of input genes
- Fold change: (number of input genes significantly associated with the disease in the literature / number of input genes) / (total number of genes significantly associated with the disease in the literature / total number of genes)
- P-value: computed by a Fisher's exact test (may not be reliable for low counts < 5)
- FDR: False Discovery Rate computed by Benjamini Hochberg method
- Gene symbols: list of symbols of input genes significantly associated with the disease and, in superscript, numbers of relevant citations in the literature (200 max. Links to PubMed)
- Gene symbols: pipe-separated list of symbols of input genes significantly associated with the disease
- Entrez Gene IDs: pipe-separated list of Entrez IDs of input genes significantly associated with the disease. Items of this list correspond to gene symbols at the same position in the list of gene symbols.
- PMIDs: pipe-separated list of comma-separated lists of PMIDs of relevant citations in the literature for each gene. Comma-separated lists correspond to gene symbols at the same position in the list of gene symbols.
Online Data Mining Tool
Click on the following link to see an example: example output text pageType of analysis: gene to disease / disease to gene
When querying a gene list or a disease list in order to get all related gene-disease associations, the output page shows a table with the following columns:- Genes symbol: official gene symbol linked to Entrez Gene
- Disease: disease term from the MeSH vocabulary (links to related MeSH database entry)
- Gene-disease citations: number of gene-related citations associated with the disease in the literature (you may filter out low counts < 5) (links to a max of 200 relevant citations)
- Gene citations: number of gene-related citations
- Fold change: (number of gene-related citations associated with the disease in the literature / number of gene citations) / (number of disease-related citations / total number of citations)
- P-value: computed by a Fisher's exact test (may not be reliable for low counts < 5)
- FDR: False Discovery Rate computed by Benjamini Hochberg method
Top Data Mining Tools
Genes from this table can be analyzed as a set to find enriched diseases by clicking on the button below the table labeled 'Submit all genes to Gene set → diseases enrichment analysis'. Click on the following link to see an example: example output web page for a gene to disease analysis for 5 genes. Click on the following link to see an example: example output web page for a disease to gene analysis for 1 disease. The text TSV version of this table contains 2 additional columns as follows:- Entrez Gene ID: Entrez Gene ID of the gene.
- PMIDs: comma-separated list of PMIDs of relevant citations in the literature.