Pubmed Data Mining Tools

Home | Documentation | WebService | PDF article

Question: How Do People Go About Pubmed Text Mining. 7.5 years ago. If you have to do hardcore text/data-mining on millions of documents, you'll need to get a local copy of MEDLINE and PubMed Central. Which is a tool for analyzing whatever corpus you feed it. It knows about biological terms, so you can search for. Mining data to make sense out of it has applications in varied fields of industry and academia. In this article, we explore the best open source tools that can aid us in data mining. Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting information from it.

Scope and license

This tool, Gene set to diseases, computes disease enrichment analysis on gene sets using biomedical literature data. Please note that returned results are only predictions and not curated data. License: free for non-commercial use; special conditions for commercial use or non-academic users (contact us for details).

Citation

This tool was created by Dr. Jean Fred Fontaine and Prof. Miguel A. Andrade-Navarro from the Johannes Gutenberg University of Mainz, Germany. The following citation can be used for citation: Andrade-Navarro MA, Fontaine JF (2016). Gene Set to Diseases (GS2D): Disease Enrichment Analysis on Human Gene Sets with Literature Data. Genomics and Computational Biology, 2(1): e33. Accessible from http://cbdm.uni-mainz.de/geneset2diseases
  1. This tool may not reflect the most current and accurate biomedical/scientific data available from NLM. Though, PubMed data is updated daily and other databases several times a year. Parameters Type of analysis: gene set to disease From a gene set as input, the tool outputs a list of diseases enriched in the literature of those genes taken as a set.
  2. Example PubMed Search: (mining[tiab] OR 'data mining'[mesh] OR. Understanding how text and data mining tools work - by those using the tools for SRs.

Data

The data comes from NCBI (e.g. Entrez Gene) and NLM databases (e.g. PubMed and MeSH). Genes are identified by official gene symbols or Entrez Gene IDs. Diseases are identified by MeSH C terms. Disease-related citations are retrieved from annotations in PubMed and gene-related citations are retrieved from the Entrez Gene database. This tool may not reflect the most current and accurate biomedical/scientific data available from NLM. Though, PubMed data is updated daily and other databases several times a year.

Parameters

Type of analysis: gene set to disease

From a gene set as input, the tool outputs a list of diseases enriched in the literature of those genes taken as a set. Genes must be listed using official gene symbols or Entrez Gene IDs. The search box facilitates the retrieval of gene symbols with an auto-completion feature. Pubmed data mining tools for sale

Type of analysis: gene to disease / disease to gene

For each gene or disease provided as input, the tool returns corresponding individual gene-disease associations. Associations are defined as significant co-occurrences of gene and disease mentions in annotations of PubMed citations. Diseases must be listed using exact MeSH terms. The search box facilitates the retrieval of such terms with an auto-completion feature. This type of analysis is automatically selected for single-gene queries of a gene set analysis.

Search box

The search box finds genes or diseases having at least 1 significant association in the dataset. Official gene symbols or exact MeSH terms are retrieved. Selecting an item from the suggestions copies it to the list of items for the query.

Min number of disease-related citations for a gene (min=3)

By increasing the required minimum number of disease-related citations for a gene, the selection of individual gene-disease associations should be more precise but less sensitive. By decreasing this number, the selection of individual gene-disease associations should be less precise but more sensitive.

For a gene set, min number of genes significantly associated with a disease

By increasing the required minimum number of genes from the gene set to be significantly associated with a disease, the computed list of enriched diseases should be more precise but less sensitive. By decreasing this number, the computed list of enriched diseases should be less precise but more sensitive.

Max False Discovery Rate (FDR)

Filters the computed list of enriched diseases for a gene set by FDR. The lower, the more confident the results.

Max number of rows to display

Limits the number of rows in the first view of the HTML output table. All results are still available using a navigation system.

Type of output

The Web page output provides an interactive table, and tab-separated values would be better suited for import in external software.

Output pages

The computation results in a table of diseases enriched for a gene set (type of analysis Gene set to disease) or in a table of significant gene-disease co-occurrences from the literature (type of analysis Gene to disease). The table may be divided in several pages accessible from a navigation bar at the bottom right-hand corner, rows can be filtered using a the Search box at the top left-hand corner, and columns can be sorted by clicking on their headers (shit+click on an other column header for multi-column sorting). The default sorting is by P-value although sorting by gene-disease citations could be more relevant if an FDR cut-off is already applied (e.g. FDR<0.05). The table is then followed by a legend and a reminder of the input parameters.

Type of analysis: gene set to disease

When querying a gene set in order to get a list of enriched diseases, the output page shows a table of disease with the following columns:
  • Disease: disease term from the MeSH vocabulary (links to related MeSH database entry)
  • Genes count: number of input genes significantly associated with the disease in the literature (you may filter out low counts < 5)
  • Genes percentage: number of input genes significantly associated with the disease in the literature / number of input genes
  • Fold change: (number of input genes significantly associated with the disease in the literature / number of input genes) / (total number of genes significantly associated with the disease in the literature / total number of genes)
  • P-value: computed by a Fisher's exact test (may not be reliable for low counts < 5)
  • FDR: False Discovery Rate computed by Benjamini Hochberg method
  • Gene symbols: list of symbols of input genes significantly associated with the disease and, in superscript, numbers of relevant citations in the literature (200 max. Links to PubMed)
Click on the following link to see an example: example output web page The text TSV version of this table contains a differently formatted 'Gene symbols' column and 2 additional columns as follows:
  • Gene symbols: pipe-separated list of symbols of input genes significantly associated with the disease
  • Entrez Gene IDs: pipe-separated list of Entrez IDs of input genes significantly associated with the disease. Items of this list correspond to gene symbols at the same position in the list of gene symbols.
  • PMIDs: pipe-separated list of comma-separated lists of PMIDs of relevant citations in the literature for each gene. Comma-separated lists correspond to gene symbols at the same position in the list of gene symbols.

Online Data Mining Tool

Click on the following link to see an example: example output text page

Type of analysis: gene to disease / disease to gene

When querying a gene list or a disease list in order to get all related gene-disease associations, the output page shows a table with the following columns:
  • Genes symbol: official gene symbol linked to Entrez Gene
  • Disease: disease term from the MeSH vocabulary (links to related MeSH database entry)
  • Gene-disease citations: number of gene-related citations associated with the disease in the literature (you may filter out low counts < 5) (links to a max of 200 relevant citations)
  • Gene citations: number of gene-related citations
  • Fold change: (number of gene-related citations associated with the disease in the literature / number of gene citations) / (number of disease-related citations / total number of citations)
  • P-value: computed by a Fisher's exact test (may not be reliable for low counts < 5)
  • FDR: False Discovery Rate computed by Benjamini Hochberg method

Top Data Mining Tools

Genes from this table can be analyzed as a set to find enriched diseases by clicking on the button below the table labeled 'Submit all genes to Gene set → diseases enrichment analysis'. Click on the following link to see an example: example output web page for a gene to disease analysis for 5 genes. Click on the following link to see an example: example output web page for a disease to gene analysis for 1 disease. The text TSV version of this table contains 2 additional columns as follows:
  • Entrez Gene ID: Entrez Gene ID of the gene.
  • PMIDs: comma-separated list of PMIDs of relevant citations in the literature.
Click on the following link to see an example: example output text page for a gene to disease analysis for 5 genes. Click on the following link to see an example: example output text page for a disease to gene analysis for 1 disease.

Copyright

Data mining tools Copyright Fontaine 2015. Data processed by this tool are provided by NCBI. Please see NCBI's Disclaimer and Copyright notice. MeSH Headings are from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.

Disclaimer

Data Mining Tools Comparison

THIS SERVICE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SERVICE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.