PubMed to spreadsheet made easy

August 15, 2010, 6:31 pm

≫ Next: PubMed to Excel: PubMed2XL version 0.9

Update, September 2010: This post refers to an Alpha version of PubMed2XL. You can get the latest version of the software here.

…

Some time ago – exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.

That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.

PubMed2XL's a GUI program written in Python and it uses PyQT:

… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.

Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. ~~If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful~~.

Basically all you need to do is this:

Conduct searches in PubMed.
Send your articles to the Clipboard.
Send the results to "File" as XML.
Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:
Then "open" the file you downloaded from PubMed (pubmed_results.txt).
You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.

That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org because, um, there's no help documentation yet.

If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.

By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.

Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.

If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in $ PubMed2XL_CL -h you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.

Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example, users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.

Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.

Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.

<?xml version="1.0" encoding="UTF-8" ?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
	<spreadsheetHeader>
		<column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column>
		<column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column>
		<column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>
	</spreadsheetHeader>
</config>

↧

PubMed to Excel: PubMed2XL version 0.9

September 19, 2010, 5:03 pm

≫ Next: and yet more PubMed to Excel news

≪ Previous: PubMed to spreadsheet made easy

I've released the first Beta version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

If you'd like to use the software you can download it. Yes, it's free.

Here's a little video tutorial on installing and using the software:

PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

PubMed2XL's documentation is available at: blog.humaneguitarist.org/projects/pubmed2xl/.

The documentation includes a download link to the program files.

↧

and yet more PubMed to Excel news

November 13, 2010, 11:12 am

≫ Next: dealing with a PubMed2XL bug

≪ Previous: PubMed to Excel: PubMed2XL version 0.9

I've updated the documentation for PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files. The documentation isn't incredibly thorough, but I think it's enough to work for now.

Speaking of getting PubMed search results into a spreadsheet check this out:

Those who search PubMed regularly have often wished for a way to import search results into a a program such as Excel. It’s here! A new tool called FLink (Frequency-weighted Links) is now accessible from the NIH National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov/Structure/flink/docs/flink_about.html. FLink allows PubMed search results to be saved as a CSV, or comma-separated value, file which can be imported into a program like Excel.

source: Dragonfly » Blog Archive » FLink: A New Way to Save PubMed Search Results. Retrieved November 13, 2010, from http://nnlm.gov/pnr/dragonfly/2010/11/10/flink-a-new-way-to-save-pubmed-search-results/

For instructions, just click here.

Unfortunately, those instructions don't instruct the user to to import the CSV file with UTF-8 encoding, etc. Not using the correct character encoding upon import could cause characters like accents and umlauts that might appear in author names, for example, to appear as strange, nonsensical characters.

Also, the output format is fixed – i.e. I don't think the user has any control of what data gets exported to the CSV file. Some data is concatenated together in one spreadsheet cell and that can be a problem for those who need to parse the data at a more granular level. It's more difficult to split data and re-sort it than it is to concatenate data that is already parsed in a granular fashion.

On the contrary, the PubMed2XL output can be customized – although it requires some skill with XML. Also, it places in each cell only one value and lastly I've never experienced any character encoding issues in the tests I've done.

Sure, I'm trying to compare the two approaches – just a touch, but in the end the best way will be for the users to have an easy interface offered directly from PubMed.gov and its related sites. I'm just saying that I hope they soon offer more options and a more user-friendly method for the sake of the user.

↧

dealing with a PubMed2XL bug

March 16, 2011, 4:26 pm

≫ Next: PubMed2XL 0.9.1 available

≪ Previous: and yet more PubMed to Excel news

Björn from Sweden has been using PubMed2XL and has suggested some additional features that we are working on. More on that some other time …

But he also found a bug, or rather an oversight on my part. That needs to be dealt with first.

I didn't realize that some data in the PubMed.gov XML elements are insanely long. We encountered an abstract in one article nearly 50,000 characters long. That wasn't breaking PubMed2XL but the resultant spreadsheet had all kinds of problems – values in the wrong column, wrong cell, etc. I guess this is because – as I now know – Excel/OpenOffice don't let cells carry more than about 32k characters. I don't know if this is true of newer versions of MS Excel, but whatever. 32k is enough!

So in a test version of the application, I added a length checking and stoppage feature. This restricts the length of the data placed into a cell to 30,000 characters if the data to be placed is greater than 32,000 characters.

Eventually, I'll make it so that if the data is greater than 32k characters, the cell will contain colored text so the user can know that "Hey, this data is incomplete because it's so darn long!".

Anyway, as a note to myself, here's a code snippet that seems to be a quick patch. I'll upload the fixed version in a week or so. I'm moving and all, so my schedule's a bit wonky.

cell = getElement.text
if len(cell) > 32000:
	cell = cell[0:30000]
writeExcel.write (rowIter, columnIter, cell)

↧

PubMed2XL 0.9.1 available

April 3, 2011, 3:40 pm

≫ Next: PubMed2XL 1.0 available

≪ Previous: dealing with a PubMed2XL bug

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

If you'd like to use the software you can download it for free.

For those who are interested, here's the changelog:

0.9.1
- worked with Björn Carlsson on a few things:
- added length checker for <getElement> so that abstracts greater than 32k characters would get truncated to the first 30k characters.
- see: http://blog.humaneguitarist.org/2011/03/16/dealing-with-a-pubmed2xl-bug/
- added <getAttributeByElementPosition> element.
- Updated schema.
- removed code that displayed the "aboutMessage" variable on the command line if command line options are used.
- This is because the diacritic in Mr. Carlsson's name caused encoding errors with the default Windows command prompt.
- added <hyperlinkSuffix> element so that alternate views of PubMed data could be passed via the URL.
- updated schema.
- For example, see this: http://www.ncbi.nlm.nih.gov/pubmed/21069543 then this: http://www.ncbi.nlm.nih.gov/pubmed/21069543?report=medline
- The hyperlink suffix of ?report=medline changes the display!
- For more information, see:
- PubMed Help — PubMed Help — NCBI Bookshelf. Retrieved November 13, 2010, from http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=pubmedhelp&rendertype=table&id=pubmedhelp.T40
- pm_workbook.pdf. Retrieved November 13, 2010, from http://www.nlm.nih.gov/pubs/manuals/pm_workbook.pdf (see page 135).
- updated py2exe "setup.py" to automatically name the command line/console version correctly (i.e. with the "-CL" suffix).
- removed "src" folder and placed Python files in same folder as .exe's.
_______________________________________________________________________
0.9.0
- this was the first version - that worked!

↧

PubMed2XL 1.0 available

June 18, 2011, 11:28 am

≫ Next: the serpent, the apple, and Joe

≪ Previous: PubMed2XL 0.9.1 available

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from PubMed.gov into Microsoft Excel files.

Unlike downloading the CSV directly from PubMed.gov, PubMed2XL gives users (OK … advanced users) the ability to customize the output but even the default format includes Abstract, links to each article, and even links to related articles, and reviews.

Here's an example of a spreadsheet made with PubMed2XL and here's the source file used to make it. The source file was downloaded from PubMed.gov using a search for "Mexican flu".

If you'd like to use the software you can download it for free.

If you notice any bugs or have any questions or remarks, please feel free to leave a comment on the site. Thanks!

↧

the serpent, the apple, and Joe

January 7, 2012, 7:22 am

≫ Next: getting real-time values from imported modules with a Python GUI

≪ Previous: PubMed2XL 1.0 available

For better or worse, the one application of mine that people actually use is the one I wrote pretty casually with Python over a couple weekends from bed because I was too lazy or hungover to get moving on those days.

That software, PubMed2XL, lets people do a few things with downloaded citations from PubMed.gov that isn't currently offered directly from the site. I've gotten some nice feedback from librarians, researchers, and information-y people at companies that have found it useful.

This post isn't a plug though; it's more an acknowledgement of something that I didn't really realize in full at the time. And that is when one writes software that people go on to actually use, one better be prepared to support it. Now, the software's simple enough that there haven't been real bugs save one, but it does eat at me that I can't offer a simple way for it to work on multiple platforms.

While the Windows version is really easy to setup – thanks to py2exe and Inno Setup – getting it running on Linux is a bit more work, given all the distro variations and dependency installation. But getting it running on a Mac – particularly with an easy to use installer – isn't going to be possible unless I can find someone to compile it for a Mac who will also test it and compile future versions. Sure, there's the possibility of using Wine, but that's still asking a lot from end users.

Normally, I wouldn't care. Apple doesn't make it easy for people to develop for Macs unless you fork over the change for a Mac – and I ain't buying a copy of OSX and doing the Hackintosh bit. But, since the software is ultimately about health-related research, I do care.

Unfortunately I made – with the advantage of hindsight – two coding decisions that create problems.

First, I chose PyQT as the GUI toolkit for the software simply because it looks prettier than Python's native Tkinter. My reasoning at the time was the people were more likely to trust better looking software even though it's just a small window with some basic menu options. Eventually, I added a progress bar, too, so downgrading to Tkinter has become less of an option.

Second (and this is the big one), I used lxml since the PubMed2XL setup files employ XSL to tell the software what data to put in a spreadsheet cell. Granted, lxml is freakin' fantastic, but since it's not a pure Python module I can't just distribute it in a folder and import the module locally. Not that I had much of a choice: there's no built in XSLT-capable module that ships with Python 'far as I know.

So I've been asking myself how to make the serpent (Python) and the apple (OSX) get along.

I've consider just making PubMed2XL a web-app, but that will entail expenses for me that simply offering people a desktop app doesn't entail.

So, I think the solution lies in a cup of Joe. That's to say that a Java app is the obvious solution, specifically using Jython.

That would leave me to replace PyQT with Swing. I'm fine with that. It's not like PyQT is all that Pythonic in the first place. There's a nice Jython/Swing tutorial here.

And as for the XSLT component, this tutorial on XSLT with Jython and native Java libraries should help immensely.

So, I should be able to use Jython to make a cross-platform version of PubMed2XL. I don't necessarily want to, but given the type of research I'd like to help facilitate (in a very small way, I know), I think I probably should.

↧

getting real-time values from imported modules with a Python GUI

February 7, 2013, 12:00 pm

≫ Next: PubMed2XL 2.0 now available for download

≪ Previous: the serpent, the apple, and Joe

Situation: Over a year ago I wrote a Python script called PubMed2XL to allow one to convert XML citations from Pubmed.gov to a Microsoft spreadsheet.

I know some people are using it here and there, so I wanted to make it better.

The main problem with the old script is it's just sloppy (I knew even less back then than I do now!). It's also a script that wraps everything into one: non-GUI and GUI. If you don't pass command line options, then it launches the GUI, etc. Anyway, that makes the code hard to read for me since it intermixes data parsing with command line option stuff and GUI stuff.

So, for the next version I'm working on, I started with the premise to write it as a Python library so that it can be imported and one can use the function to make a spreadsheet inside another Python script a la:

import pubmed2xl pubmed2xl.makeSheet("pubmed.xml", "pubmed.xls") #pass input and output (Excel file to be written)

It's also setup to make it easy to use command line options a la:

python pubmed2xl.py pubmed.xml pubmed.xls

The function and command line options also support showing the progress of completion while the spreadsheet is being made. This can be called as such:

pubmed2xl.makeSheet("pubmed.xml", "pubmed.xls", showProgress=True)

python pubmed2xl.py pubmed.xml pubmed.xsl --verbose

The problem for me, then, was how to show the progress inside a GUI application. Essentially, I needed the value of a the progress counter "variable" that was created inside a loop and updated each time the loop occurred – i.e. updating the progress counter. But I couldn't figure out how to retrieve the value of the progress counter variable in real time as the loop occurred. And I need it in real time so my GUI could show the progress update to the user – in real time!

I spent way too much time following leads that got me nowhere. I tried threads, running the python script as a sub-process, etc. but I could never access the variable "progressValue" that equates to the percentage of task completion as citations are getting processed into a spreadsheet.

So, somehow I found my way to realizing that if my original script had a class and my second script added a method to the class then I could get the value of "progressValue" in real time.

Anyway, I've got two scripts below. The "first.py" script emulates a progress calculator by simply counting to 100. The script also has a class, "callback" and a global dictionary "_CALLBACK_DICT" into which I can place key/value pairs for whatever variables I want to retrieve during the loop.

The function "canYouSeeMe()" inside "first.py" also tries to execute the method "_CALLBACK.callback()" during the loop. In other words, if the method's there, run it, otherwise just ignore it.

The second script "second.py" is a little TKinter GUI app. It imports the first module and the instantiated class("_CALLBACK"). It also has a function called "getCallback()" that does what I want: i.e. retrieve the progress count in real time and show it in the GUI in real time. I then I equate "getCallback()" to the "_CALLBACK.callback()" method. So now, when I run the "second.py" script, the loop in "first.py" can give me the data I want to show in "second.py" in real time. Make sense? I hope so because it seems to be working OK.

Here's a screenshot from running "second.py" and below are the scripts themselves. I'd love any feedback on better ways of doing this, by the way.

Tkinter callback example

first.py

##### "first.py"

class callback():
  pass
_CALLBACK = callback()
_CALLBACK_DICT = {}

rangers = range(0, 101, 10)
def canYouSeeMe():
  for ranger in rangers:
    _CALLBACK_DICT["this_ranger"] = str(ranger)
    try:
      _CALLBACK.callback()
    except:
      pass

second.py

##### "second.py"

#import first module
import first
from first import _CALLBACK

#import Tkinter
from Tkinter import *

#create function and add as method to class "_CALLBACK"
def getCallback():
    importedValue = first._CALLBACK_DICT["this_ranger"]
    t.insert(END, importedValue + "%\n")
    if importedValue == "100":
        t.insert(END, "\nDone.")
    t.see(END)
    t.update_idletasks()
_CALLBACK.callback = getCallback #adding method to class

#create GUI buttons
class buttons():
   
    def __init__(self, root):
 
        #make frame/button
        frame = Frame(root)
        frame.pack()
       
        buttonText = "go"
        buttonAction = self.go
        self.makeButton = Button(frame, text=buttonText, command=buttonAction)
        self.makeButton.pack()
       
    #run go()
    def go(self):
      first.canYouSeeMe()

#create GUI
root = Tk()
buttons = buttons(root)
t = Text(root, background="black", foreground="blue")
t.pack()
geo = ("150x250")
root.geometry(geo)
root.mainloop()

↧

PubMed2XL 2.0 now available for download

July 7, 2013, 1:11 pm

≫ Next: PubMed2XL 2.01 available

≪ Previous: getting real-time values from imported modules with a Python GUI

PubMed2XL 2.0 is now available.

You can read the documentation and download the newest version here.

There are a few notable changes to the graphical user interface (GUI) and lots of huge changes under the hood.

As far as the GUI the visible changes are as such:

You will now get notified by the software if a newer version of the software is available;
You can now turn the processing of book (non-journal) citations on/off;
You can toggle between Excel 2007 (.xls) and OpenDocument (.ods) output;
You can now save your preferences;
You can inspect a stylesheet and see the column title names before selecting that stylesheet;
… and there's even a simple logo (below) thanks to klukeart's icon on IconArchive.

And for programmers there is now a "pm2xl.py" Python library that has lots of functions that I hope might be useful for folks. The GUI is now built on top of the library so there's a clear separation of concerns between data processing functions and user interface. Non-Python programmers can also call the library functions via the command line.

OK, here's the logo (made with Inkscape):

PubMed2XL 2.0 logo

I'd also like to say congrats to Andy Murray for winning Wimbledon today!

↧

PubMed2XL 2.01 available

October 5, 2013, 9:01 am

≫ Next: user contributed content: getting PubMed2XL to work on MacOS

≪ Previous: PubMed2XL 2.0 now available for download

A new version of PubMed2XL is available.

You can download it here.

This update should not affect anyone using the graphical desktop software but I recommend updating anyway. Make sure not to lose any spreadsheets you made with the old version.

I should also mention that I'm still new to distributing software, so while I'm hoping everything works OK, it might not. If you notice anything weird, let me know and I'll try to address it in a timely manner.

It's a bit strange to release this during a US government shutdown, but PubMed.gov still seems like it's working fine.

↧

user contributed content: getting PubMed2XL to work on MacOS

December 28, 2015, 9:19 am

≫ Next: PubMed2XL review in JMLA

≪ Previous: PubMed2XL 2.01 available

In the last couple of months, I've had two users tell me they have been able to use PubMed2XL on their Apple/Macs with WineBottler.

One user, Bob, sent me email instructions for how he got PubMed2XL 2.0 working on his Apple computer.

I've uploaded a PDF of his email here.

While I can't verify any of this because I don't have a Mac, I'd love to know if it works for other Mac users. Anyone using these instructions accepts full responsibility for their actions, of course. That's goes without saying even though I just said it.

If it works for you and you want to leave a comment, please do. If you want to create your own tutorial or video tutorial, that would be great as well. Let me know.

Thanks Bob!

↧

PubMed2XL review in JMLA

March 27, 2016, 9:16 am

≪ Previous: user contributed content: getting PubMed2XL to work on MacOS

Just a quick one …

A user of PubMed2XL recently published a review of the application in the Journal of the Medical Library Association. You can read it here.

Thanks David.

↧