Open Notebook Science BCCE 2008

Jean-Claude Bradley: Thanks very much for giving me the opportunity to talk about Cheminformatics and Open Notebook Science. Before I get started with the whole Cheminformatics component, let me give you a little bit of a background about what I mean by Open Notebook Science.


There's been a trend, in the past few years to go from more closed systems, in research and in teaching, to more open systems. We're all familiar with the traditional lab notebook where all the stuff that's in there is not going to become public, unless somebody actually puts it together and writes a paper.


And of course, the traditional journal article is more open, but still, people have to pay for it to get access to the information. Recently, open access has become pretty popular. The advantage of that is people don't have to pay to access the articles, but the format of the articles is still the same as the traditional journal.


So, when we talk about Open Notebook Science, we're talking about making the actual laboratory notebook, and all associated processes completely open in real-time, or as closed to real-time as we can possibly get. There's also a teaching aspect to this, which I'm not going to talk about too much. But, I do record my lectures, and I make them publicly available as much as I can.


An interesting little statistic that I found out recently after I made these YouTube recordings for my organic chemistry questions, was that the strongest demographic were older males, 45-55. I would have assumed it would be the students, but there's something else going on here. Very interesting what happens when you make this stuff open.


I'm not going to talk too much about the details of the organic chemistry. But, just to give you a global idea, we are trying to make anti-malarial compounds. So, my group is in the middle at Drexel, and we do the synthesis of compounds. Before that we have Rajarshi Guha at Indiana University whose doing docking for us. He's telling us which compounds to make, and once we make those compounds, we ship them off to various places.


Recently, Phil Rosenthal, at UCSF, is kind enough to do testing against falcipain-2, the enzyme that he's isolated in his lab, as well as general anti-malarial activity.


We're trying to do this as openly as possible, and that's sort of the idea. Can we do drug development in a completely open way? In order to do this, I'm not going to use one technology. I'm actually going to piece together all kinds of different technologies, the vast majority of which are completely free and very well put together. I will talk a little bit about using the blog, the wiki, Google Docs, CDD, ChemSpider and even the mailing list. The 'old school' mailing list actually still has a role to play in all of this.


So, where does this actually come from? If you're an organic chemist and you've ever tried to repeat a procedure and tried to follow the experimental, and found that you just simply couldn't figure out how they managed to do what they claimed they did, it's because there's some information missing. If you had access to the lab notebook, you could actually see exactly what the problem. Maybe they made a mistake, or maybe you're making a mistake. And so, it's really the question of where's the beef, in terms of the experimental.


Here's a blog post where I talk about us making some falcipain-2 targets. And, at the bottom there in red, you see it says, "See Experiment 150". So, when you click on that link, it takes you to our lab wiki. The wiki is basically just a collection of pages. Each page is similar to a paper notebook, except I get to have links, I can embed images, and I can embed videos. I can do all kinds of things, and I'm just going to give you a sample of what these things are.


I just clicked on Experiment 150, and it shows me the scheme, and it gives me the objective. If I click on one of the links for the compound, it takes me to ChemSpider. ChemSpider is a free database. It has about 20 million compounds on it, and it's very handy for an organic chemist because it automatically gives you the InChI key. It gives you the InChI, the SMILES, and you can search by all of these. You can do sub-structural searches.


So, if we keep our compounds on ChemSpider, I don't have to run software on my machines to do sub-structure searches. And, I prefer that because I'm an organic chemist, and I don't want to manage software.


And, we also have links. Remember when I told you about Rajarshi who does the docking for us? We try to make that as open as possible as well. If you actually click on these links, it takes you to a Google Doc that has the list of SMILES codes in the order that Rajarshi calculated for this particular docking run. So, if you wanted to actually see what he did and try to repeat it, or investigate it, you certainly could.


All I'm doing here is taking a long page and dividing it up into sections. There's also a procedure section, typically, and this is something that we'd like to be able to copy and paste when this makes it to a traditional paper. We would like to be able to copy and paste that to the experimental sections.


And, I'm a huge fan of JCAMP, which was mentioned a little bit earlier. This is really nice because with NMR data, with JCAMP, you can actually use JSpec viewer through a browser, and people don't have to download any software, or even know that there's software involved. It uses Java, and they basically click on the spectrum and it pops up. And, if you drag the mouse across the spectrum, it will actually expand any area.


So, this is far superior to printing out endless expansions. You just need to save it once, and then you can get your J constants at any point in time. And also, you notice in this spectrum, it's kind of hard to tell what's going on. I know that sometimes you see the supplementary sections in papers, and you wished that you could expand to see if there's a rotamer, or some kind of impurity there. Here anyone can do that; it's totally open.


And ChemSpider - I'm going to revisit ChemSpider a couple of times - can do some very useful things. Once we've characterized the compound, we can actually upload the spectrum in JCAMP format. Right on the record of that compound, you can expand, and you can zoom into the various peaks. You can also upload pictures. So, this is a picture of this particular precipitate. You can upload any of the spectra. So, JCAMP format, you can get from IR, NMR, and MassSpec. It's the most convenient format that we've certainly found that's open.


And, ChemSpider recently has the ability to predict an NMR spectrum. So, there's a button for this record, and if you click on it, it generates this spectrum. I've actually put it next to our experimental spectrum to show that you still need to be a chemist, and you still need to have some common sense. But, it's certainly pretty helpful to use your background just to get a rough idea of where those peaks should be. It's a pretty neat for a free service.


Of course, on a page there has to be a log. So it's my job to teach my students how to keep a log, in terms of noting what they've done and what they have observed. So, it could take some time to figure out exactly what that is, but this is live and it's real. As things get corrected you get to see that process.


So, finally, and we're still on that same page that I started at the beginning of my talk. We have the conclusion. This UD product was obtained at 59% yield. The nice thing about this that you don't have to trust me. You don't have to trust my students. You can go to the raw data. You can analyze the NMRs. You can look at the log. You can see if we made a mistake or...


So, this is how science can be done more efficiently. If we had access to each other's lab notebooks, I think that we would spend a lot less time repeating stuff that shouldn't be repeated. OK.


So, there's a bunch of different ways to access this information. There is, of course, a table of contents, which is very paper-like. It's just a list of the experiments.


But there are other ways to find it. Since this is Cheminformatics [sp] theme here, I'm going to talk her about INCHIs and INCHI keys [sp]. We use INCHIs for most of the small molecules. We've been using more and more INCHI keys for the larger molecules. We want our compounds to be found in Google.


It turns out that the INCHIs for the larger molecules let's say 500 daltons, they don't get indexed properly in Google. So the INCHI keys is basically a hash of that. They look like this right here. No matter how big the molecule is, the INCHI key will be the same length. So that gets indexed really nicely by Google.


You can do, even Google custom searches. That's a free service that Google offers. You can say: only search this domain. So you could get a lot of free flexibility with Google. That's it right here.


So what I'm doing here is searching for this, the first part of the INCHI key. Basically, the way if you're not familiar with the INCHI key is the first part gives you the connectivity of the atoms and the second part gives you the stereochemistry. So, if you only search for the first part you are sure to get all the isomers.


Some other ways that people might be finding our experiments. These are some of the search terms that I just looked for about a week to see how people were doing this. There's four basic categories.


First, there are specific compounds. So, someone will search for methylene choride or tosyl cyanide or they will be searching for experimental conditions, like kinetics, the Bach [sp] protection. We have done a lot of NMR monitoring of the removal of the Bach group. So if you are looking for that you're in luck, because you have got piles of data that you can grab.


People are also looking for general educational stuff, like how to make a poster in Second Life or 3D periodic tables. Those are some of the things that appeared on my blog, for example. And of course, big picture things like Cheminformatics project proposal. I do put my proposals up publicly so if people want to see what we plan on doing, they can also grab it through there.


I'd like to give you an example I see that is just two or three weeks old, which I think is great way to show how a wiki can be used. This is a search that someone from the UK did: "Purification of Phenyl acid aldehyde." The first hit from the Google search is our experiment 37. When you hit that, it turns out that is my undergrads games. And the title of this experiment is "Purification of Phenyl Acid Aldehyde." He is doing a distillation.


So let's go through the experiment and see what happened. OK, so you find this procedure. Simple enough. He took a picture. I can see where the thermometer is. I can see exactly what is going on. He put up an NMR. This is before we started to use JCAM [sp] so this is not expandable. This is just an image.


You can see that there is some stuff going on here. Mainly, it does look like it is phenyl acid aldehyde, but it certainly is not the purest compound that I'd be happy with. And of course, he has his log. So I can see exactly how long everything took during the distillation.


So, if you look at the discussion, actually it's not so simple. He reported the distillation happened at 162 degrees but the reported boiling point is 195 of this reference on this DS sheet. So there is something strange going on. The conclusion is that we can't use this distillate. In fact, there were globules of liquid inside of it.


Basically, I don't know. Maybe we're the only people in the history of the world to have this problem with phenyl acid aldehyde or maybe not. So this might be helpful. Now, let's take a look at the real power of the wiki. What you just looked at was the final state of the wiki.


Let's see the history, how it came to be that way. This experiment started in Oct. 27th, 2006. And the wiki history gives you all the different people who modified it. If I click on that first entry, you'll notice here that the conclusion is different. It says, acid aldehyde was indeed purified with only little traces of impurities. Something happened between this and what it looks like today.


We can see the difference between the different versions. So Oct. 29th I modified it. The way it works on wiki space is, the green stuff is the added and the red stuff is the deleted. So, you know from this that I deleted his conclusion and I put this conclusion that, in fact, it wasn't a good distillation.


There's a lot of that kind of thing going on with our wiki. It's a very good way for me to interact with my students. In this particular case, I actually wrote the answer, but often I will actually pose a question. I will say something that doesn't make sense, maybe look in to this a little more, and that actually is a really good way to interact.


Looking further down, in June, my graduate student came in and put a link to the Chem Fighter entry and he also added tags. So he added the [indecipherable] tag. And then actually year later, James answered this question. If you remember that picture, there were no tubes attached to the condenser. I thought that was a little weird. So I asked him about it, and he said that actually that picture was not of that run while it was happening. So he answered the question.


Now, we can keep going on like this until infinity. And that is what research is really like. It is sort of a fiction to think that we do research, it stops, and it turns into a fact and then it becomes a paper. But that is not really what happens. We keep at this process until we feel comfortable enough to put it in a paper, but we're never completely satisfied. And this really captures that whole process. So students can actually write reports on what they did and they can link to their own experiments. And they can link to their labmates experiments very nicely.


So here Shannon is writing her report. She's comparing her results with Emily and with Tim. And there's no confusion whatsoever who did what. Because the wiki captures every single addition from each one of the students. That's some thing I really like, in terms of if I need to access a student, I can tell what kind of discussions did they write, did they do conclusions, did they do the experiment.


So, real quick I'll just basically go through the other ways of getting to that information. So far I just used the wiki. I just used Google, but that's not really the perfect way to get to this information. We're repeating the same reaction, the UV reaction and we keep changing the conditions slightly.


Wouldn't it be nice to be able to compare all the reactions in one place? And we use a Google doc for that where each one of these experiments has all the entries in the table. So I can go to this table. I can sort by the amount of acid and then this will link back to the wiki page.


So if I wanted to dig deeper, I could. So, remember, this does not have all the information. A table can never completely represent an experiment. We pretend it does in a paper, but in fact every table and every paper that has ever appeared should have an asterisk next to each entry because it was a little bit different than the other one.


This allows you to do that in complete honesty, because if you want more information, you can just click back to the wiki. So, that's another way.


Another way that we can spread our information is with CDD collaborative joint discovery. Once we do our anti-malarial assays we can report the assays on this database. And this is publicly available as well.


We've also used YouTube to record. I actually don't want my students to write. I just want to know what happened. And it turns out it is much easier for them to do a video of the experimental setup and just write in the log, I set up the experiment. Because I can actually see all the stuff that they wouldn't think to write. I just find it interesting that this little video of a round bottom flash has 5600 views. So there is something going on. People are interested in this.


We also use mailing lists. Mailing lists are basically useful for collaborating between groups for working out and knowing details. Every experiment has those and knowing that. On the Chem Informatics, if you are acquainted with Second Life, we now have tools in Second Life to construct 3D molecules, based only on the Smiles or ENCHI. So if you are interested in that, talk to me. But my students are doing projects now where they don't have to do scripting because we have tools that do that automatically and minimize the 3D structure.


You can also talk to molecules. This is an aldehyde and an amine. You tell them to react and they go through each intermediate to form the immine. Each intermediate has been minimized so that it is realistic in its 3D shape.


We can even look at NMR spectrum or any spectrum that is is JCAMP format. You can talk to the spectrum in Second Life and it will zoom in to the region that you want. So Second Life is really becoming a pretty cool environment for discussing some pretty high-level chemistry. You really could do that with your students.


Another way that we're disseminating information, we have the SIF files from our X-ray crystallography. Now Drexel has a site of e-crystals, which is freely available. Other people have started to do this kind of Open Notebook science. Gus Rosania [sp] at Michigan has also converted his group to this wiki interface, where his students are reporting real time what they are doing.


Cameron Nyeland [sp] is also a very strong proponent of Open Notebook science. He doesn't use a wiki. He uses his own customized software, but it's the same kind of concept. And real quick, some of the other things we have done, if you are interested, you can talk to me about this later.


With JCAMP again, you can write scripts in individual Basic for Excel and you can have it calculate rates, reactions. Finally, where we're headed with all this I think we want to move from an environment where humans are interacting with each other to having machines interact with each other. But the only way we're going to get there is if we represent the execution of the experiment in a machine friendly way.


So what we've been doing is translating our logs using a strict terminology. So, for example, if we want to specify methanol, we would actually use a ENCHI. So this script is pretty simple for a machine to understand and it should be able to replicate what actually happened in this experiment.


And we've had the good luck of having a mini mapper, a robot, in our lab that was on loan to us. We have been able to do 48 experiments at the same times. These are little tubes that have a filter in them. So in this experiment we basically chemicals together and the product comes out as a precipitate.


So that's the idea here is to try to figure out how to get the highest yield and how to get as much pure product as possible. So, we can program the mini mapper to execute those experiments. In the spirit of openness we use Google docs.


We can just copy the Google docs to the Metlo [sp] Toledo software so that the persons running the Metlo Toledo software doesn't necessarily have to know the experimental design to be able to provide the results.


So the idea here is that you can have truer crowd sourcing, where people from around the world who have expertise in experimental design of this experiment could request some experiments to be done within the confines of our experimental setup. Because we are making our results public, they could also benefit from that without use even really be fully aware of everything that is going on.


Anyways, that's sort of the bigger picture of this. And the machine, of course, understands XML so that makes it much, much easier than trying to read the human generated logs. That's the one advantage that we have with machines obviously.


So, yeah, that's basically the bigger concept here. I think we are going to move towards a world where machines are going to be able to design experiments, execute them and analyze them, but I don't think it will happen within the same unit. I think it's going to happen in a distributed way around the world. Much in the same way the Internet works now. Everybody contributes their own little part.


And that's my message is: it's fun to talk about these things, but let's start to do them. We can discuss standards like JCAMP that emerge but we didn't decide on JCAMP before doing this. We fell upon JCAMP as a convenient solution to our problem. And try to get your data out in as many different formats as you possibly can with this new technology. Six.


Transcription by CastingWords