用大数据解放科学家,学术更简单

本文涉及的产品
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
云原生大数据计算服务 MaxCompute,5000CU*H 100GB 3个月
简介:

毕业季不仅有聚会、合照、啤酒、大哭,还有让所有人抓狂的毕业论文。而毕业论文中的文献综述工作是让小编印象最深刻的恐怖经历。相信很多读者也一样经历过寻找、逐一阅读并整合一个领域成百上千篇论文的经历。本期原点的创业嘉宾也一样,在完成博士论文期间萌生了用计算机和算法代替学者们完成这一枯燥工作的想法。这一想法从诞生到初步完成花费了五年时间,现在,生物医学领域的学者们已经基本可以利用这一工具追溯一个领域的研究历史及重要论文了。


Sam Molyneux给自己的这个网站取名位Sciencescape,希望能够建立起一种逃离传统学术搜索类网站的用户体验,他给自己网站的一个定义是“twitter-like“,从外观和使用感受上都想给读者一种全新的学术体验。

大数据文摘原点今日有幸跟大家分享对Sciencescape创始人Sam Molyneux的访谈,带您一起了解这位医学家的创业经历,看看他如何在做学术的同时赚取自己的一桶金。


观看以下视频先看看Sciencescape在做什么(02:12)


英语阅读训练马上开始
Are You Ready?
Q
&
A
1. Canyou briefly introduce Sciencescape? Whatare the factors/motivations that drive you to set up Sciencescape?

Briefly,we set out to build Sciencescape which is a place on the web that connects youto breaking research as happens around the world, and also gives you somepowerful tool to explore the entire history of Biomedicine.

The motivations for building Sciencescape, were a number of experiences that I hadas a cancer genomics researcher, through a PhD that I was working on a numberof years ago, essentially with the challenges around how difficult it is tobrowse and explore the history of any fields of research or fields of researchyou were working with, identifying landmark papers, tracing the connectionsamong the papers through history. And then second to that, how challenging itis, more nearly possible to be to actually stream and follow the research in apersonalized manner. Based on that, we looked at the tools were in market atthe time, which were essentially just search engines with email alerts and RSSFeed that you could link to from journal tables of content.

We envision a platform thatyou could walk into and follow anything in your world of research whether it ispeople, genes, diseases, proteins, products, places, affiliations, etc., anyentity, among millions of potential entities in the world of research thatrelate to you and in a “twitter-like” experience, research which stream to you,through the web and through the mobile, and then there will be some powerfulinterfaces for explore history. That is something we envisioned, and 5 yearslater, we finally made some progress with that.

- Technical & Product-Related Questions


2. Based on your short intro video and online information, we’ve learned that Sciencescape leverages the advantages of Natural Language Processing and Content Identification techniques. How to apply those techniques to your platform to provide real-time customized recommendations for users?


A really good call of what we work on Sciencescape beyond the platform and product user experience, the core work we work on is a knowledge graph, which we have been building for almost half decades now. Essentially the knowledge graph concept is a network of linked entities, to which you organize content around and you organize related to each other, applying that concept to the world of biomedical literature. We have set up on this project to be able to in an automatic fashion using machine learning intelligence and identify all of these entities that are related to any paper. So those entities come from essentially informatics classes, some of these I just mentioned. To do that, you have to apply a family of nature language processing with different algorithms, versus genes, versus disease, versus an organism, versus a product, etc., and a number of other machine learning approaches, such as classification algorithms, to do with disambiguating authors or affiliations. We apply many different machine learning approaches to the universal challenge of tagging papers correctly with entities and concept.


3. What kinds of metrics are used to value the paper? Through the numbers of citation? What else? How to evaluate the importance of paper?


A number of years ago, we looked at the metrics field and we concluded that there is a lot of work to do with citation work alone – a lot of work that are not fully productized. We thought to identify a universal metrics that would allow us to organize papers relative to each other, and organize entities relative to each other. We landed on a various of pagerank algorithm, which is well known as the algorithm that google started on. We identify the variant which is called eigenfactor, which is very similar to pagerank, we just applied the citation network. And if you calculate that on a very large an accepted citation network – today I believe we have the 4th largest in the world. If you calculated on that, you will end up with robust metrics at the article level to use any information which is retrievable, search rankings and other listing of the papers. But those metrics also propagates at the level of entity, so you can calculate the institute level eigenfactor, or the author level eigenfactor, or the gene level eigenfactor. So we ended up with a flexible, powerful and robust metric, which propagates through our system and allows much of these experiences to keep figured in list things. A short answer is to use eigenfactor.


That is only where we started, so we thought we identify base metric to build our initial system on and where we are going with is of course a diverse set of metrics, possibly hundreds of metrics, pulled from through the web all the metrics, as well as internal network metrics to be able to improve the experience in a personalized way for users and allow to explicitly rank the papers and entities against. We actually get a lot of progress on that as well.


4. Some of the algorithm, such as page ranks, you just mentioned, are widely applied by other companies in industry as well. Compared to those big article search engines, your competitors, such as google scholar, what are your competitive advantages?


First, we do not spend a lot of time thinking about our competitors, we spent a lot of time thinking about how to build a different transformed experience for streaming and exploring papers through history. I don’t think there are any products, any platforms and any services there that fill that need, which is why is worth passion on it and why we work intensively on this. We think it is a unifying platform – it is not necessarily a distractive platform, because the function is not well filled. So thinking about search, search is an effective way that literature is access. We think there is a great search there, such as google scholars, is the most extensive search on citation. But search is a solved problem for papers you know exist, if you can describe a paper well. If you have fragment of the title, or part of whole text abstract, or a couple of the authors, it is trivial to use searches on that paper. You can use pubmed and google scholar to do this. There is great enterprise search as well. We don’t work on and think about search too much, we intend to have good search on our platform. But we think about streaming and discovering a lot more and we think the best opportunity really is to improve scientific leadership experience globally.


5. Through some other interview to you, we got to know that your company also had some partnership relation with some individuals or organizations, such as publishers. Compared to the product you mentioned before, this is totally a different business model. Can you talk a little bit more on that?


We have been building our partnership with publishers for quite a while. Essentially, we worked with publishers to help increase the discoverability of article on full text bases. We think publisher is really important in terms of on time delivery of weekly or monthly articles that researchers searched for. In the current marketplace, that is essentially not the case. The researchers are unserved and publishers are unserved. So we built a program over the last a couple of years for publishers. We accept their content and mine it in a very deep way to increase the discoverability by applying the family of machine learning intelligence algorithms to teach one of the articles. We are open to publisher industrial and value the relationship that we are working together.


- Operation and Strategy Questions

6. Has the company generated revenue so far? How do you monetize your business model?

We are in Series A stage. We had some fantastic investors; we have a large Canadian group investors, who we built the company with during Series A stage. During the next stage, we have venture capital firms in Canada, Hong Kong and US. We have 30 people in the company and we are located in a beautiful start-up based downtown Toronto. Our company is dominant by data science and engineering and we just start to build up our product team, in terms of product management, marketing, design. Our company has gone through tremendous transformation internally. A lot of workload will start to expand externally over the next 6 months.


7. What is the biggest challenge your company came across so far? How did you deal with the challenge?


Ourchallenge is pretty similar to any other start-up companies. Fund raising isalways a challenge. We are in very important but specialized market. We are notproviding a consumer-facing application in the sense that we offered to thebroad consumer market. We have consumer-facing application with respect ofresearchers and leaders of research literature. So there is not 1 billionreaders of scientific literature – it is just a small number. But the people whoinvolved in reading the literature are tremendous important from what theycontributed to society and the budget they command in terms of researchdollars, in terms of driving medical and other society progress. Because ofthat specialized nature of market, sometimes it is difficult to persuade peopleto invest in the company. I think at this time, we have achieved a lot and weovercome a lot of those communication limitations.



8. Our big data digest has over 150 thousands followers who are interested or currently working in various big data fields. Is China going to be one of your future markets? How can we help you to make influence using our effective connections and resources?

We see tremendous opportunity to help increase discoverability of we search for Chinese researchers, a market growing really fast. One of the companies we have partnership with, focus on some of these emergent research markets and some of these data shows the growth of these markets is just explosive. So China is definitely a big market for us and India as well. If any opportunities help us to grow our usership through networks, we will be very graceful to work with you.

We have Institutional Ambassador Program – researchers who interested can join our company and work from abroad, to promote their research through our product. We can provide contact info for the readers who are interested in this program.


- Other questions related to market future, startups management and entrepreneurship


9. As a researcher, how do you envision the cancer therapy in future?


I think the program of personalized medicine and Cancer Genomics is a long-term program. So it is going to take many years to elucidate all the long tails of changes in each cancer genome. Each tumor is very different. Cells in the tumor presents lot of differences. heterogenous changes are the most constant thing in cancer genome. But looking into this in another way, despite of the large number of mutations in different cancer genes, all the changes are kind of resolved into a relatively small number of pathways. If we have enough pathway-targeting drugs and we can use them in a combinatorial fashion we can systematically and effectively suppress cancer across multiple pathways. Of course we also need to consider and take care of cancer evolution, but I think this is a productive approach. Also recent cancer immunotherapy results are staggering. Considering the nature of cancer and cancer genome evolution, it is definitely better to have multiple-line of therapies. If we can combine them we can probably suppress cancer in a long run.

10. Do you have any plans to expand your business to clinical informatics? Or other related medical informatics?

I think that is interesting. We have a lot of people who have background of bioinformatics and clinical/medical informatics as well. We are passionate to those approaches. But as a start-up company, we must focus on a main problem, a main market and get really good with something within that space. So we think that the data we are yielding in the process of mining scientific articles to create our knowledge fast can probably be used in precision medicine and probably be used in bioinformatics as another data source to analyze. We will probably focus on providing a transformed data set that can be used by any academics. We are working on for next year the ability for academic to use our data and build on it.



尽管Sam认为Sciencescape尚在初级阶段,还有很多潜力有待深度挖掘,但从其言语间不难听出他为Sciencescape所描绘的宏伟蓝图。他希望用户在文献搜索时,能通过Sciencescape徜徉在生物医学的知识世界,更高效地随手触及从古至今的文献资料。

然而,不光是Sciencescape界面的用户体验还是目前的文献数量都还不甚完善,要想成为与Google scholar相媲美、并有自己独特优势的学术文献搜索工具,Sciencescape还有很长的路要走。

面对潜力巨大的中国市场,Sam明确表达了他想要接触并合作的渴望。但至于能否像Wikipedia一样准确进行非英语语言的翻译工作,Sam只是表达了愿望,却并未将这项工作列入近期的议事日程。

也许对于Sciencescape这样一家创业公司来说,能够“标新立异”可以算是一个很好的开端,但未来能走多远、走多广,还要看其对于市场的把握以及对于主体业务的探索。


原文发布时间为:2015-07-04

本文来自云栖社区合作伙伴“大数据文摘”,了解相关信息可以关注“BigDataDigest”微信公众号

相关实践学习
基于MaxCompute的热门话题分析
本实验围绕社交用户发布的文章做了详尽的分析,通过分析能得到用户群体年龄分布,性别分布,地理位置分布,以及热门话题的热度。
SaaS 模式云数据仓库必修课
本课程由阿里云开发者社区和阿里云大数据团队共同出品,是SaaS模式云原生数据仓库领导者MaxCompute核心课程。本课程由阿里云资深产品和技术专家们从概念到方法,从场景到实践,体系化的将阿里巴巴飞天大数据平台10多年的经过验证的方法与实践深入浅出的讲给开发者们。帮助大数据开发者快速了解并掌握SaaS模式的云原生的数据仓库,助力开发者学习了解先进的技术栈,并能在实际业务中敏捷的进行大数据分析,赋能企业业务。 通过本课程可以了解SaaS模式云原生数据仓库领导者MaxCompute核心功能及典型适用场景,可应用MaxCompute实现数仓搭建,快速进行大数据分析。适合大数据工程师、大数据分析师 大量数据需要处理、存储和管理,需要搭建数据仓库?学它! 没有足够人员和经验来运维大数据平台,不想自建IDC买机器,需要免运维的大数据平台?会SQL就等于会大数据?学它! 想知道大数据用得对不对,想用更少的钱得到持续演进的数仓能力?获得极致弹性的计算资源和更好的性能,以及持续保护数据安全的生产环境?学它! 想要获得灵活的分析能力,快速洞察数据规律特征?想要兼得数据湖的灵活性与数据仓库的成长性?学它! 出品人:阿里云大数据产品及研发团队专家 产品 MaxCompute 官网 https://www.aliyun.com/product/odps 
相关文章
|
机器学习/深度学习 存储 人工智能
阿里云大数据&AI 2022 学术合集
阿里云大数据&AI 2022学术论文解读合集,了解阿里云AI产品在顶级国际会议最新前沿动态,欢迎前来解读。
阿里云大数据&AI 2022 学术合集
|
新零售 搜索推荐 大数据
“阿里云大数据技术实战训练营”江苏省大学生万人计划学术冬令营活动成功举行
2019年1月14日-23日,由江苏省教育厅主办,阿里云、常州大学承办的“阿里云大数据技术实战训练营”在常州大学顺利举行。本次训练营致力于通过系统融合大数据学科的前沿技术、阿里云先进的云上计算实验资源以及行业场景化案例,并贯穿在理论层面到应用层面之中,使学员深入地理解所学专业知识的应用场景及应用技巧,同时有力地掌握相关行业的必备技能,培养成为新工科领域引领性人才。
2269 0
|
人工智能 大数据 数据挖掘
CCF 大数据学术会议的企业论坛上,八位嘉宾们都在关注怎样的产业应用问题?
反向带动业界人士对学术资源与学术研究动向的关注。我们也期待在CCF大数据学术会议的带动下,有更多的产业界人士能参与到学术会议中来,互相促进大数据领域的发展。
4735 0
|
数据采集 大数据
《大数据、小数据、无数据:网络世界的数据学术》一 导读
相对于一般商品而言,研究类数据(research data)的利用价值或多或少。数据管理计划、数据发布需求以及由资助机构、学术期刊和科研机构提出的积极政策都很难适应数据多样性和跨领域的新实践。除了按例定义之外,很少有政策尝试给出数据的定义。
1536 0
|
大数据
《大数据、小数据、无数据:网络世界的数据学术》一 3.6 结论
本节书摘来自华章出版社《大数据、小数据、无数据:网络世界的数据学术》一 书中的第3章,第3.6节,作者:[美] 克莉丝汀L. 伯格曼(Christine L. Borgman),更多章节内容可以访问云栖社区“华章计算机”公众号查看。
1094 0
|
存储 大数据
《大数据、小数据、无数据:网络世界的数据学术》一 3.5 交流融合
本节书摘来自华章出版社《大数据、小数据、无数据:网络世界的数据学术》一 书中的第3章,第3.5节,作者:[美] 克莉丝汀L. 伯格曼(Christine L. Borgman),更多章节内容可以访问云栖社区“华章计算机”公众号查看。
1364 0
|
大数据
《大数据、小数据、无数据:网络世界的数据学术》一 3.4 开放学术
本节书摘来自华章出版社《大数据、小数据、无数据:网络世界的数据学术》一 书中的第3章,第3.4节,作者:[美] 克莉丝汀L. 伯格曼(Christine L. Borgman),更多章节内容可以访问云栖社区“华章计算机”公众号查看。
1576 0
|
大数据
《大数据、小数据、无数据:网络世界的数据学术》一 3.3 社会与技术
本节书摘来自华章出版社《大数据、小数据、无数据:网络世界的数据学术》一 书中的第3章,第3.3节,作者:[美] 克莉丝汀L. 伯格曼(Christine L. Borgman),更多章节内容可以访问云栖社区“华章计算机”公众号查看。
1375 0
|
大数据 索引
《大数据、小数据、无数据:网络世界的数据学术》一 3.2 知识基础设施
本节书摘来自华章出版社《大数据、小数据、无数据:网络世界的数据学术》一 书中的第3章,第3.2节,作者:[美] 克莉丝汀L. 伯格曼(Christine L. Borgman),更多章节内容可以访问云栖社区“华章计算机”公众号查看。
1212 0