The GAssist Pittsburgh Learning Classifier System. Dr. J. Bacardit, N. Krasnogor G53BIO - Bioinformatics

Relaterede dokumenter
PARALLELIZATION OF ATTILA SIMULATOR WITH OPENMP MIGUEL ÁNGEL MARTÍNEZ DEL AMOR MINIPROJECT OF TDT24 NTNU

Basic statistics for experimental medical researchers

Vina Nguyen HSSP July 13, 2008

Particle-based T-Spline Level Set Evolution for 3D Object Reconstruction with Range and Volume Constraints

Linear Programming ١ C H A P T E R 2

Black Jack --- Review. Spring 2012

Strings and Sets: set complement, union, intersection, etc. set concatenation AB, power of set A n, A, A +

v Motivation v Multi- Atlas Segmentation v Learn Dictionary v Apply Dictionary v Results

Unitel EDI MT940 June Based on: SWIFT Standards - Category 9 MT940 Customer Statement Message (January 2004)

Differential Evolution (DE) "Biologically-inspired computing", T. Krink, EVALife Group, Univ. of Aarhus, Denmark

Observation Processes:

Aktivering af Survey funktionalitet

Improving data services by creating a question database. Nanna Floor Clausen Danish Data Archives

Hvor er mine runde hjørner?

On the complexity of drawing trees nicely: corrigendum

Project Step 7. Behavioral modeling of a dual ported register set. 1/8/ L11 Project Step 5 Copyright Joanne DeGroat, ECE, OSU 1

RoE timestamp and presentation time in past

Heuristics for Improving

IBM Network Station Manager. esuite 1.5 / NSM Integration. IBM Network Computer Division. tdc - 02/08/99 lotusnsm.prz Page 1

Small Autonomous Devices in civil Engineering. Uses and requirements. By Peter H. Møller Rambøll

CHAPTER 8: USING OBJECTS

Developing a tool for searching and learning. - the potential of an enriched end user thesaurus

what is this all about? Introduction three-phase diode bridge rectifier input voltages input voltages, waveforms normalization of voltages voltages?

A multimodel data assimilation framework for hydrology

Cross-Sectorial Collaboration between the Primary Sector, the Secondary Sector and the Research Communities

ECE 551: Digital System * Design & Synthesis Lecture Set 5

Skriftlig Eksamen Beregnelighed (DM517)

Handout 1: Eksamensspørgsmål

The X Factor. Målgruppe. Læringsmål. Introduktion til læreren klasse & ungdomsuddannelser Engelskundervisningen

Probabilistic properties of modular addition. Victoria Vysotskaya

Evaluating Germplasm for Resistance to Reniform Nematode. D. B. Weaver and K. S. Lawrence Auburn University

Portal Registration. Check Junk Mail for activation . 1 Click the hyperlink to take you back to the portal to confirm your registration

Bilag. Resume. Side 1 af 12

Avancerede bjælkeelementer med tværsnitsdeformation

Trolling Master Bornholm 2014

Privat-, statslig- eller regional institution m.v. Andet Added Bekaempelsesudfoerende: string No Label: Bekæmpelsesudførende

Generalized Probit Model in Design of Dose Finding Experiments. Yuehui Wu Valerii V. Fedorov RSU, GlaxoSmithKline, US

Kvant Eksamen December timer med hjælpemidler. 1 Hvad er en continuous variable? Giv 2 illustrationer.

Appendix 1: Interview guide Maria og Kristian Lundgaard-Karlshøj, Ausumgaard

Applications. Computational Linguistics: Jordan Boyd-Graber University of Maryland RL FOR MACHINE TRANSLATION. Slides adapted from Phillip Koehn

ESG reporting meeting investors needs

SOFTWARE PROCESSES. Dorte, Ida, Janne, Nikolaj, Alexander og Erla

IPv6 Application Trial Services. 2003/08/07 Tomohide Nagashima Japan Telecom Co., Ltd.

Statistik for MPH: 7

Er implementering af 3R = Dyrevelfærd?

LED STAR PIN G4 BASIC INFORMATION: Series circuit. Parallel circuit HOW CAN I UNDERSTAND THE FOLLOWING SHEETS?

Den nye Eurocode EC Geotenikerdagen Morten S. Rasmussen

Baltic Development Forum

2a. Conceptual Modeling Methods

Bachelorprojekt. Forår 2013 DMD10

Engelsk. Niveau C. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

Skidding System. Challenge Us

Sortering fra A-Z. Henrik Dorf Chefkonsulent SAS Institute

Financial Literacy among 5-7 years old children

Design til digitale kommunikationsplatforme-f2013

Curve Modeling B-Spline Curves. Dr. S.M. Malaek. Assistant: M. Younesi

CS 4390/5387 SOFTWARE V&V LECTURE 5 BLACK-BOX TESTING - 2

Trolling Master Bornholm 2014

Vores mange brugere på musskema.dk er rigtig gode til at komme med kvalificerede ønsker og behov.

Outline CS 4387/5387 SOFTWARE V&V LECTURE 7 INTEGRATION TESTING. Integration Testing. Integrating OO Applications. Definition Strategies 6/20/2018

Titel: Barry s Bespoke Bakery

Det er muligt at chekce følgende opg. i CodeJudge: og

South Baileygate Retail Park Pontefract

Website review groweasy.dk

How Long Is an Hour? Family Note HOME LINK 8 2

Skriftlig Eksamen Beregnelighed (DM517)

Brug sømbrættet til at lave sjove figurer. Lav fx: Få de andre til at gætte, hvad du har lavet. Use the nail board to make funny shapes.

Trolling Master Bornholm 2012

Åbenrå Orienteringsklub

Angle Ini/al side Terminal side Vertex Standard posi/on Posi/ve angles Nega/ve angles. Quadrantal angle

Engelsk. Niveau D. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

Skriftlig Eksamen Kombinatorik, Sandsynlighed og Randomiserede Algoritmer (DM528)

Sikkerhed & Revision 2013

USERTEC USER PRACTICES, TECHNOLOGIES AND RESIDENTIAL ENERGY CONSUMPTION

Exercise 6.14 Linearly independent vectors are also affinely independent.

Velkommen til IFF QA erfa møde d. 15. marts Erfaringer med miljømonitorering og tolkning af nyt anneks 1.

Special VFR. - ved flyvning til mindre flyveplads uden tårnkontrol som ligger indenfor en kontrolzone

Learnings from the implementation of Epic

United Nations Secretariat Procurement Division

INTEL INTRODUCTION TO TEACHING AND LEARNING AARHUS UNIVERSITET

On the Relations Between Fuzzy Topologies and α Cut Topologies

DIVAR VIGTIGT! / IMPORTANT! MÅL / DIMENSIONS. The DIVAR wall lamp comes standard. with 2.4 m braided cord and a plug in power supply (EU or UK).

LUL s Flower Power Vest dansk version

Reexam questions in Statistics and Evidence-based medicine, august sem. Medis/Medicin, Modul 2.4.

Maneurop reciprocating compressors

Sign variation, the Grassmannian, and total positivity

Sampling real algebraic varieties for topological data analysis

Opera Ins. Model: MI5722 Product Name: Pure Sine Wave Inverter 1000W 12VDC/230 30A Solar Regulator

MSE PRESENTATION 2. Presented by Srunokshi.Kaniyur.Prema. Neelakantan Major Professor Dr. Torben Amtoft

Userguide. NN Markedsdata. for. Microsoft Dynamics CRM v. 1.0

WIO200A INSTALLATIONS MANUAL Rev Dato:

Trolling Master Bornholm 2015

Resource types R 1 1, R 2 2,..., R m CPU cycles, memory space, files, I/O devices Each resource type R i has W i instances.

Basic Design Flow. Logic Design Logic synthesis Logic optimization Technology mapping Physical design. Floorplanning Placement Fabrication

Help / Hjælp

Motion på arbejdspladsen

Measuring the Impact of Bicycle Marketing Messages. Thomas Krag Mobility Advice Trafikdage i Aalborg,

Trolling Master Bornholm 2014

Women in STEM education in the Nordics

Our activities. Dry sales market. The assortment

Transkript:

The GAssist Pittsburgh Learning Classifier System Dr. J. Bacardit, N. Krasnogor G53BIO -

Outline bioinformatics Summary and future directions

Objectives of GAssist GAssist [Bacardit, 04] is a Pittsburgh Approach Learning Classifier System evolving variable-length rule sets The research done on this system has three objectives Generation of compact and accurate solutions Run-time reduction Representations for real valued attributes

Objectives of GAssist Representations for real valued attributes GAssist should be applicable to a range of problems as broad as possible This means that it should be able to handle continuous attributes Achieved by the The Adaptive Discretization Intervals (ADI) rule representation

GAssist has been applied to protein domains Proteins are biological molecules of primary importance to the functioning of living organisms Proteins are constructed as a chains of amino acid residues This chain folds to create a 3D structure

It is relatively easy to know the primary sequence of a protein, but much more difficult to know its 3D structure Can we predict the 3D structure of a protein from its primary sequence? Protein Structure Prediction (PSP) PSP problem is divided in several sub problems. We focus on Coordination Number (CN) prediction

Primary Structure = Sequence MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTY LFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKV RSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSE SQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIIT SQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERS TEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVF E Secondary Structure Tertiary Local Interactions Global Interactions

Coordination Number (CN) prediction Two residues of a chain are said to be in contact if their distance is less than a certain threshold CN of a residue : number of contacts that a certain residue has

Kinjo et al. s definition of CN Distance between two residues is defined as the distance between their C β atoms (C α for Glycine) Uses a smooth definition of contact based on a sigmoid function instead of the usual crisp definition Discards local contacts

Classification approach We need to convert the real-valued CN into a finite set of categories We have tested two criteria based on the two usual unsupervised discretization methods: Uniform Frequency (UF) and Uniform Length (UL) UF UL

Protein dataset Used the same set used by Kinjo et al. 1050 protein chains 259768 residues Ten lists of the chains are available, first 950 chains in each list are for training, the rest for tests (10xbootstrap)

We have to transform the data into a regular structure so that it can be processed by standard machine learning techniques Each residue is characterized by several features. We use one (i.e., the AA type) or more of them as input information and one of them as target (CN) R i-5 R i-4 CN i-4 R i-3 CN i-3 R i-2 R i-1 R i R i+1 R i+2 R i+3 R i+4 R i+5 CN i-5 CN i-2 CN i-1 CN i CN i+3 CN i+1 CN i+2 CN i+4 CN i+5 R i-1,r i,r i+1 CN i R i,r i+1,r i+2 CN i+1 R i+1,r i+2,r i+3 CN i+2

Input information 3 types of input information Base information: The AA type of the residues included in the window around the target Global protein information Aim: providing information about the average CN of the protein chain 1st version: 21 attributes: length of the protein and frequency of appearance of the 20 AA types 2nd version: 1 attribute: predicted ave. CN using the 21 att. defined above as input Predicted SS of the residues included in the window

Summary of results All datasets using UL class definition have better performance than their UF equivalent (7-12% dif.) PredSS gives a 2-3% performance boost Global protein information gives a 1.5-2% performance boost

Interpretability analysis of GAssist Example of a rule set for the CN1-UL-2 classes dataset All AA types associated to the central residue are hydrophobic (core of a protein) D, E consistently do not appear in the predicates. They are negatively charges residues (surface of a protein)

Future directions Assessing the added value of the class definitions Testing other types of input information Extending the interpretability analysis Improving GAssist With purely ML techniques Feeding back information from the interpretability analysis to bias the search

Summary and future directions GAssist produces very compact but accurate rule sets This is done by combining the techniques described briefly in this presentation

Summary and future directions Future directions Smart recombination operators Develop theoretical models for all the components of the system