Perl: Matching four different files and obtaining particular Information in output file -
i have 4 files. file 1 (named inupt_22.txt) input file containing 2 columns (space delimited). first column alphabetically sorted list of ligandcode (three letter/number code particular ligand). second column list of pdbcodes (protein data bank code) respective of each ligandcode (unsorted list though).
file 1 (input_22.txt):
803 1cqp amh 1b2i asc 1f9g ets 1cil mit 1dwc tfp 1ctr vdx 1db1 zmr 1a4g file 2(named sd_2.txt) sdf (structure data file) fragments of each ligand. ligand can contain 1 or more 1 fragments. instance, here 803 ligandcode , has 2 fragments. file like: 4 dollar sign ($$$$) followed ligandcode (i.e 803 in example) in next line. every fragment follows same thing. next, in 5th line of each fragment (third line $$$$.\n803), there number represents number of rows in next block of rows, 7 in first fragment , 10 in next fragment of 803 ligand. now, next block of rows contains column (61-62) contains specific number refers atoms in fragments. example in first fragment of 803, these numbers 15,16,17,19,20,21,22. these numbers need matched in file 3.
file 2 (sd_2.txt) looks like:
$$$$ 803 scitegic05101215222d 7 7 0 0 0 0 999 v2000 3.0215 -0.5775 0.0000 c 0 0 0 0 0 0 0 0 0 15 0 0 2.3070 -0.9900 0.0000 c 0 0 0 0 0 0 0 0 0 16 0 0 1.5926 -0.5775 0.0000 c 0 0 0 0 0 0 0 0 0 17 0 0 1.5926 0.2475 0.0000 c 0 0 0 0 0 0 0 0 0 19 0 0 2.3070 0.6600 0.0000 c 0 0 0 0 0 0 0 0 0 20 0 0 2.3070 1.4850 0.0000 o 0 0 0 0 0 0 0 0 0 21 0 0 3.0215 0.2475 0.0000 o 0 0 0 0 0 0 0 0 0 22 0 0 1 2 1 0 1 7 1 0 2 3 1 0 3 4 1 0 4 5 1 0 5 6 2 0 5 7 1 0 m end > <name> 803 > <num_rings> 1 > <num_csp3> 4 > <fsp3> 0.8 > <fstereo> 0 $$$$ 803 scitegic05101215222d 10 11 0 0 0 0 999 v2000 -1.7992 -1.7457 0.0000 c 0 0 0 0 0 0 0 0 0 1 0 0 -2.5137 -1.3332 0.0000 c 0 0 0 0 0 0 0 0 0 2 0 0 -2.5137 -0.5082 0.0000 c 0 0 0 0 0 0 0 0 0 3 0 0 -1.7992 -0.0957 0.0000 c 0 0 0 0 0 0 0 0 0 5 0 0 -1.0847 -0.5082 0.0000 c 0 0 0 0 0 0 0 0 0 6 0 0 -0.3702 -0.0957 0.0000 c 0 0 0 0 0 0 0 0 0 7 0 0 0.3442 -0.5082 0.0000 c 0 0 0 0 0 0 0 0 0 8 0 0 0.3442 -1.3332 0.0000 c 0 0 0 0 0 0 0 0 0 9 0 0 -0.3702 -1.7457 0.0000 c 0 0 0 0 0 0 0 0 0 11 0 0 -1.0847 -1.3332 0.0000 c 0 0 0 0 0 0 0 0 0 12 0 0 1 2 1 0 1 10 1 0 2 3 1 0 3 4 1 0 4 5 2 0 5 6 1 0 5 10 1 0 6 7 2 0 7 8 1 0 8 9 1 0 10 9 1 0 m end > <name> 803 > <num_rings> 2 > <num_csp3> 6 > <fsp3> 0.6 > <fstereo> 0.1 file 3 cif (crystallographic information file). file can obtained following link: file_3 file collection of individual cif files several ligand molecules. each part in file starts data_ligandcode. our example data_803. after 46 lines start of each small file in collection, there block gives structural information molecule. number of rows in block not fixed. however, block ends hash sign (#). in block 2 columns important 53-56 , 62-63. 62-63 column contains numbers can matched numbers obtained file 2. and, 53-56 contains atom names c1 (carbon 1) etc. column can used match file 4.
file 4 grow.out file contains information interaction of each ligand target protein. file name pdbcode given in file 1 against each ligand. example ligand 803 pdbcode 1cqp. so, grow.out file having name of 1cqp. 1cqp in file rows important contain ligandcode (for example 803) , and atom name obtained 53-56 column of file three.
task: need script reads ligandcode file 1, goes file 2 search $$$$ . \nligandcode , obtain numbers column 61-62 each fragment. in next step script should pass these number file 3 , match rows containing these number in column 62-63 of file 3 , pull out information in column 53-56 (atom names). , last step opening of file 4 name of pdbcode , printing rows containing ligandcode , atom names obtained file 3. printing should done in output file.
i biomedical research student. don't have computer science background. however, have use perl programming task. above mentioned task wrote script, not working , can not find reason behind it. script wrote :
#!/usr/bin/perl use strict; use warnings; use text::table; use carp qw(croak); { $a; $b; $input_file = "input_22.txt"; @lines = slurp($input_file); $line (@lines){ ($ligandcode, $pdbcode) = split(/\t/, $line); $i=0; $k=0; @array; @array1; open (file, '<', "sd_2.txt"); while (<file>) { $i=0; $k=0; @array; @array1; if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) { $nextline1 = <file>; $nextline2 = <file>; $nextline3 = <file>; $nextline4= <file>; $totalatoms= substr( $nextline4, 1,2); print $totalatoms,"\n"; while ($i<$totalatoms) { $nextlines= <file>; $sub= substr($nextlines, 61, 2); print $sub; $array[$i] = $sub; open (fh, '<', "components.txt"); while (my $ship=<fh>) { $var="data_$ligandcode"; if ($ship=~/$var/) { while ($k<=44) { $k++; $nextline = <fh>; } $j=0; $nextline3; { $nextline3=<fh>; print $nextline3; $part= substr($nextline3, 62, 2); $part2= substr($nextline3, 53, 4); $array1[$j] = $part; if ($array1[$j] eq $array[$i]) { print $part2, "\n"; open (gh, '<', "$pdbcode"); open (oh, ">>out_grow.txt"); while (my $grow = <gh>) { if ( $grow=~/$ligandcode/){ print oh $grow if $grow=~/$part2/; }} close (gh); close (oh); } $j++; } while $nextline3 !~/\x23/; } } $i++; close (fh); } }} close (file); } } ##slurps file list sub slurp { ($file) = @_; (@data, @data_chomped); open in, "<", $file or croak "can't open $file\n"; @data = <in>; $line (@data){ chomp($line); push (@data_chomped, $line); } close in; return (@data_chomped); } i want make script works fast , works 1000 fragments altogether, if make list of 400 molecules in file 1. kindly me make script working. ll grateful.
you need break code manageable steps.
create data-structures files
use slurp; @input = map{ [ split /\s+/, $_, 2 ] } slurp $input_filename; # etcprocess each element of
input_22.txt, using data structures.
i think should perlmol. after all, half reason use perl cpan.
things did well
- using 3-arg
open use strict;use warnings;
things shouldn't have done
- (re)defined
$a,$b
defined you. - reimplemented
slurp(poorly) - read same file in multiple times.
openedsd_2.txtonce every line ofinput_22.txt. - defined symbols outside of scope use them.
$j,$k,@array,@array1defined twice, 1 of definitions being used. - used
open,closewithout sort of error checking.
eitheropen ... or die;oruse autodie; - you used bareword filehandles.
in,fileetc
instead useopen $fh, ...
most of aren't big of deal though, one-off program.
Comments
Post a Comment