Perl: Matching four different files and obtaining particular Information in output file -


i have 4 files. file 1 (named inupt_22.txt) input file containing 2 columns (space delimited). first column alphabetically sorted list of ligandcode (three letter/number code particular ligand). second column list of pdbcodes (protein data bank code) respective of each ligandcode (unsorted list though).enter image description here

file 1 (input_22.txt):

    803      1cqp         amh      1b2i         asc      1f9g         ets      1cil         mit      1dwc         tfp      1ctr          vdx      1db1          zmr      1a4g       

file 2(named sd_2.txt) sdf (structure data file) fragments of each ligand. ligand can contain 1 or more 1 fragments. instance, here 803 ligandcode , has 2 fragments. file like: 4 dollar sign ($$$$) followed ligandcode (i.e 803 in example) in next line. every fragment follows same thing. next, in 5th line of each fragment (third line $$$$.\n803), there number represents number of rows in next block of rows, 7 in first fragment , 10 in next fragment of 803 ligand. now, next block of rows contains column (61-62) contains specific number refers atoms in fragments. example in first fragment of 803, these numbers 15,16,17,19,20,21,22. these numbers need matched in file 3.enter image description here

file 2 (sd_2.txt) looks like:

$$$$     803       scitegic05101215222d         7  7  0  0  0  0            999 v2000         3.0215   -0.5775    0.0000 c   0  0  0  0  0  0  0  0  0 15  0  0          2.3070   -0.9900    0.0000 c   0  0  0  0  0  0  0  0  0 16  0  0          1.5926   -0.5775    0.0000 c   0  0  0  0  0  0  0  0  0 17  0  0           1.5926    0.2475    0.0000 c   0  0  0  0  0  0  0  0  0 19  0  0           2.3070    0.6600    0.0000 c   0  0  0  0  0  0  0  0  0 20  0  0           2.3070    1.4850    0.0000 o   0  0  0  0  0  0  0  0  0 21  0  0           3.0215    0.2475    0.0000 o   0  0  0  0  0  0  0  0  0 22  0  0         1  2  1  0         1  7  1  0          2  3  1  0        3  4  1  0         4  5  1  0         5  6  2  0        5  7  1  0       m  end       > <name>       803        > <num_rings>       1        > <num_csp3>        4       > <fsp3>        0.8           > <fstereo>        0        $$$$       803          scitegic05101215222d          10 11  0  0  0  0            999 v2000           -1.7992    -1.7457    0.0000 c   0  0  0  0  0  0  0  0  0  1  0  0            -2.5137    -1.3332    0.0000 c   0  0  0  0  0  0  0  0  0  2  0  0            -2.5137    -0.5082    0.0000 c   0  0  0  0  0  0  0  0  0  3  0  0            -1.7992    -0.0957    0.0000 c   0  0  0  0  0  0  0  0  0  5  0  0            -1.0847   -0.5082    0.0000 c   0  0  0  0  0  0  0  0  0  6  0  0           -0.3702    -0.0957    0.0000 c   0  0  0  0  0  0  0  0  0  7  0  0             0.3442     -0.5082    0.0000 c   0  0  0  0  0  0  0  0  0  8  0  0            0.3442     -1.3332    0.0000 c   0  0  0  0  0  0  0  0  0  9  0  0            -0.3702     -1.7457    0.0000 c   0  0  0  0  0  0  0  0  0 11  0  0          -1.0847    -1.3332    0.0000 c   0  0  0  0  0  0  0  0  0 12  0  0          1  2  1  0           1 10  1  0         2  3  1  0         3  4  1  0         4  5  2  0         5  6  1  0         5 10  1  0        6  7  2  0         7  8  1  0         8  9  1  0       10  9  1  0       m  end         > <name>        803         > <num_rings>      2       > <num_csp3>      6       > <fsp3>      0.6        > <fstereo>       0.1       

file 3 cif (crystallographic information file). file can obtained following link: file_3 file collection of individual cif files several ligand molecules. each part in file starts data_ligandcode. our example data_803. after 46 lines start of each small file in collection, there block gives structural information molecule. number of rows in block not fixed. however, block ends hash sign (#). in block 2 columns important 53-56 , 62-63. 62-63 column contains numbers can matched numbers obtained file 2. and, 53-56 contains atom names c1 (carbon 1) etc. column can used match file 4.

file 4 grow.out file contains information interaction of each ligand target protein. file name pdbcode given in file 1 against each ligand. example ligand 803 pdbcode 1cqp. so, grow.out file having name of 1cqp. 1cqp in file rows important contain ligandcode (for example 803) , and atom name obtained 53-56 column of file three.

task: need script reads ligandcode file 1, goes file 2 search $$$$ . \nligandcode , obtain numbers column 61-62 each fragment. in next step script should pass these number file 3 , match rows containing these number in column 62-63 of file 3 , pull out information in column 53-56 (atom names). , last step opening of file 4 name of pdbcode , printing rows containing ligandcode , atom names obtained file 3. printing should done in output file.

i biomedical research student. don't have computer science background. however, have use perl programming task. above mentioned task wrote script, not working , can not find reason behind it. script wrote :

#!/usr/bin/perl use strict; use warnings; use text::table; use carp qw(croak);  {      $a;     $b;     $input_file = "input_22.txt";     @lines = slurp($input_file);     $line (@lines){         ($ligandcode, $pdbcode) = split(/\t/, $line);         $i=0;         $k=0;          @array;         @array1;          open (file, '<', "sd_2.txt");           while (<file>) {             $i=0;             $k=0;              @array;             @array1;              if ( $_=~/\x24\x24\x24\x24/ . /\n$ligandcode/) {                   $nextline1 = <file>;                 $nextline2 = <file>;                 $nextline3 = <file>;                 $nextline4= <file>;                  $totalatoms= substr( $nextline4, 1,2);                 print $totalatoms,"\n";                 while ($i<$totalatoms)                   {                        $nextlines= <file>;                        $sub= substr($nextlines, 61, 2);                       print $sub;                       $array[$i] = $sub;                       open (fh, '<', "components.txt");                        while (my $ship=<fh>) {                           $var="data_$ligandcode";                           if ($ship=~/$var/)                               {                                  while ($k<=44)                                   {                                       $k++;                                       $nextline = <fh>;                                    }                                  $j=0;                                 $nextline3;                                                                   {                                        $nextline3=<fh>;                                       print $nextline3;                                        $part= substr($nextline3, 62, 2);                                       $part2= substr($nextline3, 53, 4);                                       $array1[$j] = $part;                                       if ($array1[$j] eq $array[$i])                                         {                                             print $part2, "\n";                                             open (gh, '<', "$pdbcode");                                              open (oh, ">>out_grow.txt");                                             while (my $grow = <gh>)                                               {                                                   if ( $grow=~/$ligandcode/){                                                       print oh $grow if $grow=~/$part2/;                                                    }}                                             close (gh);                                             close (oh);                                         }                                        $j++;                                   } while $nextline3 !~/\x23/;                             }                       }                       $i++;                       close (fh);                   }              }}          close (file);      } }   ##slurps file list sub slurp {     ($file) = @_;     (@data, @data_chomped);     open in, "<", $file or croak "can't open $file\n";     @data = <in>;     $line (@data){         chomp($line);         push (@data_chomped, $line);     }     close in;     return (@data_chomped); } 

i want make script works fast , works 1000 fragments altogether, if make list of 400 molecules in file 1. kindly me make script working. ll grateful.

you need break code manageable steps.

  1. create data-structures files

    use slurp;  @input = map{   [ split /\s+/, $_, 2 ] } slurp $input_filename;  # etc 
  2. process each element of input_22.txt, using data structures.

i think should perlmol. after all, half reason use perl cpan.


things did well

  • using 3-arg open
  • use strict;
  • use warnings;

things shouldn't have done

  • (re)defined $a , $b
    defined you.
  • reimplemented slurp (poorly)
  • read same file in multiple times.
    opened sd_2.txt once every line of input_22.txt.
  • defined symbols outside of scope use them.
    $j, $k, @array , @array1 defined twice, 1 of definitions being used.
  • used open , close without sort of error checking.
    either open ... or die; or use autodie;
  • you used bareword filehandles. in, file etc
    instead use open $fh, ...

most of aren't big of deal though, one-off program.


Comments