Playing with Hashes from a FTP flow in Perl

2019-07-01 10:52发布

Ok, so I'm obviously having some issues understanding how to work with hashes. Long story short, I'm attempting to parse through an ftp log and find the relevant flows for a specific search criteria. Basically what I'm trying to make it do is, say I have an IP address or a user name, it first does a pretty simple grep to try to minimize any data I don't need and send the output to an external file. If I'm searching for username testing1, then it does a grep on testing1 and sends the output to another file called output.txt:

Dec  2 00:14:09 ftp1 ftpd[743]: USER testing1
Dec  2 00:14:09 ftp1 ftpd[743]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], testing1
Dec  2 00:30:08 ftp1 ftpd[1261]: USER testing1
Dec  2 00:30:09 ftp1 ftpd[1261]: FTP LOGIN FROM 192.168.0.4 [192.168.0.4], testing1
Dec  2 01:12:33 ftp1 ftpd[11804]: USER testing1
Dec  2 01:12:33 ftp1 ftpd[11804]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], testing1

And below is an example of the originating log data:

Dec  1 23:59:03 ftp1 ftpd[4152]: USER testing1
Dec  1 23:59:03 ftp1 ftpd[4152]: PASS password  
Dec  1 23:59:03 ftp1 ftpd[4152]: FTP LOGIN FROM 192.168.0.02 [192.168.0.2], testing1  
Dec  1 23:59:03 ftp1 ftpd[4152]: PWD  
Dec  1 23:59:03 ftp1 ftpd[4152]: CWD /test/data/  
Dec  1 23:59:03 ftp1 ftpd[4152]: TYPE Image

I then go in, put all the processIDs that I find along with the time of that ID and put them into a hash. Which is what you see below:

$VAR1 = {
      '743' => [
                 '00:1'
               ],
      '20687' => [
                   '01:3'
                 ],
      '27186' => [
                   '15:3'
                 ],
      '6929' => [
                  '12:0'
                ],
      '24771' => [
                   '09:0'
                 ],
      '11804' => [
                   '01:1'
                 ],
      '27683' => [
                   '08:3'
                 ],
      '14976' => [
                   '04:3'
                 ],
};

It looks as if the time is being put into the hash as an array. I was unable to figure out why this is happening to I decided to work with it as an array. The following is how the hash of arrays are created:

# -------------------------------------------------------
# Extract PIDs and Time from lines, take out doubles
# -------------------------------------------------------
my $infile3 = 'output.txt';
my %pids;
my $found;
my $var;

open (INPUT2, $infile3) or die "Couldn't read $infile3.\n";

while (my $line = <INPUT2>) {
    if($line =~ /(\d{2})\:(\d)/ ) {
        my $hhmm = $1 . ":" . $2;
        if ($line =~ /ftpd\[(.*?)\]/) {
            $found = 0;
            foreach $var(keys %pids){
                if(grep $1 =~ $var, keys %pids){
                    $found = 1;
                }
            }
            if ($found == 0){
                push @{$pids{$1}}, $hhmm;

            }
        }       
    }

}

To speed things up I have decided to read all the lines that have the matching PIDs, whether they fit the flow or not, into an array so I don't have to keep reading in the originating file.

##-------------------------------------------------------
## read each line from file into an array
##-------------------------------------------------------
open (INPUT, $infile2) or die "Couldn't read $infile2.\n";

my @messages;

while (my $line = <INPUT>){
    # if there is a match to the PID then put the line in the array
    if ($line =~ /ftpd\[(.*?)\]/){
        my $mPID = $1;
        foreach my $key (keys %pids){
            if ($key =~ $mPID){
                push @messages, $line;
            }
        }  
    }
}

I'm now trying to match the line up with the PID and the Time to get the flow. I'm only matching the hh:m in the time for more of a chance to get the entire flow and because chances of other flows with a PID having the same timeframe is pretty slim. Eventually all these results will be send to an internal web page.

# -------------------------------------------------------
#find flow based on PID that was found from criteria
#-------------------------------------------------------

foreach my $line(@messages){
    if(my($pid) = $line =~ m{ \[ \s*(\d+) \]: }x) {
        if($line =~ /(\d{2})\:(\d)/){
            my $time = $1 . ":" . $2;
            if ($pids{$pid}[0] =~ /$time/){
                 push $pids{$pid}[0], $line;
            }
        }
    }
}

Right now the above code for some reason is actually deleting the time from the hash once it is matched. I am unsure why this is happening.

I was able to get is working with a bash script but took decades for it to complete. Thanks to suggestions from people here I have decided to tackle it with Perl so am basically taking a crash course. I've read everything I can and have basic programming skills in C++ but obviously still need a lot of work. I also got it working using arrays but once again it was incredibly slow and i was getting a lot of flows that matched the process ID but were not the flows I was looking for. So after further suggestions I decided to work with hashes, have the process ID as the key, have a specific time referenced to that key, and then lines within the log that have both that key and time as the flow. I have had multiple questions on this already but have A. Not explained myself clearly and B. have been trying different things as I learn. But for the record everyone here has helped me tremendously and I hope that one day I can do the same for others on this list. For some reason I just can't get this stuff through my thick skull.

Anyways, hopefully I covered everything, I'm sure I'm starting to get on people's nerves with all these questions so I apologize.

UPDATE:

Well I think I figured out how to make it all hashes but doesn't look right. I changed push @{$pids{$1}}, $hhmm; to $pids{$1}{$x} = $hhmm; which creates the following:

$VAR1 = {
          '743' => {
                     '' => '00:1'
                   },
          '20687' => {
                       '' => '01:3'
        },

But it doesn't look like it's referencing correctly so when I do print $pids{743}; all it prints is HASH(0x4caf10)

UPDATE:

Ok, I was able to put all the values into hashes by changing @{$pids{$1}}, $hhmm; to $pids{$1} = $hhmm; which seems to be working:

$VAR1 = {
          '743' => '00:1',
          '20687' => '01:3',
};

But now how do I check to see if the value '00:1' matches another variable? This is what I currently have and is not working:

if($pids{$pid} == qr/$time/){
    $pids{$pid}{$time}[$y] = $line;
    $y++;
};

This is how it should look after the match is made:

$VAR1 = {
          '743' => '00:1',
          '4771' => {
                      '23:5' => [
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: USER test
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: PASS password
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: CWD /home/test/
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: TYPE Image
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: PASV
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: RETR test
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: QUIT
',
                                  'Dec  1 23:59:23 ftp1 ftpd[4771]: FTP session closed
'
                                ]
                    },

1条回答
Anthone
2楼-- · 2019-07-01 11:21

You have a couple of errors in your code.

The first is that you're only pulling out one digit of the minutes:

    if($line =~ /(\d{2})\:(\d)/ ) {

should be

    if($line =~ /(\d{2})\:(\d{2})/ ) {

If I'm interpreting the intent of your code correctly, you're trying to find out whether you've already seen a time for a given pid so that you only set it the first time. If so, you don't need to loop through all the keys in %pid to do this. All you really need to do is

        if ($line =~ /ftpd\[(.*?)\]/) {
            $pid{$1}[0] = $hhmm unless exists $pid{$1};
        }

Notice that I'm doing an assignment rather than a push, so I will wind up with the time in the first element of the array reference.

I think you may have meant to type "==" instead of "=~" below:

            if(grep $1 =~ $var, keys %pids){

Presumably you need to capture more information than just the time, such as the user name, transfer type, etc. so you may find it better to use a hash reference instead of an array reference under the pid. That way you can tag and easily find your information:

        my $pid = $1;
        if ($line =~ /ftpd\[(.*?)\]/) {
            $pid{$pid}{time} = $hhmm unless exists $pid{$pid};
        }
        if ($line =~ /USER (\w+)/) {
            $pid{$pid}{user} = $1;
        }

Of course, you'll want to index according to whatever makes most sense for your purposes to make your searches fast. For instance, you might keep a second hash indexed by time:

           $time{$hhmm}{pid} = $pid;

or even keep a list of all the pids associated with a given user

           push @{$user{$1}}, $pid;
查看更多
登录 后发表回答