Banner


Workshop 2.3: Advanced Pandas#


Joins and merges [Ashwin]#

Pandas UserGuide : Merge, join, concatenate and compare

Load some data and normalize it into:#

  • Processes

  • ParentProcesses

  • Users

import pandas as pd

procs_df = pd.read_csv(
    "../data/process_tree.csv",
    parse_dates=["TimeCreatedUtc", "TimeGenerated"],
    index_col=0
)
parents = procs_df[["ProcessId", "ParentProcessName"]].drop_duplicates()
procs = (
    procs_df[["NewProcessId", "NewProcessName", "CommandLine", "ProcessId", "TimeCreatedUtc", "SubjectUserSid"]]
    .drop_duplicates()
    .rename(columns={"ProcessId": "ParentProcessId"})
)
users = procs_df[['SubjectUserSid', 'SubjectUserName', 'SubjectDomainName']].drop_duplicates()

print("original", len(procs_df))
print("procs", len(procs))
print("parents", len(parents))
print("users", len(users))
display(procs.head(3))
display(parents)
display(users)
original 117
procs 117
parents 3
users 2
NewProcessId NewProcessName CommandLine ParentProcessId TimeCreatedUtc SubjectUserSid
0 0x1580 C:\Diagnostics\UserTmp\ftp.exe .\ftp -s:C:\RECYCLER\xxppyy.exe 0xbc8 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500
1 0x16fc C:\Diagnostics\UserTmp\reg.exe .\reg not /domain:everything that /sid:shines... 0xbc8 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500
2 0x1700 C:\Diagnostics\UserTmp\cmd.exe cmd /c "systeminfo && systeminfo" 0xbc8 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500
ProcessId ParentProcessName
0 0xbc8 C:\Windows\System32\cmd.exe
115 0x440 C:\Windows\System32\svchost.exe
116 0x1580 C:\Diagnostics\UserTmp\powershell.exe
SubjectUserSid SubjectUserName SubjectDomainName
0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
115 S-1-5-18 MSTICAlertsWin1$ WORKGROUP

Joining on Index using pd.concat#

pd.concat([df1, df2...])

We saw using pd.concat to append rows in part 1

# Do some processing on the original DF
dec_logon_id = (
    pd.DataFrame(procs_df.SubjectLogonId.apply(lambda x: int(x, base=16)))
    .rename(columns={"SubjectLogonId": "SubjectLogonId_dec"})
)

dec_logon_id.head(5)
SubjectLogonId_dec
0 16428071
1 16428071
2 16428071
3 16428071
4 16428071

pd.concat with axis="columns" or axis=1 joins column-wise (horizontally)#

(
    pd.concat([procs_df, dec_logon_id], axis="columns")
    .head()
    .filter(regex=".*Process.*|Sub.*")
)
SubjectUserSid SubjectUserName SubjectDomainName SubjectLogonId NewProcessId NewProcessName ProcessId ParentProcessName SubjectLogonId_dec
0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1 0xfaac27 0x1580 C:\Diagnostics\UserTmp\ftp.exe 0xbc8 C:\Windows\System32\cmd.exe 16428071
1 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1 0xfaac27 0x16fc C:\Diagnostics\UserTmp\reg.exe 0xbc8 C:\Windows\System32\cmd.exe 16428071
2 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1 0xfaac27 0x1700 C:\Diagnostics\UserTmp\cmd.exe 0xbc8 C:\Windows\System32\cmd.exe 16428071
3 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1 0xfaac27 0x1728 C:\Diagnostics\UserTmp\rundll32.exe 0xbc8 C:\Windows\System32\cmd.exe 16428071
4 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1 0xfaac27 0x175c C:\Diagnostics\UserTmp\rundll32.exe 0xbc8 C:\Windows\System32\cmd.exe 16428071

Key-based Joins#

df1.merge(df2, ...)
df1.join(df2, ...)

Source tables

display(procs.head())
display(users)
NewProcessId NewProcessName CommandLine ParentProcessId TimeCreatedUtc SubjectUserSid
0 0x1580 C:\Diagnostics\UserTmp\ftp.exe .\ftp -s:C:\RECYCLER\xxppyy.exe 0xbc8 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500
1 0x16fc C:\Diagnostics\UserTmp\reg.exe .\reg not /domain:everything that /sid:shines... 0xbc8 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500
2 0x1700 C:\Diagnostics\UserTmp\cmd.exe cmd /c "systeminfo && systeminfo" 0xbc8 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500
3 0x1728 C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C 12345.exe 0xbc8 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500
4 0x175c C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C c:\users\MSTICAdmin\12345.exe 0xbc8 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500
SubjectUserSid SubjectUserName SubjectDomainName
0 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
115 S-1-5-18 MSTICAlertsWin1$ WORKGROUP

Simple merge on common key#

procs.merge(users, on="SubjectUserSid")
NewProcessId NewProcessName CommandLine ParentProcessId TimeCreatedUtc SubjectUserSid SubjectUserName SubjectDomainName
0 0x1580 C:\Diagnostics\UserTmp\ftp.exe .\ftp -s:C:\RECYCLER\xxppyy.exe 0xbc8 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
1 0x16fc C:\Diagnostics\UserTmp\reg.exe .\reg not /domain:everything that /sid:shines... 0xbc8 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
2 0x1700 C:\Diagnostics\UserTmp\cmd.exe cmd /c "systeminfo && systeminfo" 0xbc8 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
3 0x1728 C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C 12345.exe 0xbc8 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
4 0x175c C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C c:\users\MSTICAdmin\12345.exe 0xbc8 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
... ... ... ... ... ... ... ... ...
112 0x1434 C:\Diagnostics\UserTmp\rundll32.exe .\rundll32.exe /C c:\windows\fonts\conhost.exe 0xbc8 2019-01-15 05:15:14.613 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
113 0x123c C:\Diagnostics\UserTmp\regsvr32.exe .\regsvr32 /u /s c:\windows\fonts\csrss.exe 0xbc8 2019-01-15 05:15:14.693 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
114 0x240 C:\Windows\System32\tasklist.exe tasklist 0xbc8 2019-01-15 05:15:14.770 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
115 0x15a0 C:\Windows\System32\win32calc.exe "C:\Windows\System32\win32calc.exe" 0x1580 2019-01-15 05:15:13.053 S-1-5-21-996632719-2361334927-4038480536-500 MSTICAdmin MSTICAlertsWin1
116 0xbc8 C:\Windows\System32\cmd.exe cmd.exe /c c:\Diagnostics\WindowsSimulateDetec... 0x440 2019-01-15 05:15:03.047 S-1-5-18 MSTICAlertsWin1$ WORKGROUP

117 rows × 8 columns

Left joins (also right and outer)#

procs.merge(users[1:], on="SubjectUserSid")
NewProcessId NewProcessName CommandLine ParentProcessId TimeCreatedUtc SubjectUserSid SubjectUserName SubjectDomainName
0 0xbc8 C:\Windows\System32\cmd.exe cmd.exe /c c:\Diagnostics\WindowsSimulateDetec... 0x440 2019-01-15 05:15:03.047 S-1-5-18 MSTICAlertsWin1$ WORKGROUP
procs.merge(users[1:], on="SubjectUserSid", how="left")
NewProcessId NewProcessName CommandLine ParentProcessId TimeCreatedUtc SubjectUserSid SubjectUserName SubjectDomainName
0 0x1580 C:\Diagnostics\UserTmp\ftp.exe .\ftp -s:C:\RECYCLER\xxppyy.exe 0xbc8 2019-01-15 05:15:15.677 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
1 0x16fc C:\Diagnostics\UserTmp\reg.exe .\reg not /domain:everything that /sid:shines... 0xbc8 2019-01-15 05:15:16.167 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
2 0x1700 C:\Diagnostics\UserTmp\cmd.exe cmd /c "systeminfo && systeminfo" 0xbc8 2019-01-15 05:15:16.277 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
3 0x1728 C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C 12345.exe 0xbc8 2019-01-15 05:15:16.340 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
4 0x175c C:\Diagnostics\UserTmp\rundll32.exe .\rundll32 /C c:\users\MSTICAdmin\12345.exe 0xbc8 2019-01-15 05:15:16.400 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
... ... ... ... ... ... ... ... ...
112 0x1434 C:\Diagnostics\UserTmp\rundll32.exe .\rundll32.exe /C c:\windows\fonts\conhost.exe 0xbc8 2019-01-15 05:15:14.613 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
113 0x123c C:\Diagnostics\UserTmp\regsvr32.exe .\regsvr32 /u /s c:\windows\fonts\csrss.exe 0xbc8 2019-01-15 05:15:14.693 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
114 0x240 C:\Windows\System32\tasklist.exe tasklist 0xbc8 2019-01-15 05:15:14.770 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN
115 0xbc8 C:\Windows\System32\cmd.exe cmd.exe /c c:\Diagnostics\WindowsSimulateDetec... 0x440 2019-01-15 05:15:03.047 S-1-5-18 MSTICAlertsWin1$ WORKGROUP
116 0x15a0 C:\Windows\System32\win32calc.exe "C:\Windows\System32\win32calc.exe" 0x1580 2019-01-15 05:15:13.053 S-1-5-21-996632719-2361334927-4038480536-500 NaN NaN

117 rows × 8 columns

Joins where no common key#

(
    procs.merge(parents, left_on="ParentProcessId", right_on="ProcessId")
    .head()
    .filter(regex=".*Process.*")
)
NewProcessId NewProcessName ParentProcessId ProcessId ParentProcessName
0 0x1580 C:\Diagnostics\UserTmp\ftp.exe 0xbc8 0xbc8 C:\Windows\System32\cmd.exe
1 0x16fc C:\Diagnostics\UserTmp\reg.exe 0xbc8 0xbc8 C:\Windows\System32\cmd.exe
2 0x1700 C:\Diagnostics\UserTmp\cmd.exe 0xbc8 0xbc8 C:\Windows\System32\cmd.exe
3 0x1728 C:\Diagnostics\UserTmp\rundll32.exe 0xbc8 0xbc8 C:\Windows\System32\cmd.exe
4 0x175c C:\Diagnostics\UserTmp\rundll32.exe 0xbc8 0xbc8 C:\Windows\System32\cmd.exe

Using Styles [Ian]#

  • Max/min values

  • Value coloring

  • Inline bars

df.style(...)

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")

# Generate a summary
summary_df = (
    net_df[["RemoteRegion", "TotalAllowedFlows", "L7Protocol"]]
    .groupby("RemoteRegion")
    .agg(
        FlowsSum = pd.NamedAgg("TotalAllowedFlows", "sum"),
        FlowsVar = pd.NamedAgg("TotalAllowedFlows", "var"),
        FlowsStdDev = pd.NamedAgg("TotalAllowedFlows", "std"),
        L7Prots = pd.NamedAgg("L7Protocol", "nunique"),
    )
)
summary_df
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.0 57.267027 7.567498 6
canadacentral 5103.0 29.811223 5.459965 1
centralus 236.0 4.675897 2.162382 1
eastus 602.0 1.646154 1.283025 3
eastus2 1502.0 4.830914 2.197934 1
northeurope 82.0 0.492438 0.701739 1
southcentralus 817.0 8.882186 2.980300 1
westcentralus 59.0 0.017241 0.131306 1
westus 38.0 0.782609 0.884652 1
westus2 7.0 0.300000 0.547723 1

highlight_max/highlight_mix#

df.style.highlight_max(...)

df_style = summary_df.style.highlight_max(color="blue").highlight_min(color="green")
df_style
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.000000 57.267027 7.567498 6
canadacentral 5103.000000 29.811223 5.459965 1
centralus 236.000000 4.675897 2.162382 1
eastus 602.000000 1.646154 1.283025 3
eastus2 1502.000000 4.830914 2.197934 1
northeurope 82.000000 0.492438 0.701739 1
southcentralus 817.000000 8.882186 2.980300 1
westcentralus 59.000000 0.017241 0.131306 1
westus 38.000000 0.782609 0.884652 1
westus2 7.000000 0.300000 0.547723 1

Color gradients#

df.style.background_gradient(...)

import seaborn as sns
cm = sns.light_palette("blue", as_cmap=True)

summary_df.style.background_gradient(cmap=cm).format("{:.1f}")
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.0 57.3 7.6 6.0
canadacentral 5103.0 29.8 5.5 1.0
centralus 236.0 4.7 2.2 1.0
eastus 602.0 1.6 1.3 3.0
eastus2 1502.0 4.8 2.2 1.0
northeurope 82.0 0.5 0.7 1.0
southcentralus 817.0 8.9 3.0 1.0
westcentralus 59.0 0.0 0.1 1.0
westus 38.0 0.8 0.9 1.0
westus2 7.0 0.3 0.5 1.0

Inline bars#

df.style.bar(...)

summary_df.style.bar(color="blue").format("{:.2f}")
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.00 57.27 7.57 6.00
canadacentral 5103.00 29.81 5.46 1.00
centralus 236.00 4.68 2.16 1.00
eastus 602.00 1.65 1.28 3.00
eastus2 1502.00 4.83 2.20 1.00
northeurope 82.00 0.49 0.70 1.00
southcentralus 817.00 8.88 2.98 1.00
westcentralus 59.00 0.02 0.13 1.00
westus 38.00 0.78 0.88 1.00
westus2 7.00 0.30 0.55 1.00
summary_df.style.set_properties(**{
    'background-color': 'black',
    'color': 'lawngreen',
    'font-family': 'consolas',
}).format("{:.2f}")
FlowsSum FlowsVar FlowsStdDev L7Prots
RemoteRegion
814.00 57.27 7.57 6.00
canadacentral 5103.00 29.81 5.46 1.00
centralus 236.00 4.68 2.16 1.00
eastus 602.00 1.65 1.28 3.00
eastus2 1502.00 4.83 2.20 1.00
northeurope 82.00 0.49 0.70 1.00
southcentralus 817.00 8.88 2.98 1.00
westcentralus 59.00 0.02 0.13 1.00
westus 38.00 0.78 0.88 1.00
westus2 7.00 0.30 0.55 1.00

Reshaping/preprocessing data? [Ian]#

  • Dealing with nulls/NAs

  • Type conversion

  • Renaming columns

  • Pandas operations: melt, explode, transpose, indexing/stack/unstack

  • Dealing with complex Python objects - explode

  • Tidy data - melt

Dealing with nulls/NAs#

Working with missing data

pandas primarily uses NaN to represent missing data which is of floattype. IEEE 754 floating point representation of Not a Number (NaN).

Tip: Often you will see TypeError exceptions about not being able to perform an expected operation on a float (when you were expecting the type to be a string or other object type). This is very likely due NaNs in your data.

Also NaT - is the equivalent of NaN for DateTime data.

Sometimes python also raises None for missing data. NoneType object.

import pandas as pd
net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
len(net_df)
1360
print(f"Null elements in DataFrame: {net_df.isnull().values.sum()} \n\
Rows with null elements: {net_df.shape[0] - net_df.dropna().shape[0]}")
Null elements in DataFrame: 24 
Rows with null elements: 8

Which columns have NAs?#

df.isna()
series.isna()

net_df.isna().any()
TimeGenerated          False
FlowStartTime          False
FlowEndTime            False
FlowIntervalEndTime    False
FlowType               False
ResourceGroup           True
VMName                  True
VMIPAddress            False
PublicIPs               True
SrcIP                  False
DestIP                 False
L4Protocol             False
L7Protocol             False
DestPort               False
FlowDirection          False
AllowedOutFlows        False
AllowedInFlows         False
DeniedInFlows          False
DeniedOutFlows         False
RemoteRegion           False
VMRegion               False
AllExtIPs              False
TotalAllowedFlows      False
dtype: bool
net_df.ResourceGroup.value_counts()
asihuntomsworkspacerg    1352
Name: ResourceGroup, dtype: int64

Filtering to see which columns have NaNs#

You can use .isna() on the whole DataFrame or a single column.

net_df[net_df["PublicIPs"].isna()]
TimeGenerated FlowStartTime FlowEndTime FlowIntervalEndTime FlowType ResourceGroup VMName VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows RemoteRegion VMRegion AllExtIPs TotalAllowedFlows
326 2019-02-13 01:23:47.634 2019-02-13 00:55:10 2019-02-13 00:57:33 2019-02-13 01:00:00 IntraVNet None None None 10.0.3.4 ... 445.0 O 3.0 0.0 0.0 0.0 eastus 10.0.3.5 3.0
327 2019-02-13 01:23:47.634 2019-02-13 00:55:10 2019-02-13 00:57:33 2019-02-13 01:00:00 IntraVNet None None None 10.0.3.4 ... 445.0 I 0.0 3.0 0.0 0.0 eastus 10.0.3.4 3.0
336 2019-02-12 22:23:05.399 2019-02-12 21:55:43 2019-02-12 21:55:43 2019-02-12 22:00:00 IntraVNet None None None 10.0.3.5 ... 22.0 O 1.0 0.0 0.0 0.0 eastus 10.0.3.4 1.0
345 2019-02-12 22:23:05.384 2019-02-12 21:54:04 2019-02-12 21:55:36 2019-02-12 22:00:00 IntraVNet None None None 10.0.3.5 ... 22.0 O 6.0 0.0 0.0 0.0 eastus 104.211.30.1 6.0
357 2019-02-12 23:23:59.515 2019-02-12 22:22:35 2019-02-12 22:55:37 2019-02-12 23:00:00 IntraVNet None None None 10.0.3.5 ... 22.0 O 12.0 0.0 0.0 0.0 eastus 104.211.30.1 12.0
413 2019-02-12 18:23:51.853 2019-02-12 17:26:19 2019-02-12 17:44:09 2019-02-12 18:00:00 IntraVNet None None None 10.0.3.4 ... 445.0 O 6.0 0.0 0.0 0.0 eastus 10.0.3.5 6.0
414 2019-02-12 18:23:51.853 2019-02-12 17:26:19 2019-02-12 17:44:09 2019-02-12 18:00:00 IntraVNet None None None 10.0.3.4 ... 445.0 I 0.0 6.0 0.0 0.0 eastus 10.0.3.4 6.0
466 2019-02-12 22:23:17.236 2019-02-12 21:55:43 2019-02-12 21:55:43 2019-02-12 22:00:00 IntraVNet None None None 10.0.3.5 ... 22.0 I 0.0 1.0 0.0 0.0 eastus 10.0.3.5 1.0

8 rows × 23 columns

Removing NaNs with .dropna#

df.dropna() # removes all rows with ANY NaNs
df.dropna(axis=1) # removes all columns with ANY NaNs

df.dropna(how="all") # removes all rows that are ALL NaNs
df.dropna(axis=1, how="all") # removes all cols that are ALL NaNs

dropna() also supports inplace=True. Don’t do it!!!

len(net_df.dropna())
1352

Replacing NaNs with values#

df.fillna(replacement) # replace NaNs with 'replacement'

df[column] = df[column].fillna(replacement) # replace NaNs in a single column

net_df2 = net_df.fillna(value="N/A")
net_df2.ResourceGroup.value_counts()
asihuntomsworkspacerg    1352
N/A                         8
Name: ResourceGroup, dtype: int64

Type Conversion#

net_df.dtypes
TimeGenerated          datetime64[ns]
FlowStartTime          datetime64[ns]
FlowEndTime            datetime64[ns]
FlowIntervalEndTime    datetime64[ns]
FlowType                       object
ResourceGroup                  object
VMName                         object
VMIPAddress                    object
PublicIPs                      object
SrcIP                          object
DestIP                         object
L4Protocol                     object
L7Protocol                     object
DestPort                      float64
FlowDirection                  object
AllowedOutFlows               float64
AllowedInFlows                float64
DeniedInFlows                 float64
DeniedOutFlows                float64
RemoteRegion                   object
VMRegion                       object
AllExtIPs                      object
TotalAllowedFlows             float64
dtype: object

series.column.astype(target) # convert type

Target can be a numpy type, a pandas dtype or a friendly string:

  • “object”

  • “datetime”

  • “number”

df.column.astype(target|{col1: type1, col2, type2...}) # convert multiple cols

net_df.TotalAllowedFlows = net_df.TotalAllowedFlows.astype('str')
net_df.TotalAllowedFlows.dtypes
dtype('O')

Convert using explicit pandas function#

Gives you more control over specific conversions (esp for DateTime)

net_df.TotalAllowedFlows = pd.to_numeric(net_df.TotalAllowedFlows)
#pd.to_datetime
#pd.to_timedelta
net_df.TotalAllowedFlows.dtypes
dtype('float64')

Renaming columns#

net_df.columns
Index(['TimeGenerated', 'FlowStartTime', 'FlowEndTime', 'FlowIntervalEndTime',
       'FlowType', 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',
       'SrcIP', 'DestIP', 'L4Protocol', 'L7Protocol', 'DestPort',
       'FlowDirection', 'AllowedOutFlows', 'AllowedInFlows', 'DeniedInFlows',
       'DeniedOutFlows', 'RemoteRegion', 'VMRegion', 'AllExtIPs',
       'TotalAllowedFlows'],
      dtype='object')

df.rename(columns={col1: col1_new, col2: ....}) # rename

net_df.rename(columns={"FlowStartTime": "FlowStartDateTime", "FlowEndTime": "FlowEndDateTime"}).columns
Index(['TimeGenerated', 'FlowStartDateTime', 'FlowEndDateTime',
       'FlowIntervalEndTime', 'FlowType', 'ResourceGroup', 'VMName',
       'VMIPAddress', 'PublicIPs', 'SrcIP', 'DestIP', 'L4Protocol',
       'L7Protocol', 'DestPort', 'FlowDirection', 'AllowedOutFlows',
       'AllowedInFlows', 'DeniedInFlows', 'DeniedOutFlows', 'RemoteRegion',
       'VMRegion', 'AllExtIPs', 'TotalAllowedFlows'],
      dtype='object')

df.rename(func, axis='columns')

net_df.rename(str.lower, axis='columns').columns
Index(['timegenerated', 'flowstarttime', 'flowendtime', 'flowintervalendtime',
       'flowtype', 'resourcegroup', 'vmname', 'vmipaddress', 'publicips',
       'srcip', 'destip', 'l4protocol', 'l7protocol', 'destport',
       'flowdirection', 'allowedoutflows', 'allowedinflows', 'deniedinflows',
       'deniedoutflows', 'remoteregion', 'vmregion', 'allextips',
       'totalallowedflows'],
      dtype='object')
net_df.columns
Index(['TimeGenerated', 'FlowStartTime', 'FlowEndTime', 'FlowIntervalEndTime',
       'FlowType', 'ResourceGroup', 'VMName', 'VMIPAddress', 'PublicIPs',
       'SrcIP', 'DestIP', 'L4Protocol', 'L7Protocol', 'DestPort',
       'FlowDirection', 'AllowedOutFlows', 'AllowedInFlows', 'DeniedInFlows',
       'DeniedOutFlows', 'RemoteRegion', 'VMRegion', 'AllExtIPs',
       'TotalAllowedFlows'],
      dtype='object')

Statically rename using assignment#

net_df.columns = [
    "timegenerated",
    "flowstarttime",
    "flowendtime",
    "flowintervalendtime",
    "flowtype",
    "resourcegroup",
    "vmname",
    "vmipaddress",
    "publicips",
    "srcip",
    "destip",
    "l4protocol",
    "l7protocol",
    "destport",
    "flowdirection",
    "allowedoutflows",
    "allowedinflows",
    "deniedinflows",
    "deniedoutflows",
    "remoteregion",
    "vmregion",
    "allextips",
    "totalallowedflows",
]

Pandas operations: melt, explode, transpose, indexing/stack/unstack [Ashwin]#

Dealing with complex Python objects - explode#

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
net_df.PublicIPs.head(10)
0                               [13.67.143.117]
1                                [40.77.232.95]
2                  [13.65.107.32, 40.124.45.19]
3                  [13.65.107.32, 40.124.45.19]
4                                [20.38.98.100]
5                               [13.67.143.117]
6                [13.71.172.128, 13.71.172.130]
7                [13.71.172.128, 13.71.172.130]
8    [65.55.44.109, 40.77.228.69, 65.55.44.108]
9    [65.55.44.109, 40.77.228.69, 65.55.44.108]
Name: PublicIPs, dtype: object
net_df.PublicIPs.count()
1352
net_df_ext = net_df.explode("PublicIPs")
net_df_ext.PublicIPs.head(10)
0    13.67.143.117
1     40.77.232.95
2     13.65.107.32
2     40.124.45.19
3     13.65.107.32
3     40.124.45.19
4     20.38.98.100
5    13.67.143.117
6    13.71.172.128
6    13.71.172.130
Name: PublicIPs, dtype: object
len(net_df_ext.PublicIPs.unique())
123

Tidy data - melt#

Pandas.melt() unpivots a DataFrame from wide format to long format.
melt() function is useful to message a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis, leaving just two non-identifier columns, variable and value.

net_df_min = net_df[["FlowType", "AllExtIPs", "TotalAllowedFlows"]]
pd.melt(net_df_min, 
         id_vars=['AllExtIPs'])
AllExtIPs variable value
0 13.67.143.117 FlowType AzurePublic
1 40.77.232.95 FlowType AzurePublic
2 13.65.107.32 FlowType AzurePublic
3 40.124.45.19 FlowType AzurePublic
4 20.38.98.100 FlowType AzurePublic
... ... ... ...
2715 13.71.172.130 TotalAllowedFlows 23.0
2716 40.77.232.95 TotalAllowedFlows 1.0
2717 52.168.138.145 TotalAllowedFlows 4.0
2718 23.215.98.90 TotalAllowedFlows 2.0
2719 72.21.81.240 TotalAllowedFlows 2.0

2720 rows × 3 columns

Transpose#

net_df_min.head().T
0 1 2 3 4
FlowType AzurePublic AzurePublic AzurePublic AzurePublic AzurePublic
AllExtIPs 13.67.143.117 40.77.232.95 13.65.107.32 40.124.45.19 20.38.98.100
TotalAllowedFlows 1.0 1.0 4.0 4.0 1.0

Indexing, Stack and Unstack#

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
net_df_agg = net_df.groupby("AllExtIPs").agg({"TotalAllowedFlows":['mean', 'min', 'max'],
                                              "AllowedOutFlows":['mean', 'min', 'max'],
                                              "AllowedInFlows":['mean', 'min', 'max']})
net_df_agg.head()
TotalAllowedFlows AllowedOutFlows AllowedInFlows
mean min max mean min max mean min max
AllExtIPs
10.0.3.4 3.333333 1.0 6.0 0.333333 0.0 1.0 3.000000 0.0 6.0
10.0.3.5 3.333333 1.0 6.0 3.000000 0.0 6.0 0.333333 0.0 1.0
104.211.30.1 9.000000 6.0 12.0 9.000000 6.0 12.0 0.000000 0.0 0.0
104.40.17.153 1.750000 1.0 2.0 1.750000 1.0 2.0 0.000000 0.0 0.0
104.43.212.12 2.166667 1.0 4.0 2.166667 1.0 4.0 0.000000 0.0 0.0
net_df_agg["TotalAllowedFlows"]["mean"]
AllExtIPs
10.0.3.4          3.333333
10.0.3.5          3.333333
104.211.30.1      9.000000
104.40.17.153     1.750000
104.43.212.12     2.166667
                   ...    
90.130.70.73      1.000000
99.84.104.63      7.000000
99.84.106.178    10.000000
99.84.106.27     10.000000
99.84.106.92     10.000000
Name: mean, Length: 125, dtype: float64
idx = pd.IndexSlice
net_df_agg.loc[:,idx[:,'mean']]
TotalAllowedFlows AllowedOutFlows AllowedInFlows
mean mean mean
AllExtIPs
10.0.3.4 3.333333 0.333333 3.000000
10.0.3.5 3.333333 3.000000 0.333333
104.211.30.1 9.000000 9.000000 0.000000
104.40.17.153 1.750000 1.750000 0.000000
104.43.212.12 2.166667 2.166667 0.000000
... ... ... ...
90.130.70.73 1.000000 1.000000 0.000000
99.84.104.63 7.000000 7.000000 0.000000
99.84.106.178 10.000000 10.000000 0.000000
99.84.106.27 10.000000 10.000000 0.000000
99.84.106.92 10.000000 10.000000 0.000000

125 rows × 3 columns

net_df_agg_stacked = net_df_agg.stack()
net_df_agg_stacked.head()
TotalAllowedFlows AllowedOutFlows AllowedInFlows
AllExtIPs
10.0.3.4 mean 3.333333 0.333333 3.000000
min 1.000000 0.000000 0.000000
max 6.000000 1.000000 6.000000
10.0.3.5 mean 3.333333 3.000000 0.333333
min 1.000000 0.000000 0.000000
net_df_agg_stacked.loc[("10.0.3.4","mean"),"TotalAllowedFlows"]
3.3333333333333335
net_df_agg_stacked.unstack().head()
TotalAllowedFlows AllowedOutFlows AllowedInFlows
mean min max mean min max mean min max
AllExtIPs
10.0.3.4 3.333333 1.0 6.0 0.333333 0.0 1.0 3.000000 0.0 6.0
10.0.3.5 3.333333 1.0 6.0 3.000000 0.0 6.0 0.333333 0.0 1.0
104.211.30.1 9.000000 6.0 12.0 9.000000 6.0 12.0 0.000000 0.0 0.0
104.40.17.153 1.750000 1.0 2.0 1.750000 1.0 2.0 0.000000 0.0 0.0
104.43.212.12 2.166667 1.0 4.0 2.166667 1.0 4.0 0.000000 0.0 0.0

Pivoting/pivot tables [Ashwin]#

Reshaping and Pivot Tables#

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
net_df.head()
TimeGenerated FlowStartTime FlowEndTime FlowIntervalEndTime FlowType ResourceGroup VMName VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows RemoteRegion VMRegion AllExtIPs TotalAllowedFlows
0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.67.143.117] ... 443.0 O 1.0 0.0 0.0 0.0 centralus eastus 13.67.143.117 1.0
1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [40.77.232.95] ... 443.0 O 1.0 0.0 0.0 0.0 westcentralus eastus 40.77.232.95 1.0
2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 13.65.107.32 4.0
3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 40.124.45.19 4.0
4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [20.38.98.100] ... 443.0 O 1.0 0.0 0.0 0.0 eastus eastus 20.38.98.100 1.0

5 rows × 23 columns

# Prepare groupby dataset to perform pivot. Does expect column with unique values
net_df_grouped = net_df.groupby(['FlowIntervalEndTime','FlowType'])['AllExtIPs'].count().reset_index()
net_df_grouped.head()
FlowIntervalEndTime FlowType AllExtIPs
0 2019-02-07 13:00:00 AzurePublic 3
1 2019-02-07 14:00:00 AzurePublic 8
2 2019-02-07 14:00:00 ExternalPublic 1
3 2019-02-07 15:00:00 AzurePublic 5
4 2019-02-07 15:00:00 ExternalPublic 3
net_df_grouped.pivot(index="FlowIntervalEndTime", columns="FlowType", values="AllExtIPs")
FlowType AzurePublic ExternalPublic IntraVNet
FlowIntervalEndTime
2019-02-07 13:00:00 3.0 NaN NaN
2019-02-07 14:00:00 8.0 1.0 NaN
2019-02-07 15:00:00 5.0 3.0 NaN
2019-02-07 16:00:00 7.0 1.0 NaN
2019-02-07 17:00:00 8.0 NaN NaN
... ... ... ...
2019-02-14 09:00:00 8.0 1.0 NaN
2019-02-14 10:00:00 8.0 NaN NaN
2019-02-14 11:00:00 7.0 2.0 NaN
2019-02-14 12:00:00 9.0 NaN NaN
2019-02-14 13:00:00 2.0 NaN NaN

141 rows × 3 columns


Time manipulation [Ashwin]#

  • Timezone conversions

  • Resample - Grouping by time

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
net_df.head()
TimeGenerated FlowStartTime FlowEndTime FlowIntervalEndTime FlowType ResourceGroup VMName VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows RemoteRegion VMRegion AllExtIPs TotalAllowedFlows
0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.67.143.117] ... 443.0 O 1.0 0.0 0.0 0.0 centralus eastus 13.67.143.117 1.0
1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [40.77.232.95] ... 443.0 O 1.0 0.0 0.0 0.0 westcentralus eastus 40.77.232.95 1.0
2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 13.65.107.32 4.0
3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 40.124.45.19 4.0
4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [20.38.98.100] ... 443.0 O 1.0 0.0 0.0 0.0 eastus eastus 20.38.98.100 1.0

5 rows × 23 columns

Timezone considerations#

dti = pd.to_datetime(net_df['TimeGenerated'])
dti_utc = dti.dt.tz_localize("UTC")
print(dti_utc)
0      2019-02-14 13:23:59.512000+00:00
1      2019-02-14 13:23:59.512000+00:00
2      2019-02-14 03:26:06.765000+00:00
3      2019-02-14 03:26:06.765000+00:00
4      2019-02-14 03:26:06.828000+00:00
                     ...               
1355   2019-02-09 03:32:41.967000+00:00
1356   2019-02-09 03:32:51.124000+00:00
1357   2019-02-09 03:32:51.264000+00:00
1358   2019-02-09 03:32:45.608000+00:00
1359   2019-02-09 03:32:45.608000+00:00
Name: TimeGenerated, Length: 1360, dtype: datetime64[ns, UTC]
dti_pst = dti.dt.tz_localize("US/Pacific")
print(dti_pst)
0      2019-02-14 13:23:59.512000-08:00
1      2019-02-14 13:23:59.512000-08:00
2      2019-02-14 03:26:06.765000-08:00
3      2019-02-14 03:26:06.765000-08:00
4      2019-02-14 03:26:06.828000-08:00
                     ...               
1355   2019-02-09 03:32:41.967000-08:00
1356   2019-02-09 03:32:51.124000-08:00
1357   2019-02-09 03:32:51.264000-08:00
1358   2019-02-09 03:32:45.608000-08:00
1359   2019-02-09 03:32:45.608000-08:00
Name: TimeGenerated, Length: 1360, dtype: datetime64[ns, US/Pacific]

Grouping by time#

Resampling - resample() time-based groupby

net_df.set_index('TimeGenerated').resample('H')['FlowType'].count()
TimeGenerated
2019-02-07 13:00:00    3
2019-02-07 14:00:00    9
2019-02-07 15:00:00    8
2019-02-07 16:00:00    8
2019-02-07 17:00:00    8
                      ..
2019-02-14 09:00:00    9
2019-02-14 10:00:00    8
2019-02-14 11:00:00    9
2019-02-14 12:00:00    9
2019-02-14 13:00:00    2
Freq: H, Name: FlowType, Length: 169, dtype: int64

Other Useful operations [Ian]#

  • Chaining multiple operations with “.”

  • Including external functions with pipe

  • Apply, assign, others ???

Chaining multiple operations with “.”#

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
net_df.head()
TimeGenerated FlowStartTime FlowEndTime FlowIntervalEndTime FlowType ResourceGroup VMName VMIPAddress PublicIPs SrcIP ... DestPort FlowDirection AllowedOutFlows AllowedInFlows DeniedInFlows DeniedOutFlows RemoteRegion VMRegion AllExtIPs TotalAllowedFlows
0 2019-02-14 13:23:59.512 2019-02-14 12:21:58 2019-02-14 12:21:58 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.67.143.117] ... 443.0 O 1.0 0.0 0.0 0.0 centralus eastus 13.67.143.117 1.0
1 2019-02-14 13:23:59.512 2019-02-14 12:29:02 2019-02-14 12:29:02 2019-02-14 13:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [40.77.232.95] ... 443.0 O 1.0 0.0 0.0 0.0 westcentralus eastus 40.77.232.95 1.0
2 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 13.65.107.32 4.0
3 2019-02-14 03:26:06.765 2019-02-14 02:08:46 2019-02-14 02:48:45 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [13.65.107.32, 40.124.45.19] ... 443.0 O 4.0 0.0 0.0 0.0 southcentralus eastus 40.124.45.19 4.0
4 2019-02-14 03:26:06.828 2019-02-14 02:30:56 2019-02-14 02:30:56 2019-02-14 03:00:00 AzurePublic asihuntomsworkspacerg msticalertswin1 10.0.3.5 [20.38.98.100] ... 443.0 O 1.0 0.0 0.0 0.0 eastus eastus 20.38.98.100 1.0

5 rows × 23 columns

net_df[["TimeGenerated","AllExtIPs"]].groupby("AllExtIPs").agg("count")
TimeGenerated
AllExtIPs
10.0.3.4 3
10.0.3.5 3
104.211.30.1 2
104.40.17.153 4
104.43.212.12 12
... ...
90.130.70.73 2
99.84.104.63 1
99.84.106.178 1
99.84.106.27 1
99.84.106.92 1

125 rows × 1 columns

net_df[["RemoteRegion","AllExtIPs"]].groupby("AllExtIPs").agg("count").sort_values(by="RemoteRegion", ascending=False)
RemoteRegion
AllExtIPs
65.55.44.109 139
13.71.172.130 136
52.168.138.145 117
40.124.45.19 115
13.71.172.128 114
... ...
23.45.181.178 1
23.45.181.176 1
23.45.181.160 1
23.45.180.34 1
99.84.106.92 1

125 rows × 1 columns

Use parentheses to let you stack the functions vertically#

(
    net_df[["RemoteRegion","AllExtIPs"]]
    .groupby("AllExtIPs")
    .agg("count")
    .sort_values(by="RemoteRegion", ascending=False)
)
RemoteRegion
AllExtIPs
65.55.44.109 139
13.71.172.130 136
52.168.138.145 117
40.124.45.19 115
13.71.172.128 114
... ...
23.45.181.178 1
23.45.181.176 1
23.45.181.160 1
23.45.180.34 1
99.84.106.92 1

125 rows × 1 columns

(
    net_df[["RemoteRegion", "AllExtIPs"]]
    .groupby("AllExtIPs")
    .agg("count")
    .sort_values(
        by="RemoteRegion", ascending=False
    )
    .head(5)
)
RemoteRegion
AllExtIPs
65.55.44.109 139
13.71.172.130 136
52.168.138.145 117
40.124.45.19 115
13.71.172.128 114
(
    net_df[["RemoteRegion","AllExtIPs"]]
    .groupby("AllExtIPs")
    .agg("count")
    .sort_values(by="RemoteRegion", ascending=False)
    .head(5)
    .index
    .to_list()
)
['65.55.44.109',
 '13.71.172.130',
 '52.168.138.145',
 '40.124.45.19',
 '13.71.172.128']

External functions with .pipe#

df.pipe(function)

You can call functions to do processing on your data. The function must take a DataFrame as the first parameter and return a DataFrame

# Define a couple of (not very useful) functions

def drop_duplicates(df, column_name):
    return df.drop_duplicates(subset=column_name)


def fill_missing_values(df):
    df_result = df.copy()
    for col in df_result.columns:
        df_result[col].fillna("N/A", inplace=True)
    return df_result
display(net_df[["TimeGenerated", "ResourceGroup"]][net_df["ResourceGroup"].isna()])
print("rows with NaNs:", net_df.isnull().values.sum())
TimeGenerated ResourceGroup
326 2019-02-13 01:23:47.634 None
327 2019-02-13 01:23:47.634 None
336 2019-02-12 22:23:05.399 None
345 2019-02-12 22:23:05.384 None
357 2019-02-12 23:23:59.515 None
413 2019-02-12 18:23:51.853 None
414 2019-02-12 18:23:51.853 None
466 2019-02-12 22:23:17.236 None
rows with NaNs: 24
net_df.pipe(fill_missing_values).isnull().values.sum()
0

Using the drop_duplicates function

len(net_df)
1360
net_df.pipe(drop_duplicates, "AllExtIPs").shape
(125, 23)

Using both functions

net_df = pd.read_pickle("../data/az_net_comms_df.pkl")
len(net_df)
1360
net_df_cleaned = (
    net_df
    .pipe(drop_duplicates, "AllExtIPs")
    .pipe(fill_missing_values)
)
display(net_df[["TimeGenerated", "ResourceGroup"]][net_df["ResourceGroup"].isna()])
display(net_df_cleaned[["TimeGenerated", "ResourceGroup"]][net_df["ResourceGroup"].isna()])
TimeGenerated ResourceGroup
326 2019-02-13 01:23:47.634 None
327 2019-02-13 01:23:47.634 None
336 2019-02-12 22:23:05.399 None
345 2019-02-12 22:23:05.384 None
357 2019-02-12 23:23:59.515 None
413 2019-02-12 18:23:51.853 None
414 2019-02-12 18:23:51.853 None
466 2019-02-12 22:23:17.236 None
C:\Users\Ian\Anaconda3\envs\condadev\lib\site-packages\ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  
TimeGenerated ResourceGroup
326 2019-02-13 01:23:47.634 N/A
327 2019-02-13 01:23:47.634 N/A
345 2019-02-12 22:23:05.384 N/A

External functions with .apply#

apply is very inefficient but necessary sometimes.

.apply and pandas series#

series.apply(function)
df.column_name.apply(function)

display(
    net_df.VMIPAddress.apply(str.split, ".").head()
)
display(
    net_df.VMRegion.apply(str.capitalize).head()
)
0    [10.0.3.5]
1    [10.0.3.5]
2    [10.0.3.5]
3    [10.0.3.5]
4    [10.0.3.5]
Name: VMIPAddress, dtype: object
0    Eastus
1    Eastus
2    Eastus
3    Eastus
4    Eastus
Name: VMRegion, dtype: object
# Using a lambda (inline) function
display(
    net_df.VMIPAddress.apply(
        lambda col: "_".join(col.split("."))
    )
    .head()
)
0    10_0_3_5
1    10_0_3_5
2    10_0_3_5
3    10_0_3_5
4    10_0_3_5
Name: VMIPAddress, dtype: object
import ipaddress

def to_ip(ip_str):
    if ip_str:
        ip = ipaddress.ip_address(ip_str)
        if ip.is_global:
            return "global", ip
        else:
            return "other", ip
    return "Unknown"

display(
    net_df.AllExtIPs.apply(to_ip)
    .head(10)
)
0    (global, 13.67.143.117)
1     (global, 40.77.232.95)
2     (global, 13.65.107.32)
3     (global, 40.124.45.19)
4     (global, 20.38.98.100)
5    (global, 13.67.143.117)
6    (global, 13.71.172.128)
7    (global, 13.71.172.130)
8     (global, 65.55.44.109)
9     (global, 40.77.228.69)
Name: AllExtIPs, dtype: object

.apply and DataFrames#

df.apply(function, axis=1) # apply by row
df.apply(function [, axis=0]) # apply by column

display(
    net_df
    .apply(lambda row: row.RemoteRegion.upper(), axis=1)
    .head(5)
)

display(
    net_df
    .apply(lambda row: row.RemoteRegion.capitalize() + ": " + str(hash(row.RemoteRegion)), axis=1)
    .head()
)
0         CENTRALUS
1     WESTCENTRALUS
2    SOUTHCENTRALUS
3    SOUTHCENTRALUS
4            EASTUS
dtype: object
0         Centralus: 3997470178254466550
1     Westcentralus: 2950529182713360191
2    Southcentralus: 2388453837175337402
3    Southcentralus: 2388453837175337402
4           Eastus: -5448835124403651518
dtype: object
def df_to_ip(row):
    for name in row.index:
        value = row[name]
        try:
            ip = ipaddress.ip_address(value)
            if ip.is_global:
                row[name] = f"IP global: {ip}"
            elif ip.is_private:
                row[name] = f"IP private: {ip}"
            else:
                row[name] = f"IP other: {ip}"
        except:
            pass
    return row

net_df.apply(df_to_ip, axis=1).filter(regex=".*IP.*")
VMIPAddress PublicIPs SrcIP DestIP AllExtIPs
0 IP private: 10.0.3.5 [13.67.143.117] IP global: 13.67.143.117
1 IP private: 10.0.3.5 [40.77.232.95] IP global: 40.77.232.95
2 IP private: 10.0.3.5 [13.65.107.32, 40.124.45.19] IP global: 13.65.107.32
3 IP private: 10.0.3.5 [13.65.107.32, 40.124.45.19] IP global: 40.124.45.19
4 IP private: 10.0.3.5 [20.38.98.100] IP global: 20.38.98.100
... ... ... ... ... ...
1355 IP private: 10.0.3.5 [13.71.172.128, 13.71.172.130] IP global: 13.71.172.130
1356 IP private: 10.0.3.5 [40.77.232.95] IP global: 40.77.232.95
1357 IP private: 10.0.3.5 [52.168.138.145] IP global: 52.168.138.145
1358 IP private: 10.0.3.5 [23.215.98.90, 72.21.81.240] IP global: 23.215.98.90
1359 IP private: 10.0.3.5 [23.215.98.90, 72.21.81.240] IP global: 72.21.81.240

1360 rows × 5 columns


End of Session#

Break: 5 Minutes#