Banner


Workshop 2.2: Visualization in Jupyter Notebooks#

Disclaimer:#

This is not intended to be a comprehensive overview of Visualization in Python/Jupyter. There are many libraries and techniques not covered here. These are just a few options that we’ve used and liked and give you a lot of scope.


Basic plotting with pandas using Matplotlib#

Resources:

Cheatsheets :

Matplotlib Cheatsheets#

Bar charts#

Refer Bar Plots section for more examples and options to customize

import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import pandas as pd
logons_full_df = pd.read_pickle("../data/host_logons.pkl")
net_full_df = pd.read_pickle("../data/az_net_comms_df.pkl")
logons_full_df.head()
Account EventID TimeGenerated Computer SubjectUserName SubjectDomainName SubjectUserSid TargetUserName TargetDomainName TargetUserSid TargetLogonId LogonType IpAddress WorkstationName TimeCreatedUtc
0 NT AUTHORITY\SYSTEM 4624 2019-02-12 04:56:34.307 MSTICAlertsWin1 MSTICAlertsWin1$ WORKGROUP S-1-5-18 SYSTEM NT AUTHORITY S-1-5-18 0x3e7 5 - - 2019-02-12 04:56:34.307
1 MSTICAlertsWin1\MSTICAdmin 4624 2019-02-12 04:37:25.340 MSTICAlertsWin1 - - S-1-0-0 MSTICAdmin MSTICAlertsWin1 S-1-5-21-996632719-2361334927-4038480536-500 0xc90e957 3 131.107.147.209 IANHELLE-DEV17 2019-02-12 04:37:25.340
2 MSTICAlertsWin1\MSTICAdmin 4624 2019-02-12 04:37:27.997 MSTICAlertsWin1 - - S-1-0-0 MSTICAdmin MSTICAlertsWin1 S-1-5-21-996632719-2361334927-4038480536-500 0xc90ea44 3 131.107.147.209 IANHELLE-DEV17 2019-02-12 04:37:27.997
3 MSTICAlertsWin1\MSTICAdmin 4624 2019-02-12 04:38:16.550 MSTICAlertsWin1 - - S-1-0-0 MSTICAdmin MSTICAlertsWin1 S-1-5-21-996632719-2361334927-4038480536-500 0xc912d62 3 131.107.147.209 IANHELLE-DEV17 2019-02-12 04:38:16.550
4 MSTICAlertsWin1\MSTICAdmin 4624 2019-02-12 04:38:21.370 MSTICAlertsWin1 - - S-1-0-0 MSTICAdmin MSTICAlertsWin1 S-1-5-21-996632719-2361334927-4038480536-500 0xc913737 3 131.107.147.209 IANHELLE-DEV17 2019-02-12 04:38:21.370
# Preprocess the data- Group by LogonType and count the no of accounts
logontypebyacc = logons_full_df.groupby(['LogonType'])['Account'].count()
logontypebyacc.head()
LogonType
0      2
2     12
3     13
4      9
5    126
Name: Account, dtype: int64
logontypebyacc.plot(kind='bar')
<AxesSubplot:xlabel='LogonType'>
../../_images/d5f0a31f6b6fdb9d0825dd9abbf7bee8c8bf22eaef730e749cfa62796865ad3d.png

Line charts#

#Preprocess dataframe by 
logonaccountbyday = logons_full_df.set_index('TimeGenerated').resample('D')['Account'].count()
logonaccountbyday.head()
TimeGenerated
2019-02-09     3
2019-02-10    11
2019-02-11     6
2019-02-12    72
2019-02-13    15
Freq: D, Name: Account, dtype: int64
logonaccountbyday.plot(figsize = (20,8))
<AxesSubplot:xlabel='TimeGenerated'>
../../_images/3b35918489606fa765b37febb81f03cc8f99427a0187d386cbaf3372da608abc.png

Customizations#

Annotate your charts by adding texts, labels and other customizations.

Docs:

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

plt.figure(figsize = (20,8))
plt.plot(logonaccountbyday, marker='o')
plt.title("Daily trend of account logons")
plt.xlabel("Date")
plt.ylabel("Logon Count")

# another example of customization with plot
# plt.plot(logonaccountbyday, color='green', marker='o', linestyle='dashed',linewidth=2)

plt.show()
../../_images/0cbad5ee4d87da0a3009fe4fc0d80e7733feff5aa7deb6acb4c76d412346164e.png

Hvplot, Bokeh made easy(ier)#

Holoviews

Bokeh#

is a very flexible JS visualization framework. Beautiful interactive charts but somewhat complex.

Example Bokeh Ridge plot

HoloViews#

is a higherlevel, declarative layer built on top of Bokeh (or MatplotLib)

Example Holoviews Violin plot

HVplot (HV == Holoviews)#

is some of Holoviews functionality implemented as a pandas extension.

Installing and loading#

conda install -c pyviz hvplot
pip install hvplot

Examples#

import hvplot.pandas

count_of_logons = logons_full_df[["TimeGenerated", "Account"]].groupby("Account").count()
count_of_logons.hvplot.barh(height=300)
plot_df = (
    net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
    .groupby(["L7Protocol", "TotalAllowedFlows"])
    .nunique()
)
display(plot_df.head(3))
plot_df.hvplot.scatter(by="L7Protocol")
AllExtIPs
L7Protocol TotalAllowedFlows
ftp 1.0 1
http 1.0 12
2.0 16
plot_df = (
    logons_full_df[["TimeCreatedUtc", "Account", "LogonType"]]
    .assign(HourOfDay=logons_full_df.TimeCreatedUtc.dt.hour)
    
)
display(plot_df.head(3))
plot_df.hvplot.hist(y="HourOfDay", by="Account", title="Logons by Hour")
TimeCreatedUtc Account LogonType HourOfDay
0 2019-02-12 04:56:34.307 NT AUTHORITY\SYSTEM 5 4
1 2019-02-12 04:37:25.340 MSTICAlertsWin1\MSTICAdmin 3 4
2 2019-02-12 04:37:27.997 MSTICAlertsWin1\MSTICAdmin 3 4

Subplots#

plot_df.hvplot.hist(y="HourOfDay", by="Account", subplots=True, width=400).cols(2)

More parameters

plot_df.hvplot.hist(y="HourOfDay", by="Account", subplots=True, shared_axes=False, width=400).cols(2)
plot_df = (
    net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
    .groupby(["L7Protocol", "TotalAllowedFlows"])
    .nunique()
)
display(plot_df.head(3))
plot_df.hvplot.violin(by="L7Protocol", height=600)
AllExtIPs
L7Protocol TotalAllowedFlows
ftp 1.0 1
http 1.0 12
2.0 16

Combining plots#

plot_df = (
    net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
    .groupby(["L7Protocol", "TotalAllowedFlows"])
    .nunique()
)


plot_df.hvplot.scatter(by="L7Protocol", height=600) + plot_df.hvplot.violin(by="L7Protocol", height=600)
plot2_df = (
    net_full_df[["FlowStartTime", "AllExtIPs", "L7Protocol", "RemoteRegion"]]
    
    .groupby(["RemoteRegion", pd.Grouper(key="FlowStartTime", freq="5min")])
    .agg({"L7Protocol": "nunique", "AllExtIPs": "nunique"})
    .sort_index()
    # .head(500)
    .reset_index()
)
plot2_df.hvplot.scatter(y="AllExtIPs", alpha=0.5, height=500, by="RemoteRegion") * plot2_df.hvplot.line(y="L7Protocol", color="blue")
plot_df = (
    net_full_df[["FlowStartTime", "L7Protocol", "RemoteRegion", "TotalAllowedFlows", "AllExtIPs"]]
    .assign(MinOfDay=(
        net_full_df.FlowStartTime.dt.hour * 60) + net_full_df.FlowStartTime.dt.minute
    )
    .groupby(["FlowStartTime", "L7Protocol", "RemoteRegion", "TotalAllowedFlows", ])
    .nunique()
    .reset_index()
)
plot_df.hvplot.box(y="TotalAllowedFlows", by="RemoteRegion", rot=30, height=400) * plot_df.hvplot.violin(y="TotalAllowedFlows", by="RemoteRegion")

Seaborn for specialized stats plots#

Intro: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Statistical specialization

Resources:

In below example, we are visualizing regression models with demo dataset provided by seaborn. The dataset has 2 quantitive variable and with this graph we can see how those variable are related to each other.

You can check more examples based on the data you have:

import seaborn as sns
sns.set_theme(style="darkgrid")

tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, height= 8, aspect=15/8)
<seaborn.axisgrid.FacetGrid at 0x268db708048>
../../_images/02417220ffb50fc1f49be9c46f25d9adc871e25e80aec16e67b517007f203da1.png

Plotly#

Data Visualization Using Plotly: Python’s Visualization Library#

By Meenal Sarda.

Plotly is an open-source library that provides a whole set of chart types as well as tools to create dynamic dashboards. You can think of Plotly as a suite of tools as it integrates or extends with libraries such as Dash or Chart Studio to provide interactive dashboards. Plotly’s Python graphing library makes interactive, publication-quality graphs.

Plotly supports dynamic charts and animations as a first principle and this is the main difference between other visualization libraries like matplotlib or seaborn.

Main Properties of Plotly:

  • It can be used with other languages such as R, Python, Java.

  • No JavaScript knowledge is required at all. You code Plotly in your choice of supported languages.

  • Each Plotly visual is a JSON object. In this way, the visual can be accessed and used in different programming languages.

  • With Plotly you can also build dynamic dashboards using Dash extension.

  • Chart Studio allows you to create and update the graphics you want without any coding. It has a very simple and useful interface. It is especially useful in areas such as business intelligence.

  • Plotly allows you to view the entire dataset in the same figure which is very important for the user experience.

  • Transforming Matplotlib charts to Plotly charts is supported.

  • Plotly has been added to the Pandas plotting backend with the new version of Pandas. So we can make plotting on Pandas without having to import Plotly Express.

Plotly Express#

The plotly.express module (usually imported as px) contains functions that can create entire figures at once, and is referred to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures.

  • Let’s import Plotly Express:

import plotly.express as px
  • We can create a bar chart by using the bar method:

# Preparing Dataframe
df = logontypebyacc.to_frame(name = 'Frequency')
df.reset_index(inplace = True)
# Creating bar chart
fig = px.bar(df, x = 'LogonType', y = 'Frequency', title = 'Logon Frequency by Logon Type')
# Forcing the X axis to be categorical. Reference: https://plotly.com/python/categorical-axes/
fig.update_xaxes(type='category')
# Presenting chart
fig.show()
import pandas as pd
import json

# Opeing the log file
zeek_data = open('../data/combined_zeek.log','r')
# Creating a list of dictionaries
zeek_list = []
for dict in zeek_data:
    zeek_list.append(json.loads(dict))
# Closing the log file
zeek_data.close()
# Creating a dataframe
zeek_df = pd.DataFrame(data = zeek_list)
zeek_df.head()
@stream @system @proc ts uid id_orig_h id_orig_p id_resp_h id_resp_p proto ... is_64bit uses_aslr uses_dep uses_code_integrity uses_seh has_import_table has_export_table has_cert_table has_debug_data section_names
0 conn bobs.bigwheel.local zeek 1.588205e+09 Cvf4XX17hSAgXDdGEd 10.0.1.6 54243.0 10.0.0.4 53.0 udp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 conn bobs.bigwheel.local zeek 1.588205e+09 CJ21Le4zsTUcyKKi98 10.0.1.6 56880.0 10.0.0.4 445.0 tcp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 conn bobs.bigwheel.local zeek 1.588205e+09 CnOP7t1eGGHf6LFfuk 10.0.1.6 65108.0 10.0.0.4 53.0 udp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 conn bobs.bigwheel.local zeek 1.588205e+09 CvxbPE3MuO7boUdSc8 10.0.1.6 138.0 10.0.1.255 138.0 udp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 conn bobs.bigwheel.local zeek 1.588205e+09 CuRbE21APSQo2qd6rk 10.0.1.6 123.0 10.0.0.4 123.0 udp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 148 columns

  • We learned how a histogram can help us to describe the distribution of frequencies. We can create one to analyze the distribution of frequencies for the network connection duration using the histogram method.

# Creating histogram chart
fig = px.histogram(zeek_df, x = 'duration', title = 'Distribution of Frequencies', nbins = 1000)
# Presenting chart
fig.show()
  • Let’s now create a box plot to describe the variability of the network connection duration. We can use the box method to create box plots.

# Creating box plot
fig = px.box(zeek_df, x = 'id_resp_h', y = 'duration', title = 'Variability of Duration by Response IP Address')
# Presenting chart
fig.show()

MSTICPy visualizations#

Event timeline#

Basic plots#

import msticpy.vis.mp_pandas_plot

net_data = net_full_df.sort_values("FlowStartTime").tail(500)
net_data.mp_plot.timeline(time_column="FlowStartTime")
Loading BokehJS ...
Column(
id = '8339', …)
net_data.mp_plot.timeline(
    time_column="FlowStartTime",
    source_columns=["L7Protocol", "RemoteRegion", "AllExtIPs"]
)
Loading BokehJS ...
Column(
id = '8635', …)

Grouping#

net_data.mp_plot.timeline(
    time_column="FlowStartTime",
    source_columns=["L7Protocol", "RemoteRegion", "AllExtIPs"],
    group_by="L7Protocol",
)
Loading BokehJS ...
Column(
id = '10304', …)

More parameters#

help(net_data.mp_plot.timeline)
Help on method timeline in module msticpy.vis.mp_pandas_plot:

timeline(**kwargs) -> bokeh.models.layouts.LayoutDOM method of msticpy.vis.mp_pandas_plot.MsticpyPlotAccessor instance
    Display a timeline of events.
    
    Parameters
    ----------
    time_column : str, optional
        Name of the timestamp column
        (the default is 'TimeGenerated')
    source_columns : list, optional
        List of default source columns to use in tooltips
        (the default is None)
    
    Other Parameters
    ----------------
    title : str, optional
        Title to display (the default is None)
    alert : SecurityAlert, optional
        Add a reference line/label using the alert time (the default is None)
    ref_event : Any, optional
        Add a reference line/label using the alert time (the default is None)
    ref_time : datetime, optional
        Add a reference line/label using `ref_time` (the default is None)
    group_by : str
        The column to group timelines on.
    legend: str, optional
        "left", "right", "inline" or "none"
        (the default is to show a legend when plotting multiple series
        and not to show one when plotting a single series)
    yaxis : bool, optional
        Whether to show the yaxis and labels (default is False)
    ygrid : bool, optional
        Whether to show the yaxis grid (default is False)
    xgrid : bool, optional
        Whether to show the xaxis grid (default is True)
    range_tool : bool, optional
        Show the the range slider tool (default is True)
    height : int, optional
        The height of the plot figure
        (the default is auto-calculated height)
    width : int, optional
        The width of the plot figure (the default is 900)
    color : str
        Default series color (default is "navy")
    overlay_data : pd.DataFrame:
        A second dataframe to plot as a different series.
    overlay_color : str
        Overlay series color (default is "green")
    ref_events : pd.DataFrame, optional
        Add references line/label using the event times in the dataframe.
        (the default is None)
    ref_time_col : str, optional
        Add references line/label using the this column in `ref_events`
        for the time value (x-axis).
        (this defaults the value of the `time_column` parameter or 'TimeGenerated'
        `time_column` is None)
    ref_col : str, optional
        The column name to use for the label from `ref_events`
        (the default is None)
    ref_times : List[Tuple[datetime, str]], optional
        Add one or more reference line/label using (the default is None)
    
    Returns
    -------
    LayoutDOM
        The bokeh plot figure.

Event duration#

net_data.mp_plot.timeline_duration(group_by="L7Protocol")
Loading BokehJS ...
Column(
id = '9381', …)

Matrix plots#

Simple interactions

net_data.mp_plot.matrix(x="RemoteRegion", y="AllExtIPs")
Loading BokehJS ...
Figure(
id = '9632', …)
(
    net_data[~net_data["L7Protocol"]
    .isin(["http", "https"])]
    .mp_plot.matrix(x="L7Protocol", y="AllExtIPs", invert=True)
)
Loading BokehJS ...
Figure(
id = '10603', …)
net_data.mp_plot.matrix(x="RemoteRegion", y="AllExtIPs", invert=True)
Loading BokehJS ...
Figure(
id = '10704', …)

Process Trees#

process_df = pd.read_pickle("../data/processes_test.pkl")

process_df.mp_plot.process_tree(legend_col="Account")
Loading BokehJS ...
(Figure(id='10806', ...), Row(id='10920', ...))

End of Session#

Break: 15 Minutes#