High-frequency econometrics

Two recently developed packages simplify programmatic access to LOBSTER data directly from your analysis environment:

R — lobsteR: Available on CRAN, lobsteR handles authentication, data requests, and downloads from the LOBSTER platform and integrates naturally with the highfrequency package for downstream analysis.
Python — lobsterdata: A lightweight Python client that lets you submit data construction requests, monitor their status, and download the resulting CSV files, covering the full asynchronous workflow of the LOBSTER API.

KU students can obtain access to LOBSTER data by writing to stefan.voigt@econ.ku.dk.

Read in and process Lobster files

This exercise requires you mainly to familiarize yourself with LOBSTER, an online limit order book data tool that provides easy-to-use, high-quality data for the entire universe of NASDAQ traded stocks. I requested data based on their online interface and stored it on Absalon before running the code below. Lobster files contain all trading messages (order submissions, cancellations, trades) for a pre-specified ticker that went through NASDAQ. For example, the sample file in Absalon contains the entire order book history for the ticker SAP until level 5 in January 2022. LOBSTER generates a ‘message’ and an ‘orderbook’ file for each active trading day of a selected ticker. The ‘orderbook’ file contains the evolution of the limit order book up to the requested levels. The ‘message’ file contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are time-stamped in seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period.

The ‘message’ and ‘order book’ files are provided in the .csv format and can easily be read with any statistical software package.

Exercises:

Familiarize yourself with the output documentation of Lobster files on their homepage. Download the file SAP_2022-01-01_2022-01-31_5.7z from Absalon and read them using read_csv. Lobster files always have the same naming convention, ticker_date_34200000_57600000_filetype_level.csv, where filetype denotes message or the corresponding order book snapshots. Take notes on what you think are essential cleaning steps to get a helpful dataset comprising order book and trading information.
Merge the two files after you apply some intuitive naming convention such that you can work with one file which traces all information about messages and the corresponding order book snapshots. Try to create a function process_orderbook, which performs essential cleaning of the data by making sure you filter out trading halts and aggregate order execution messages which may have triggered multiple limit orders. More specifically, the function should complete the following generic steps:
1. Identify potential trading stops (consult the Lobster documentation for this) and discard all observations recorded during the window when trading was interrupted.
2. Sometimes, messages related to the daily opening and closing auction are included in the Lobster file (especially during days when trading stops earlier, e.g., before holidays). Remove such observations (opening auctions are identified as messages with type == 6 and ID == -1, and closing auctions can be identified as messages with type == 6 and ID == -2).
3. Replace “empty” slots in the order book with NAs in case there was no liquidity quoted at higher order book levels. In that case, Lobster reports negative bid and large ask prices (above 199999).
4. Remove observations with seemingly crossed prices (when the best ask is below the best bid).
5. Market orders are sometimes executed against multiple limit orders. In that case, Lobster records multiple transactions with an identical time stamp. Aggregate these transactions and generate the corresponding trade size and the (size-weighted) execution price. Also, retain the order book snapshot after the transaction has been executed.
6. The direction of a trade is hard to evaluate (not observable) if the transaction has been executed against a hidden order (type == 5). Create a proxy for the direction based on the executed price relative to the last observed order book snapshot.

You can use this function when you work with raw order book data from Lobster to produce a clean, analysis-ready order book. In the solution part, I provide some further explanations of my implementation. - Use the function to clean the SAP Lobster output. - Summarize the order book by extracting the average midquote $q_t = (a_t + b_t)/2$ (where $a_t$ and $b_t$ denote the best bid and best ask), the average spread $S_t= (a_t - b_t)/q_t$ (in basis points relative to the concurrent midquote), the aggregate volume in USD and the quoted depth (in million USD 5 basis points from the concurrent best price on each side of the order book) for each 5-minute interval.
- Provide a meaningful illustration of the order book dynamics during the particular trading day. This is intentionally a very open question: There is so much information within an order book that it is impossible to highlight everything worth considering. I encourage you to post your visualization on the discussion board in Absalon.

Solutions:

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(archive)
library(readr)
library(ggplot2)
library(gt)
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(forcats)

I use the following code to read in (general) Lobster files - change ticker, date, and level in the code snippet below.

files_available <- archive::archive("SAP_2022-01-01_2022-01-31_5.7z") |>
  mutate(
    ticker = gsub("(.*)_(.*)_342.*_(.*).csv", "\\1", path),
    date = gsub("(.*)_(.*)_342.*_(.*).csv", "\\2", path),
    level = gsub("(.*)_(.*)_342.*_(.*).csv", "\\3", path)
  )

files_available

# A tibble: 40 × 5
   path                                                 size date   ticker level
   <chr>                                               <int> <chr>  <chr>  <chr>
 1 SAP_2022-01-28_34200000_57600000_message_5.csv          0 2022-… SAP    5    
 2 SAP_2022-01-28_34200000_57600000_orderbook_5.csv        0 2022-… SAP    5    
 3 SAP_2022-01-03_34200000_57600000_message_5.csv   14902108 2022-… SAP    5    
 4 SAP_2022-01-03_34200000_57600000_orderbook_5.csv 38240040 2022-… SAP    5    
 5 SAP_2022-01-04_34200000_57600000_message_5.csv   13248413 2022-… SAP    5    
 6 SAP_2022-01-04_34200000_57600000_orderbook_5.csv 33881101 2022-… SAP    5    
 7 SAP_2022-01-05_34200000_57600000_message_5.csv   12234669 2022-… SAP    5    
 8 SAP_2022-01-05_34200000_57600000_orderbook_5.csv 31189574 2022-… SAP    5    
 9 SAP_2022-01-06_34200000_57600000_message_5.csv   17311993 2022-… SAP    5    
10 SAP_2022-01-06_34200000_57600000_orderbook_5.csv 44240752 2022-… SAP    5    
# ℹ 30 more rows

Lobster files always follow the same naming convention; thus, the code above can be recycled in a more general fashion for other Lobster files. Next, we read in data for one(!) of the 20 trading days.

messages_raw <- read_csv(
  archive::archive_read("SAP_2022-01-01_2022-01-31_5.7z", 3),
  col_names = c(
    "ts",
    "type",
    "order_id",
    "m_size",
    "m_price",
    "direction",
    "null"
  ),
  col_type = cols(null = col_skip())
) |>
  mutate(
    ts = as.POSIXct(
      ts,
      origin = as.Date("2022-01-07"),
      tz = "GMT",
      format = "%Y-%m-%d %H:%M:%OS6"
    ),
    m_price = m_price / 10000
  )

messages_raw |>
  filter(type == 4 | type == 5) |>
  ggplot(aes(x = ts, y = m_price)) +
  geom_step() +
  geom_point(data = messages_raw |> tail(1), color = "red", size = 3) +
  labs(x = "", y = "USD", title = "Traded prices: SAP, January 3rd, 2022")

Figure 1: Traded prices for SAP on January 3rd, 2022, based on execution messages (types 4 and 5).

Note my naming convention for column names. The command col_type makes sure that there is easy processing afterward. By default, ts denotes the time in seconds since midnight (decimals are precise until the nanosecond level), and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

Each message is associated with the corresponding order book snapshot. After merging the message and orderbook files, the entire data thus looks as follows.

The code below is rather flexible to handle Lobster files with more or less levels. Each column contains the corresponding prices and sizes on each side of the order book. More information is available on the documentation of Lobster. The output contains one file with all information available in the order book.

level <- 5
orderbook_raw <- read_csv(
  archive::archive_read("SAP_2022-01-01_2022-01-31_5.7z", 4),
  col_names = paste(
    rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level),
    rep(1:level, each = 4),
    sep = "_"
  ),
  cols(.default = col_double())
) |>
  mutate_at(vars(contains("price")), ~ . / 10000)

orderbook <- bind_cols(messages_raw, orderbook_raw)
orderbook

# A tibble: 316,379 × 26
   ts                   type order_id m_size m_price direction ask_price_1
   <dttm>              <dbl>    <dbl>  <dbl>   <dbl>     <dbl>       <dbl>
 1 2022-01-07 09:30:00     3 34351556    500    140.        -1        140.
 2 2022-01-07 09:30:00     1 34456580    177    140.         1        140.
 3 2022-01-07 09:30:00     3 34390184    100    140.        -1        140.
 4 2022-01-07 09:30:00     1 34457908    160    140.        -1        140.
 5 2022-01-07 09:30:00     3 34302936    100    140.         1        140.
 6 2022-01-07 09:30:00     1 34458304    192    140.        -1        140.
 7 2022-01-07 09:30:00     3 34457908    160    140.        -1        140.
 8 2022-01-07 09:30:00     1 34463304    100    140.        -1        140.
 9 2022-01-07 09:30:00     2 34458304     57    140.        -1        140.
10 2022-01-07 09:30:00     2 34458304     29    140.        -1        140.
# ℹ 316,369 more rows
# ℹ 19 more variables: ask_size_1 <dbl>, bid_price_1 <dbl>, bid_size_1 <dbl>,
#   ask_price_2 <dbl>, ask_size_2 <dbl>, bid_price_2 <dbl>, bid_size_2 <dbl>,
#   ask_price_3 <dbl>, ask_size_3 <dbl>, bid_price_3 <dbl>, bid_size_3 <dbl>,
#   ask_price_4 <dbl>, ask_size_4 <dbl>, bid_price_4 <dbl>, bid_size_4 <dbl>,
#   ask_price_5 <dbl>, ask_size_5 <dbl>, bid_price_5 <dbl>, bid_size_5 <dbl>

The order book contains 316379 observations. While the columns ts to direction help understand the specific messages, the remaining columns contain a snapshot of the order book at that particular time.

Next, Lobster sometimes contains opening and closing auctions (in particular on days when trading hours deviate from the regular times). If you follow the documentation on Lobster, you will see that these times can be identified as I do in the code below.

opening_auction <- orderbook |>
  filter((type == 6 & order_id == -1) | row_number() == 1) |>
  filter(row_number() == n()) |>
  pull(ts)
closing_auction <- orderbook |>
  filter((type == 6 & order_id == -2) | row_number() == n()) |>
  filter(row_number() == 1) |>
  pull(ts)

c(opening_auction, closing_auction)

[1] "2022-01-07 09:30:02 GMT" "2022-01-07 15:59:59 GMT"

In line with the exercise, the function process_orderbook() performs several essential steps. You can call it from within R with the following command.

process_orderbook <- function(orderbook) {
  # Did a trading halt happen? ----
  halt_index <- orderbook |>
    filter(
      type == 7 &
        direction == -1 &
        m_price == -1 / 10000 |
        type == 7 & direction == -1 & m_price == 1 / 10000
    )

  while (nrow(halt_index) > 1) {
    # Filter out messages that occurred in between trading halts
    cat("Trading halt detected")
    orderbook <- orderbook |>
      filter(ts < halt_index$ts[1] | ts > halt_index$ts[2])
    halt_index <- halt_index |> filter(row_number() > 2)
  }

  # Opening and closing auction ----
  # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2

  opening_auction <- orderbook |>
    filter(
      type == 6,
      order_id == -1
    ) |>
    pull(ts)

  closing_auction <- orderbook |>
    filter(
      type == 6,
      order_id == -2
    ) |>
    pull(ts)

  if (length(opening_auction) != 1) {
    opening_auction <- orderbook |>
      select(ts) |>
      head(1) |>
      pull(ts) -
      lubridate::seconds(0.1)
  }
  if (length(closing_auction) != 1) {
    closing_auction <- orderbook |>
      select(ts) |>
      tail(1) |>
      pull(ts) +
      lubridate::seconds(0.1)
  }

  orderbook <- orderbook |> filter(ts > opening_auction & ts < closing_auction)
  orderbook <- orderbook |> filter(type != 6 & type != 7)

  # Replace "empty" slots in the order book (0 volume) with NA prices ----
  orderbook <- orderbook |>
    mutate(
      across(contains("bid_price"), ~ replace(., . < 0, NA)),
      across(contains("ask_price"), ~ replace(., . >= 999999, NA))
    )

  # Remove crossed order book observations ----
  orderbook <- orderbook |>
    filter(ask_price_1 > bid_price_1)

  # Merge transactions with unique timestamp ----
  trades <- orderbook |>
    filter(type == 4 | type == 5) |>
    select(ts:direction)

  trades <- inner_join(
    trades,
    orderbook |>
      group_by(ts) |>
      filter(row_number() == 1) |>
      ungroup() |>
      transmute(
        ts,
        ask_price_1,
        bid_price_1,
        midquote = ask_price_1 / 2 + bid_price_1 / 2,
        lag_midquote = lag(midquote)
      ),
    by = "ts"
  )

  trades <- trades |>
    mutate(
      direction = case_when(
        type == 5 & m_price < lag_midquote ~ 1, # lobster convention: direction = 1 if executed against a limit buy order
        type == 5 & m_price > lag_midquote ~ -1,
        type == 4 ~ as.double(direction),
        TRUE ~ as.double(NA)
      )
    )

  # Aggregate transactions with size and volume-weighted price
  trade_aggregated <- trades |>
    group_by(ts) |>
    summarise(
      type = last(type),
      order_id = NA,
      m_price = sum(m_price * m_size) / sum(m_size),
      m_size = sum(m_size),
      direction = last(direction)
    )

  # Merge trades with the last observed order book snapshot
  trade_aggregated <- inner_join(
    trade_aggregated,
    orderbook |>
      select(ts, ask_price_1:last_col()) |>
      group_by(ts) |>
      filter(row_number() == n()),
    by = "ts"
  )

  orderbook <- orderbook |>
    filter(
      type != 4,
      type != 5
    ) |>
    bind_rows(trade_aggregated) |>
    arrange(ts) |>
    mutate(direction = if_else(type == 4 | type == 5, direction, as.double(NA)))

  return(orderbook)
}

You can process the raw data from before by simply calling.

orderbook <- process_orderbook(orderbook)

The function compute_depth below computes the number of traded shares within bp basis points of the associated best price, which can be used to measure liquidity.

compute_depth <- function(orderbook, side = "bid", bp = 0) {
  # Computes depth (in contracts) based on order book snapshots
  if (side == "bid") {
    value_bid <- (1 - bp / 10000) * orderbook |> select("bid_price_1")
    index_bid <- orderbook |>
      select(contains("bid_price")) |>
      mutate_all(function(x) {
        if_else(is.na(x), FALSE, x >= value_bid)
      })

    sum_vector <- (orderbook |> select(contains("bid_size")) * index_bid) |>
      rowSums(na.rm = TRUE)
  } else {
    value_ask <- (1 + bp / 10000) * orderbook |> select("ask_price_1")
    index_ask <- orderbook |>
      select(contains("ask_price")) |>
      mutate_all(function(x) {
        if_else(is.na(x), FALSE, x <= value_ask)
      })
    sum_vector <- (orderbook |> select(contains("ask_size")) * index_ask) |>
      rowSums(na.rm = TRUE)
  }
  return(sum_vector)
}

Next, I compute the midquote, bid-ask spread, aggregate volume, and depth (the amount of trade-able units in the order book):

Midquote $q_t = (a_t + b_t)/2$ (where $a_t$ and $b_t$ denote the best bid and best ask)
Spread $S_t= (a_t - b_t)$ (values below are computed in basis points relative to the concurrent midquote)
Volume is the aggregate sum of traded units of the stock. I do not differentiate between hidden (type==5) and visible volume.
Depth at the best level.

I aggregate the information into buckets of 5 minutes, using time-weighting for the order book variables.

orderbook_summaries <- orderbook |>
  transmute(
    ts,
    midquote = ask_price_1 / 2 + bid_price_1 / 2,
    trading_volume = if_else(
      type == 4 | type == 5,
      m_price * m_size,
      as.double(NA)
    ),
    depth0_bid = bid_size_1,
    depth0_ask = ask_size_1,
    spread = 10000 * (ask_price_1 - bid_price_1) / midquote
  ) |>
  mutate(
    ts_latency = as.numeric(lead(ts)) - as.numeric(ts), # time between messages
    ts_latency = if_else(is.na(ts_latency), 0, ts_latency), # 0 for first message
    ts_minute = lubridate::floor_date(ts, "5 minutes")
  ) |> # create 5 minute buckets (e.g. messages from 10:30:00 - 10:34:59.xx belong into 10:30:00)
  group_by(ts_minute) |>
  summarise(
    midquote = last(midquote),
    n_trades = sum(!is.na(trading_volume)),
    n = n(), # number of messages
    trading_volume = sum(trading_volume, na.rm = TRUE),
    trading_volume = if_else(
      is.na(trading_volume),
      as.double(0),
      trading_volume
    ),
    depth0_bid = weighted.mean(depth0_bid, ts_latency),
    depth0_ask = weighted.mean(depth0_ask, ts_latency),
    spread = weighted.mean(spread, ts_latency)
  )

orderbook_summaries |>
  gt() |>
  cols_label(
    ts_minute = "Time",
    midquote = "Midquote (USD)",
    n_trades = "Trades",
    n = "Messages",
    trading_volume = "Volume (USD)",
    depth0_bid = "Bid depth",
    depth0_ask = "Ask depth",
    spread = "Spread (bps)"
  ) |>
  tab_spanner(
    label = "Order book",
    columns = c(midquote, spread, depth0_bid, depth0_ask)
  ) |>
  tab_spanner(
    label = "Trading activity",
    columns = c(n_trades, n, trading_volume)
  ) |>
  fmt_time(columns = ts_minute, time_style = "hm") |>
  fmt_currency(columns = midquote, currency = "USD", decimals = 2) |>
  fmt_currency(columns = trading_volume, currency = "USD", decimals = 0) |>
  fmt_number(columns = c(depth0_bid, depth0_ask), decimals = 0) |>
  fmt_number(columns = spread, decimals = 2) |>
  fmt_integer(columns = c(n_trades, n))

Five-minute summaries of midquote, spread (bps), trade count, volume, and quoted depth for SAP, January 2022.
Time	Order book				Trading activity
Time	Midquote (USD)	Spread (bps)	Bid depth	Ask depth	Trades	Messages	Volume (USD)
9:30 AM	$140.38	3.62	485	534	61	12,926	$1,313,466
9:35 AM	$140.24	4.33	669	584	23	14,844	$602,060
9:40 AM	$140.15	4.31	761	427	24	11,150	$299,088
9:45 AM	$139.49	4.54	777	557	43	12,891	$711,751
9:50 AM	$139.26	4.76	713	720	45	20,975	$643,746
9:55 AM	$139.31	4.79	748	628	46	17,916	$666,569
10:00 AM	$139.81	5.05	749	671	37	17,288	$640,948
10:05 AM	$139.81	4.93	835	520	42	14,336	$517,071
10:10 AM	$140.16	5.33	805	796	32	11,862	$609,344
10:15 AM	$140.17	4.59	493	633	24	11,218	$218,798
10:20 AM	$140.25	4.31	567	652	69	10,213	$1,392,951
10:25 AM	$140.18	4.54	470	805	17	12,386	$193,138
10:30 AM	$140.19	4.43	447	715	27	12,563	$443,163
10:35 AM	$140.29	3.61	435	426	44	10,616	$408,664
10:40 AM	$140.49	4.04	554	548	91	10,862	$1,516,834
10:45 AM	$140.43	4.51	467	758	28	12,057	$249,738
10:50 AM	$140.24	4.32	407	746	36	11,627	$292,742
10:55 AM	$140.21	3.61	299	787	19	8,978	$363,035
11:00 AM	$140.18	2.79	178	950	45	8,117	$852,378
11:05 AM	$140.19	4.07	455	821	29	8,625	$788,170
11:10 AM	$140.23	3.13	366	513	57	9,471	$1,794,017
11:15 AM	$140.31	2.79	278	470	36	9,019	$547,094
11:20 AM	$140.37	3.72	528	594	40	7,751	$742,033
11:25 AM	$140.57	4.10	526	533	58	12,209	$1,843,627
11:30 AM	$140.39	6.24	219	522	47	2,706	$1,043,222
11:35 AM	$140.37	8.53	240	203	41	501	$926,211
11:40 AM	$140.37	8.63	173	166	16	207	$82,531
11:45 AM	$140.31	5.78	136	188	19	212	$201,726
11:50 AM	$140.14	6.04	82	56	10	405	$111,357
11:55 AM	$140.25	6.20	154	170	12	398	$142,420
12:00 PM	$140.22	6.09	138	119	10	234	$143,158
12:05 PM	$140.23	6.90	142	46	14	188	$171,964
12:10 PM	$140.13	5.57	188	72	14	157	$129,363
12:15 PM	$140.08	3.55	109	68	17	192	$191,348
12:20 PM	$140.17	3.94	120	44	9	143	$63,614
12:25 PM	$140.10	4.88	134	121	21	210	$173,409
12:30 PM	$140.00	6.99	138	165	26	286	$407,699
12:35 PM	$140.00	5.66	50	121	32	300	$586,622
12:40 PM	$139.94	5.94	229	76	8	198	$82,985
12:45 PM	$139.99	5.80	210	79	10	188	$27,437
12:50 PM	$139.94	4.92	288	103	13	255	$122,189
12:55 PM	$139.99	4.16	171	113	6	257	$59,784
1:00 PM	$139.95	4.83	255	167	13	240	$168,846
1:05 PM	$139.91	6.02	152	142	2	137	$6,995
1:10 PM	$140.07	7.01	192	122	3	310	$19,176
1:15 PM	$140.06	6.59	162	221	5	139	$21,009
1:20 PM	$140.17	3.50	213	26	10	189	$84,217
1:25 PM	$140.21	2.68	250	156	9	147	$79,495
1:30 PM	$140.28	2.34	237	100	12	196	$112,770
1:35 PM	$140.39	3.00	172	120	26	340	$322,772
1:40 PM	$140.63	6.59	274	88	18	483	$121,353
1:45 PM	$140.62	4.52	106	126	17	247	$172,959
1:50 PM	$140.68	5.11	251	71	7	178	$40,787
1:55 PM	$140.72	4.62	168	153	17	270	$193,457
2:00 PM	$140.75	4.61	285	150	10	127	$72,340
2:05 PM	$140.97	6.84	398	196	16	348	$154,616
2:10 PM	$141.11	4.68	278	110	32	285	$362,107
2:15 PM	$141.18	4.06	191	146	14	283	$176,133
2:20 PM	$141.12	2.63	129	125	25	375	$242,762
2:25 PM	$141.13	4.58	139	61	17	204	$196,725
2:30 PM	$141.03	3.71	183	74	12	201	$77,879
2:35 PM	$141.01	4.19	170	46	30	406	$330,383
2:40 PM	$141.19	8.42	163	88	11	247	$142,360
2:45 PM	$141.20	4.63	391	48	17	240	$155,433
2:50 PM	$141.24	5.42	297	86	27	358	$381,499
2:55 PM	$141.12	3.00	87	76	23	492	$263,047
3:00 PM	$141.14	5.22	145	147	19	577	$123,330
3:05 PM	$141.12	3.25	173	124	14	285	$199,390
3:10 PM	$141.20	2.59	303	19	23	430	$332,351
3:15 PM	$141.29	2.48	218	137	41	600	$451,828
3:20 PM	$141.29	2.08	181	174	56	517	$806,482
3:25 PM	$141.33	1.74	287	111	39	582	$559,393
3:30 PM	$141.44	2.30	198	117	64	1,028	$739,473
3:35 PM	$141.44	2.13	263	111	65	864	$886,603
3:40 PM	$141.29	2.51	206	103	57	1,252	$742,247
3:45 PM	$141.56	2.84	355	165	35	1,065	$242,671
3:50 PM	$141.29	2.50	193	151	154	2,458	$2,367,376
3:55 PM	$141.43	2.00	246	161	140	2,142	$1,857,272

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher order book levels.

orderbook_trades <- orderbook |>
  filter(type == 4 | type == 5) |>
  select(ts, m_price)

orderbook_quotes <- orderbook |>
  mutate(ts_new = floor_date(ts, "minute")) |>
  group_by(ts_new) |>
  summarise_all(dplyr::last) |>
  select(-ts_new) |>
  mutate(id = row_number()) |>
  select(ts, id, matches("bid|ask")) |>
  gather(level, price, -ts, -id) |>
  separate(level, into = c("side", "variable", "level"), sep = "_") |>
  mutate(level = as.numeric(level)) |>
  pivot_wider(names_from = variable, values_from = price)

ggplot() +
  theme_bw() +
  geom_point(
    data = orderbook_quotes,
    aes(x = ts, y = price, color = as_factor(level), size = size / max(size)),
    alpha = 0.1
  ) +
  geom_point(data = orderbook_trades, aes(x = ts, y = m_price), color = 'red') +
  labs(title = "Orderbook Dynamics", y = "Price", x = "") +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none"
  ) +
  scale_y_continuous()

Figure 2: Order book dynamics for SAP: quoted prices at each level (shaded by relative depth) with executed trades overlaid in red.

import py7zr
import pandas as pd
import numpy as np
from plotnine import (
    ggplot, aes, geom_step, geom_point, labs,
    theme_bw, theme, element_blank,
)
from great_tables import GT

I use the following code to read in (general) Lobster files - change ticker, date, and level in the code snippet below.

archive_path = "SAP_2022-01-01_2022-01-31_5.7z"

with py7zr.SevenZipFile(archive_path, mode="r") as archive:
    file_info = archive.list()

files_available = pd.DataFrame(
    {"path": [f.filename for f in file_info], "size": [f.uncompressed for f in file_info]}
)
files_available[["ticker", "date", "level"]] = (
    files_available["path"].str.extract(r"(.*)_(.*)_342.*_(.*).csv")
)

files_available.head(10)

	path	size	ticker	date	level
0	SAP_2022-01-28_34200000_57600000_message_5.csv	0	SAP	2022-01-28	5
1	SAP_2022-01-28_34200000_57600000_orderbook_5.csv	0	SAP	2022-01-28	5
2	SAP_2022-01-03_34200000_57600000_message_5.csv	14902108	SAP	2022-01-03	5
3	SAP_2022-01-03_34200000_57600000_orderbook_5.csv	38240040	SAP	2022-01-03	5
4	SAP_2022-01-04_34200000_57600000_message_5.csv	13248413	SAP	2022-01-04	5
5	SAP_2022-01-04_34200000_57600000_orderbook_5.csv	33881101	SAP	2022-01-04	5
6	SAP_2022-01-05_34200000_57600000_message_5.csv	12234669	SAP	2022-01-05	5
7	SAP_2022-01-05_34200000_57600000_orderbook_5.csv	31189574	SAP	2022-01-05	5
8	SAP_2022-01-06_34200000_57600000_message_5.csv	17311993	SAP	2022-01-06	5
9	SAP_2022-01-06_34200000_57600000_orderbook_5.csv	44240752	SAP	2022-01-06	5

Lobster files always follow the same naming convention; thus, the code above can be recycled in a more general fashion for other Lobster files. The helper below extracts a single file from the archive into memory, mirroring archive::archive_read() in R.

def read_archive_csv(path, target_index, names=None, usecols=None):
    with py7zr.SevenZipFile(path, mode="r") as archive:
        file_info = archive.list()
        target_info = file_info[target_index]
        target = target_info.filename
        target_size = target_info.uncompressed
        if target_size is None or target_size < 0:
            raise ValueError(f"Could not determine uncompressed size for archive member: {target}")
        factory = py7zr.io.BytesIOFactory(max(target_size, 1))
        archive.extract(targets=[target], factory=factory)
    buf = factory.products[target]
    buf.seek(0)
    return pd.read_csv(buf, header=None, names=names, usecols=usecols)

Next, we read in data for one(!) of the trading days available in the archive.

messages_raw = read_archive_csv(
    archive_path,
    target_index=2,
    names=["ts", "type", "order_id", "m_size", "m_price", "direction", "null"],
    usecols=range(6),
)
messages_raw["ts"] = (
    pd.Timestamp("2022-01-03", tz="UTC")
    + pd.to_timedelta(messages_raw["ts"], unit="s")
)
messages_raw["m_price"] = messages_raw["m_price"] / 10000
trades = messages_raw[messages_raw["type"].isin([4, 5])]

(
    ggplot(
        trades,
        aes(x="ts", y="m_price"),
    )
    + geom_step()
    + geom_point(data=trades.tail(1), color="red", size=3)
    + labs(x="", y="USD", title="Traded prices: SAP, January 3rd, 2022")
)

<plotnine.ggplot.ggplot object at 0x00000233BAD70E50>

Note my naming convention for column names. The argument usecols=range(6) makes sure we drop the trailing null column for easy processing afterward. By default, ts denotes the time in seconds since midnight (decimals are precise until the nanosecond level), and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

The code below is rather flexible to handle Lobster files with more or less levels. Each column contains the corresponding prices and sizes on each side of the order book. After merging the message and orderbook files, the entire data thus looks as follows.

level = 5
ob_cols = [
    f"{var}_{lvl}"
    for lvl in range(1, level + 1)
    for var in ("ask_price", "ask_size", "bid_price", "bid_size")
]

orderbook_raw = read_archive_csv(archive_path, target_index=3, names=ob_cols)
price_cols = [c for c in orderbook_raw.columns if "price" in c]
orderbook_raw[price_cols] = orderbook_raw[price_cols] / 10000

orderbook = pd.concat(
    [messages_raw.reset_index(drop=True), orderbook_raw.reset_index(drop=True)],
    axis=1,
)
pd.set_option("display.max_columns", 7)
orderbook.head(10)

	ts	type	order_id	...	ask_size_5	bid_price_5	bid_size_5
0	2022-01-03 09:30:00.041693104+00:00	3	34351556	...	696	140.23	800
1	2022-01-03 09:30:00.042287443+00:00	1	34456580	...	696	140.23	800
2	2022-01-03 09:30:00.042327460+00:00	3	34390184	...	764	140.23	800
3	2022-01-03 09:30:00.043255366+00:00	1	34457908	...	764	140.23	800
4	2022-01-03 09:30:00.043453093+00:00	3	34302936	...	764	140.23	800
5	2022-01-03 09:30:00.043699050+00:00	1	34458304	...	764	140.23	800
6	2022-01-03 09:30:00.049983479+00:00	3	34457908	...	764	140.23	800
7	2022-01-03 09:30:00.050566166+00:00	1	34463304	...	696	140.23	800
8	2022-01-03 09:30:00.050704660+00:00	2	34458304	...	696	140.23	800
9	2022-01-03 09:30:00.051073967+00:00	2	34458304	...	696	140.23	800

10 rows × 26 columns

The order book contains as many observations as the message file. While the columns ts to direction help understand the specific messages, the remaining columns contain a snapshot of the order book at that particular time.

opening_auction = orderbook.loc[
    (orderbook["type"] == 6) & (orderbook["order_id"] == -1), "ts"
]
if opening_auction.empty:
    opening_auction = orderbook["ts"].head(1)

closing_auction = orderbook.loc[
    (orderbook["type"] == 6) & (orderbook["order_id"] == -2), "ts"
]
if closing_auction.empty:
    closing_auction = orderbook["ts"].tail(1)

pd.concat([opening_auction, closing_auction]).tolist()

[Timestamp('2022-01-03 09:30:02.221711541+0000', tz='UTC'), Timestamp('2022-01-03 15:59:59.000480373+0000', tz='UTC')]

In line with the exercise, the function process_orderbook() performs several essential steps. You can call it from within Python with the following command.

def process_orderbook(orderbook):
    orderbook = orderbook.copy()

    # Did a trading halt happen? ----
    halt_mask = (
        (orderbook["type"] == 7)
        & (orderbook["direction"] == -1)
        & (orderbook["m_price"].isin([-1 / 10000, 1 / 10000]))
    )
    halt_ts = orderbook.loc[halt_mask, "ts"].reset_index(drop=True)

    while len(halt_ts) > 1:
        # Filter out messages that occurred in between trading halts
        print("Trading halt detected")
        orderbook = orderbook[
            (orderbook["ts"] < halt_ts.iloc[0]) | (orderbook["ts"] > halt_ts.iloc[1])
        ]
        halt_ts = halt_ts.iloc[2:].reset_index(drop=True)

    # Opening and closing auction ----
    # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2
    opening = orderbook.loc[
        (orderbook["type"] == 6) & (orderbook["order_id"] == -1), "ts"
    ]
    closing = orderbook.loc[
        (orderbook["type"] == 6) & (orderbook["order_id"] == -2), "ts"
    ]
    open_ts = (
        opening.iloc[0]
        if len(opening) == 1
        else orderbook["ts"].iloc[0] - pd.Timedelta("0.1s")
    )
    close_ts = (
        closing.iloc[0]
        if len(closing) == 1
        else orderbook["ts"].iloc[-1] + pd.Timedelta("0.1s")
    )
    orderbook = orderbook[(orderbook["ts"] > open_ts) & (orderbook["ts"] < close_ts)]
    orderbook = orderbook[~orderbook["type"].isin([6, 7])]

    # Replace "empty" slots in the order book (0 volume) with NA prices ----
    bid_price_cols = [c for c in orderbook.columns if "bid_price" in c]
    ask_price_cols = [c for c in orderbook.columns if "ask_price" in c]
    orderbook[bid_price_cols] = orderbook[bid_price_cols].mask(orderbook[bid_price_cols] < 0)
    orderbook[ask_price_cols] = orderbook[ask_price_cols].mask(orderbook[ask_price_cols] >= 999999)

    # Remove crossed order book observations ----
    orderbook = orderbook[orderbook["ask_price_1"] > orderbook["bid_price_1"]]

    # Merge transactions with unique timestamp ----
    trades = orderbook[orderbook["type"].isin([4, 5])][
        ["ts", "type", "order_id", "m_size", "m_price", "direction"]
    ].copy()

    snapshots = (
        orderbook.drop_duplicates(subset="ts", keep="first")
        .assign(midquote=lambda d: (d["ask_price_1"] + d["bid_price_1"]) / 2)
        [["ts", "ask_price_1", "bid_price_1", "midquote"]]
        .sort_values("ts")
        .reset_index(drop=True)
    )
    snapshots["lag_midquote"] = snapshots["midquote"].shift(1)

    trades = trades.merge(snapshots, on="ts", how="inner")

    # lobster convention: direction = 1 if executed against a limit buy order
    trades["direction"] = np.where(
        (trades["type"] == 5) & (trades["m_price"] < trades["lag_midquote"]),
        1.0,
        np.where(
            (trades["type"] == 5) & (trades["m_price"] > trades["lag_midquote"]),
            -1.0,
            np.where(trades["type"] == 4, trades["direction"].astype(float), np.nan),
        ),
    )

    # Aggregate transactions with size and volume-weighted price
    trade_aggregated = trades.groupby("ts", as_index=False).apply(
        lambda g: pd.Series(
            {
                "type": g["type"].iloc[-1],
                "order_id": np.nan,
                "m_price": np.sum(g["m_price"] * g["m_size"]) / g["m_size"].sum(),
                "m_size": g["m_size"].sum(),
                "direction": g["direction"].iloc[-1],
            }
        ),
        include_groups=False,
    )

    # Merge trades with the last observed order book snapshot
    snapshot_cols = ["ts"] + [
        c for c in orderbook.columns if c.startswith(("ask_", "bid_"))
    ]
    last_snapshot = (
        orderbook[snapshot_cols]
        .drop_duplicates(subset="ts", keep="last")
        .reset_index(drop=True)
    )
    trade_aggregated = trade_aggregated.merge(last_snapshot, on="ts", how="inner")

    non_trades = orderbook[~orderbook["type"].isin([4, 5])]
    orderbook = (
        pd.concat([non_trades, trade_aggregated], ignore_index=True)
        .sort_values("ts")
        .reset_index(drop=True)
    )
    orderbook["direction"] = np.where(
        orderbook["type"].isin([4, 5]), orderbook["direction"], np.nan
    )
    return orderbook

You can process the raw data from before by simply calling.

orderbook = process_orderbook(orderbook)

The function compute_depth below computes the number of traded shares within bp basis points of the associated best price, which can be used to measure liquidity.

def compute_depth(orderbook, side="bid", bp=0):
    # Computes depth (in contracts) based on order book snapshots
    if side == "bid":
        ref = (1 - bp / 10000) * orderbook["bid_price_1"]
        price_cols = [c for c in orderbook.columns if "bid_price" in c]
        size_cols = [c for c in orderbook.columns if "bid_size" in c]
        idx = orderbook[price_cols].ge(ref, axis=0).fillna(False)
    else:
        ref = (1 + bp / 10000) * orderbook["ask_price_1"]
        price_cols = [c for c in orderbook.columns if "ask_price" in c]
        size_cols = [c for c in orderbook.columns if "ask_size" in c]
        idx = orderbook[price_cols].le(ref, axis=0).fillna(False)

    sizes = orderbook[size_cols].fillna(0).values
    return (sizes * idx.values).sum(axis=1)

Next, I compute the midquote, bid-ask spread, aggregate notional volume, and depth (the amount of trade-able units in the order book):

Midquote $q_t = (a_t + b_t)/2$ (where $a_t$ and $b_t$ denote the best bid and best ask)
Spread $S_t= (a_t - b_t)$ (values below are computed in basis points relative to the concurrent midquote)
Notional volume is the aggregate traded dollar value of the stock, computed as trade price times trade size. I do not differentiate between hidden (type==5) and visible trade volume.
Depth at the best level.

I aggregate the information into buckets of 5 minutes, using time-weighting for the order book variables.

summary = pd.DataFrame(
    {
        "ts": orderbook["ts"].values,
        "midquote": (orderbook["ask_price_1"] + orderbook["bid_price_1"]) / 2,
        "trading_volume": np.where(
            orderbook["type"].isin([4, 5]),
            orderbook["m_price"] * orderbook["m_size"],
            np.nan,
        ),
        "depth0_bid": orderbook["bid_size_1"].values,
        "depth0_ask": orderbook["ask_size_1"].values,
    }
)
summary["spread"] = (
    10000
    * (orderbook["ask_price_1"].values - orderbook["bid_price_1"].values)
    / summary["midquote"]
)
# time between messages (0 for last message)
summary["ts_latency"] = (
    (summary["ts"].shift(-1) - summary["ts"]).dt.total_seconds().fillna(0)
)
# 5-minute buckets (e.g. messages from 10:30:00 - 10:34:59.xx belong into 10:30:00)
summary["ts_minute"] = summary["ts"].dt.floor("5min")


def weighted_mean(values, weights):
    mask = ~values.isna()
    if not mask.any() or weights[mask].sum() == 0:
        return np.nan
    return np.average(values[mask], weights=weights[mask])


orderbook_summaries = summary.groupby("ts_minute", as_index=False).apply(
    lambda g: pd.Series(
        {
            "midquote": g["midquote"].iloc[-1],
            "n_trades": int(g["trading_volume"].notna().sum()),
            "n": len(g),
            "trading_volume": float(g["trading_volume"].sum()),
            "depth0_bid": weighted_mean(g["depth0_bid"], g["ts_latency"]),
            "depth0_ask": weighted_mean(g["depth0_ask"], g["ts_latency"]),
            "spread": weighted_mean(g["spread"], g["ts_latency"]),
        }
    ),
    include_groups=False,
)

orderbook_display = orderbook_summaries.copy()
orderbook_display["ts_minute"] = orderbook_display["ts_minute"].dt.time

(
    GT(orderbook_display)
    .cols_label(
        ts_minute="Time",
        midquote="Midquote (USD)",
        n_trades="Trades",
        n="Messages",
        trading_volume="Volume (USD)",
        depth0_bid="Bid depth",
        depth0_ask="Ask depth",
        spread="Spread (bps)",
    )
    .tab_spanner(
        label="Order book",
        columns=["midquote", "spread", "depth0_bid", "depth0_ask"],
    )
    .tab_spanner(
        label="Trading activity",
        columns=["n_trades", "n", "trading_volume"],
    )
    .fmt_time(columns="ts_minute", time_style="h_m_p")
    .fmt_currency(columns="midquote", currency="USD", decimals=2)
    .fmt_currency(columns="trading_volume", currency="USD", decimals=0)
    .fmt_number(columns=["depth0_bid", "depth0_ask"], decimals=0)
    .fmt_number(columns="spread", decimals=2)
    .fmt_integer(columns=["n_trades", "n"])
)

Table 1: Five-minute summaries of midquote, spread (bps), trade count, notional volume, and quoted depth for SAP, January 2022.

Time	Order book				Trading activity
Time	Midquote (USD)	Spread (bps)	Bid depth	Ask depth	Trades	Messages	Volume (USD)
9:30 AM	$140.38	3.62	482	533	61	12,926	$1,313,466
9:35 AM	$140.24	4.33	668	583	23	14,844	$602,060
9:40 AM	$140.15	4.32	760	424	24	11,150	$299,088
9:45 AM	$139.49	4.54	776	554	43	12,891	$711,751
9:50 AM	$139.26	4.76	711	717	45	20,975	$643,746
9:55 AM	$139.31	4.79	747	626	46	17,916	$666,569
10:00 AM	$139.81	5.05	747	671	37	17,288	$640,948
10:05 AM	$139.81	4.93	831	518	42	14,336	$517,071
10:10 AM	$140.16	5.33	804	795	32	11,862	$609,344
10:15 AM	$140.17	4.59	491	633	24	11,218	$218,798
10:20 AM	$140.25	4.31	566	651	69	10,213	$1,392,951
10:25 AM	$140.18	4.55	469	804	17	12,386	$193,138
10:30 AM	$140.19	4.44	446	714	27	12,563	$443,163
10:35 AM	$140.29	3.61	431	426	44	10,616	$408,664
10:40 AM	$140.49	4.04	551	547	91	10,862	$1,516,834
10:45 AM	$140.43	4.51	466	756	28	12,057	$249,738
10:50 AM	$140.24	4.32	405	745	36	11,627	$292,742
10:55 AM	$140.21	3.61	299	787	19	8,978	$363,035
11:00 AM	$140.18	2.79	178	950	45	8,117	$852,378
11:05 AM	$140.19	4.08	448	819	29	8,625	$788,170
11:10 AM	$140.23	3.13	363	510	57	9,471	$1,794,017
11:15 AM	$140.31	2.79	277	470	36	9,019	$547,094
11:20 AM	$140.37	3.72	527	593	40	7,751	$742,033
11:25 AM	$140.57	4.10	525	528	58	12,209	$1,843,627
11:30 AM	$140.39	6.24	219	522	47	2,706	$1,043,222
11:35 AM	$140.37	8.53	240	203	41	501	$926,211
11:40 AM	$140.37	8.63	173	166	16	207	$82,531
11:45 AM	$140.31	5.78	136	188	19	212	$201,726
11:50 AM	$140.14	6.04	82	56	10	405	$111,357
11:55 AM	$140.25	6.20	154	170	12	398	$142,420
12:00 PM	$140.22	6.09	138	119	10	234	$143,158
12:05 PM	$140.23	6.90	142	46	14	188	$171,964
12:10 PM	$140.13	5.57	188	72	14	157	$129,363
12:15 PM	$140.08	3.55	109	68	17	192	$191,348
12:20 PM	$140.17	3.94	120	44	9	143	$63,614
12:25 PM	$140.10	4.88	134	121	21	210	$173,409
12:30 PM	$140.00	6.99	138	165	26	286	$407,699
12:35 PM	$140.00	5.66	50	121	32	300	$586,622
12:40 PM	$139.94	5.94	229	76	8	198	$82,985
12:45 PM	$139.99	5.80	210	79	10	188	$27,437
12:50 PM	$139.94	4.92	288	103	13	255	$122,189
12:55 PM	$139.99	4.16	171	113	6	257	$59,784
1:00 PM	$139.95	4.83	255	167	13	240	$168,846
1:05 PM	$139.91	6.02	152	142	2	137	$6,995
1:10 PM	$140.07	7.01	192	122	3	310	$19,176
1:15 PM	$140.06	6.59	162	221	5	139	$21,009
1:20 PM	$140.17	3.50	213	26	10	189	$84,217
1:25 PM	$140.21	2.68	250	156	9	147	$79,495
1:30 PM	$140.28	2.34	237	100	12	196	$112,770
1:35 PM	$140.39	3.00	172	120	26	340	$322,772
1:40 PM	$140.63	6.59	274	88	18	483	$121,353
1:45 PM	$140.62	4.52	106	126	17	247	$172,959
1:50 PM	$140.68	5.11	251	71	7	178	$40,787
1:55 PM	$140.72	4.62	168	153	17	270	$193,457
2:00 PM	$140.75	4.61	285	150	10	127	$72,340
2:05 PM	$140.97	6.84	398	196	16	348	$154,616
2:10 PM	$141.11	4.68	278	110	32	285	$362,107
2:15 PM	$141.18	4.06	190	146	14	283	$176,133
2:20 PM	$141.12	2.63	129	125	25	375	$242,762
2:25 PM	$141.13	4.58	139	61	17	204	$196,725
2:30 PM	$141.03	3.71	183	74	12	201	$77,879
2:35 PM	$141.01	4.19	170	46	30	406	$330,383
2:40 PM	$141.19	8.42	163	88	11	247	$142,360
2:45 PM	$141.20	4.63	391	48	17	240	$155,433
2:50 PM	$141.24	5.42	297	86	27	358	$381,499
2:55 PM	$141.12	3.00	87	76	23	492	$263,047
3:00 PM	$141.14	5.22	144	147	19	577	$123,330
3:05 PM	$141.12	3.25	173	124	14	285	$199,390
3:10 PM	$141.20	2.59	303	19	23	430	$332,351
3:15 PM	$141.29	2.48	218	137	41	600	$451,828
3:20 PM	$141.29	2.08	181	174	56	517	$806,482
3:25 PM	$141.33	1.74	287	111	39	582	$559,393
3:30 PM	$141.44	2.30	198	117	64	1,028	$739,473
3:35 PM	$141.44	2.13	263	111	65	864	$886,603
3:40 PM	$141.29	2.51	206	103	57	1,252	$742,247
3:45 PM	$141.56	2.84	355	165	35	1,065	$242,671
3:50 PM	$141.29	2.50	193	151	154	2,458	$2,367,376
3:55 PM	$141.43	2.00	246	161	140	2,142	$1,857,272

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher order book levels.

orderbook_trades = orderbook.loc[orderbook["type"].isin([4, 5]), ["ts", "m_price"]]

ob_minute = (
    orderbook.assign(ts_new=lambda d: d["ts"].dt.floor("min"))
    .groupby("ts_new", as_index=False)
    .last()
    .drop(columns="ts_new")
)
quote_cols = [c for c in ob_minute.columns if c.startswith(("ask_", "bid_"))]
long_df = (
    ob_minute[["ts"] + quote_cols]
    .melt(id_vars="ts", var_name="key", value_name="value")
)
parts = long_df["key"].str.split("_", expand=True)
long_df["side"] = parts[0]
long_df["variable"] = parts[1]
long_df["level"] = parts[2].astype(int)
orderbook_quotes = (
    long_df.pivot_table(
        index=["ts", "side", "level"], columns="variable", values="value"
    )
    .reset_index()
    .dropna(subset=["price"])
)
orderbook_quotes["size_norm"] = orderbook_quotes["size"] / orderbook_quotes["size"].max()
orderbook_quotes["level"] = pd.Categorical(orderbook_quotes["level"])

(
    ggplot()
    + theme_bw()
    + geom_point(
        orderbook_quotes,
        aes(x="ts", y="price", color="level", size="size_norm"),
        alpha=0.1,
    )
    + geom_point(orderbook_trades, aes(x="ts", y="m_price"), color="red")
    + labs(title="Orderbook Dynamics", y="Price", x="")
    + theme(
        panel_grid_major=element_blank(),
        panel_grid_minor=element_blank(),
        legend_position="none",
    )
)

<plotnine.ggplot.ggplot object at 0x00000233BB62F090>

Brownian Motion and realized volatility

In this exercise, you will analyze the empirical properties of the (geometric) Brownian motion based on a simulation study. Such simulation studies can also be helpful for various applications in continuous time finance, e.g., option pricing.

Finance theory suggests that prices must be so-called semimartingales. The most commonly used semimartingale is the Geometric Brownian Motion (GBM), where we assume that the logarithm of the stock price $S_t$, $X_t = \log(S_t)$ follows the following dynamic \[ X_t = X_0 + \mu t + \sigma W_t \] where $\mu$ and $\sigma$ are constants and $W_t$ is a Brownian motion.

Exercises:

Write a function that simulates (log) stock prices according to the model above. Recall that the Brownian motion has independent increments which are normally distributed. The function should generate a path of the log stock price with a pre-defined number of observations (granularity), drift term, and volatility. Then, plot one sample path of the implied (log) stock price.
Use the function to simulate a large number of geometric Brownian motions. Illustrate the quadratic variation of the process.
Compute the realized volatility for each simulated process and illustrate the sampling distribution.
Finally, consider the infill asymptotics. Show the effect of sampling once per day, once per hour, once per minute, and so on on the variance of the realized volatility.

Solutions:

For this exercise, I use some functionalities from the xts and highfrequency packages, particularly for aggregation at different frequencies.

library(dplyr)
library(tidyr)
library(xts)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################


Attaching package: 'xts'

The following objects are masked from 'package:dplyr':

    first, last

library(highfrequency)

Warning: package 'highfrequency' was built under R version 4.5.3

library(purrr)

library(ggplot2)
library(lubridate)

Note: When loading xts and the tidyverse jointly, you may run into serious issues if you do not consider naming conflicts! Especially the convenient functions last and first from dplyr may be affected. For that purpose, I consistently use dplyr::first and dplyr::last in this chapter.

First, the function below is a rather general version of a simulated generalized Brownian motion. Note that there is no immediate need to assign trading date and hour; I implement this because it will provide output closer to actual data, and therefore, parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values start and the geometric Brownian motion mu and sigma parameters. The input steps defines the simulated Brownian motion’s granularity (sampling frequency).

simulate_geom_brownian_motion <- function(
  seed = 3010,
  steps = 1000,
  mu = 0,
  sigma = 1,
  start = 0
) {
  set.seed(seed)
  date <- Sys.Date() + days(seed)
  hour(date) <- 9
  minute(date) <- 30
  date_end <- date
  hour(date_end) <- 16
  minute(date_end) <- 0

  tibble(
    step = 1:(steps + 1),
    time = seq(from = date, to = date_end, length.out = steps + 1),
    price = cumsum(c(
      start,
      rnorm(steps, mean = mu / steps, sd = sigma / sqrt(steps))
    ))
  )
}

Note the last line of the function doing most of the work: We exploit that increments of a Brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next, I simulate the stock price with 1000 intraday observations. Recall that the geometric Brownian motion implies dynamics for the log of the stock price, which is why I adjust the plotted value in the $y$-axis.

tab <- simulate_geom_brownian_motion(start = log(18))
tab |>
  ggplot(aes(x = time, y = exp(price))) +
  geom_line() +
  labs(x = "", y = "Stock price")

Figure 3: A single simulated intraday path of the geometric Brownian motion (1000 steps).

Next, we scale the simulation up. The code below simulates 250 paths of the geometric Brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma <- 2.4
steps <- 1000

simulation <- tibble(iteration = 1:250)
simulation <- simulation |>
  mutate(
    values = map(
      iteration,
      simulate_geom_brownian_motion,
      sigma = true_sigma,
      steps = steps
    )
  ) |>
  unnest(values)

simulation |>
  ggplot(aes(x = step, y = price, group = iteration)) +
  geom_line(alpha = 0.3) +
  labs(x = "Iteration", y = "(log) price")

Figure 4: Two hundred and fifty simulated log price paths of the geometric Brownian motion ($\sigma = 2.4$, 1000 intraday steps each).

Next, we investigate some of the properties of the geometric Brownian motion. First, by definition, the sample variation of $X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t$ is $\int\limits_0^t\sigma^2 dt = \sigma^2t$. The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in $t$ as expected. For the simulation, I used $\sigma = 2.4$ (true_sigma), which corresponds to the slope of the line (adjusted by the number of observations).

simulation |>
  group_by(step) |>
  summarise(sd = sd(price)) |>
  ggplot(aes(x = step, y = sd^2)) +
  geom_line() +
  geom_abline(aes(slope = true_sigma^2 / steps, intercept = 0)) +
  labs(x = "Iteration", y = "Sample variation")

Figure 5: Sample variation of simulated GBM log prices over time; the straight line shows the theoretical $\sigma^2 t$.

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps). The bar plot below illustrates the sampling distribution of the realized volatility, which, as expected, centers around true_sigma, the integrated variance of the simulated geometric Brownian motion. You can change the value true_sigma above and analyze the effect.

simulation |>
  group_by(iteration) |>
  summarise(rv = sum(((price - lag(price))^2), na.rm = TRUE)) |>
  ggplot(aes(x = rv)) +
  geom_histogram(bins = 100) +
  labs(x = "Realized Volatility") +
  geom_vline(aes(xintercept = true_sigma^2), color = "red", linetype = "dotted")

Figure 6: Sampling distribution of the realized variance across 250 simulated GBM paths; the dotted vertical line marks the true integrated variance.

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub-samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprisingly, the dispersion around the true parameter decreases substantially with higher sampling frequency.

Note that the code below illustrates how to work with the xts package and switch between the tidyverse and the high-frequency data format requirements. Of course, the procedure could be more convenient, but it is worth exploiting the functionalities of both worlds instead of recording everything in a tidy manner.

import numpy as np
import pandas as pd
from plotnine import *
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
plt.switch_backend("Agg")

def simulate_geom_brownian_motion(seed=3010, steps=1000, mu=0, sigma=1, start=0):
    np.random.seed(seed)

    date = datetime.now().date() + timedelta(days=seed)
    date = datetime.combine(date, datetime.min.time()).replace(hour=9, minute=30)
    date_end = date.replace(hour=16, minute=0)

    times = pd.date_range(start=date, end=date_end, periods=steps + 1)

    increments = np.random.normal(
        loc=mu / steps, scale=sigma / np.sqrt(steps), size=steps
    )
    price = np.cumsum(np.insert(increments, 0, start))

    return pd.DataFrame(
        {"step": np.arange(1, steps + 2), "time": times, "price": price}
    )

tab = simulate_geom_brownian_motion(start=np.log(18))

(
    ggplot(tab, aes(x="time", y=np.exp(tab["price"])))
    + geom_line()
    + labs(x="", y="Stock price")
)

<plotnine.ggplot.ggplot object at 0x0000023375AECCD0>

Next, we scale the simulation up. The code below simulates 250 paths of the geometric Brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma = 2.4
steps = 1000

simulations = []
for i in range(1, 251):
    df = simulate_geom_brownian_motion(seed=i, sigma=true_sigma, steps=steps)
    df["iteration"] = i
    simulations.append(df)

simulation = pd.concat(simulations, ignore_index=True)

(
    ggplot(simulation, aes(x="step", y="price", group="iteration"))
    + geom_line(alpha=0.3)
    + labs(x="Iteration", y="(log) price")
)

<plotnine.ggplot.ggplot object at 0x00000233FCDBA850>

var_df = (
    simulation
    .groupby("step")["price"]
    .std()
    .reset_index(name="sd")
    .assign(var=lambda d: d["sd"] ** 2)
)

(
    ggplot(var_df, aes(x="step", y="var"))
    + geom_line()
    + geom_abline(slope=true_sigma**2 / steps, intercept=0, linetype="dashed")
    + labs(x="Iteration", y="Sample variation")
)

<plotnine.ggplot.ggplot object at 0x0000023377014990>

rv = (
    simulation
    .assign(lag_price=lambda d: d.groupby("iteration")["price"].shift(1))
    .assign(sq_diff=lambda d: (d["price"] - d["lag_price"]) ** 2)
    .groupby("iteration")["sq_diff"]
    .sum()
    .dropna()
)

rv_df = rv.reset_index(name="rv")

(
    ggplot(rv_df, aes(x="rv"))
    + geom_histogram(bins=100)
    + geom_vline(xintercept=true_sigma**2, linetype="dotted", color="red")
    + labs(x="Realized Volatility")
)

<plotnine.ggplot.ggplot object at 0x0000023375954D10>

def infill_asymptotics(df, sampling_minutes=5):
    df = df.set_index("time").sort_index()

    # resample (similar to aggregatePrice)
    sampled = df["price"].resample(f"{sampling_minutes}min").last().dropna()

    rv = np.sum(np.diff(sampled) ** 2)
    return rv


results = []

freq_order = [
    "1 Second",
    "1 Minute",
    "5 Minutes",
    "15 Minutes",
    "1 Hour",
    "1 Day",
]

rvs = (
    simulation
    .groupby("iteration")
    .apply(
        lambda grp: pd.Series({
            "1 Second": infill_asymptotics(grp, 1 / 60),
            "1 Minute": infill_asymptotics(grp, 1),
            "5 Minutes": infill_asymptotics(grp, 5),
            "15 Minutes": infill_asymptotics(grp, 15),
            "1 Hour": infill_asymptotics(grp, 60),
            "1 Day": infill_asymptotics(grp, int(6.5 * 60)),
        })
    )
    .reset_index()
)

rvs_long = (
    rvs
    .melt(id_vars="iteration",
          var_name="sampling_frequency",
          value_name="rv")
    .assign(sqrt_rv=lambda d: np.sqrt(d["rv"]))
    .assign(
        sampling_frequency=lambda d: pd.Categorical(
            d["sampling_frequency"],
            categories=freq_order,
            ordered=True
        )
    )
)

(
    ggplot(rvs_long, aes(x="sqrt_rv", fill="sampling_frequency"))
    + geom_density(alpha=0.5)
    + geom_vline(xintercept=true_sigma, linetype="dotted")
    + facet_wrap("~sampling_frequency", scales="free_y")
    + labs(x="(Log) Realized Volatility", y="")
)

<plotnine.ggplot.ggplot object at 0x000002B8DF550D90>