High-frequency econometrics

Two recently developed packages simplify programmatic access to LOBSTER data directly from your analysis environment:

KU students can obtain access to LOBSTER data by writing to stefan.voigt@econ.ku.dk.

Read in and process Lobster files

This exercise requires you mainly to familiarize yourself with LOBSTER, an online limit order book data tool that provides easy-to-use, high-quality data for the entire universe of NASDAQ traded stocks. I requested data based on their online interface and stored it on Absalon before running the code below. Lobster files contain all trading messages (order submissions, cancellations, trades) for a pre-specified ticker that went through NASDAQ. For example, the sample file in Absalon contains the entire order book history for the ticker SAP until level 5 in January 2022. LOBSTER generates a ‘message’ and an ‘orderbook’ file for each active trading day of a selected ticker. The ‘orderbook’ file contains the evolution of the limit order book up to the requested levels. The ‘message’ file contains indicators for the type of event causing an update of the limit order book in the requested price range. All events are time-stamped in seconds after midnight, with decimal precision of at least milliseconds and up to nanoseconds depending on the requested period.

The ‘message’ and ‘order book’ files are provided in the .csv format and can easily be read with any statistical software package.

Exercises:

  • Familiarize yourself with the output documentation of Lobster files on their homepage. Download the file SAP_2022-01-01_2022-01-31_5.7z from Absalon and read them using read_csv. Lobster files always have the same naming convention, ticker_date_34200000_57600000_filetype_level.csv, where filetype denotes message or the corresponding order book snapshots. Take notes on what you think are essential cleaning steps to get a helpful dataset comprising order book and trading information.

  • Merge the two files after you apply some intuitive naming convention such that you can work with one file which traces all information about messages and the corresponding order book snapshots. Try to create a function process_orderbook, which performs essential cleaning of the data by making sure you filter out trading halts and aggregate order execution messages which may have triggered multiple limit orders. More specifically, the function should complete the following generic steps:

    1. Identify potential trading stops (consult the Lobster documentation for this) and discard all observations recorded during the window when trading was interrupted.
    2. Sometimes, messages related to the daily opening and closing auction are included in the Lobster file (especially during days when trading stops earlier, e.g., before holidays). Remove such observations (opening auctions are identified as messages with type == 6 and ID == -1, and closing auctions can be identified as messages with type == 6 and ID == -2).
    3. Replace “empty” slots in the order book with NAs in case there was no liquidity quoted at higher order book levels. In that case, Lobster reports negative bid and large ask prices (above 199999).
    4. Remove observations with seemingly crossed prices (when the best ask is below the best bid).
    5. Market orders are sometimes executed against multiple limit orders. In that case, Lobster records multiple transactions with an identical time stamp. Aggregate these transactions and generate the corresponding trade size and the (size-weighted) execution price. Also, retain the order book snapshot after the transaction has been executed.
    6. The direction of a trade is hard to evaluate (not observable) if the transaction has been executed against a hidden order (type == 5). Create a proxy for the direction based on the executed price relative to the last observed order book snapshot.

You can use this function when you work with raw order book data from Lobster to produce a clean, analysis-ready order book. In the solution part, I provide some further explanations of my implementation. - Use the function to clean the SAP Lobster output. - Summarize the order book by extracting the average midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask), the average spread \(S_t= (a_t - b_t)/q_t\) (in basis points relative to the concurrent midquote), the aggregate volume in USD and the quoted depth (in million USD 5 basis points from the concurrent best price on each side of the order book) for each 5-minute interval.
- Provide a meaningful illustration of the order book dynamics during the particular trading day. This is intentionally a very open question: There is so much information within an order book that it is impossible to highlight everything worth considering. I encourage you to post your visualization on the discussion board in Absalon.

Solutions:

library(dplyr)
Warning: package 'dplyr' was built under R version 4.5.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(archive)
library(readr)
library(ggplot2)
library(gt)
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(forcats)

I use the following code to read in (general) Lobster files - change ticker, date, and level in the code snippet below.

files_available <- archive::archive("SAP_2022-01-01_2022-01-31_5.7z") |>
  mutate(
    ticker = gsub("(.*)_(.*)_342.*_(.*).csv", "\\1", path),
    date = gsub("(.*)_(.*)_342.*_(.*).csv", "\\2", path),
    level = gsub("(.*)_(.*)_342.*_(.*).csv", "\\3", path)
  )

files_available
# A tibble: 40 × 5
   path                                                 size date   ticker level
   <chr>                                               <int> <chr>  <chr>  <chr>
 1 SAP_2022-01-28_34200000_57600000_message_5.csv          0 2022-… SAP    5    
 2 SAP_2022-01-28_34200000_57600000_orderbook_5.csv        0 2022-… SAP    5    
 3 SAP_2022-01-03_34200000_57600000_message_5.csv   14902108 2022-… SAP    5    
 4 SAP_2022-01-03_34200000_57600000_orderbook_5.csv 38240040 2022-… SAP    5    
 5 SAP_2022-01-04_34200000_57600000_message_5.csv   13248413 2022-… SAP    5    
 6 SAP_2022-01-04_34200000_57600000_orderbook_5.csv 33881101 2022-… SAP    5    
 7 SAP_2022-01-05_34200000_57600000_message_5.csv   12234669 2022-… SAP    5    
 8 SAP_2022-01-05_34200000_57600000_orderbook_5.csv 31189574 2022-… SAP    5    
 9 SAP_2022-01-06_34200000_57600000_message_5.csv   17311993 2022-… SAP    5    
10 SAP_2022-01-06_34200000_57600000_orderbook_5.csv 44240752 2022-… SAP    5    
# ℹ 30 more rows

Lobster files always follow the same naming convention; thus, the code above can be recycled in a more general fashion for other Lobster files. Next, we read in data for one(!) of the 20 trading days.

messages_raw <- read_csv(
  archive::archive_read("SAP_2022-01-01_2022-01-31_5.7z", 3),
  col_names = c(
    "ts",
    "type",
    "order_id",
    "m_size",
    "m_price",
    "direction",
    "null"
  ),
  col_type = cols(null = col_skip())
) |>
  mutate(
    ts = as.POSIXct(
      ts,
      origin = as.Date("2022-01-07"),
      tz = "GMT",
      format = "%Y-%m-%d %H:%M:%OS6"
    ),
    m_price = m_price / 10000
  )

messages_raw |>
  filter(type == 4 | type == 5) |>
  ggplot(aes(x = ts, y = m_price)) +
  geom_step() +
  geom_point(data = messages_raw |> tail(1), color = "red", size = 3) +
  labs(x = "", y = "USD", title = "Traded prices: SAP, January 3rd, 2022")
Figure 1: Traded prices for SAP on January 3rd, 2022, based on execution messages (types 4 and 5).

Note my naming convention for column names. The command col_type makes sure that there is easy processing afterward. By default, ts denotes the time in seconds since midnight (decimals are precise until the nanosecond level), and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

Each message is associated with the corresponding order book snapshot. After merging the message and orderbook files, the entire data thus looks as follows.

The code below is rather flexible to handle Lobster files with more or less levels. Each column contains the corresponding prices and sizes on each side of the order book. More information is available on the documentation of Lobster. The output contains one file with all information available in the order book.

level <- 5
orderbook_raw <- read_csv(
  archive::archive_read("SAP_2022-01-01_2022-01-31_5.7z", 4),
  col_names = paste(
    rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level),
    rep(1:level, each = 4),
    sep = "_"
  ),
  cols(.default = col_double())
) |>
  mutate_at(vars(contains("price")), ~ . / 10000)

orderbook <- bind_cols(messages_raw, orderbook_raw)
orderbook
# A tibble: 316,379 × 26
   ts                   type order_id m_size m_price direction ask_price_1
   <dttm>              <dbl>    <dbl>  <dbl>   <dbl>     <dbl>       <dbl>
 1 2022-01-07 09:30:00     3 34351556    500    140.        -1        140.
 2 2022-01-07 09:30:00     1 34456580    177    140.         1        140.
 3 2022-01-07 09:30:00     3 34390184    100    140.        -1        140.
 4 2022-01-07 09:30:00     1 34457908    160    140.        -1        140.
 5 2022-01-07 09:30:00     3 34302936    100    140.         1        140.
 6 2022-01-07 09:30:00     1 34458304    192    140.        -1        140.
 7 2022-01-07 09:30:00     3 34457908    160    140.        -1        140.
 8 2022-01-07 09:30:00     1 34463304    100    140.        -1        140.
 9 2022-01-07 09:30:00     2 34458304     57    140.        -1        140.
10 2022-01-07 09:30:00     2 34458304     29    140.        -1        140.
# ℹ 316,369 more rows
# ℹ 19 more variables: ask_size_1 <dbl>, bid_price_1 <dbl>, bid_size_1 <dbl>,
#   ask_price_2 <dbl>, ask_size_2 <dbl>, bid_price_2 <dbl>, bid_size_2 <dbl>,
#   ask_price_3 <dbl>, ask_size_3 <dbl>, bid_price_3 <dbl>, bid_size_3 <dbl>,
#   ask_price_4 <dbl>, ask_size_4 <dbl>, bid_price_4 <dbl>, bid_size_4 <dbl>,
#   ask_price_5 <dbl>, ask_size_5 <dbl>, bid_price_5 <dbl>, bid_size_5 <dbl>

The order book contains 316379 observations. While the columns ts to direction help understand the specific messages, the remaining columns contain a snapshot of the order book at that particular time.

Next, Lobster sometimes contains opening and closing auctions (in particular on days when trading hours deviate from the regular times). If you follow the documentation on Lobster, you will see that these times can be identified as I do in the code below.

opening_auction <- orderbook |>
  filter((type == 6 & order_id == -1) | row_number() == 1) |>
  filter(row_number() == n()) |>
  pull(ts)
closing_auction <- orderbook |>
  filter((type == 6 & order_id == -2) | row_number() == n()) |>
  filter(row_number() == 1) |>
  pull(ts)

c(opening_auction, closing_auction)
[1] "2022-01-07 09:30:02 GMT" "2022-01-07 15:59:59 GMT"

In line with the exercise, the function process_orderbook() performs several essential steps. You can call it from within R with the following command.

process_orderbook <- function(orderbook) {
  # Did a trading halt happen? ----
  halt_index <- orderbook |>
    filter(
      type == 7 &
        direction == -1 &
        m_price == -1 / 10000 |
        type == 7 & direction == -1 & m_price == 1 / 10000
    )

  while (nrow(halt_index) > 1) {
    # Filter out messages that occurred in between trading halts
    cat("Trading halt detected")
    orderbook <- orderbook |>
      filter(ts < halt_index$ts[1] | ts > halt_index$ts[2])
    halt_index <- halt_index |> filter(row_number() > 2)
  }

  # Opening and closing auction ----
  # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2

  opening_auction <- orderbook |>
    filter(
      type == 6,
      order_id == -1
    ) |>
    pull(ts)

  closing_auction <- orderbook |>
    filter(
      type == 6,
      order_id == -2
    ) |>
    pull(ts)

  if (length(opening_auction) != 1) {
    opening_auction <- orderbook |>
      select(ts) |>
      head(1) |>
      pull(ts) -
      lubridate::seconds(0.1)
  }
  if (length(closing_auction) != 1) {
    closing_auction <- orderbook |>
      select(ts) |>
      tail(1) |>
      pull(ts) +
      lubridate::seconds(0.1)
  }

  orderbook <- orderbook |> filter(ts > opening_auction & ts < closing_auction)
  orderbook <- orderbook |> filter(type != 6 & type != 7)

  # Replace "empty" slots in the order book (0 volume) with NA prices ----
  orderbook <- orderbook |>
    mutate(
      across(contains("bid_price"), ~ replace(., . < 0, NA)),
      across(contains("ask_price"), ~ replace(., . >= 999999, NA))
    )

  # Remove crossed order book observations ----
  orderbook <- orderbook |>
    filter(ask_price_1 > bid_price_1)

  # Merge transactions with unique timestamp ----
  trades <- orderbook |>
    filter(type == 4 | type == 5) |>
    select(ts:direction)

  trades <- inner_join(
    trades,
    orderbook |>
      group_by(ts) |>
      filter(row_number() == 1) |>
      ungroup() |>
      transmute(
        ts,
        ask_price_1,
        bid_price_1,
        midquote = ask_price_1 / 2 + bid_price_1 / 2,
        lag_midquote = lag(midquote)
      ),
    by = "ts"
  )

  trades <- trades |>
    mutate(
      direction = case_when(
        type == 5 & m_price < lag_midquote ~ 1, # lobster convention: direction = 1 if executed against a limit buy order
        type == 5 & m_price > lag_midquote ~ -1,
        type == 4 ~ as.double(direction),
        TRUE ~ as.double(NA)
      )
    )

  # Aggregate transactions with size and volume-weighted price
  trade_aggregated <- trades |>
    group_by(ts) |>
    summarise(
      type = last(type),
      order_id = NA,
      m_price = sum(m_price * m_size) / sum(m_size),
      m_size = sum(m_size),
      direction = last(direction)
    )

  # Merge trades with the last observed order book snapshot
  trade_aggregated <- inner_join(
    trade_aggregated,
    orderbook |>
      select(ts, ask_price_1:last_col()) |>
      group_by(ts) |>
      filter(row_number() == n()),
    by = "ts"
  )

  orderbook <- orderbook |>
    filter(
      type != 4,
      type != 5
    ) |>
    bind_rows(trade_aggregated) |>
    arrange(ts) |>
    mutate(direction = if_else(type == 4 | type == 5, direction, as.double(NA)))

  return(orderbook)
}

You can process the raw data from before by simply calling.

orderbook <- process_orderbook(orderbook)

The function compute_depth below computes the number of traded shares within bp basis points of the associated best price, which can be used to measure liquidity.

compute_depth <- function(orderbook, side = "bid", bp = 0) {
  # Computes depth (in contracts) based on order book snapshots
  if (side == "bid") {
    value_bid <- (1 - bp / 10000) * orderbook |> select("bid_price_1")
    index_bid <- orderbook |>
      select(contains("bid_price")) |>
      mutate_all(function(x) {
        if_else(is.na(x), FALSE, x >= value_bid)
      })

    sum_vector <- (orderbook |> select(contains("bid_size")) * index_bid) |>
      rowSums(na.rm = TRUE)
  } else {
    value_ask <- (1 + bp / 10000) * orderbook |> select("ask_price_1")
    index_ask <- orderbook |>
      select(contains("ask_price")) |>
      mutate_all(function(x) {
        if_else(is.na(x), FALSE, x <= value_ask)
      })
    sum_vector <- (orderbook |> select(contains("ask_size")) * index_ask) |>
      rowSums(na.rm = TRUE)
  }
  return(sum_vector)
}

Next, I compute the midquote, bid-ask spread, aggregate volume, and depth (the amount of trade-able units in the order book):

  • Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
  • Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
  • Volume is the aggregate sum of traded units of the stock. I do not differentiate between hidden (type==5) and visible volume.
  • Depth at the best level.

I aggregate the information into buckets of 5 minutes, using time-weighting for the order book variables.

orderbook_summaries <- orderbook |>
  transmute(
    ts,
    midquote = ask_price_1 / 2 + bid_price_1 / 2,
    trading_volume = if_else(
      type == 4 | type == 5,
      m_price * m_size,
      as.double(NA)
    ),
    depth0_bid = bid_size_1,
    depth0_ask = ask_size_1,
    spread = 10000 * (ask_price_1 - bid_price_1) / midquote
  ) |>
  mutate(
    ts_latency = as.numeric(lead(ts)) - as.numeric(ts), # time between messages
    ts_latency = if_else(is.na(ts_latency), 0, ts_latency), # 0 for first message
    ts_minute = lubridate::floor_date(ts, "5 minutes")
  ) |> # create 5 minute buckets (e.g. messages from 10:30:00 - 10:34:59.xx belong into 10:30:00)
  group_by(ts_minute) |>
  summarise(
    midquote = last(midquote),
    n_trades = sum(!is.na(trading_volume)),
    n = n(), # number of messages
    trading_volume = sum(trading_volume, na.rm = TRUE),
    trading_volume = if_else(
      is.na(trading_volume),
      as.double(0),
      trading_volume
    ),
    depth0_bid = weighted.mean(depth0_bid, ts_latency),
    depth0_ask = weighted.mean(depth0_ask, ts_latency),
    spread = weighted.mean(spread, ts_latency)
  )

orderbook_summaries |>
  gt() |>
  cols_label(
    ts_minute = "Time",
    midquote = "Midquote (USD)",
    n_trades = "Trades",
    n = "Messages",
    trading_volume = "Volume (USD)",
    depth0_bid = "Bid depth",
    depth0_ask = "Ask depth",
    spread = "Spread (bps)"
  ) |>
  tab_spanner(
    label = "Order book",
    columns = c(midquote, spread, depth0_bid, depth0_ask)
  ) |>
  tab_spanner(
    label = "Trading activity",
    columns = c(n_trades, n, trading_volume)
  ) |>
  fmt_time(columns = ts_minute, time_style = "hm") |>
  fmt_currency(columns = midquote, currency = "USD", decimals = 2) |>
  fmt_currency(columns = trading_volume, currency = "USD", decimals = 0) |>
  fmt_number(columns = c(depth0_bid, depth0_ask), decimals = 0) |>
  fmt_number(columns = spread, decimals = 2) |>
  fmt_integer(columns = c(n_trades, n))
Five-minute summaries of midquote, spread (bps), trade count, volume, and quoted depth for SAP, January 2022.
Time
Order book
Trading activity
Midquote (USD) Spread (bps) Bid depth Ask depth Trades Messages Volume (USD)
9:30 AM $140.38 3.62 485 534 61 12,926 $1,313,466
9:35 AM $140.24 4.33 669 584 23 14,844 $602,060
9:40 AM $140.15 4.31 761 427 24 11,150 $299,088
9:45 AM $139.49 4.54 777 557 43 12,891 $711,751
9:50 AM $139.26 4.76 713 720 45 20,975 $643,746
9:55 AM $139.31 4.79 748 628 46 17,916 $666,569
10:00 AM $139.81 5.05 749 671 37 17,288 $640,948
10:05 AM $139.81 4.93 835 520 42 14,336 $517,071
10:10 AM $140.16 5.33 805 796 32 11,862 $609,344
10:15 AM $140.17 4.59 493 633 24 11,218 $218,798
10:20 AM $140.25 4.31 567 652 69 10,213 $1,392,951
10:25 AM $140.18 4.54 470 805 17 12,386 $193,138
10:30 AM $140.19 4.43 447 715 27 12,563 $443,163
10:35 AM $140.29 3.61 435 426 44 10,616 $408,664
10:40 AM $140.49 4.04 554 548 91 10,862 $1,516,834
10:45 AM $140.43 4.51 467 758 28 12,057 $249,738
10:50 AM $140.24 4.32 407 746 36 11,627 $292,742
10:55 AM $140.21 3.61 299 787 19 8,978 $363,035
11:00 AM $140.18 2.79 178 950 45 8,117 $852,378
11:05 AM $140.19 4.07 455 821 29 8,625 $788,170
11:10 AM $140.23 3.13 366 513 57 9,471 $1,794,017
11:15 AM $140.31 2.79 278 470 36 9,019 $547,094
11:20 AM $140.37 3.72 528 594 40 7,751 $742,033
11:25 AM $140.57 4.10 526 533 58 12,209 $1,843,627
11:30 AM $140.39 6.24 219 522 47 2,706 $1,043,222
11:35 AM $140.37 8.53 240 203 41 501 $926,211
11:40 AM $140.37 8.63 173 166 16 207 $82,531
11:45 AM $140.31 5.78 136 188 19 212 $201,726
11:50 AM $140.14 6.04 82 56 10 405 $111,357
11:55 AM $140.25 6.20 154 170 12 398 $142,420
12:00 PM $140.22 6.09 138 119 10 234 $143,158
12:05 PM $140.23 6.90 142 46 14 188 $171,964
12:10 PM $140.13 5.57 188 72 14 157 $129,363
12:15 PM $140.08 3.55 109 68 17 192 $191,348
12:20 PM $140.17 3.94 120 44 9 143 $63,614
12:25 PM $140.10 4.88 134 121 21 210 $173,409
12:30 PM $140.00 6.99 138 165 26 286 $407,699
12:35 PM $140.00 5.66 50 121 32 300 $586,622
12:40 PM $139.94 5.94 229 76 8 198 $82,985
12:45 PM $139.99 5.80 210 79 10 188 $27,437
12:50 PM $139.94 4.92 288 103 13 255 $122,189
12:55 PM $139.99 4.16 171 113 6 257 $59,784
1:00 PM $139.95 4.83 255 167 13 240 $168,846
1:05 PM $139.91 6.02 152 142 2 137 $6,995
1:10 PM $140.07 7.01 192 122 3 310 $19,176
1:15 PM $140.06 6.59 162 221 5 139 $21,009
1:20 PM $140.17 3.50 213 26 10 189 $84,217
1:25 PM $140.21 2.68 250 156 9 147 $79,495
1:30 PM $140.28 2.34 237 100 12 196 $112,770
1:35 PM $140.39 3.00 172 120 26 340 $322,772
1:40 PM $140.63 6.59 274 88 18 483 $121,353
1:45 PM $140.62 4.52 106 126 17 247 $172,959
1:50 PM $140.68 5.11 251 71 7 178 $40,787
1:55 PM $140.72 4.62 168 153 17 270 $193,457
2:00 PM $140.75 4.61 285 150 10 127 $72,340
2:05 PM $140.97 6.84 398 196 16 348 $154,616
2:10 PM $141.11 4.68 278 110 32 285 $362,107
2:15 PM $141.18 4.06 191 146 14 283 $176,133
2:20 PM $141.12 2.63 129 125 25 375 $242,762
2:25 PM $141.13 4.58 139 61 17 204 $196,725
2:30 PM $141.03 3.71 183 74 12 201 $77,879
2:35 PM $141.01 4.19 170 46 30 406 $330,383
2:40 PM $141.19 8.42 163 88 11 247 $142,360
2:45 PM $141.20 4.63 391 48 17 240 $155,433
2:50 PM $141.24 5.42 297 86 27 358 $381,499
2:55 PM $141.12 3.00 87 76 23 492 $263,047
3:00 PM $141.14 5.22 145 147 19 577 $123,330
3:05 PM $141.12 3.25 173 124 14 285 $199,390
3:10 PM $141.20 2.59 303 19 23 430 $332,351
3:15 PM $141.29 2.48 218 137 41 600 $451,828
3:20 PM $141.29 2.08 181 174 56 517 $806,482
3:25 PM $141.33 1.74 287 111 39 582 $559,393
3:30 PM $141.44 2.30 198 117 64 1,028 $739,473
3:35 PM $141.44 2.13 263 111 65 864 $886,603
3:40 PM $141.29 2.51 206 103 57 1,252 $742,247
3:45 PM $141.56 2.84 355 165 35 1,065 $242,671
3:50 PM $141.29 2.50 193 151 154 2,458 $2,367,376
3:55 PM $141.43 2.00 246 161 140 2,142 $1,857,272

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher order book levels.

orderbook_trades <- orderbook |>
  filter(type == 4 | type == 5) |>
  select(ts, m_price)

orderbook_quotes <- orderbook |>
  mutate(ts_new = floor_date(ts, "minute")) |>
  group_by(ts_new) |>
  summarise_all(dplyr::last) |>
  select(-ts_new) |>
  mutate(id = row_number()) |>
  select(ts, id, matches("bid|ask")) |>
  gather(level, price, -ts, -id) |>
  separate(level, into = c("side", "variable", "level"), sep = "_") |>
  mutate(level = as.numeric(level)) |>
  pivot_wider(names_from = variable, values_from = price)

ggplot() +
  theme_bw() +
  geom_point(
    data = orderbook_quotes,
    aes(x = ts, y = price, color = as_factor(level), size = size / max(size)),
    alpha = 0.1
  ) +
  geom_point(data = orderbook_trades, aes(x = ts, y = m_price), color = 'red') +
  labs(title = "Orderbook Dynamics", y = "Price", x = "") +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none"
  ) +
  scale_y_continuous()
Figure 2: Order book dynamics for SAP: quoted prices at each level (shaded by relative depth) with executed trades overlaid in red.
import py7zr
import pandas as pd
import numpy as np
from plotnine import (
    ggplot, aes, geom_step, geom_point, labs,
    theme_bw, theme, element_blank,
)
from great_tables import GT

I use the following code to read in (general) Lobster files - change ticker, date, and level in the code snippet below.

archive_path = "SAP_2022-01-01_2022-01-31_5.7z"

with py7zr.SevenZipFile(archive_path, mode="r") as archive:
    file_info = archive.list()

files_available = pd.DataFrame(
    {"path": [f.filename for f in file_info], "size": [f.uncompressed for f in file_info]}
)
files_available[["ticker", "date", "level"]] = (
    files_available["path"].str.extract(r"(.*)_(.*)_342.*_(.*).csv")
)

files_available.head(10)
path size ticker date level
0 SAP_2022-01-28_34200000_57600000_message_5.csv 0 SAP 2022-01-28 5
1 SAP_2022-01-28_34200000_57600000_orderbook_5.csv 0 SAP 2022-01-28 5
2 SAP_2022-01-03_34200000_57600000_message_5.csv 14902108 SAP 2022-01-03 5
3 SAP_2022-01-03_34200000_57600000_orderbook_5.csv 38240040 SAP 2022-01-03 5
4 SAP_2022-01-04_34200000_57600000_message_5.csv 13248413 SAP 2022-01-04 5
5 SAP_2022-01-04_34200000_57600000_orderbook_5.csv 33881101 SAP 2022-01-04 5
6 SAP_2022-01-05_34200000_57600000_message_5.csv 12234669 SAP 2022-01-05 5
7 SAP_2022-01-05_34200000_57600000_orderbook_5.csv 31189574 SAP 2022-01-05 5
8 SAP_2022-01-06_34200000_57600000_message_5.csv 17311993 SAP 2022-01-06 5
9 SAP_2022-01-06_34200000_57600000_orderbook_5.csv 44240752 SAP 2022-01-06 5

Lobster files always follow the same naming convention; thus, the code above can be recycled in a more general fashion for other Lobster files. The helper below extracts a single file from the archive into memory, mirroring archive::archive_read() in R.

def read_archive_csv(path, target_index, names=None, usecols=None):
    with py7zr.SevenZipFile(path, mode="r") as archive:
        file_info = archive.list()
        target_info = file_info[target_index]
        target = target_info.filename
        target_size = target_info.uncompressed
        if target_size is None or target_size < 0:
            raise ValueError(f"Could not determine uncompressed size for archive member: {target}")
        factory = py7zr.io.BytesIOFactory(max(target_size, 1))
        archive.extract(targets=[target], factory=factory)
    buf = factory.products[target]
    buf.seek(0)
    return pd.read_csv(buf, header=None, names=names, usecols=usecols)

Next, we read in data for one(!) of the trading days available in the archive.

messages_raw = read_archive_csv(
    archive_path,
    target_index=2,
    names=["ts", "type", "order_id", "m_size", "m_price", "direction", "null"],
    usecols=range(6),
)
messages_raw["ts"] = (
    pd.Timestamp("2022-01-03", tz="UTC")
    + pd.to_timedelta(messages_raw["ts"], unit="s")
)
messages_raw["m_price"] = messages_raw["m_price"] / 10000
trades = messages_raw[messages_raw["type"].isin([4, 5])]

(
    ggplot(
        trades,
        aes(x="ts", y="m_price"),
    )
    + geom_step()
    + geom_point(data=trades.tail(1), color="red", size=3)
    + labs(x="", y="USD", title="Traded prices: SAP, January 3rd, 2022")
)
<plotnine.ggplot.ggplot object at 0x00000233BAD70E50>

Note my naming convention for column names. The argument usecols=range(6) makes sure we drop the trailing null column for easy processing afterward. By default, ts denotes the time in seconds since midnight (decimals are precise until the nanosecond level), and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 5 levels.

The code below is rather flexible to handle Lobster files with more or less levels. Each column contains the corresponding prices and sizes on each side of the order book. After merging the message and orderbook files, the entire data thus looks as follows.

level = 5
ob_cols = [
    f"{var}_{lvl}"
    for lvl in range(1, level + 1)
    for var in ("ask_price", "ask_size", "bid_price", "bid_size")
]

orderbook_raw = read_archive_csv(archive_path, target_index=3, names=ob_cols)
price_cols = [c for c in orderbook_raw.columns if "price" in c]
orderbook_raw[price_cols] = orderbook_raw[price_cols] / 10000

orderbook = pd.concat(
    [messages_raw.reset_index(drop=True), orderbook_raw.reset_index(drop=True)],
    axis=1,
)
pd.set_option("display.max_columns", 7)
orderbook.head(10)
ts type order_id ... ask_size_5 bid_price_5 bid_size_5
0 2022-01-03 09:30:00.041693104+00:00 3 34351556 ... 696 140.23 800
1 2022-01-03 09:30:00.042287443+00:00 1 34456580 ... 696 140.23 800
2 2022-01-03 09:30:00.042327460+00:00 3 34390184 ... 764 140.23 800
3 2022-01-03 09:30:00.043255366+00:00 1 34457908 ... 764 140.23 800
4 2022-01-03 09:30:00.043453093+00:00 3 34302936 ... 764 140.23 800
5 2022-01-03 09:30:00.043699050+00:00 1 34458304 ... 764 140.23 800
6 2022-01-03 09:30:00.049983479+00:00 3 34457908 ... 764 140.23 800
7 2022-01-03 09:30:00.050566166+00:00 1 34463304 ... 696 140.23 800
8 2022-01-03 09:30:00.050704660+00:00 2 34458304 ... 696 140.23 800
9 2022-01-03 09:30:00.051073967+00:00 2 34458304 ... 696 140.23 800

10 rows × 26 columns

The order book contains as many observations as the message file. While the columns ts to direction help understand the specific messages, the remaining columns contain a snapshot of the order book at that particular time.

Next, Lobster sometimes contains opening and closing auctions (in particular on days when trading hours deviate from the regular times). If you follow the documentation on Lobster, you will see that these times can be identified as I do in the code below.

opening_auction = orderbook.loc[
    (orderbook["type"] == 6) & (orderbook["order_id"] == -1), "ts"
]
if opening_auction.empty:
    opening_auction = orderbook["ts"].head(1)

closing_auction = orderbook.loc[
    (orderbook["type"] == 6) & (orderbook["order_id"] == -2), "ts"
]
if closing_auction.empty:
    closing_auction = orderbook["ts"].tail(1)

pd.concat([opening_auction, closing_auction]).tolist()
[Timestamp('2022-01-03 09:30:02.221711541+0000', tz='UTC'), Timestamp('2022-01-03 15:59:59.000480373+0000', tz='UTC')]

In line with the exercise, the function process_orderbook() performs several essential steps. You can call it from within Python with the following command.

def process_orderbook(orderbook):
    orderbook = orderbook.copy()

    # Did a trading halt happen? ----
    halt_mask = (
        (orderbook["type"] == 7)
        & (orderbook["direction"] == -1)
        & (orderbook["m_price"].isin([-1 / 10000, 1 / 10000]))
    )
    halt_ts = orderbook.loc[halt_mask, "ts"].reset_index(drop=True)

    while len(halt_ts) > 1:
        # Filter out messages that occurred in between trading halts
        print("Trading halt detected")
        orderbook = orderbook[
            (orderbook["ts"] < halt_ts.iloc[0]) | (orderbook["ts"] > halt_ts.iloc[1])
        ]
        halt_ts = halt_ts.iloc[2:].reset_index(drop=True)

    # Opening and closing auction ----
    # Discard everything before type 6 & ID -1 and everything after type 6 & ID -2
    opening = orderbook.loc[
        (orderbook["type"] == 6) & (orderbook["order_id"] == -1), "ts"
    ]
    closing = orderbook.loc[
        (orderbook["type"] == 6) & (orderbook["order_id"] == -2), "ts"
    ]
    open_ts = (
        opening.iloc[0]
        if len(opening) == 1
        else orderbook["ts"].iloc[0] - pd.Timedelta("0.1s")
    )
    close_ts = (
        closing.iloc[0]
        if len(closing) == 1
        else orderbook["ts"].iloc[-1] + pd.Timedelta("0.1s")
    )
    orderbook = orderbook[(orderbook["ts"] > open_ts) & (orderbook["ts"] < close_ts)]
    orderbook = orderbook[~orderbook["type"].isin([6, 7])]

    # Replace "empty" slots in the order book (0 volume) with NA prices ----
    bid_price_cols = [c for c in orderbook.columns if "bid_price" in c]
    ask_price_cols = [c for c in orderbook.columns if "ask_price" in c]
    orderbook[bid_price_cols] = orderbook[bid_price_cols].mask(orderbook[bid_price_cols] < 0)
    orderbook[ask_price_cols] = orderbook[ask_price_cols].mask(orderbook[ask_price_cols] >= 999999)

    # Remove crossed order book observations ----
    orderbook = orderbook[orderbook["ask_price_1"] > orderbook["bid_price_1"]]

    # Merge transactions with unique timestamp ----
    trades = orderbook[orderbook["type"].isin([4, 5])][
        ["ts", "type", "order_id", "m_size", "m_price", "direction"]
    ].copy()

    snapshots = (
        orderbook.drop_duplicates(subset="ts", keep="first")
        .assign(midquote=lambda d: (d["ask_price_1"] + d["bid_price_1"]) / 2)
        [["ts", "ask_price_1", "bid_price_1", "midquote"]]
        .sort_values("ts")
        .reset_index(drop=True)
    )
    snapshots["lag_midquote"] = snapshots["midquote"].shift(1)

    trades = trades.merge(snapshots, on="ts", how="inner")

    # lobster convention: direction = 1 if executed against a limit buy order
    trades["direction"] = np.where(
        (trades["type"] == 5) & (trades["m_price"] < trades["lag_midquote"]),
        1.0,
        np.where(
            (trades["type"] == 5) & (trades["m_price"] > trades["lag_midquote"]),
            -1.0,
            np.where(trades["type"] == 4, trades["direction"].astype(float), np.nan),
        ),
    )

    # Aggregate transactions with size and volume-weighted price
    trade_aggregated = trades.groupby("ts", as_index=False).apply(
        lambda g: pd.Series(
            {
                "type": g["type"].iloc[-1],
                "order_id": np.nan,
                "m_price": np.sum(g["m_price"] * g["m_size"]) / g["m_size"].sum(),
                "m_size": g["m_size"].sum(),
                "direction": g["direction"].iloc[-1],
            }
        ),
        include_groups=False,
    )

    # Merge trades with the last observed order book snapshot
    snapshot_cols = ["ts"] + [
        c for c in orderbook.columns if c.startswith(("ask_", "bid_"))
    ]
    last_snapshot = (
        orderbook[snapshot_cols]
        .drop_duplicates(subset="ts", keep="last")
        .reset_index(drop=True)
    )
    trade_aggregated = trade_aggregated.merge(last_snapshot, on="ts", how="inner")

    non_trades = orderbook[~orderbook["type"].isin([4, 5])]
    orderbook = (
        pd.concat([non_trades, trade_aggregated], ignore_index=True)
        .sort_values("ts")
        .reset_index(drop=True)
    )
    orderbook["direction"] = np.where(
        orderbook["type"].isin([4, 5]), orderbook["direction"], np.nan
    )
    return orderbook

You can process the raw data from before by simply calling.

orderbook = process_orderbook(orderbook)

The function compute_depth below computes the number of traded shares within bp basis points of the associated best price, which can be used to measure liquidity.

def compute_depth(orderbook, side="bid", bp=0):
    # Computes depth (in contracts) based on order book snapshots
    if side == "bid":
        ref = (1 - bp / 10000) * orderbook["bid_price_1"]
        price_cols = [c for c in orderbook.columns if "bid_price" in c]
        size_cols = [c for c in orderbook.columns if "bid_size" in c]
        idx = orderbook[price_cols].ge(ref, axis=0).fillna(False)
    else:
        ref = (1 + bp / 10000) * orderbook["ask_price_1"]
        price_cols = [c for c in orderbook.columns if "ask_price" in c]
        size_cols = [c for c in orderbook.columns if "ask_size" in c]
        idx = orderbook[price_cols].le(ref, axis=0).fillna(False)

    sizes = orderbook[size_cols].fillna(0).values
    return (sizes * idx.values).sum(axis=1)

Next, I compute the midquote, bid-ask spread, aggregate notional volume, and depth (the amount of trade-able units in the order book):

  • Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
  • Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
  • Notional volume is the aggregate traded dollar value of the stock, computed as trade price times trade size. I do not differentiate between hidden (type==5) and visible trade volume.
  • Depth at the best level.

I aggregate the information into buckets of 5 minutes, using time-weighting for the order book variables.

summary = pd.DataFrame(
    {
        "ts": orderbook["ts"].values,
        "midquote": (orderbook["ask_price_1"] + orderbook["bid_price_1"]) / 2,
        "trading_volume": np.where(
            orderbook["type"].isin([4, 5]),
            orderbook["m_price"] * orderbook["m_size"],
            np.nan,
        ),
        "depth0_bid": orderbook["bid_size_1"].values,
        "depth0_ask": orderbook["ask_size_1"].values,
    }
)
summary["spread"] = (
    10000
    * (orderbook["ask_price_1"].values - orderbook["bid_price_1"].values)
    / summary["midquote"]
)
# time between messages (0 for last message)
summary["ts_latency"] = (
    (summary["ts"].shift(-1) - summary["ts"]).dt.total_seconds().fillna(0)
)
# 5-minute buckets (e.g. messages from 10:30:00 - 10:34:59.xx belong into 10:30:00)
summary["ts_minute"] = summary["ts"].dt.floor("5min")


def weighted_mean(values, weights):
    mask = ~values.isna()
    if not mask.any() or weights[mask].sum() == 0:
        return np.nan
    return np.average(values[mask], weights=weights[mask])


orderbook_summaries = summary.groupby("ts_minute", as_index=False).apply(
    lambda g: pd.Series(
        {
            "midquote": g["midquote"].iloc[-1],
            "n_trades": int(g["trading_volume"].notna().sum()),
            "n": len(g),
            "trading_volume": float(g["trading_volume"].sum()),
            "depth0_bid": weighted_mean(g["depth0_bid"], g["ts_latency"]),
            "depth0_ask": weighted_mean(g["depth0_ask"], g["ts_latency"]),
            "spread": weighted_mean(g["spread"], g["ts_latency"]),
        }
    ),
    include_groups=False,
)

orderbook_display = orderbook_summaries.copy()
orderbook_display["ts_minute"] = orderbook_display["ts_minute"].dt.time

(
    GT(orderbook_display)
    .cols_label(
        ts_minute="Time",
        midquote="Midquote (USD)",
        n_trades="Trades",
        n="Messages",
        trading_volume="Volume (USD)",
        depth0_bid="Bid depth",
        depth0_ask="Ask depth",
        spread="Spread (bps)",
    )
    .tab_spanner(
        label="Order book",
        columns=["midquote", "spread", "depth0_bid", "depth0_ask"],
    )
    .tab_spanner(
        label="Trading activity",
        columns=["n_trades", "n", "trading_volume"],
    )
    .fmt_time(columns="ts_minute", time_style="h_m_p")
    .fmt_currency(columns="midquote", currency="USD", decimals=2)
    .fmt_currency(columns="trading_volume", currency="USD", decimals=0)
    .fmt_number(columns=["depth0_bid", "depth0_ask"], decimals=0)
    .fmt_number(columns="spread", decimals=2)
    .fmt_integer(columns=["n_trades", "n"])
)
Table 1: Five-minute summaries of midquote, spread (bps), trade count, notional volume, and quoted depth for SAP, January 2022.
Time Order book Trading activity
Midquote (USD) Spread (bps) Bid depth Ask depth Trades Messages Volume (USD)
9:30 AM $140.38 3.62 482 533 61 12,926 $1,313,466
9:35 AM $140.24 4.33 668 583 23 14,844 $602,060
9:40 AM $140.15 4.32 760 424 24 11,150 $299,088
9:45 AM $139.49 4.54 776 554 43 12,891 $711,751
9:50 AM $139.26 4.76 711 717 45 20,975 $643,746
9:55 AM $139.31 4.79 747 626 46 17,916 $666,569
10:00 AM $139.81 5.05 747 671 37 17,288 $640,948
10:05 AM $139.81 4.93 831 518 42 14,336 $517,071
10:10 AM $140.16 5.33 804 795 32 11,862 $609,344
10:15 AM $140.17 4.59 491 633 24 11,218 $218,798
10:20 AM $140.25 4.31 566 651 69 10,213 $1,392,951
10:25 AM $140.18 4.55 469 804 17 12,386 $193,138
10:30 AM $140.19 4.44 446 714 27 12,563 $443,163
10:35 AM $140.29 3.61 431 426 44 10,616 $408,664
10:40 AM $140.49 4.04 551 547 91 10,862 $1,516,834
10:45 AM $140.43 4.51 466 756 28 12,057 $249,738
10:50 AM $140.24 4.32 405 745 36 11,627 $292,742
10:55 AM $140.21 3.61 299 787 19 8,978 $363,035
11:00 AM $140.18 2.79 178 950 45 8,117 $852,378
11:05 AM $140.19 4.08 448 819 29 8,625 $788,170
11:10 AM $140.23 3.13 363 510 57 9,471 $1,794,017
11:15 AM $140.31 2.79 277 470 36 9,019 $547,094
11:20 AM $140.37 3.72 527 593 40 7,751 $742,033
11:25 AM $140.57 4.10 525 528 58 12,209 $1,843,627
11:30 AM $140.39 6.24 219 522 47 2,706 $1,043,222
11:35 AM $140.37 8.53 240 203 41 501 $926,211
11:40 AM $140.37 8.63 173 166 16 207 $82,531
11:45 AM $140.31 5.78 136 188 19 212 $201,726
11:50 AM $140.14 6.04 82 56 10 405 $111,357
11:55 AM $140.25 6.20 154 170 12 398 $142,420
12:00 PM $140.22 6.09 138 119 10 234 $143,158
12:05 PM $140.23 6.90 142 46 14 188 $171,964
12:10 PM $140.13 5.57 188 72 14 157 $129,363
12:15 PM $140.08 3.55 109 68 17 192 $191,348
12:20 PM $140.17 3.94 120 44 9 143 $63,614
12:25 PM $140.10 4.88 134 121 21 210 $173,409
12:30 PM $140.00 6.99 138 165 26 286 $407,699
12:35 PM $140.00 5.66 50 121 32 300 $586,622
12:40 PM $139.94 5.94 229 76 8 198 $82,985
12:45 PM $139.99 5.80 210 79 10 188 $27,437
12:50 PM $139.94 4.92 288 103 13 255 $122,189
12:55 PM $139.99 4.16 171 113 6 257 $59,784
1:00 PM $139.95 4.83 255 167 13 240 $168,846
1:05 PM $139.91 6.02 152 142 2 137 $6,995
1:10 PM $140.07 7.01 192 122 3 310 $19,176
1:15 PM $140.06 6.59 162 221 5 139 $21,009
1:20 PM $140.17 3.50 213 26 10 189 $84,217
1:25 PM $140.21 2.68 250 156 9 147 $79,495
1:30 PM $140.28 2.34 237 100 12 196 $112,770
1:35 PM $140.39 3.00 172 120 26 340 $322,772
1:40 PM $140.63 6.59 274 88 18 483 $121,353
1:45 PM $140.62 4.52 106 126 17 247 $172,959
1:50 PM $140.68 5.11 251 71 7 178 $40,787
1:55 PM $140.72 4.62 168 153 17 270 $193,457
2:00 PM $140.75 4.61 285 150 10 127 $72,340
2:05 PM $140.97 6.84 398 196 16 348 $154,616
2:10 PM $141.11 4.68 278 110 32 285 $362,107
2:15 PM $141.18 4.06 190 146 14 283 $176,133
2:20 PM $141.12 2.63 129 125 25 375 $242,762
2:25 PM $141.13 4.58 139 61 17 204 $196,725
2:30 PM $141.03 3.71 183 74 12 201 $77,879
2:35 PM $141.01 4.19 170 46 30 406 $330,383
2:40 PM $141.19 8.42 163 88 11 247 $142,360
2:45 PM $141.20 4.63 391 48 17 240 $155,433
2:50 PM $141.24 5.42 297 86 27 358 $381,499
2:55 PM $141.12 3.00 87 76 23 492 $263,047
3:00 PM $141.14 5.22 144 147 19 577 $123,330
3:05 PM $141.12 3.25 173 124 14 285 $199,390
3:10 PM $141.20 2.59 303 19 23 430 $332,351
3:15 PM $141.29 2.48 218 137 41 600 $451,828
3:20 PM $141.29 2.08 181 174 56 517 $806,482
3:25 PM $141.33 1.74 287 111 39 582 $559,393
3:30 PM $141.44 2.30 198 117 64 1,028 $739,473
3:35 PM $141.44 2.13 263 111 65 864 $886,603
3:40 PM $141.29 2.51 206 103 57 1,252 $742,247
3:45 PM $141.56 2.84 355 165 35 1,065 $242,671
3:50 PM $141.29 2.50 193 151 154 2,458 $2,367,376
3:55 PM $141.43 2.00 246 161 140 2,142 $1,857,272

Finally, some visualization of the data at hand: The code below shows the dynamics of the traded prices (red points) and the quoted prices at the higher order book levels.

orderbook_trades = orderbook.loc[orderbook["type"].isin([4, 5]), ["ts", "m_price"]]

ob_minute = (
    orderbook.assign(ts_new=lambda d: d["ts"].dt.floor("min"))
    .groupby("ts_new", as_index=False)
    .last()
    .drop(columns="ts_new")
)
quote_cols = [c for c in ob_minute.columns if c.startswith(("ask_", "bid_"))]
long_df = (
    ob_minute[["ts"] + quote_cols]
    .melt(id_vars="ts", var_name="key", value_name="value")
)
parts = long_df["key"].str.split("_", expand=True)
long_df["side"] = parts[0]
long_df["variable"] = parts[1]
long_df["level"] = parts[2].astype(int)
orderbook_quotes = (
    long_df.pivot_table(
        index=["ts", "side", "level"], columns="variable", values="value"
    )
    .reset_index()
    .dropna(subset=["price"])
)
orderbook_quotes["size_norm"] = orderbook_quotes["size"] / orderbook_quotes["size"].max()
orderbook_quotes["level"] = pd.Categorical(orderbook_quotes["level"])

(
    ggplot()
    + theme_bw()
    + geom_point(
        orderbook_quotes,
        aes(x="ts", y="price", color="level", size="size_norm"),
        alpha=0.1,
    )
    + geom_point(orderbook_trades, aes(x="ts", y="m_price"), color="red")
    + labs(title="Orderbook Dynamics", y="Price", x="")
    + theme(
        panel_grid_major=element_blank(),
        panel_grid_minor=element_blank(),
        legend_position="none",
    )
)
<plotnine.ggplot.ggplot object at 0x00000233BB62F090>

Brownian Motion and realized volatility

In this exercise, you will analyze the empirical properties of the (geometric) Brownian motion based on a simulation study. Such simulation studies can also be helpful for various applications in continuous time finance, e.g., option pricing.

Finance theory suggests that prices must be so-called semimartingales. The most commonly used semimartingale is the Geometric Brownian Motion (GBM), where we assume that the logarithm of the stock price \(S_t\), \(X_t = \log(S_t)\) follows the following dynamic \[ X_t = X_0 + \mu t + \sigma W_t \] where \(\mu\) and \(\sigma\) are constants and \(W_t\) is a Brownian motion.

Exercises:

  • Write a function that simulates (log) stock prices according to the model above. Recall that the Brownian motion has independent increments which are normally distributed. The function should generate a path of the log stock price with a pre-defined number of observations (granularity), drift term, and volatility. Then, plot one sample path of the implied (log) stock price.
  • Use the function to simulate a large number of geometric Brownian motions. Illustrate the quadratic variation of the process.
  • Compute the realized volatility for each simulated process and illustrate the sampling distribution.
  • Finally, consider the infill asymptotics. Show the effect of sampling once per day, once per hour, once per minute, and so on on the variance of the realized volatility.

Solutions:

For this exercise, I use some functionalities from the xts and highfrequency packages, particularly for aggregation at different frequencies.

library(dplyr)
library(tidyr)
library(xts)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: 'xts'
The following objects are masked from 'package:dplyr':

    first, last
library(highfrequency)
Warning: package 'highfrequency' was built under R version 4.5.3
library(purrr)

library(ggplot2)
library(lubridate)

Note: When loading xts and the tidyverse jointly, you may run into serious issues if you do not consider naming conflicts! Especially the convenient functions last and first from dplyr may be affected. For that purpose, I consistently use dplyr::first and dplyr::last in this chapter.

First, the function below is a rather general version of a simulated generalized Brownian motion. Note that there is no immediate need to assign trading date and hour; I implement this because it will provide output closer to actual data, and therefore, parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values start and the geometric Brownian motion mu and sigma parameters. The input steps defines the simulated Brownian motion’s granularity (sampling frequency).

simulate_geom_brownian_motion <- function(
  seed = 3010,
  steps = 1000,
  mu = 0,
  sigma = 1,
  start = 0
) {
  set.seed(seed)
  date <- Sys.Date() + days(seed)
  hour(date) <- 9
  minute(date) <- 30
  date_end <- date
  hour(date_end) <- 16
  minute(date_end) <- 0

  tibble(
    step = 1:(steps + 1),
    time = seq(from = date, to = date_end, length.out = steps + 1),
    price = cumsum(c(
      start,
      rnorm(steps, mean = mu / steps, sd = sigma / sqrt(steps))
    ))
  )
}

Note the last line of the function doing most of the work: We exploit that increments of a Brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next, I simulate the stock price with 1000 intraday observations. Recall that the geometric Brownian motion implies dynamics for the log of the stock price, which is why I adjust the plotted value in the \(y\)-axis.

tab <- simulate_geom_brownian_motion(start = log(18))
tab |>
  ggplot(aes(x = time, y = exp(price))) +
  geom_line() +
  labs(x = "", y = "Stock price")
Figure 3: A single simulated intraday path of the geometric Brownian motion (1000 steps).

Next, we scale the simulation up. The code below simulates 250 paths of the geometric Brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma <- 2.4
steps <- 1000

simulation <- tibble(iteration = 1:250)
simulation <- simulation |>
  mutate(
    values = map(
      iteration,
      simulate_geom_brownian_motion,
      sigma = true_sigma,
      steps = steps
    )
  ) |>
  unnest(values)

simulation |>
  ggplot(aes(x = step, y = price, group = iteration)) +
  geom_line(alpha = 0.3) +
  labs(x = "Iteration", y = "(log) price")
Figure 4: Two hundred and fifty simulated log price paths of the geometric Brownian motion (\(\sigma = 2.4\), 1000 intraday steps each).

Next, we investigate some of the properties of the geometric Brownian motion. First, by definition, the sample variation of \(X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t\) is \(\int\limits_0^t\sigma^2 dt = \sigma^2t\). The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in \(t\) as expected. For the simulation, I used \(\sigma = 2.4\) (true_sigma), which corresponds to the slope of the line (adjusted by the number of observations).

simulation |>
  group_by(step) |>
  summarise(sd = sd(price)) |>
  ggplot(aes(x = step, y = sd^2)) +
  geom_line() +
  geom_abline(aes(slope = true_sigma^2 / steps, intercept = 0)) +
  labs(x = "Iteration", y = "Sample variation")
Figure 5: Sample variation of simulated GBM log prices over time; the straight line shows the theoretical \(\sigma^2 t\).

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps). The bar plot below illustrates the sampling distribution of the realized volatility, which, as expected, centers around true_sigma, the integrated variance of the simulated geometric Brownian motion. You can change the value true_sigma above and analyze the effect.

simulation |>
  group_by(iteration) |>
  summarise(rv = sum(((price - lag(price))^2), na.rm = TRUE)) |>
  ggplot(aes(x = rv)) +
  geom_histogram(bins = 100) +
  labs(x = "Realized Volatility") +
  geom_vline(aes(xintercept = true_sigma^2), color = "red", linetype = "dotted")
Figure 6: Sampling distribution of the realized variance across 250 simulated GBM paths; the dotted vertical line marks the true integrated variance.

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub-samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprisingly, the dispersion around the true parameter decreases substantially with higher sampling frequency.

Note that the code below illustrates how to work with the xts package and switch between the tidyverse and the high-frequency data format requirements. Of course, the procedure could be more convenient, but it is worth exploiting the functionalities of both worlds instead of recording everything in a tidy manner.

import numpy as np
import pandas as pd
from plotnine import *
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
plt.switch_backend("Agg")

First, the function below is a rather general version of a simulated generalized Brownian motion. Note that there is no immediate need to assign trading date and hour; I implement this because it will provide output closer to actual data, and therefore, parts of my code can be recycled for future projects. The function can be called with a specific seed to replicate the results. The relevant parameters are the initial values start and the geometric Brownian motion mu and sigma parameters. The input steps defines the simulated Brownian motion’s granularity (sampling frequency).

def simulate_geom_brownian_motion(seed=3010, steps=1000, mu=0, sigma=1, start=0):
    np.random.seed(seed)

    date = datetime.now().date() + timedelta(days=seed)
    date = datetime.combine(date, datetime.min.time()).replace(hour=9, minute=30)
    date_end = date.replace(hour=16, minute=0)

    times = pd.date_range(start=date, end=date_end, periods=steps + 1)

    increments = np.random.normal(
        loc=mu / steps, scale=sigma / np.sqrt(steps), size=steps
    )
    price = np.cumsum(np.insert(increments, 0, start))

    return pd.DataFrame(
        {"step": np.arange(1, steps + 2), "time": times, "price": price}
    )

Note the last line of the function doing most of the work: We exploit that increments of a Brownian motion are independent and normally distributed. The entire path is nothing but the cumulative sum of the increments.

Next, I simulate the stock price with 1000 intraday observations. Recall that the geometric Brownian motion implies dynamics for the log of the stock price, which is why I adjust the plotted value in the \(y\)-axis.

tab = simulate_geom_brownian_motion(start=np.log(18))

(
    ggplot(tab, aes(x="time", y=np.exp(tab["price"])))
    + geom_line()
    + labs(x="", y="Stock price")
)
<plotnine.ggplot.ggplot object at 0x0000023375AECCD0>

Next, we scale the simulation up. The code below simulates 250 paths of the geometric Brownian motion, all with the same starting value. The function illustrates the sampled (log) price paths.

true_sigma = 2.4
steps = 1000

simulations = []
for i in range(1, 251):
    df = simulate_geom_brownian_motion(seed=i, sigma=true_sigma, steps=steps)
    df["iteration"] = i
    simulations.append(df)

simulation = pd.concat(simulations, ignore_index=True)

(
    ggplot(simulation, aes(x="step", y="price", group="iteration"))
    + geom_line(alpha=0.3)
    + labs(x="Iteration", y="(log) price")
)
<plotnine.ggplot.ggplot object at 0x00000233FCDBA850>

Next, we investigate some of the properties of the geometric Brownian motion. First, by definition, the sample variation of \(X_t = \int\limits_0^t\mu dt + \int\limits_0^t\sigma dW_t\) is \(\int\limits_0^t\sigma^2 dt = \sigma^2t\). The figure below illustrates the sample variation over time and shows that the standard deviation across all simulated (log) prices increases linearly in \(t\) as expected. For the simulation, I used \(\sigma = 2.4\) (true_sigma), which corresponds to the slope of the line (adjusted by the number of observations).

var_df = (
    simulation
    .groupby("step")["price"]
    .std()
    .reset_index(name="sd")
    .assign(var=lambda d: d["sd"] ** 2)
)

(
    ggplot(var_df, aes(x="step", y="var"))
    + geom_line()
    + geom_abline(slope=true_sigma**2 / steps, intercept=0, linetype="dashed")
    + labs(x="Iteration", y="Sample variation")
)
<plotnine.ggplot.ggplot object at 0x0000023377014990>

What about the realized volatility? As defined in the lecture, the realized variance \[RV = \sum\limits_{i=0}^n \Delta X_t^2\] converges to the integrated variance (in the absence of jumps). The bar plot below illustrates the sampling distribution of the realized volatility, which, as expected, centers around true_sigma, the integrated variance of the simulated geometric Brownian motion. You can change the value true_sigma above and analyze the effect.

rv = (
    simulation
    .assign(lag_price=lambda d: d.groupby("iteration")["price"].shift(1))
    .assign(sq_diff=lambda d: (d["price"] - d["lag_price"]) ** 2)
    .groupby("iteration")["sq_diff"]
    .sum()
    .dropna()
)

rv_df = rv.reset_index(name="rv")

(
    ggplot(rv_df, aes(x="rv"))
    + geom_histogram(bins=100)
    + geom_vline(xintercept=true_sigma**2, linetype="dotted", color="red")
    + labs(x="Realized Volatility")
)
<plotnine.ggplot.ggplot object at 0x0000023375954D10>

Finally, some more analysis regarding the infill asymptotics. As should be clear by now, sampling at the highest frequency possible will ensure that the realized volatility recovers the integrated variance of the (log) price process. In other words, we can measure the variance. The following lines of code illustrate the effect: Whereas we simulated our process at a high frequency, we now simulate sub-samples at different frequencies, ranging from every second to once per day. The figure then illustrates the daily realized volatilities relative to the integrated variance of the process. Not surprisingly, the dispersion around the true parameter decreases substantially with higher sampling frequency.

def infill_asymptotics(df, sampling_minutes=5):
    df = df.set_index("time").sort_index()

    # resample (similar to aggregatePrice)
    sampled = df["price"].resample(f"{sampling_minutes}min").last().dropna()

    rv = np.sum(np.diff(sampled) ** 2)
    return rv


results = []

freq_order = [
    "1 Second",
    "1 Minute",
    "5 Minutes",
    "15 Minutes",
    "1 Hour",
    "1 Day",
]

rvs = (
    simulation
    .groupby("iteration")
    .apply(
        lambda grp: pd.Series({
            "1 Second": infill_asymptotics(grp, 1 / 60),
            "1 Minute": infill_asymptotics(grp, 1),
            "5 Minutes": infill_asymptotics(grp, 5),
            "15 Minutes": infill_asymptotics(grp, 15),
            "1 Hour": infill_asymptotics(grp, 60),
            "1 Day": infill_asymptotics(grp, int(6.5 * 60)),
        })
    )
    .reset_index()
)

rvs_long = (
    rvs
    .melt(id_vars="iteration",
          var_name="sampling_frequency",
          value_name="rv")
    .assign(sqrt_rv=lambda d: np.sqrt(d["rv"]))
    .assign(
        sampling_frequency=lambda d: pd.Categorical(
            d["sampling_frequency"],
            categories=freq_order,
            ordered=True
        )
    )
)

(
    ggplot(rvs_long, aes(x="sqrt_rv", fill="sampling_frequency"))
    + geom_density(alpha=0.5)
    + geom_vline(xintercept=true_sigma, linetype="dotted")
    + facet_wrap("~sampling_frequency", scales="free_y")
    + labs(x="(Log) Realized Volatility", y="")
)
<plotnine.ggplot.ggplot object at 0x000002B8DF550D90>