Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/larsiusprime/openavmkit/llms.txt

Use this file to discover all available pages before exploring further.

The data module provides core data structures and functions for loading, processing, and enriching assessment and sales data.

Core Data Structures

SalesUniversePair

A container for the sales and universe DataFrames. Many functions operate on this data structure.
from openavmkit.data import SalesUniversePair

sup = SalesUniversePair(sales=df_sales, universe=df_universe)
sales
pd.DataFrame
required
DataFrame containing sales data
universe
pd.DataFrame
required
DataFrame containing universe (parcel) data

Methods

copy() Create a copy of the SalesUniversePair object.
sup_copy = sup.copy()
sup_copy
SalesUniversePair
A new SalesUniversePair object with copied DataFrames
set(key, value) Set the sales or universe DataFrame.
sup.set("sales", new_sales_df)
sup.set("universe", new_universe_df)
key
str
required
Either “sales” or “universe”
value
pd.DataFrame
required
The new DataFrame to set for the specified key
update_sales(new_sales, allow_remove_rows) Update the sales DataFrame with new information as an overlay without redundancy.
sup.update_sales(new_sales_df, allow_remove_rows=True)
new_sales
pd.DataFrame
required
New sales DataFrame with updates
allow_remove_rows
bool
required
If True, allows the update to remove rows from sales. If False, preserves all original rows
limit_sales_to_keys(new_sale_keys) Update the sales DataFrame to only those that match a key in new_sale_keys.
sup.limit_sales_to_keys(["sale_123", "sale_456", "sale_789"])
new_sale_keys
list[str]
required
List of sale keys to filter to

Data Loading Functions

load_dataframe()

Load a single DataFrame based on configuration settings.
from openavmkit.data import load_dataframe

df = load_dataframe(
    entry,
    settings,
    verbose=True,
    fields_cat=categorical_fields,
    fields_bool=boolean_fields,
    fields_num=numeric_fields
)
entry
dict
required
Configuration entry for loading the dataframe
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints detailed logs during data loading
fields_cat
list[str]
default:"None"
List of categorical field names
fields_bool
list[str]
default:"None"
List of boolean field names
fields_num
list[str]
default:"None"
List of numeric field names
df
pd.DataFrame
The loaded DataFrame

Data Processing Functions

process_data()

Process raw dataframes according to settings and return a SalesUniversePair.
from openavmkit.data import process_data

sup = process_data(dataframes, settings, verbose=True)
dataframes
dict[str, pd.DataFrame]
required
Dictionary mapping keys to DataFrames
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints progress information
sup
SalesUniversePair
A SalesUniversePair containing processed sales and universe data

get_hydrated_sales_from_sup()

Merge the sales and universe DataFrames to “hydrate” the sales data.
from openavmkit.data import get_hydrated_sales_from_sup

df_hydrated = get_hydrated_sales_from_sup(sup)
sup
SalesUniversePair
required
SalesUniversePair containing sales and universe DataFrames
df_hydrated
pd.DataFrame | gpd.GeoDataFrame
The merged (hydrated) sales DataFrame

get_sup_model_group()

Get a subset of a SalesUniversePair for a specific model group.
from openavmkit.data import get_sup_model_group

sup_mg = get_sup_model_group(sup, model_group_id)
sup
SalesUniversePair
required
The SalesUniversePair to filter
model_group_id
str
required
The model group identifier to filter by
sup_mg
SalesUniversePair
A new SalesUniversePair containing only the specified model group

Enrichment Functions

enrich_time()

Enrich the DataFrame by converting specified time fields to datetime and deriving additional fields.
from openavmkit.data import enrich_time

df = enrich_time(df, time_formats, settings)
df
pd.DataFrame
required
Input DataFrame
time_formats
dict
required
Dictionary mapping field names to datetime formats
settings
dict
required
Settings dictionary
df
pd.DataFrame
DataFrame with enriched time fields

enrich_sup_spatial_lag()

Enrich the sales and universe DataFrames with spatial lag features.
from openavmkit.data import enrich_sup_spatial_lag

sup = enrich_sup_spatial_lag(sup, settings, verbose=True)
sup
SalesUniversePair
required
SalesUniversePair containing sales and universe DataFrames
settings
dict
required
Settings dictionary
verbose
bool
default:"False"
If True, prints progress information
sup
SalesUniversePair
Enriched SalesUniversePair with spatial lag features

enrich_df_streets()

Enrich a GeoDataFrame with street network data.
This function can be VERY computationally and memory intensive for large datasets.
from openavmkit.data import enrich_df_streets

df = enrich_df_streets(
    df,
    settings,
    spacing=1.0,
    max_ray_length=25.0,
    network_buffer=500.0,
    verbose=True
)
df_in
gpd.GeoDataFrame
required
Input GeoDataFrame containing parcels
settings
dict
required
Settings dictionary containing configuration for the enrichment
spacing
float
default:"1.0"
Spacing in meters for ray casting to calculate distances to streets
max_ray_length
float
default:"25.0"
Maximum length of rays to shoot for distance calculations, in meters
network_buffer
float
default:"500.0"
Buffer around the street network to consider for distance calculations, in meters
verbose
bool
default:"False"
If True, prints progress information
df
gpd.GeoDataFrame
Enriched GeoDataFrame with additional columns for street-related metrics

Utility Functions

get_sale_field()

Determine the appropriate sale price field based on time adjustment settings.
from openavmkit.data import get_sale_field

sale_field = get_sale_field(settings, df)
settings
dict
required
Settings dictionary
df
pd.DataFrame
default:"None"
Optional DataFrame to check field existence
sale_field
str
Field name to be used for sale price (either “sale_price” or “sale_price_time_adj”)

get_vacant_sales()

Filter the sales DataFrame to return only vacant (unimproved) sales.
from openavmkit.data import get_vacant_sales

df_vacant = get_vacant_sales(df, settings, invert=False)
df_in
pd.DataFrame
required
Input DataFrame
settings
dict
required
Settings dictionary
invert
bool
default:"False"
If True, return non-vacant (improved) sales
df_vacant
pd.DataFrame
DataFrame with an added is_vacant column

get_vacant()

Filter the DataFrame based on the ‘is_vacant’ column.
from openavmkit.data import get_vacant

df_vacant = get_vacant(df, settings, invert=False)
df_in
pd.DataFrame
required
Input DataFrame
settings
dict
required
Settings dictionary
invert
bool
default:"False"
If True, return non-vacant rows
df_vacant
pd.DataFrame
DataFrame filtered by the is_vacant flag

get_train_test_keys()

Get the training and testing keys for the sales DataFrame.
from openavmkit.data import get_train_test_keys

train_keys, test_keys = get_train_test_keys(df, settings)
df_in
pd.DataFrame
required
Input DataFrame containing sales data
settings
dict
required
Settings dictionary
keys_train
np.ndarray
Keys for training set
keys_test
np.ndarray
Keys for testing set

get_train_test_masks()

Get the training and testing masks for the sales DataFrame.
from openavmkit.data import get_train_test_masks

mask_train, mask_test = get_train_test_masks(df, settings)
df_in
pd.DataFrame
required
Input DataFrame containing sales data
settings
dict
required
Settings dictionary
mask_train
pd.Series
Boolean mask for training set
mask_test
pd.Series
Boolean mask for testing set

Field Classification Functions

get_field_classifications()

Retrieve a mapping of field names to their classifications (land, improvement, or other) and types.
from openavmkit.data import get_field_classifications

field_map = get_field_classifications(settings)
settings
dict
required
Settings dictionary
field_map
dict
Dictionary mapping field names to type and class

get_important_field()

Retrieve the important field name for a given field alias from settings.
from openavmkit.data import get_important_field

field_name = get_important_field(settings, "deed_id", df)
settings
dict
required
Settings dictionary
field_name
str
required
Identifier for the field
df
pd.DataFrame
default:"None"
Optional DataFrame to check field existence
field_name
str | None
The mapped field name if found, else None

get_report_locations()

Retrieve report location fields from settings.
from openavmkit.data import get_report_locations

locations = get_report_locations(settings, df)
settings
dict
required
Settings dictionary
df
pd.DataFrame
default:"None"
Optional DataFrame to filter available locations
locations
list[str]
List of report location field names

Output Functions

write_parquet()

Write data to a parquet file.
from openavmkit.data import write_parquet

write_parquet(df, "out/data.parquet")
df
pd.DataFrame
required
Data to be written
path
str
required
File path for saving the parquet

write_gpkg()

Write data to a GeoPackage file.
from openavmkit.data import write_gpkg

write_gpkg(gdf, "out/data.gpkg")
df
gpd.GeoDataFrame
required
Data to be written
path
str
required
File path for saving the GeoPackage

write_zipped_shapefile()

Write data to a zipped shapefile.
from openavmkit.data import write_zipped_shapefile

write_zipped_shapefile(gdf, "out/data.shp.zip")
df
gpd.GeoDataFrame
required
Data to be written
path
str
required
File path for saving the zipped shapefile

write_csv()

Write data to a CSV file.
from openavmkit.data import write_csv

write_csv(df, "out/data.csv")
df
pd.DataFrame
required
Data to be written
path
str
required
File path for saving the CSV