nature.com

Solar PV Generation and Consumption Dataset of an Estonian Residential Dwelling

Background & Summary

Energy generation and consumption datasets are essential for understanding, assessing, and optimizing residential microgrids and nanogrids. Given the rapidly growing global energy demand, regional disputes, and the depletion of fossil fuels, many nations have adopted renewable energy sources to ensure energy security and sustainability. However, there are significant issues associated with household renewable installation. Datasets can help with the design of residential renewable energy installations. Several residential consumption and generation datasets are publicly available for different regions worldwide. For instance, the ENTERTALK dataset includes 15 Hz resolution appliance-level measurements from 22 South Korean residences1. The ECD-UY dataset contains a 1–15 minute resolution dataset on energy usage in Uruguay2. The Plegma dataset includes whole-house aggregate loads and appliance-level consumption measurements taken at 10 second intervals over one year from 13 different houses3. Hourly consumption data from smart meter-using Spanish houses was included in the dataset4.

Different applications require datasets with varying resolutions. High-resolution datasets are employed in fault analysis, forecasting, and characteristic simulations, whereas low-resolution datasets are generally used in planning, policy formation, resource assessment, and economic analysis5,6. For instance, the 1-hour dataset from the NASA Power is widely used in energy planning, system optimization, and forecasting7. Different dataset resolutions and their applications are shown in Fig. 1.

figure 1

Electrical generation and consumption data resolution and application categories30

,

31.

Full size image

There are numerous methods for measuring both aggregated and appliance level energy usage in buildings. Direct measurement is used to log aggregated consumption. There are two methods for measuring appliance-level consumption: direct measurement via consumer grade smart plugs, IoT devices, or with specialized measurement equipment. The second approach involves obtaining appliance-level data through non-intrusive load monitoring (NILM) methods. NILM is a computational approach that analyzes aggregate power signals from smart meter readings8. NILM can also be utilized for fault detection and demand management9,10. Additionally, such datasets are used to forecast both short- and long-term demand. Machine learning-based methodologies for forecasting short- and long-term load demand include support vector regression, artificial neural networks, Gaussian process regression, and meta-heuristic algorithms11,12,13. Rooftop solar installations are becoming more common in many parts of Europe and the world. Some European countries are represented in the datasets listed in Table 1; however, there is currently no publicly available residential energy generation and consumption dataset specific to Estonia. This absence poses significant challenges for accurately modeling residential energy consumption patterns in the Estonian climate, which is characterized by its unpredictability. Additionally, while data from Distribution System Operators (DSOs) is available, it lacks complete information on photovoltaic (PV) generation and is limited by its relatively low temporal resolution, being logged only at hourly intervals. To address this, this paper presents a comprehensive residential energy generation and consumption dataset for an Estonian dwelling, captured at a high temporal resolution of 10 seconds. This dataset aims to provide more precise insights into energy usage and generation dynamics under Estonia’s unique climatic conditions.

Table 1 Comparison of other datasets with the Estonian residential generation and consumption dataset.

Full size table

Methods

The data was collected from a detached house near Tallinn. It is a two-story building with a total area of 130 m2 that is used as a main residence by a single family with two adults and two children. There is one large room and kitchen on the first floor and three smaller bedrooms on the second one. The house is linked to the utility grid via a three-phase connection. It has all the standard domestic loads and a single-phase electric vehicle (EV) charger. The EV is used daily. There is also one air-to-water heat pump and one air conditioner installed in the house. However, the loads in each phase are not symmetrically balanced. Aside from the loads, a 5 kWp rooftop PV array is connected to the grid via a 4.2 kW inverter. All the appliances and the loads within the house, along with their nominal ratings, are listed in Table 2. The data reported here was collected by a set of measurement devices, and there was no direct involvement of the residents in the form of surveys or questionnaires. Moreover, this work did not require ethical approval under the guidelines set by Tallinn University of Technology. The homeowner consented to the publication of data related to power generation and consumption, excluding any personal information. The data collection process was ethically managed by installing energy monitoring sensors with the help of a certified electrician, who ensured safety and proper electrical guidelines were followed. The methodology employed falls outside the scope of research requiring formal ethical review and is exempt from ethics approval from Tallinn University of Technology. The recorded data is for the year 2023, and data logging has been discontinued since then.

Table 2 Description of the electrical appliances within the house.

Full size table

Two Camille Bauer PQ1000 power quality monitoring units were installed inside the house to accurately record PV generation, load usage, and grid energy import and export. One monitors the parameters directly from the PV inverter, while the other is installed before the energy meter to acquire load data. Within the dataset, this logger is referred to as L2, whereas the logger that measures PV parameters is designated as PV. These two loggers are connected to a Windows laptop, where the measurement data is stored and can be remotely accessed. The measurement resolution was 10 s. Given the amount of data, the dataset is segregated into months and saved as CSV files. The connection diagram for the loggers is depicted in Fig. 2.

figure 2

Simplified schematic of the logger connection within the house.

Full size image

From Fig. 2, it seems that the direction of the current transformers(CT) is opposite for both loggers. It was strategically placed so that all the parameters could be easily extracted from the dataset. To extract data from the raw logger output, the following arguments were used, as shown in Table 3. These constraints were applied on a per-phase basis to extract true power balance conditions.

Table 3 Constraints for extracting load consumption, power imported from, and power exported to the grid.

Full size table

In addition, the CSV files produced by the logger contain some irrelevant characters in the timestamps and missing data points. To obtain the consumption data from this dataset, Python scripts (stored in the script subdirectory) were used for performing multiple steps (resampling and timestamp cleaning). There are 5 scripts. Scripts 1 and 2 (SC1_PV_auto_sort.py and SC2_L2_auto_sort.py) are adequate to work with the dataset and obtain a cleaned-up dataset from the raw data (correcting the data timesteps and detecting missing points). Additional column selections might be added to those scripts to select the appropriate data. Following that, the final three scripts demonstrate how data can be processed for the intended application. Scripts 1 and 2 resample the dataset at its original resolution (10 seconds). This restores the timestamp continuity and makes the missing points appear as blank cells (NaN) in the spreadsheet. These missing data points can be filled out using a variety of basic and advanced procedures. In this case, the missing cells are imputed using the simple K-nearest neighbor (KNN) approach (Scripts 3 and 4). Once they were filled out, the constraints listed in Table 3 were applied via script 5, resulting in consumption data. Basic KNN can fill most scenarios when the missing data points in between are relatively low; nevertheless, for a large number of missing data points, the data should be processed by different approaches that involve advanced machine learning.

Data Records

Raw data obtained from the two loggers is saved into separate month-by-month CSV files and stored in separate folders, as illustrated in Fig. 3. The dataset was deposited in the TalTech Data Repository14. The following pattern defines the original dataset names: YYYY_MMM_.csv, and the format for processed file names is as follows: MM_YYYY_.csv.

figure 3

File hierarchy of the dataset stored in the repository.

Full size image

PQ1000 is an AC power quality analyzer. It is capable of measuring instantaneous voltage, current power values, reactive power, harmonic analysis, and energy balance analysis. Apart from logging data at predefined intervals, it also generates high-resolution data for abnormal conditions like voltage dips, swells, and interruptions that may be further analyzed to identify faults. In this dataset, individual phase voltage, active power, and power factor and their minimum and maximum values (within the sampling time) were only recorded. The recorded parameters following their header labels are explained in Table 4.

Table 4 Recorded variables and their descriptions.

Full size table

Technical Validation

As previously stated, the initial dataset contains certain missing values, requiring their imputation. In this section, the missing data, discrepancies, and a short overview of the dataset are discussed.

Measurement Deviation

The dataset obtained from the logger was compared to the data provided by the DSO that was gathered by the smart meter installed in the house. The historical data obtained from the DSO are energy data by daily and monthly usage and sampled at 1-hour duration. Furthermore, the DSO data only covers grid consumption and surplus energy sold to the grid, not the full scope of PV generation. Data from the logger was resampled to 1 hour to match the resolution of the data obtained from the DSO. It should also be noted that the logger has better measurement precision with an accuracy of 0.5% for power measurement, while the utility meter generally has an accuracy of 1% for active power measurement15,16. The data for a full week are shown in Fig. 4 to illustrate measurement deviation. The mean absolute error for the data shown in the figure is 0.005788 for energy import and 0.006802 for energy export. The week is particularly chosen since there are no missing points, allowing for a fair comparison between the two data sources. The missing data points are discussed in the next subsection.

Fig. 4

figure 4

A snippet of energy measurement by both logger and utility meter for seven days in June. The deviation of the energy measured by the meter-side logger compared to the utility meter is shown in the subplots. The largest deviations are 0.090 kWh for energy imports and 0.046 kWh for energy exports within this period.

Full size image

Missing Datapoints

The missing data points are mainly attributed to network complications, during which the logging instruments intermittently became detached from the network interface, resulting in the loss of a portion of the recorded values. Daily missing data points for the PV side and meter side logger are illustrated in Figs. 5 and 6.

Fig. 5

figure 5

Missing points on PV side logger. The missing points correspond to timestamps only, not each variable column. The total duration in hours of the missing data would be the missing point count divided by 360, as the sampling interval is 10 seconds. The largest missing data interval for PV generation is 2.18 hours on December 21st.

Full size image

figure 6

Missing points on the meter-side logger. Similar to PV generation, the longest period of missing data on the meter side logger is approximately 10.1 hours on November 21st.

Full size image

Seasonal Consumption and Generation

Estonia normally has four seasons: summer, autumn, winter, and spring. The average temperature in the summer ranges between 14.5 and 18.3°C, while in the winter it ranges between −5.3 and 1.9°C17. During the winter, the sunshine duration is quite short, and there is a minimal chance of getting power from rooftop solar installations due to heavy snowfall, which is reflected in Fig. 7. However, there is usually enough PV generation during the summer and spring to meet daily consumption.

Fig. 7

figure 7

Season-wise daily generation and consumption. The green bars indicate daily PV generation, and the red bars indicate daily consumption. The maximum daily PV generation was 38 kWh during summer and late spring, while the highest daily consumption was 91.2 kWh during winter.

Full size image

Usage Notes

This dataset can be accessed using any software capable of loading and manipulating CSV files. For this paper, data processing and visualization were largely done using different Python libraries (Pandas, NumPy, Scikit-Learn, and Matplotlib). However, working with all of the data necessitates a fairly powerful PC with greater memory or a cloud computing platform. For month-wise data analysis/visualization, Binjr can be used because it is a very fast and resource-efficient time-series dataset viewer software written in Java18. Also, MATLAB can be used for plotting different variables for the whole year without the requirement of a high-performance PC.

Read full news in source page