Written by Paloma Recuerdo, original post in Spanish
In this post we will see an example of how we can perform simple descriptive analytics on a set of data, without having to resort to specific tools. It is a very widespread tool, but often we are not aware of its great power. As in a previous post, we have used it for data preparation and filtering tasks, in this example it is used as an analytical tool that allows us to answer the questions we ask about Bicimad.
In 2014, the Bicimad public bicycle service became operational. Like many other people in Madrid who suffer from daily traffic jams in Madrid, we thought it was good news to have other public transport alternatives. But we were worried about one issue: safety. Would there be more accidents? Would this information be transparently transmitted to citizens?
We decided to investigate this issue and so we searched the Open Data portal of the Madrid City Council for information about accidents related to bicycles. We found the dataset that collects information on “Traffic Accidents involving bicycles”. Although the service was started in 2014, we only have information as of January 2017, meaning the volume of data is not very large. However, this information is updated monthly. Although the size of the sample is small and, therefore, the conclusions that we obtain will not be very determining, we are interested in seeing how we can “talk” to this data to answer the questions that concern us.
|Figure 1: Open Data Portal of the Madrid City Council.|
Unfortunately, in July of 2017, the first fatal BiciMad accident occurred, at a location whose danger had already been documented by another user (via his blog) a year before. It is very important to detect these “danger spots” to see what measures can be taken to avoid accidents. We are going to analyse this set of data with Excel to find the answers to these questions about the security of BiciMAD for ourselves.
First, we downloaded the data set in Excel format. This file contains information about traffic accidents where at least one bicycle is involved, indicating day, time, number of victims, district, name of the road and type of accident.
The types of accident considered are as follows:
- Double Collision: Traffic accident between two vehicles in motion.
- Multiple collision: Traffic accident between more than two vehicles in motion.
- Impact with stationary object: Accident between a vehicle in motion with a driver and an immobile object that occupies the road or area away from it, whether a parked vehicle, tree, street lamp, etc.
- Atropello: Accident between a vehicle and a pedestrian occupying the roadway or passing through sidewalks, shelters, walkways or areas of public roads not intended for the movement of vehicles.
- Tipping: Accident suffered by a vehicle with more than two wheels and that for some reason its tires lose contact with the road being supported on one side or on the roof.
- Motorcycle crash: Accident suffered by a motorcycle, which at some point loses its balance, due to the driver or due to the circumstances of the road.
- Moped fall: Accident suffered by a moped, which at a certain moment loses its balance, due to the driver or the circumstances of the road.
- Bicycle fall: Accident suffered by a bicycle, which at a certain moment loses its balance, due to the driver or the circumstances of the road.
The downloaded file looks like this:
|Figure 2: File downloaded in Excel format.|
Let’s start by creating a table.
It is as simple as choosing the option “Insert table” in the ´What do you want to do? ´ menu. It automatically preselects the entire table and detects that it has headers- we just have to confirm them. (We have eliminated the first row before because the information is now redundant).
|Figure 3, We create a table|
We have something like this:
|Figure 4: Appearance of the table.|
We can insert total values. To do this, we activate the “Row of totals” box, which can be found in Table / Design Tools.
|Figure 5: Insert row of totals.|
Then, in the column that we need, we display and choose the function (sum, average, maximum, etc.). In this case, we choose “the account” of the accidents.
As we are going to analyse the distribution of cases by time slots, we will assign to each action to a label:
- Tomorrow: from 7 to 12
- Afternoon: from 13 to 20
- Night: from 21 to 6
To do this, first remove the “DE” with the function “Search-Replace” (from the Home menu)
|Figure 6: Use Search-Replace function.|
Then we extract the data “hour” from the column “Hourly section”. Since the interval is always 1 hour, we are only interested in the first numeric value of the column. We insert two additional columns, and from the Data menu, we select:
Data, Text in columns, delimited by “:”
We are left with the first column, which indicates the time and we eliminate the other two.
Now we want to assign a “morning / evening / night” tag, depending on the value of that column. For that, we can use nested SI functions. As they are sometimes a little complex, if you have any problems with them, these videos will be very useful.
Insert a new column, and with the nested SI function assign the corresponding labels:
= YES (O ([@ [TIME STOP]]> = 20; [@ [TIME STOP]] <= 6); “Night”; YES ([@ [TIME STOP]]> = 13; “Afternoon”; ” Morning”))
|Figure 7: Use of the nested SI function.|
Once the time bands are labelled, we will insert the dynamic table. These tables (pivot table) will allow us to create a simple “dashboard” to analyse the data dynamically.
Before inserting the dynamic table, we make sure that the names of the columns are correct, to facilitate the selection of fields in the next phase. Then, we select “Insert” – Dynamic Table, and a new sheet.
|Figure 8: Create dynamic table.|
Now, we will have to add the fields that interest us in our dynamic table. For example, by adding Schedule and Months in rows, and as values, Type of Vehicle (that being the “bicycle” helps us to count the cases):
|Figure 9: We choose fields.|
|Figure 10: Incidents per month.|
.. we can begin to see how the number of accidents registered in January (44) almost doubled in the summer months, with a maximum value of 80 incidents in June.
Remember that our goal is to create a “table” where we can answer questions such as:
- What are the most frequent accidents?
- When do they occur? Are there more at sometimes of the year than others? More in certain time slots than others?
- Where are they happening? Are they more frequent in some districts than in others? Are there “black spots” of accident accumulation?
1. We start by creating the first visualization. We add the fields:
- Type of accident (Rows)
- District (Filter)
- Hourly section (Columns)
- Number of incidents (values)
We can see how many accidents of each type occur according to time zone, district, or we can filter those that interest us the most.
|Figure 11: Accidents by time zone and district.|
2. We create a second table on the same sheet
Remember that our intention is to create a table. For this, when adding the new table from the sheet “accident list …” we indicate the name of the sheet and its position (simply by choosing a cell) in which we have created the first pivot table instead of “new sheet”.
In this second table we will analyse the accidents by district. Therefore, we are going to add the following fields:
- District (Rows)
- Months (Filter)
- Number of incidents (values)
We obtain this result:
|Figure 12, Incidents by district|
Now, let’s filter again by visual time slot. We want to create some “buttons” labelled “Tomorrow”, “Afternoon” and “Night”, and place them on each of the dynamic tables that we have created.
For this, in the field selector for the dynamic table, on the “Time section” option, we choose “Add as data segmentation” (slicing) in the context menu.
|Figure 13: We add segmentation by Section Schedule|
With this we obtain the “buttons”. Now, to make it look better on the panel, we put the buttons on the dynamic table and, using the menu “Data segmentation tools“, we chose the option “3 columns” to display it horizontally.
|Figure 14, Configuration of columns for “buttons”|
And it looks like this:
|Figure 15: Segmentation buttons on the table|
Thus, simply by pressing the button that interests us, we can see the data in the table segmented by that specific value. Now we want that segmentation to take affect on the other table as well. Since we are building a panel, we want all tables / diagrams to be “synchronized” in terms of time zone.
To do this, in the “Data segmentation tools” menu, select “Report connections” and also mark the first table (Dynamic table 8). We see that the segmentation can be synchronized even with dynamic tables that are located in other sheets, which allows us to make multipage reports.
|Figure 16: Report connections.|
3. Insert dynamic graphics.
Now that we have the two tables, and a segmentation that interests us, we can insert dynamic graphs that help us analyse the information in a more visual way. For example, in the first table we insert a “pie” dynamic graphic (pie chart).
To insert the dynamic chart, we put ourselves on the table that interests us, and select the option “Insert dynamic chart” from the PivotTable Tools menu, Analyse
|Figure 17: We insert a dynamic graph|
Thus, as we select the “buttons” that filter by time slot, the tables and dynamic graphics are updated. Therefore, we can see that, in the mornings, in the set of districts, the most common type of accident is the double collision.
|Figure 18, type of incident / tomorrow.|
We also see that, at night, the type of incidents diversifies.
|Figure 19: Dynamic graph type of incident / night|
Now we insert a second dynamic chart that shows us the number of cases per district. From the second table that we have created, we add a dynamic graphic as we did in the previous case. In this instance we will choose, for example, a bar diagram. As in the previous case, we will create a new data segmentation menu. In this case, we will segment by “type of accident”. Thus, we already have a panel in which we can analyse and visualize the data by choosing the segmentation that most interests us.
We will reposition the graphs and segmentation buttons in the way that is most accommodating to us (for example), and we already have a control panel that allows us to analyse the data with ease. When placing them, you must keep in mind that you cannot overlap some of the tables with others. Therefore, it will be necessary to leave the maximum space it will occupy in each segment for each table. These details are those that are optimized in the specific tools that we find in the market on data analytics, as well as having easier access the result as it is visually attractive. But for the purpose of this example, Excel is more than enough.
|Figure 20, Example of panel|
Using the two sets of buttons, the graphs and dynamic tables are updated automatically. Now we can answer the questions we raised at the beginning of this post.
- Which accident is the most frequent?
- At what times do the most accidents occur?
- In which district do more accidents occur? Is there a specific street?
- Does the number of accidents vary according to the season of the year?
We will answer these questions, but then I invite you to raise your own.
Which accident is the most frequent? At what times do the most accidents occur?
Selecting all accident types and all time slots, we can see that the most frequent are falls and double collisions, and that it is in the afternoon when more incidents occur.
In the mornings, the number of falls is slightly lower than that of double collisions, but they increase as the day progresses, being the most common at night. At night, the accidents decrease, but crashes with a fixed object increase.
In which district do more accidents occur? Any specific street?
If we observe the second dynamic table and its associated graph (we have changed it to vertical bars since we can modify the type of graphic we are looking at with more ease), we see, without doing any previous filtering by time or by district, that the higher number of incidents are concentrated in the district of Centro, followed, with some distance by the districts of Retiro and Arganzuela.
|Figure 21, Number of accidents per district|
Interestingly, if we analyse this same graph according to the time slot, we see how the pattern of incidents per district changes. The Centro district is always in first place, but the rest of the positions vary. This surely has to do with the fact that in some districts there is a greater concentration of leisure facilities, or commercial areas, offices etc.
For example, as a whole, the districts with the highest number of accidents are Centro, Arganzuela and Chambertí. However, in the afternoons, the districts with the highest number of incidents are, in this order, Centro, Retiro, Chamberí and Carabanchel. And in the mornings, they are Centro, Moncloa and Arganzuela.
Does the number of accidents vary according to the season of the year?
In the first dynamic table we created, we already saw the answer to this question. The number of incidents has gradually increased from January 30 to the maximum of 58, registered in June. It is logical that as the weather conditions improve, the number of incidents increases, since the good weather encourages more people to use the service.
We have left one question unanswered. A question that may be the most important, since it can help prevent accidents.
Can we detect any “black spots”? Is there a street where there are more accidents?
To answer this question, we will go back to the initial table “Bicycle Accident List”. Next to the column “Address” we will insert a new row with a function that tells us how many times a value is repeated in a column. The formula we will use is “COUNTIF.SI”:
= COUNTIF ([Place]; [@ Place])
Then, in the result column, we can filter through those streets with a greater number of occurrences. If we observe the greatest frequency of occurrence, “11”, we see that it corresponds to Calle Alcalá. However, when checking the column Number, we see that there are no coincidences, therefore it does not show any specific point of greater danger.
The same happens if we filter by occurrence numbers “6” and “5”. We find the streets Bravo Murillo and Paseo de la Castellana, which, after Calle Alcalá, are the second and third longest streets in Madrid.
|Figure 23: Incidents occurred on Calle Alcalá.|
It will be necessary to carefully follow the incidents that are recorded in biciMAD to be able to draw conclusions about this data, which may result in the improvement of the service. Today, the data collected does not indicate any specific black point in which there is a higher concentration of accidents. However, it would be advisable to investigate the points denounced by the users themselves, since, although incidents have not yet been recorded in those points, they may end up happening in the future.