Parcoursup use case example¶
In this example, we will do a study on the Parcoursup dataset, provided by the French government open data.
This dataset is built from the Parcoursup data of the 2022 campaign. The Parcoursup application is the national pre-registration platform established by the French Ministry of Higher Education, Research, and Innovation, allowing students to apply for university.
It covers all candidates who have at least one validated orientation wish in the main and/or complementary phase, among the 13,644 non-apprenticeship programs offered. It thus covers 967,664 candidates. A subset of the data specifically focuses on the 624,620 high school graduates among these candidates.
Importing the dataset¶
First we will import our dataset from our local machine into papAI.
Here is a demo of importing a dataset from local storage
Cleaning the dataset¶
Drop Column feature simplifies data manipulation. In our dataset , we can drop the column Session that contains the value
2022 in all the rows. With our platform, users can effortlessly remove this column to eliminate redundant or irrelevant information from their dataset. This can be done by selecting the Drop Column option and specifying the target column name.
We notice that the Session column is always
2022 in the dataset , so we can drop it
Here is a video of the drop column operation in the cleaning module
In our dataset, we have different courses with three major types : High schools , Universities and other courses, it would be interesting to separate each type on its own.
Here is a demo of filter the universities formations into a new dataset
You can filter out the two course types (High schools and other courses) by yourself
Dataset Table visualization¶
PapAi also offers a feature to analyze directly datasets attributes through a data table, which gives a bar chart representation of each attribute values.
Here is a video showcasing the visualization of the courses by area attribute
We can clearly see that the Ile-de-France region is leading with the most offered courses to students (942)
The resulting data visualization can be exported in different image formats :
png, jpg, svg
Group By feature¶
The Group By feature in the papAI platform empowers users to gain valuable insights by grouping data based on specific criteria. By selecting a desired column or attribute, users can effortlessly aggregate and organize their dataset, enabling them to identify patterns, trends, and correlations within their data. Whether it's grouping sales data by region, customer segments, or time intervals, this feature provides a powerful tool for slicing and dicing data to uncover hidden relationships and extract meaningful information. The Group By feature enhances the user's ability to perform comprehensive analysis and make data-driven decisions with ease and efficiency.
Here is a video of a Group By operation aggregating acceptance rate with the different specialties (acceptance rate per speciality)
The papAI interface incorporates a robust visualization feature that empowers users to effectively represent their data on various types of graphs. With just a few clicks, users can choose the appropriate graph type, such as bar charts, line plots, scatter plots, or pie charts, and map their data onto the X and Y axes. This versatile feature allows for the creation of visually compelling and insightful visualizations, enabling users to identify trends, compare data points, and communicate their findings effortlessly. Whether it's displaying the distribution of sales across different product categories or visualizing the correlation between two variables, our visualization feature provides users with the tools to present their data in a visually appealing and meaningful way, enhancing their data analysis and storytelling capabilities.
Here is a video of visualizing acceptance rates per specialties through a horizontal bar chart
papAI includes a convenient Dataset Download feature that allows users to easily export their analyzed data for further use or sharing. With just a few simple steps, users can select the specific dataset they have been working on and initiate the download process. Whether it's a CSV file, Excel spreadsheet, or any other preferred format, the solution ensures that the downloaded dataset retains its integrity and structure. This feature provides users with the flexibility to continue their analysis using other tools, share the data with collaborators, or store it for future reference.
Here is a video of downloading the high schools dataset
By offering a seamless and efficient download experience, our app empowers users to take their analyzed data beyond the platform and leverage it in various contexts.
Python recipe feature¶
This feature enables users to enhance their analysis by seamlessly integrating custom Python scripts into their workflow. By simply selecting the Python Recipe option and adding their script, users can leverage the full power of Python's extensive libraries and capabilities. Whether it's performing advanced statistical calculations, implementing complex machine learning algorithms, or creating customized visualizations, this feature empowers users to tailor their analysis to their unique requirements. With the ability to execute Python scripts directly within the app, users can unleash the full potential of their data and explore new possibilities for insights and decision-making. The "Python Recipe" feature provides a flexible and extensible framework, allowing users to extend the functionality of the app and unlock limitless analysis possibilities.
Here is a demo of adding a basic python recipe (pandas drop duplicates)
when entering the python recipe interface , our comments will guide you through your code
Extract using Separator Parser¶
This feature simplifies the extraction of specific string columns and splits them into different new columns based on a designated separator. With this feature, users can effortlessly parse and reorganize their data to extract valuable information. By specifying a separator, such as a comma, space, or custom character, users can efficiently split their string columns into separate columns, each containing the desired extracted data.
Here is a video of separating the latitude and longitude from the geo_localisation column containing the coordinates of each course
This feature is particularly useful when dealing with datasets that have concatenated values or when wanting to isolate specific elements within a column.
Casting column operation empowers users to effortlessly convert the data types of their columns. With this feature, users can easily transform their dataset by casting columns to different data types, ensuring compatibility and accuracy in their analysis. Whether it's converting a column from string to numeric, date to timestamp, or any other necessary conversion, our app provides a user-friendly interface to facilitate the process. By accurately aligning the data types with the nature of the underlying information, users can perform meaningful calculations, comparisons, and aggregations with ease. The Cast feature enhances the flexibility and accuracy of data analysis, enabling users to unlock deeper insights and make informed decisions based on a solid foundation of properly formatted and interpreted data.
Here is a video of casting the latitude and longitude columns into double type
papAI incorporates a powerful Heatmap Visualization feature that enables users to gain valuable insights by representing their data using color-coded matrices. With this feature, users can easily visualize patterns, correlations, and distributions within their dataset. By mapping data values to different color gradients, the heatmap provides an intuitive and visually appealing representation of the underlying trends. Whether it's identifying high or low values, detecting clusters, or spotting anomalies, the Heatmap Visualization feature allows users to explore complex relationships and make data-driven decisions. With customizable options for color schemes, labels, and tooltips, users can tailor the heatmap visualization to their specific needs, enhancing their ability to communicate findings and share impactful visualizations with others. The Heatmap plot is a powerful tool in our app's arsenal, empowering users to unlock hidden patterns and gain deeper insights from their data.
Here is a video of plotting a HeatMap viz of the applications on the French Map using the latitude and longitude generated from the geo_localisation column
We can clearly see that the applications are concentrated mainly in the Ile-de-France region
Pie chart Visualization¶
papAI allows users to effectively represent categorical data using pie charts. With a simple selection of columns or variables, users can quickly generate pie charts that provide a clear visual representation of the data distribution. Each category is represented by a slice of the pie, with the size of the slice proportional to the corresponding data frequency or proportion. This intuitive visualization enables users to easily identify the relative contributions of different categories and make comparisons at a glance. With customizable labels, colors, and legends, users can tailor the pie chart to match their preferences and convey information in a visually appealing and informative manner. The Pie Chart Visualization feature in our platform enhances data exploration and presentation, allowing users to communicate insights and patterns in a concise and engaging way.
Here is a video of plotting a Pie chart displaying the distribution of acceptance rates in the Ile-de-France departments
We can clearly see that the acceptance rates are way higher in Paris (75) than the other departments
Sometimes in your projects, you need to use a large amount of data that can be unstructured and you can be limited with the space allowed. Hence, introducing a new type of storage in papAI : Bucket object storage
Here is a video of creating a bucket containing some of the charts we did earlier with some other files about this project
Notice that a bucket can hold different types of files (jpg, png , pdf, csv...)
Get new dataset with Spark SQL recipe¶
Our platform also offers a powerful Spark SQL Recipe feature, empowering users to perform advanced data transformations and generate new datasets using the Spark SQL engine. With this feature, users can harness the full capabilities of Spark SQL to execute complex SQL queries, apply filters, aggregations, joins, and more. By leveraging the expressive SQL syntax, users can define their data manipulation logic with ease. Whether it's performing complex data transformations, creating derived columns, or summarizing data, the "Add Spark SQL Recipe" feature provides a robust framework for data manipulation and transformation. Users can effortlessly execute their Spark SQL recipes and obtain a new dataset that encapsulates the desired transformations. This feature enhances the flexibility and scalability of data analysis, enabling users to leverage the power of Spark SQL for efficient data processing and analysis.
Here is a video of creating a new dataset containing only Sorbonne Universities using Spark SQL
Follow our comments in the recipe interface to be guided while coding
Predict features with linear regression¶
With papAI , we can also perform predictive analytics using linear regression modeling. With this feature, users can explore the relationships between variables and make predictions based on linear patterns observed in the data. By selecting the target variable and specifying the predictor variables, users can easily build a linear regression model that captures the underlying trends. The app leverages the mathematical principles of linear regression to estimate the coefficients and intercept, allowing users to predict the value of the target variable based on the provided predictors. This feature is particularly useful in forecasting sales, estimating future trends, or understanding the impact of different factors on a particular outcome. With intuitive visualizations and statistical metrics, users can evaluate the accuracy and significance of the model, empowering them to make informed decisions based on reliable predictions. The "Predicting Features with Linear Regression" feature in our platform enhances the predictive capabilities of users, providing valuable insights and aiding in strategic planning and decision-making.
Here is a video of plotting a linear regression to predict the acceptance in Sorbonne Universities
Notice that papAI ML interface provides different metrics for each model.